CN113791976A

CN113791976A - Method and device for enhancing defect positioning based on program dependence

Info

Publication number: CN113791976A
Application number: CN202111056342.1A
Authority: CN
Inventors: 张天; 潘敏学; 罗雯波
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2021-12-14
Anticipated expiration: 2041-09-09
Also published as: CN113791976B

Abstract

The invention discloses a method and a device for enhancing defect positioning based on program dependence. According to the method, by analyzing the statement data dependency relationship of a source code, a statement defect suspicion table obtained by analyzing the source code by an existing software tool is combined with the statement data dependency relationship to form a characteristic value vector consisting of N +1 suspicion values, the characteristic value vector serves as a sample of each statement of the source code and is input into a support vector machine for machine analysis, and therefore an optimized defect suspicion degree list is obtained, and therefore a positioning result is strengthened, and defect positioning is more accurate.

Description

Method and device for enhancing defect positioning based on program dependence

Technical Field

The invention relates to a program code analysis technology, in particular to a defect analysis technology of a program software source code.

Background

The software has inevitable defects. In software product projects, the testing period is often much longer than the development period, because a great deal of labor is needed to repair defects found in the testing process. The first difficulty in defect repair is defect localization, i.e., finding where the software code is defective. Especially when the software is large in scale, finding the defective code takes a lot of time.

Thus, automated software defect localization methods and tools are gaining increasing attention. In the prior art, defect location is mainly achieved through two methods: the first is to traverse the program code through a static inspection tool to find out the suspicious defect; the second is to find the suspected defects of the statements in the program test execution process by the statement coverage information, such as an automatic program defect locating tool based on program spectrum represented by Ochiai.

Disclosure of Invention

The problems to be solved by the invention are as follows: and further analyzing the suspected part of the sentence defect found by the existing tool so as to enable the suspected position of the sentence defect to be more accurate.

In order to solve the problems, the invention adopts the following scheme:

the invention discloses a method for enhancing defect location based on program dependence, which comprises the following steps: a data acquisition step,

A dependence analysis step, a model training step and a defect analysis step;

the data acquisition step is for: acquiring training data and data to be evaluated;

the training data comprises a source code for training, a defect suspicion degree list for training and a known defect list;

the data to be evaluated comprises a source code to be evaluated and a defect suspicious degree list to be evaluated;

the defect suspicious degree list for training and the defect suspicious degree list to be evaluated are defect suspicious degree lists;

the dependence analysis step: constructing a control flow graph according to input source codes, and then constructing a statement data dependency relationship among nodes of the control flow graph through data flow analysis of the control flow graph;

the model training step comprises the following steps:

step ST 1: performing statement data dependency relationship analysis on the source code for training through the dependency analysis step to obtain a statement data dependency relationship for training;

step ST 2: constructing corresponding defect positioning training samples for each sentence of the source codes for training according to the sentence data dependency relationship for training, the defect suspicion degree list for training and the known defect list; the defect positioning training sample comprises a characteristic value vector and a defect label;

step ST 3: inputting the characteristic value vector and the defect label in each defect positioning training sample into a support vector machine for training;

the defect analyzing step includes the steps of:

step SA 1: analyzing the sentence data dependency relationship of the source code to be evaluated through the dependency analysis step to obtain the sentence data dependency relationship to be evaluated;

step SA 2: constructing a corresponding defect positioning evaluation sample for each statement of the source code to be evaluated according to the statement data dependency relationship to be evaluated and the list of the defect suspicion degree to be evaluated; the defect localization evaluation sample comprises a feature value vector;

step SA 3: inputting the characteristic value vector in each defect positioning evaluation sample into a support vector machine trained in the model training step for evaluation to obtain a new defect suspicion degree list;

in the above-mentioned steps, the above-mentioned step,

the defect suspicion degree list is a set of suspicious defect positioning information;

the suspicious defect positioning information at least comprises a statement position and a suspicious value;

the known defect table is a set of known defect location information;

the known defect positioning information at least comprises a statement position;

the statement position is used for indicating the position of the current statement in the source code;

the defect label is used for indicating whether the current statement has known defects or not and determining according to the known defect table;

the characteristic value vector is a vector consisting of N +1 suspicious values; the first suspicious degree value is the suspicious degree value of the current statement, and the other N suspicious degree values are the suspicious degree values of the N statements with the highest suspicious degree values of the statements which have a dependency relationship with the current statement;

the dependency relationship is determined according to the statement data dependency relationship;

and determining the suspicious degree value of the statement according to the defect suspicious positioning table.

Further, according to the method for enhancing defect localization based on program dependence of the present invention, the dependence analysis step includes the steps of:

step SY 1: constructing a control flow graph according to the input source code;

in the control flow graph, each node corresponds to a source code statement;

step SY 2: analyzing the related variables for each node in the control flow graph to obtain a set of node variable information; the node variable information corresponds to variables and comprises variable information, variable value types and statement positions; the variable value type is divided into change and reference, if the corresponding variable is changed in the corresponding statement, the variable value type is change; if the corresponding variable is used in the corresponding statement, the variable value type is a reference;

step SY 3: merging the set of the node variable information of each precursor node corresponding to each node in the control flow graph into the current node to form a new set of the node variable information in an iterative mode until the set of the node variable information of each node is not changed any more;

step SY 4: extracting a set of node variable information of each node, eliminating node variable information in which the variable value type is quoted, and forming data dependency information corresponding to each node as a statement data dependency relationship after simplification processing;

the data dependency information is a set of variable value change information corresponding to a variable which is continued by the current node;

the variable value change information includes change position information;

the change position information indicates a statement position where a variable value of a variable in the preamble node is changed;

the preorder node is a node positioned before the current node on a path represented by the control flow graph;

the precursor node refers to a preamble node connected with the current node.

Further, according to the method for enhancing defect localization based on program dependence of the present invention, in the step SY4, the data dependence information corresponding to each node is converted into a statement list having dependence relationship with the current statement as its statement data dependence relationship.

The invention relates to a device for enhancing defect location based on program dependence, which comprises: a data acquisition module,

The system comprises a dependence analysis module, a model training module and a defect analysis module;

the data acquisition module is configured to: acquiring training data and data to be evaluated;

the dependency analysis module: constructing a control flow graph according to input source codes, and then constructing a statement data dependency relationship among nodes of the control flow graph through data flow analysis of the control flow graph;

the model training module comprises the following modules:

module MT 1: performing statement data dependency relationship analysis on the source code for training through the dependency analysis module to obtain a statement data dependency relationship for training;

module MT 2: constructing corresponding defect positioning training samples for each sentence of the source codes for training according to the sentence data dependency relationship for training, the defect suspicion degree list for training and the known defect list; the defect positioning training sample comprises a characteristic value vector and a defect label;

module MT 3: inputting the characteristic value vector and the defect label in each defect positioning training sample into a support vector machine for training;

the defect analysis module comprises the following modules:

module MA 1: analyzing the sentence data dependency relationship of the source code to be evaluated through the dependency analysis module to obtain the sentence data dependency relationship to be evaluated;

module MA 2: constructing a corresponding defect positioning evaluation sample for each statement of the source code to be evaluated according to the statement data dependency relationship to be evaluated and the list of the defect suspicion degree to be evaluated; the defect localization evaluation sample comprises a feature value vector;

module MA 3: inputting the characteristic value vector in each defect positioning evaluation sample into a support vector machine trained by the model training module for evaluation to obtain a new defect suspicion degree list;

in the above-mentioned modules, the modules,

the known defect table is a set of known defect location information;

Further, according to the apparatus for enhancing defect localization based on program dependence of the present invention, the dependence analysis module comprises the following modules:

module MY 1: constructing a control flow graph according to the input source code;

in the control flow graph, each node corresponds to a source code statement;

module MY 2: analyzing the related variables for each node in the control flow graph to obtain a set of node variable information; the node variable information corresponds to variables and comprises variable information, variable value types and statement positions; the variable value type is divided into change and reference, if the corresponding variable is changed in the corresponding statement, the variable value type is change; if the corresponding variable is used in the corresponding statement, the variable value type is a reference;

module MY 3: merging the set of the node variable information of each precursor node corresponding to each node in the control flow graph into the current node to form a new set of the node variable information in an iterative mode until the set of the node variable information of each node is not changed any more;

module MY 4: extracting a set of node variable information of each node, eliminating node variable information in which the variable value type is quoted, and forming data dependency information corresponding to each node as a statement data dependency relationship after simplification processing;

the variable value change information includes change position information;

the precursor node refers to a preamble node connected with the current node.

Further, according to the apparatus for enhancing defect localization based on program dependence of the present invention, in the module MY4, the data dependence information corresponding to each node is converted into a statement list having dependence relationship with the current statement as its statement data dependence relationship.

The invention has the following technical effects: according to the invention, through a machine learning technology, the original defect positioning result and the data dependency relationship of the program are comprehensively analyzed, and the positioning result is strengthened, so that the defect positioning is more accurate.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Where SY is the dependency analysis step, ST is the model training step, and SA is the defect analysis step.

FIG. 2 is an exemplary source code for an embodiment of the present invention.

FIG. 3 is a control flow graph constructed from the example source code of FIG. 2 in accordance with an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the method for enhancing defect localization based on program dependence of the present invention mainly includes a model training step ST and a defect analysis step SA.

The data acquisition step represents the input of the invention, and how to acquire the data is not described in detail. The inputs of the invention are: training data and data to be assessed. The training data is used in a model training step ST, and includes a source code for training, a defect suspicion degree list for training, and a known defect list. And the data to be evaluated is used for a defect analysis step SA and comprises a source code to be evaluated and a defect suspicion degree list to be evaluated.

The source code for training and the source code to be evaluated are both program source codes. The program source code may be source code written in various programming languages such as c, c + +, java, Ada, Go, phyton, and the like. The programming language used to write the program source code is not limited. It is emphasized that the source code for training and the source code to be evaluated typically need to be written in the same programming language.

The defect suspicion degree list for training and the defect suspicion degree list to be evaluated are defect suspicion degree lists. The defect suspicion list is a collection of suspicious defect localization information. The suspicious defect localization information includes at least a statement location and a suspicious value. The location of the current statement in the source code is usually indicated by a file name and a line number, but may also be combined with information such as a function, a class, and a method where the current statement is located. The suspicion value is a measure value in the range of 0 to 1. When the suspicious value is 1, indicating that the statement has a defect; when the suspicion value is 0, it indicates that the statement is not defective. The defect suspicion list is typically generated by other software tools, such as the aforementioned Ochiai.

The known defect table is a collection of known defect location information; the known defect location information includes at least a sentence position. Of course, those skilled in the art understand that the known defect table may also be directly merged with the defect suspicion degree training table, and at this time, in the merged table, the item with the suspicion degree value of 1 is the information corresponding to the known defect. The invention has been divided only for convenience of description.

The model training step ST includes the steps of:

step ST1, that is, the sentence data dependency analysis of the source code for training, specifically is: and analyzing sentence data dependency relationship of the training source code to obtain the sentence data dependency relationship for training.

Step ST2, that is, constructing a defect localization training sample, specifically: constructing a corresponding defect positioning training sample for each sentence of the source code for training according to the sentence data dependency relationship for training, the defect suspicion degree list for training and the known defect list; the defect positioning training sample comprises a characteristic value vector and a defect label. The defect label is used to indicate whether the current sentence has a known defect, and is determined according to the known defect table, and is generally indicated as 0, 1. The characteristic value vector is a vector consisting of N +1 suspicious values; the first doubtful degree value is the doubtful degree value of the current statement, and the other N doubtful degree values are the doubtful degree values of the N statements with the highest doubtful degree value of the statement and the dependency relationship with the current statement. The dependency relationship here is determined by the input sentence data dependency relationship. The suspicious value of the statement is determined according to the input defect suspicious positioning table.

Step ST3, that is, the step of training the support vector machine, specifically: and inputting the characteristic value vector and the defect label in each defect positioning training sample into a support vector machine for training.

The defect analysis step SA includes the steps of:

step SA1, that is, the dependency analysis of the statement data of the source code to be evaluated specifically includes: and analyzing the sentence data dependency relationship of the source code to be evaluated to obtain the sentence data dependency relationship to be evaluated.

Step SA 2: constructing a corresponding defect positioning evaluation sample for each statement of a source code to be evaluated according to the statement data dependency relationship and the defect suspicion degree list to be evaluated; the defect localization evaluation sample includes a feature value vector.

Step SA 3: and inputting the characteristic value vector in each defect positioning evaluation sample into a support vector machine trained in the model training step for evaluation to obtain a new defect suspicion degree list.

In the above steps, step ST1 is substantially the same as step SA1, except that the program source code inputted thereto is different. In this embodiment, step ST1 and step SA1 are realized by means of a dependent analysis step SY. In particular, when implemented, the dependency analysis step SY may be embodied as a machine process performed by a function call. The dependency analysis step SY is used to analyze the data dependency relationship between the statements and output the statement data dependency relationship. The sentence data dependency relationship for training and the sentence data dependency relationship to be evaluated are both the sentence data dependency relationship output in the dependency analysis step SY.

And a dependency analysis step, namely building a control flow graph according to input source code, and then building statement data dependency relations among all nodes of the control flow graph through data flow analysis of the control flow graph. Here, the inter-statement data refers to program variables. The data flow here is also the situation in which the program variables are changed and referenced. The above statement data dependencies are also dependencies between statement variables. Depending on the analysis steps, this example was implemented as follows:

step SY 2: analyzing related variables for each node in a control flow graph to obtain a set of node variable information;

step SY 3: merging the set of node variable information of each precursor node corresponding to each node in the control flow graph into the current node to form a new set of node variable information in an iterative mode until the set of node variable information of each node is not changed any more;

step SY 4: and extracting a set of node variable information of each node, eliminating node variable information in which the variable value type is quoted, and forming data dependency information corresponding to each node as a statement data dependency relationship after simplification processing.

In step SY1, it is familiar to those skilled in the art to construct a control flow graph from source code. It should be noted that, in a general control flow graph, each node is a basic block of a program. For example, the code illustrated in fig. 2 is a program segment written in the c + + language that lists the values of the top 100 of the feronary cut series. Wherein,

lines

1, 2, 3 and 4 form a program basic block, and form nodes of a control flow graph.

Lines

10, 11, 12, and 13 form a basic block of a program, and form nodes of a control flow graph. In the embodiment, for the convenience of analyzing the positioning, the program basic block is split into separate statements. That is, in this embodiment, each statement is used as a node of one control flow graph. Thus, the basic blocks of the program formed on the 1 st, 2 nd, 3 th and 4 th lines are decomposed into nodes of four control flow graphs, and the basic blocks of the program formed on the 10 th, 11 th, 12 th and 13 th lines are also decomposed into nodes of four control flow graphs. Referring to fig. 3 as an example, a control flow graph constructed in this embodiment uses statements as nodes, and each statement is in a row in example code, and there is only one statement in a row, so the control flow graph illustrated in fig. 3 uses a row number as a node identifier, and node identifiers Line1 to Line16 are the row numbers of corresponding statements, respectively.

In step SY2, the node variable information corresponds to variables, including variable information, variable value type and statement position. The variable information is usually indicated by a variable name, and in the case of a large code amount, the function name, the method name, and the class name may be used in combination as the variable information for specifying the variable corresponding to the node variable information. For example, if the code illustrated in fig. 2 is a code with a function name of fabocci 100, the variables a, b, i, and t may be referred to as fabocci 100_ a, fabocci 100_ b, fabocci 100_ i, and fabocci 100_ t, respectively, as variable information. The variable value type is divided into a change and a reference, and if the corresponding variable is assigned or created in the corresponding statement, the variable value type is changed. If the corresponding variable is used in the corresponding statement, its variable value type is a reference. For example, if a statement corresponding to the node Line2 is a =1, the node variable information set corresponding to the node may be represented by { { a, change, Line2} }, where a is a variable name, the change indicates that a variable value of the variable a is assigned in the statement and changed, and Line2 is a statement position indicated by a Line number.

For the sake of processing simplicity, in this embodiment, the control flow graph takes a method of a function or a class as a basic unit. If the calling of the function or the method occurs in the code, the calling of the function or the method is used as a statement to construct a node. When the function or method is called, the formalized parameter is used as the assignment behavior of the variable in the statement. Since the parameters of a function or method may be assigned values during the execution of the function or method. In addition, for a function or method call, a global variable or a class member may be involved, one way is to add node variable information corresponding to all global variables and class members to the set of node variable information of the statement, and the other way is to add node variable information corresponding to the global variables and class members involved in the function or method to the set of node variable information of the statement.

In addition, it should be noted that each statement does not necessarily relate to a variable, for example, the statements corresponding to the nodes Line4 and Line16 are calls of the function print, and the statement does not relate to a variable, and the corresponding set of node variable information is empty.

Step SY3 is a process of loop iteration. Compared to step SY3, step SY2 is an initialization process of a set of node variable information of a node. In the loop iteration of step SY3, each iteration process traverses each node according to the control flow graph, and combines the set of node variable information of each precursor node corresponding to each node in the control flow graph into the current node to form a new set of node variable information. The predecessor node refers to a preamble node to which the current node is connected. A preamble node refers to a node located before the current node on the path represented by the control flow graph. For example, in the control flow graph illustrated in fig. 3, the predecessor nodes of the node Line5 are: node Line4, node Line8, and node Line 13.

Take the code illustrated in fig. 2 and the control flow graph illustrated in fig. 3 as an example. After step SY2, the sets of node variable information of each node of the control flow graph are respectively:

node Line 1: { { a, Change, Line1}, { b, Change, Line1} };

node Line 2: { { a, Change, Line2} };

node Line 3: { { b, Change, Line3} };

node Line 4: { };

node Line 5: { { i, Change, Line5} };

node Line 7: { { a, reference, Line7} };

node Line 8: { { a, reference, Line8} };

node Line 10: { { a, quote, Line10}, { b, quote, Line10}, { t, Change, Line10} };

node Line 11: { { a, reference, Line11}, { b, Change, Line11} };

node Line 12: { { t, reference, Line12}, { a, Change, Line12} };

node Line 13: { { i, ref, Line13}, { a, ref, Line13} };

node Line 16: {}.

After the control flow graph traversal of the first round of step SY3 is performed, the sets of node variable information of each node of the control flow graph are respectively:

node Line 1: { { a, Change, Line1}, { b, Change, Line1} };

node Line 2: { { a, Change, Line1}, { b, Change, Line1}, { a, Change, Line2} };

node Line 3: { { a, Change, Line1}, { b, Change, Line1}, { a, Change, Line2}, { b, Change, Line3} };

node Line 4: { { a, Change, Line1}, { b, Change, Line1}, { a, Change, Line2}, { b, Change, Line3} };

node Line 5: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13} };

node Line 7: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, quote, Line8}, { i, quote, Line13}, { a, quote, Line13}, { a, quote, Line7} };

node Line 8: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, quote, Line8}, { i, quote, Line13}, { a, quote, Line13}, { a, quote, Line7}, { a, quote, Line8} };

node Line 10: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { t, change, Line10} };

node Line 11: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { t, change, Line10}, { a, reference, Line11}, { b, change, Line11} };

node Line 12: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { t, change, Line10}, { a, reference, Line11}, { b, change, Line11}, { t, reference, Line12}, and { a, change, Line12 };

node Line 13: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { t, change, Line10}, { a, reference, Line11}, { b, change, Line11}, { t, reference, Line12}, and { a, change, Line12 };

node Line 16: { { a, Change, Line1}, { b, Change, Line1}, { a, Change, Line2}, { b, Change, Line3}, { a, reference, Line8}, { a, reference, Line13} }.

After the second round of traversal of the control flow graph of step SY3, the sets of node variable information of each node of the control flow graph are respectively:

node Line 1: { { a, Change, Line1}, { b, Change, Line1} };

node Line 2: { { a, Change, Line1}, { b, Change, Line1}, { a, Change, Line2} };

node Line 5: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { a, reference, Line11}, { b, change, Line11}, { a, change, Line12} };

node Line 7: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { a, reference, Line11}, { b, change, Line11}, { a, change, Line12} };

node Line 8: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { a, reference, Line11}, { b, change, Line11}, { a, change, Line12} };

node Line 10: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { t, change, Line10}, { a, reference, Line11}, { b, change, Line11}, { t, reference, Line12}, and { a, change, Line12 };

node Line 11: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { t, change, Line10}, { a, reference, Line11}, { b, change, Line11}, { t, reference, Line12}, and { a, change, Line12 };

node Line 16: { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { a, reference, Line8}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { a, reference, Line11}, { b, change, Line11}, { a, change, Line12 }.

After the control flow graph of the two-round step SY3 is traversed, the set of node variable information of each node of the control flow graph does not change any more, and the loop iteration of the step SY3 is ended.

Furthermore, those skilled in the art understand that the scope of the variable needs to be considered when merging the set of node variable information of each predecessor node of the node to the current node. For example, in the set of node variable information of the node Line5 { { a, change, Line1}, { b, change, Line1}, { a, change, Line2}, { b, change, Line3}, { i, change, Line5}, { a, reference, Line8}, { i, reference, Line13}, { a, reference, Line13}, { a, reference, Line7}, { a, reference, Line10}, { b, reference, Line10}, { a, reference, Line11}, { b, change, Line11}, { a, change, Line12}, the variable i is defined in the for loop body whose scope is used, and the node Line 638 is outside the for loop body, so that the information of the link node Line5 that is a node is linked with the node information should be removed from the set of nodes corresponding to the node information of the node Line 638. For another example, the scope of the variable t as a temporary variable is limited to the statements Line10, Line11, Line12 and Line13, so when the set of node variable information of the precursor node Line13 of the node Line5 is merged onto the node Line5, the node variable information related to the temporary variable t should be eliminated.

In step SY4, the data dependency information is a set of variable value change information corresponding to the variable that the current node continues to store; the variable value change information includes a set of change position information; the change position information indicates a sentence position where a variable value of the variable in the preamble node is changed. In the previous example, the set of node variable information of each node formed by the code illustrated in fig. 2 is removed, the node variable information in which the variable value type is referred to is removed, and since the remaining variable value types are all changed and the existence of the variable value type has no meaning, the variable value type parameter therein is removed, and then the data dependency information of each node is obtained as follows:

node Line 1: { { a, Line1}, { b, Line1} };

node Line 2: { { a, Line1}, { b, Line1}, { a, Line2} };

node Line 3: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3} };

node Line 4: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3} };

node Line 5: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3}, { i, Line5}, { b, Line11}, { a, Line12} };

node Line 7: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3}, { i, Line5}, { b, Line11}, { a, Line12} };

node Line 8: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3}, { i, Line5}, { b, Line11}, { a, Line12} };

node Line 10: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3}, { i, Line5}, { t, Line10}, { b, Line11}, { a, Line12} };

node Line 11: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3}, { i, Line5}, { t, Line10}, { b, Line11}, { a, Line12} };

node Line 12: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3}, { i, Line5}, { t, Line10}, { b, Line11}, { a, Line12} };

node Line 13: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3}, { i, Line5}, { t, Line10}, { b, Line11}, { a, Line12} };

node Line 16: { { a, Line1}, { b, Line1}, { a, Line2}, { b, Line3}, { b, Line11}, and { a, Line12} }.

And then, the data dependence information is simplified as follows:

1. eliminating the sentences which the nodes refer to;

2. and eliminating variable information in the data dependence information.

The data dependency information of each node representing the sentence data dependency relationship is then obtained as follows:

node Line 1: { };

node Line 2: { Line1 };

node Line 3: { Line1, Line2 };

node Line 4: { Line1, Line2, Line3 };

node Line 5: { Line1, Line2, Line3, Line11, Line12 };

node Line 7: { Line1, Line2, Line3, Line5, Line11, Line12 };

node Line 8: { Line1, Line2, Line3, Line5, Line11, Line12 };

node Line 10: { Line1, Line2, Line3, Line5, Line11, Line12 };

node Line 11: { Line1, Line2, Line3, Line5, Line10, Line12 };

node Line 12: { Line1, Line2, Line3, Line5, Line10, Line11 };

node Line 13: { Line1, Line2, Line3, Line5, Line10, Line11, Line12 };

node Line 16: { Line1, Line2, Line3, Line11, Line12 }.

In the above statement data dependency relationship, for example, the data dependency information in the node Line12 is: { Line1, Line2, Line3, Line5, Line10, Line11}, which means that the statement Line12 depends on the statements Line1, Line2, Line3, Line5, Line10, Line11, while the statements Line1, Line2, Line3, Line5, Line10, Line11 depend on the statement Line 12.

Step ST2 may be specifically divided into two steps:

step ST 21: constructing a characteristic value vector of each sentence according to the sentence data dependency relationship for training and the defect suspicion degree list for training;

step ST 22: and marking the sentence with a defect label of whether the defect exists according to the known defect table.

Step SA2 is specifically to construct a feature value vector of each sentence according to the data dependency of the sentence to be evaluated and the list of the defect suspicion degree to be evaluated.

Step SA2 is therefore substantially the same as step ST21, and simply, the feature value vectors of the respective sentences are constructed from the sentence data dependencies and the defect suspicion degree list. Specifically, the method comprises the steps of traversing sentences of a source code, and processing each traversed sentence by the following steps:

firstly, finding out whether a current statement has a suspicious value from a defect suspicious degree list, and if so, taking the found suspicious value as a first element of a statement characteristic value vector; if not, the suspicion value of the first element of the statement feature value vector is taken as 0.

Then, finding out sentences which have dependency relations with the current sentences according to the sentence data dependency relations, then extracting the suspicious degree values of the sentences which have dependency relations with the current sentences from the defect suspicious degree list, and selecting N suspicious degree values with the highest suspicious degree values as the last N elements of the sentence characteristic value vector; if the number of the sentences having the dependency relationship with the current sentence is not more than N, or the sentences having the dependency relationship with the current sentence have no more than N suspicious degree values in the defect suspicious degree list, the rest are filled with 0 as the suspicious degree value. In this embodiment, N is 6, and those skilled in the art may also take the values of 7 and 8 or other values.

That is, the feature vector of a statement is a vector consisting of N +1 suspicion values.

The core of steps ST3 and SA3 is the support vector machine. Support vector machines, i.e., SVMs, are a class of generalized linear classifiers that perform binary classification on data in a supervised learning manner. In this embodiment, the support vector machine is a Linear support vector machine, that is, a Linear SVM. The linear support vector machine in this embodiment is familiar to those skilled in the art, and will not be described in detail herein.

In addition, step SA2 essentially corresponds to step ST21, where a sentence having a dependency relationship with the current sentence needs to be found, and in the present embodiment, the data dependency information of each sentence is a sentence list dependent on the sentence. In order to simplify the processing in step SA2 and step ST21, the data dependency information of each statement may be converted into a list of statements having a dependency relationship with the current statement.

Claims

1. A method for enhancing defect localization based on program dependence, the method comprising: a data acquisition step,

A dependence analysis step, a model training step and a defect analysis step;

the model training step comprises the following steps:

the defect analyzing step includes the steps of:

in the above-mentioned steps, the above-mentioned step,

the known defect table is a set of known defect location information;

2. The method for enhancing defect localization based on program dependence of claim 1, wherein the dependence analysis step comprises the steps of:

in the control flow graph, each node corresponds to a source code statement;

the variable value change information includes change position information;

the precursor node refers to a preamble node connected with the current node.

3. The method as claimed in claim 2, wherein the data dependency information corresponding to each node is converted into a statement list having dependency relationship with the current statement as its statement data dependency relationship in step SY 4.

4. An apparatus for enhancing defect localization based on program dependence, the apparatus comprising: a data acquisition module,

the model training module comprises the following modules:

the defect analysis module comprises the following modules:

in the above-mentioned modules, the modules,

the known defect table is a set of known defect location information;

5. The apparatus according to claim 4, wherein the dependency analysis module comprises:

in the control flow graph, each node corresponds to a source code statement;

the variable value change information includes change position information;

the precursor node refers to a preamble node connected with the current node.

6. The apparatus for enhancing defect localization based on program dependence according to claim 5, wherein the module MY4 is configured to convert the data dependence information corresponding to each node into a statement list having dependence relationship with the current statement as its statement data dependence relationship.