CN113434418A - Knowledge-driven software defect detection and analysis method and system - Google Patents

Knowledge-driven software defect detection and analysis method and system Download PDF

Info

Publication number
CN113434418A
CN113434418A CN202110726068.8A CN202110726068A CN113434418A CN 113434418 A CN113434418 A CN 113434418A CN 202110726068 A CN202110726068 A CN 202110726068A CN 113434418 A CN113434418 A CN 113434418A
Authority
CN
China
Prior art keywords
defect
knowledge
code
graph
defects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110726068.8A
Other languages
Chinese (zh)
Inventor
薄莉莉
曹思聪
孙小兵
李世豪
李斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN202110726068.8A priority Critical patent/CN113434418A/en
Publication of CN113434418A publication Critical patent/CN113434418A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a knowledge-driven software defect detection and analysis method and a system, which mainly comprise the following steps: constructing a defect data set; constructing a code feature graph to model the code and embed the code into a vector space; learning implicit characteristics of the defect codes through a graph neural network, and training a defect detection model; detecting defects existing in the items by using a detection model, and calculating and matching the existing similar defects in the defect data set through similarity; and extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, and constructing a software defect knowledge map to assist a developer in understanding possible problems of the detected defects. The invention can solve the problem of high false alarm rate caused by the difficulty in formulating an effective and comprehensive defect mode in the traditional defect detection technology, and analyzes the detected defect from the knowledge perspective, so that the practical application field is wider, the precision is higher and the interpretability is stronger.

Description

Knowledge-driven software defect detection and analysis method and system
Technical Field
The invention belongs to the field of software security, and particularly relates to a knowledge-driven software defect detection and analysis method and system.
Background
A software bug refers to a software product whose results do not meet software requirements or end-user expectations. It often produces incorrect or unexpected results and behavior in an unexpected manner. The increasing complexity and dependence of software increases the difficulty of high quality, low cost and maintainability of software, as well as the possibility of creating software defects. Most of the traditional defect detection methods aim at certain specific types of defects, and sentences possibly having defects in codes are matched through regular expressions. However, it is impractical to rely entirely on defect rules made by human experts to cover the growing software defects.
There are some work currently using machine learning methods to detect software defects. The Hogiyuan et al provides a search-based semi-supervised integration method, and integrates a base classifier based on a small number of marked target examples through a genetic algorithm with global search capability, so that the performance of defect detection among projects is remarkably improved. Chen Shu et al propose to combine the field adaptation of example weighting with the predictive model training process of machine learning, and by constructing the weights associated with the target item samples, to influence the parameter learning process of the predictive model with the example weights, to adapt the distribution characteristics from the defect data set in the target item to the training data set, thereby achieving the multiplexing of the defect data samples and the cross-item software defect prediction. However, the granularity of most of the software defect detection methods based on machine learning is at a file level, so that when the software defect detection methods are deployed in an actual application scene, a user cannot assist in maintaining and guaranteeing codes according to a coarse-granularity detection result.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a knowledge-driven software defect detection and analysis method and system with the characteristics of wider application field, higher precision, stronger interpretability and the like aiming at the problems in the prior art.
The technical scheme is as follows: the technical scheme adopted for realizing the purpose of the invention is as follows: a knowledge-driven software defect detection and analysis method, the method comprising the steps of:
step 1, collecting public defect data and constructing a defect data set;
step 2, preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space; the code characteristic diagram is constructed according to an abstract syntax tree AST, a control flow diagram CFG and a program dependency diagram PDG, the edge type of the diagram is composed of the AST, the CFG and the PDG, and the node set is composed of the node set of the abstract syntax tree AST; extracting a defect feature subgraph related to a defect code line from the code feature graph for graph embedding;
step 3, learning implicit characteristics of the defect codes through a graph neural network, and training a defect detection model;
step 4, processing the item to be detected, inputting the processed item into an optimal detection model, detecting the defects in the item, carrying out similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity;
step 5, extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining key information in the similar defect report to assist a developer in understanding the detected defect;
further, the step 1 of collecting the disclosed defect data and constructing a defect data set includes the following specific processes:
step 1-1, collecting public defect data including a defect report and a defect code from a defect tracking library and an open source code library;
step 1-2, preprocessing the acquired defect data, extracting defect codes from function levels, cleaning the function level defect codes, and removing redundant information including code annotations and declared global parameters to obtain a defect data set.
Further, the step 2 of preprocessing the defect code, modeling the code through a code feature map and embedding the code into a vector space includes:
step 2-1, classifying the defects according to description information in the defect report and by combining the reasons of the defects, merging the defect sub-types through an abstract relation, and obtaining a defect type table, wherein the defect types comprise functional defects, interface defects, logic defects, calculation defects and data defects;
step 2-2, classifying the defect data in the defect data set, constructing a corpus, and dividing a training set and a verification set;
step 2-3, performing code representation on the defect codes in the training set, representing the defect codes as an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG, synthesizing the three graphs, and constructing a code characteristic graph, wherein the edge type of the graph is formed by the AST, the CFG and the PDG, and the node set is formed by a node set of the abstract syntax tree AST;
step 2-4, segmenting the defect code characteristic graph obtained in the step 2-3 through program slicing to obtain a defect characteristic subgraph related to a defect code line;
step 2-5, recognizing variable names and function names in the defect characteristic subgraph, renaming according to the occurrence sequence in a single function, standardizing codes, and eliminating noise caused by artificial naming;
and 2-6, respectively carrying out graph embedding on leaf nodes and statement nodes in the defect feature subgraph by using Word2Vec and Doc2Vec to obtain vector representation which can be used as input of a graph neural network.
Further, the step 3 of learning the implicit characteristics of the defect codes through the neural network of the graph and training the defect detection model comprises the following specific processes:
step 3-1, inputting the defect characteristic subgraph into a relational graph convolution network for training; the graph embedding vector obtained in the step 2 represents an initial node vector required by the iteration of the relational graph convolution network;
step 3-2, carrying out iterative polymerization on the node characteristics through a relation graph convolution network, wherein the polymerization process of the first time is calculated through the following formula:
Figure BDA0003137633890000031
wherein the content of the first and second substances,
Figure BDA0003137633890000032
representing the set of neighbor nodes for node i under the relationship R ∈ R (i.e., edge types AST, CFG, and PDG), R being the set of relationships. c. Ci,rIs used for calculating the number of neighbor nodes having a relation r with the node i. Wr (l)Representing the transition matrix for relation r. Ws (l)Is a self-join item for ensuring that the feature representation of the node in the iteration(s) l can also affect the feature representation of the iteration(s) l + 1.σ (-) is the activation function;
and 3-3, summing the feature embedded vectors based on the obtained node features, carrying out node aggregation on a graph level, training a classifier by utilizing a multi-class cross entropy loss function softmax to obtain a defect detection model, and inputting the defect codes in the verification set in the step 2-2 into the defect detection model obtained by training in combination with the code representation processing from the step 2-3 to the step 2-6 to carry out model optimization.
Further, the step 4 of processing the item to be detected, inputting the processed item into the optimal detection model, detecting the defects existing in the item, performing similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity, wherein the specific process comprises the following steps:
step 4-1, segmenting the item to be tested to obtain codes of a plurality of function levels, inputting the codes into the optimal detection model trained in the step 3 for detection by combining the code representation processing in the step 2, and outputting the function with the defects and the defect types corresponding to the function;
step 4-2, comparing the defect type obtained in the step 4-1 with the defect type in the defect data set, selecting a defect code set consistent with the detected defect code type from the data set as a search space, and inputting a code feature map obtained by the code characterization of the defect code set and the detected defect code type into a graph similarity function;
and 4-3, calculating the similarity between the graphs by using the following formula through a graph similarity function:
Figure BDA0003137633890000035
wherein S (G, G ') represents the similarity of diagrams G and G ', phi (G) and phi (G ') represent the graph embedding of the code characteristic graph, and d (phi (G), phi (G ')) represents the spatial distance between phi (G) and phi (G ');
and 4-4, ranking the inter-graph similarity obtained through calculation, selecting the known defects in the data set with the highest similarity as similar defects for matching, and outputting similar defect IDs.
Further, the step 5 of extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, and constructing a software defect knowledge graph comprises the following specific processes:
step 5-1: acquiring a corresponding defect report according to the similar defect ID, preprocessing the defect report, and extracting information with important semantic value, wherein the information comprises a plurality of defect numbers, defect titles, products, components, severity, modification states, defect handlers, defect reporters and defect description information;
and 5-2, extracting knowledge, namely applying a named entity identification network to classify the defect entities and identify entity relations from the defect titles and the defect description information extracted in the step 5-1, and extracting the defect entities, attributes and the interrelations among the entities from a knowledge graph corresponding to the defect data set.
Step 5-3, after acquiring defect knowledge, carrying out knowledge fusion to eliminate entity ambiguity; and adding a defect knowledge entity with no contradiction in semantics into the defect knowledge map.
Further, the method also comprises a step 6 of constructing a user portrait according to the problems related to the software defect field asked by the user, carrying out semantic understanding on the problems, carrying out atlas search on the constructed defect knowledge atlas according to entity knowledge in the problems, generating an answer list, and outputting a result by combining the user portrait, wherein the concrete process comprises the following steps:
step 6-1, preprocessing the user questions: analyzing, completing and standardizing the question sentence by combining the question and answer context information, and extracting defective entities and relations in the question sentence of the user;
step 6-2, map searching and reasoning: mapping the extracted defect entities and the relations into structured query sentences of a Neo4j graph database for graph search of a knowledge graph;
step 6-3, answer ranking: and (3) combining the user characteristics obtained by the user portrait with a candidate answer list obtained by sub-graph search, scoring the candidate answers to be ranked through a RankNet model, and ranking the candidate answers according to the scores.
Based on the same inventive concept, the invention provides an interpretable software vulnerability detection and recommendation system, which comprises:
the defect data preparation module is used for collecting public defect data, including a defect report and a defect code, and constructing a defect data set;
the code characteristic modeling module is used for preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space; the code characteristic diagram is constructed according to an abstract syntax tree AST, a control flow diagram CFG and a program dependency diagram PDG, the edge type of the diagram is composed of the AST, the CFG and the PDG, and the node set is composed of the node set of the abstract syntax tree AST; extracting a defect feature subgraph related to a defect code line from the code feature graph for graph embedding; the defect detection model building module is used for learning the implicit characteristics of the defect codes through a graph neural network and training a defect detection model;
the defect detection and matching module is used for processing the item to be detected, inputting the item to be detected into the optimal detection model, detecting the defects in the item, carrying out similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity;
and the software defect knowledge map construction module is used for extracting the defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining the key information in the similar defect report to assist a developer in understanding the detected defect.
Furthermore, the system also comprises a question-answer construction module, which constructs a user portrait according to the problems related to the software defect field asked by the user, carries out semantic understanding on the problems, carries out map search on the constructed defect knowledge map according to the entity knowledge in the problems, generates an answer list, and outputs a result by combining the user portrait.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: 1) the implicit characteristic of the defect code is described from the angle of the graph, and the multi-dimensional defect code semantic characteristic is better utilized by constructing a code characteristic graph; 2) by introducing a relation graph convolution neural network, the edge type is also used as a special attribute to be embedded into the node feature learning process, so that the training effect of the model is improved; 3) utilizing similarity matching of the defect code feature subgraphs to furthest mine the incidence relation among the defect codes; 4) the existing knowledge graph and NLP technology are combined, some textual description and visual display are provided for the matched similar defects, fine-grained and interpretable auxiliary information is provided for the detection result to a certain extent, and a foundation is laid for the practical application research of subsequent software defect positioning and repairing.
Drawings
FIG. 1 is a general flow diagram of a knowledge-driven software bug detection and analysis method in one embodiment.
FIG. 2 is a detailed flow diagram of a knowledge-driven software defect detection and analysis method in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, with reference to fig. 1, the present invention provides a knowledge-driven software defect detection and analysis method, including the following steps:
step 1, collecting public defect data and constructing a defect data set;
step 2, preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space;
step 3, learning implicit characteristics of the defect codes through a graph neural network, and training a defect detection model;
step 4, processing the item to be detected, inputting the processed item into an optimal detection model, detecting the defects in the item, carrying out similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity;
and 5, extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining key information (such as defect type, defect reason and the like) in the similar defect report to assist a developer in understanding the detected defects.
Further, in one embodiment, as shown in fig. 2, the step 1 of constructing the defect data set includes:
step 1-1, collecting public defect data including defect reports (such as defect description, defect types and the like) and defect codes from a defect tracking library (such as Bugzilla and the like) and an open source code library (such as GitHub and the like);
step 1-2, preprocessing the acquired defect data, extracting defect codes from function levels, cleaning the function level defect codes, and removing redundant information including code annotations, declared global parameters and the like to obtain a defect data set.
Further, in one embodiment, the step 2 of preprocessing the defect code, modeling the code through a code feature map, and embedding the code into a vector space includes:
step 2-1, classifying the defects according to description information in the defect report and by combining the reasons of the defects, merging the defect sub-types through an abstract relation, and obtaining a defect type table, wherein the defect types comprise functional defects, interface defects, logic defects, calculation defects and data defects; the defect type table of this example is shown in table 1 below.
TABLE 1 Defect types Table
Figure BDA0003137633890000071
Step 2-2, classifying the defect data in the defect data set by combining the table 1, constructing a corpus, and performing classification according to the following steps of 8: 2, dividing a training set and a verification set in proportion;
step 2-3, performing code representation on the defect codes in the training set, representing the defect codes as an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG through an open source tool Joern, synthesizing the three graphs, and constructing a code characteristic graph, wherein the edge type of the graph is formed by the AST, the CFG and the PDG, and the node set is formed by a node set of the abstract syntax tree AST;
step 2-4, identifying sensitive operation in the defect code, and segmenting the defect code characteristic diagram obtained in the step 2-3 through program slicing to obtain a defect characteristic subgraph closely related to a defect code line;
step 2-5, identifying Variable names Variable and function names Func in the defect feature sub-graph, and according to the appearance sequence in a single function, renaming the Variable names Var1, Var2, Func1, Func2 and the like, so as to standardize codes and eliminate noise caused by artificial naming;
and 2-6, respectively carrying out graph embedding on leaf nodes and statement nodes in the defect feature subgraph by using Word2Vec and Doc2Vec to obtain vector representation which can be used as input of a graph neural network.
Further, in one embodiment, the step 3 of learning the implicit features of the defect code through the neural network of the graph and training the defect detection model includes:
and 3-1, inputting the defect characteristic subgraph into a relational graph convolution network for training. Wherein, the graph embedding vector obtained in the step 2-6 represents an initial node vector required by the iteration of the convolution network of the relational graph;
step 3-2, carrying out iterative polymerization on the node characteristics through a relation graph convolution network, wherein the polymerization process of the first time is calculated through the following formula:
Figure BDA0003137633890000081
wherein the content of the first and second substances,
Figure BDA0003137633890000082
representing the set of neighbor nodes for node i under the relationship R ∈ R (i.e., edge types AST, CFG, and PDG), R being the set of relationships. c. Ci,rIs used for calculating the number of neighbor nodes having a relation r with the node i. Wr (l)Representing the transition matrix for relation r. Ws (l)Is a self-join item for ensuring that the feature representation of the node in the iteration(s) l can also affect the feature representation of the iteration(s) l + 1.σ (-) is the activation function;
and 3-3, summing the feature embedded vectors based on the obtained node features, carrying out node aggregation on a graph level, training a classifier by utilizing a multi-class cross entropy loss function softmax to obtain a defect detection model, and inputting the defect codes in the verification set in the step 2-2 into the defect detection model obtained by training in combination with the code representation processing from the step 2-3 to the step 2-6 to carry out model optimization.
Further, in one embodiment, the step 4 of processing the item to be detected, inputting the processed item into an optimal detection model, detecting the defects existing in the item, and performing similarity calculation on the detected defects and known defects of the same type in the defect data set according to the defect type of the detected defects, so as to match the known defects with the highest similarity, and the specific process includes:
step 4-1, segmenting the item to be detected to obtain codes of a plurality of function levels, inputting the codes into the optimal detection model obtained through optimization in the step 3-3 for detection by combining the code representation processing from the step 2-3 to the step 2-6, and outputting the function with the defects and the defect types corresponding to the function;
step 4-2, comparing the defect type obtained in the step 4-1 with the defect type of the corpus (including a training set and a verification set) in the step 2-2, selecting a defect code set consistent with the detected defect code type from the corpus as a search space, and inputting a code feature map obtained by code representation of the defect code set and the detected defect code type into a map similarity function;
and 4-3, calculating the similarity between the graphs by using the following formula through a graph similarity function:
Figure BDA0003137633890000091
wherein S (G, G ') represents the similarity of diagrams G and G ', phi (G) and phi (G ') represent the graph embedding of the code characteristic graph, and d (phi (G), phi (G ')) represents the spatial distance between phi (G) and phi (G ');
and 4-4, ranking the inter-image similarity obtained through calculation, selecting the known defects in the corpus with the highest similarity as similar defects for matching, and outputting similar defect IDs.
Further, in one embodiment, the step 5 of extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, and constructing a software defect knowledge graph includes the specific processes:
step 5-1: acquiring a corresponding defect report according to the similar defect ID, carrying out natural language preprocessing, and extracting information with important semantic value from the similar defect report, wherein the information comprises a defect number, a defect title, a product, a component, severity, a modification state, a defect handler, a defect reporter, defect description information and the like;
step 5-2: defect knowledge extraction, namely, applying a named entity recognition network (such as a deep neural network Bi-LSTM + CRF combined with the Attention) to classify defect entities and recognize entity relations from the defect titles and the defect description information extracted in the step 5-1, and then extracting defect entities, attributes and interrelations among the entities from a knowledge graph corresponding to a defect data set (namely, extracting knowledge related to similar defect reports from a complete knowledge graph constructed based on the defect data acquired in the step 1);
step 5-3: knowledge fusion and detection: after acquiring defect knowledge, entity alignment is carried out to eliminate entity ambiguity, and the similarity between every two entities is calculated by using a Levenshtein distance, namely the minimum edit distance, wherein the calculation formula is as follows:
sim1=1-(leva,b(|a|,|b|)/max(|a|,|b|))
wherein, a and b are two entity character strings; if sim1 is larger than the set threshold, then two entities are counted as the same entity; and after quality evaluation, adding qualified parts, namely defect knowledge entities without contradiction generation in semantics into the knowledge graph.
By adopting the scheme of the embodiment, the defect report is converted into the defect knowledge map to be used as the support of the next-stage defect question answering, so that the method has better effect compared with the traditional information retrieval-based mode.
Further, in one embodiment, the method further comprises a step 6 of constructing a user portrait according to the problems related to the software defect field asked by the user, performing semantic understanding on the problems, performing map search on the constructed defect knowledge map according to the entity knowledge in the problems, generating an answer list, and outputting a result by combining the user portrait. The method specifically comprises the following steps:
step 6-1, preprocessing the user questions: analyzing, completing and standardizing the question sentence by combining the question and answer context information, extracting defective entities and relations, and deleting meaningless words;
step 6-2, map searching and reasoning: mapping the extracted defect entities and the relations into a structured query sentence Cypher of a Neo4j graph database to perform graph search of a knowledge graph;
step 6-3, answer ranking: a user representation is generated, and emotion calculation for the user can be represented by a triple, which is expressed as follows:
ST=<T,C,I>
where T denotes a set of user information, i.e., T ═ T1,t2,...tnThese information carriers are also problems with software defects posed by users. C represents an emotion category or a set formed by different tendency classifications. For example, there may be some "urgency" or "frustration" when the user fails to obtain the desired result over multiple queries, where the user's interaction should be reduced or politely told that the user cannot find an answer rather than returning a irrelevant answer. I.e. C ═ C1,c2,...,cn}. The method can express discrete emotion characteristics, can combine more complex emotions by using basic emotions, and therefore, the emotion characteristics can be divided into two or more categories according to different application purposes so as to create different emotion classification models. The model directly reflects the basic understanding of the emotion granularity. I denotes a set of different emotional feature strengths, i.e. { I ═ I }1,i2,...,inGeneral strength can be divided into 3 grades of high, medium and low, and can also be divided into 5 grades of extremely high, medium, low and extremely low, and the strength characteristics are combined with emotional characteristics to form the core and the foundation of emotional calculation.
According to the definition, the calculation of the user emotion can be expressed as the acquisition and identification of the knowledge of software defects in the user input problem, so that the calculation of the user emotion function on different dimensions is realized. Thus, the computation of emotion can be expressed as a state space combination formed by the Cartesian product of the three elements described above, i.e., as
ST=T×C×I
Through the emotion calculation, the system can extract the emotion characteristics of the user, and establishes the portrait of the user by qualitative and quantitative analysis and behavior modeling, so that preparation is made for ordering the candidate answers in question answering.
Scoring the candidate answers to be ranked by combining the user characteristics obtained by the user portrait and the candidate answer list through a RankNet model, and ranking the candidate answers according to the scores;
by adopting the scheme of the embodiment, the defect knowledge map-based mode is combined with the query template, the matching of the user defect problems and the query of answers or solutions are efficiently and reliably carried out, the problem intention of the user is accurately understood by combining the user portrait, the accurate answers are given, and the satisfaction degree of the user is improved.
In one embodiment, a knowledge-driven software bug detection and analysis system is presented, the system comprising:
the defect data preparation module is used for collecting public defect data, including a defect report and a defect code, and constructing a defect data set;
the code characteristic modeling module is used for preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space; the code characteristic diagram is constructed according to an abstract syntax tree AST, a control flow diagram CFG and a program dependency diagram PDG, the edge type of the diagram is composed of the AST, the CFG and the PDG, and the node set is composed of the node set of the abstract syntax tree AST; extracting a defect feature subgraph related to a defect code line from the code feature graph for graph embedding; the defect detection model building module is used for learning the implicit characteristics of the defect codes through a graph neural network and training a defect detection model;
the defect detection and matching module is used for processing the item to be detected, inputting the item to be detected into the optimal detection model, detecting the defects in the item, carrying out similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity;
and the software defect knowledge map construction module is used for extracting the defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining the key information in the similar defect report to assist a developer in understanding the detected defect.
Furthermore, the system also comprises a question-answer construction module, which constructs a user portrait according to the problems related to the software defect field asked by the user, carries out semantic understanding on the problems, carries out map search on the constructed defect knowledge map according to the entity knowledge in the problems, generates an answer list, and outputs a result by combining the user portrait.
The system embodiment and the method embodiment belong to the same method concept, and specific implementation of each module refers to the method embodiment, which is not described herein again.
The invention can better utilize the implicit characteristics of the defect codes, fully excavate the incidence relation among similar defect codes, further analyze the detected defects by matching with some known defects on the basis of identifying various defect types in the codes, has stronger universality and universality, can relieve the problem of high false alarm rate caused by difficult establishment of effective and comprehensive defect modes in the traditional defect detection technology, and analyze the detected defects from the knowledge perspective, thereby having wider practical application field, higher precision and stronger interpretability.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. A knowledge-driven software defect detection and analysis method is characterized by comprising the following steps:
step 1, collecting public defect data including a defect report and a defect code, and constructing a defect data set;
step 2, preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space; the code characteristic diagram is constructed according to an abstract syntax tree AST, a control flow diagram CFG and a program dependency diagram PDG, the edge type of the diagram is composed of the AST, the CFG and the PDG, and the node set is composed of the node set of the abstract syntax tree AST; extracting a defect feature subgraph related to a defect code line from the code feature graph for graph embedding;
step 3, learning implicit characteristics of the defect codes through a graph neural network, and training a defect detection model;
step 4, processing the item to be detected, inputting the processed item into an optimal detection model, detecting the defects in the item, carrying out similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity;
and 5, extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining key information in the similar defect report to assist a developer in understanding the detected defects.
2. The knowledge-driven software defect detection and analysis method according to claim 1, wherein the step 1 of collecting the public defect data and constructing the defect data set comprises the following specific processes:
step 1-1, collecting public defect data including a defect report and a defect code from a defect tracking library and an open source code library;
step 1-2, preprocessing the acquired defect data, extracting defect codes from function levels, cleaning the function level defect codes, and removing redundant information including code annotations and declared global parameters to obtain a defect data set.
3. The knowledge-driven software defect detection and analysis method of claim 1, wherein the step 2 preprocesses the defect code, models the code through a code feature map, and embeds the code into a vector space, and the specific process includes:
step 2-1, classifying the defects according to description information in the defect report and by combining the reasons of the defects, merging the defect sub-types through an abstract relation, and obtaining a defect type table, wherein the defect types comprise functional defects, interface defects, logic defects, calculation defects and data defects;
step 2-2, classifying the defect data in the defect data set, constructing a corpus, and dividing a training set and a verification set;
step 2-3, performing code representation on the defect codes in the training set, representing the defect codes as an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG, synthesizing the three graphs, and constructing a code characteristic graph, wherein the edge type of the graph is formed by the AST, the CFG and the PDG, and the node set is formed by a node set of the abstract syntax tree AST;
step 2-4, segmenting the defect code characteristic graph obtained in the step 2-3 through program slicing to obtain a defect characteristic subgraph related to a defect code line;
step 2-5, recognizing variable names and function names in the defect characteristic subgraph, renaming according to the occurrence sequence in a single function, standardizing codes, and eliminating noise caused by artificial naming;
and 2-6, respectively carrying out graph embedding on leaf nodes and statement nodes in the defect feature subgraph by using Word2Vec and Doc2Vec to obtain vector representation which can be used as input of a graph neural network.
4. The knowledge-driven software defect detection and analysis method of claim 1, wherein step 3 learns the implicit features of the defect codes through the neural network, trains the defect detection model, and comprises the following specific processes:
step 3-1, inputting the defect characteristic subgraph into a relational graph convolution network for training; the graph embedding vector obtained in the step 2 represents an initial node vector required by the iteration of the relational graph convolution network;
step 3-2, carrying out iterative polymerization on the node characteristics through a relation graph convolution network, wherein the polymerization process of the first time is calculated through the following formula:
Figure FDA0003137633880000021
wherein the content of the first and second substances,
Figure FDA0003137633880000022
representing a neighbor node set of a node i under the relation R belonging to R, wherein R is a relation set and comprises edge types AST, CFG and PDG; c. Ci,rThe method is used for calculating the number of neighbor nodes having a relation r with a node i;
Figure FDA0003137633880000023
indicating the transition matrix for the relationship r,
Figure FDA0003137633880000024
the self-connection item is used for ensuring that the feature representation of the node in the iteration of the time l also has influence on the feature representation of the iteration of the time l + 1; σ (-) is the activation function;
and 3-3, summing the feature embedded vectors based on the obtained node features, carrying out node aggregation at a graph level, and training a classifier by utilizing a multi-class cross entropy loss function softmax to obtain a defect detection model.
5. The knowledge-driven software defect detection and analysis method of claim 1, wherein the step 4 of processing the item to be detected, inputting the processed item into an optimal detection model, detecting the defects existing in the item, performing similarity calculation on the detected defects and known defects of the same type in the defect data set according to the defect type of the detected defects, and matching the known defects with the highest similarity comprises the following specific steps:
step 4-1, segmenting the item to be tested to obtain codes of a plurality of function levels, inputting the codes into the optimal detection model trained in the step 3 for detection by combining the code representation processing in the step 2, and outputting the function with the defects and the defect types corresponding to the function;
step 4-2, comparing the defect type obtained in the step 4-1 with the defect type in the defect data set, selecting a defect code set consistent with the detected defect code type from the data set as a search space, and inputting a code feature map obtained by the code characterization of the defect code set and the detected defect code type into a graph similarity function;
and 4-3, calculating the similarity between the graphs by using the following formula through a graph similarity function:
Figure FDA0003137633880000031
wherein S (G, G ') represents the similarity of diagrams G and G ', phi (G) and phi (G ') represent the graph embedding of the code characteristic graph, and d (phi (G), phi (G ')) represents the spatial distance between phi (G) and phi (G ');
and 4-4, ranking the inter-graph similarity obtained through calculation, selecting the known defects in the data set with the highest similarity as similar defects for matching, and outputting similar defect IDs.
6. The knowledge-driven software defect detection and analysis method according to claim 1, wherein the step 5 of extracting defect knowledge with semantic value in similar defect reports, performing knowledge fusion and detection, and constructing a software defect knowledge graph comprises the following specific processes:
step 5-1, acquiring a corresponding defect report according to the similar defect ID, preprocessing the defect report, and extracting information with important semantic value, wherein the information comprises a plurality of defect numbers, defect titles, products, components, severity, modification states, defect handlers, defect reporters and defect description information;
step 5-2, extracting knowledge, namely, performing entity classification and entity relationship identification by applying a named entity identification network in the defect title and defect description information extracted in the step 5-1, and extracting entities, attributes and mutual relationships among the entities from a knowledge graph corresponding to a defect data set;
step 5-3, after acquiring defect knowledge, carrying out knowledge fusion to eliminate entity ambiguity; and adding a defect knowledge entity with no contradiction in semantics into the defect knowledge map.
7. The knowledge-driven software defect detection and analysis method of claim 1, further comprising step 6 of constructing a user portrait according to the related problem of the software defect field asked by the user, performing semantic understanding of the problem, performing a map search on the constructed defect knowledge map according to the entity knowledge in the problem, generating an answer list, and outputting a result in combination with the user portrait.
8. The knowledge-driven software defect detection and analysis method of claim 7, wherein step 6 is to construct a user portrait according to the related problem in the software defect field asked by the user, to understand the semantics of the problem, to perform a map search on the constructed defect knowledge map according to the entity knowledge in the problem, to generate an answer list, and to output the result in combination with the user portrait, and the specific process includes:
step 6-1, preprocessing the user problems: analyzing, completing and standardizing the question sentence by combining the question and answer context information, and extracting defective entities and relations in the question sentence of the user;
step 6-2, map searching and reasoning: mapping the extracted defect entities and the relations into a structured query statement of a Neo4j graph database to perform subgraph search operation;
step 6-3, answer ranking: and (4) scoring the candidate answers to be ranked by combining the user characteristics obtained by the user portrait and the candidate answer list through a RankNet model, and ranking the candidate answers according to the score.
9. A knowledge-driven software bug detection and analysis system, the system comprising:
the defect data preparation module is used for collecting public defect data, including a defect report and a defect code, and constructing a defect data set;
the code characteristic modeling module is used for preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space; the code characteristic diagram is constructed according to an abstract syntax tree AST, a control flow diagram CFG and a program dependency diagram PDG, the edge type of the diagram is composed of the AST, the CFG and the PDG, and the node set is composed of the node set of the abstract syntax tree AST; extracting a defect feature subgraph related to a defect code line from the code feature graph for graph embedding;
the defect detection model building module is used for learning the implicit characteristics of the defect codes through a graph neural network and training a defect detection model;
the defect detection and matching module is used for processing the item to be detected, inputting the item to be detected into the optimal detection model, detecting the defects in the item, carrying out similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity;
and the software defect knowledge map construction module is used for extracting the defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining the key information in the similar defect report to assist a developer in understanding the detected defect.
10. The knowledge-driven software defect detection and analysis system of claim 9, wherein the system further comprises:
and the question-answer construction module is used for constructing a user portrait according to the software defect field related questions asked by the user, performing semantic understanding on the questions, performing map search on the constructed defect knowledge map according to the entity knowledge in the questions, generating an answer list, and outputting results by combining the user portrait.
CN202110726068.8A 2021-06-29 2021-06-29 Knowledge-driven software defect detection and analysis method and system Pending CN113434418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110726068.8A CN113434418A (en) 2021-06-29 2021-06-29 Knowledge-driven software defect detection and analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110726068.8A CN113434418A (en) 2021-06-29 2021-06-29 Knowledge-driven software defect detection and analysis method and system

Publications (1)

Publication Number Publication Date
CN113434418A true CN113434418A (en) 2021-09-24

Family

ID=77757494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110726068.8A Pending CN113434418A (en) 2021-06-29 2021-06-29 Knowledge-driven software defect detection and analysis method and system

Country Status (1)

Country Link
CN (1) CN113434418A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113901452A (en) * 2021-09-30 2022-01-07 中国电子科技集团公司第十五研究所 Sub-graph fuzzy matching security event identification method based on information entropy
CN114936158A (en) * 2022-05-28 2022-08-23 南通大学 Software defect positioning method based on graph convolution neural network
CN116302088A (en) * 2023-01-05 2023-06-23 广东工业大学 Code clone detection method, storage medium and equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016024477A (en) * 2014-07-16 2016-02-08 株式会社Screenホールディングス Software defect prediction device, software defect prediction method, and software defect prediction program
WO2017181286A1 (en) * 2016-04-22 2017-10-26 Lin Tan Method for determining defects and vulnerabilities in software code
EP3392780A2 (en) * 2017-04-19 2018-10-24 Tata Consultancy Services Limited Systems and methods for classification of software defect reports
CN110377715A (en) * 2019-07-23 2019-10-25 天津汇智星源信息技术有限公司 Reasoning type accurate intelligent answering method based on legal knowledge map
CN110413732A (en) * 2019-07-16 2019-11-05 扬州大学 The knowledge searching method of software-oriented defect knowledge
CN111597347A (en) * 2020-04-24 2020-08-28 扬州大学 Knowledge embedded defect report reconstruction method and device
CN111783100A (en) * 2020-06-22 2020-10-16 哈尔滨工业大学 Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN112364352A (en) * 2020-10-21 2021-02-12 扬州大学 Interpretable software vulnerability detection and recommendation method and system
CN112699377A (en) * 2020-12-30 2021-04-23 哈尔滨工业大学 Function-level code vulnerability detection method based on slice attribute graph representation learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016024477A (en) * 2014-07-16 2016-02-08 株式会社Screenホールディングス Software defect prediction device, software defect prediction method, and software defect prediction program
WO2017181286A1 (en) * 2016-04-22 2017-10-26 Lin Tan Method for determining defects and vulnerabilities in software code
EP3392780A2 (en) * 2017-04-19 2018-10-24 Tata Consultancy Services Limited Systems and methods for classification of software defect reports
CN110413732A (en) * 2019-07-16 2019-11-05 扬州大学 The knowledge searching method of software-oriented defect knowledge
CN110377715A (en) * 2019-07-23 2019-10-25 天津汇智星源信息技术有限公司 Reasoning type accurate intelligent answering method based on legal knowledge map
CN111597347A (en) * 2020-04-24 2020-08-28 扬州大学 Knowledge embedded defect report reconstruction method and device
CN111783100A (en) * 2020-06-22 2020-10-16 哈尔滨工业大学 Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN112364352A (en) * 2020-10-21 2021-02-12 扬州大学 Interpretable software vulnerability detection and recommendation method and system
CN112699377A (en) * 2020-12-30 2021-04-23 哈尔滨工业大学 Function-level code vulnerability detection method based on slice attribute graph representation learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MICHAEL SCHLICHTKRULL: "Modeling Relational Data with Graph Convolutional Networks", 《THE SEMANTIC WEB_ 15TH INTERNATIONAL CONFERENCE , ESWC 2018》, pages 1 - 15 *
谭华: "《数据挖掘方法及其在证券市场中的应用》", 湖南科学技术出版社, pages: 1 - 17 *
陈定山: "知识驱动的软件缺陷搜索技术研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113901452A (en) * 2021-09-30 2022-01-07 中国电子科技集团公司第十五研究所 Sub-graph fuzzy matching security event identification method based on information entropy
CN114936158A (en) * 2022-05-28 2022-08-23 南通大学 Software defect positioning method based on graph convolution neural network
CN116302088A (en) * 2023-01-05 2023-06-23 广东工业大学 Code clone detection method, storage medium and equipment
CN116302088B (en) * 2023-01-05 2023-09-08 广东工业大学 Code clone detection method, storage medium and equipment

Similar Documents

Publication Publication Date Title
CN108804521B (en) Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system
CN109697162B (en) Software defect automatic detection method based on open source code library
CN110245496B (en) Source code vulnerability detection method and detector and training method and system thereof
WO2021253904A1 (en) Test case set generation method, apparatus and device, and computer readable storage medium
US7606784B2 (en) Uncertainty management in a decision-making system
CN113434418A (en) Knowledge-driven software defect detection and analysis method and system
CN109726120B (en) Software defect confirmation method based on machine learning
CN111444344B (en) Entity classification method, entity classification device, computer equipment and storage medium
CN105378731A (en) Correlating corpus/corpora value from answered questions
CN112364352B (en) Method and system for detecting and recommending interpretable software loopholes
CN112000802A (en) Software defect positioning method based on similarity integration
WO2022268495A1 (en) Methods and systems for generating a data structure using graphical models
CN116108191A (en) Deep learning model recommendation method based on knowledge graph
CN111091009B (en) Document association auditing method based on semantic analysis
CN113742396B (en) Mining method and device for object learning behavior mode
Endres et al. Synthetic data generation: a comparative study
CN116610810A (en) Intelligent searching method and system based on regulation and control of cloud knowledge graph blood relationship
CN110502669A (en) The unsupervised chart dendrography learning method of lightweight and device based on the side N DFS subgraph
Zhang et al. Dependency-aware form understanding
CN112835620B (en) Semantic similar code online detection method based on deep learning
KR102131423B1 (en) Automatic compile method and apparatus of documents
Swadia A study of text mining framework for automated classification of software requirements in enterprise systems
US20230385037A1 (en) Method and system for automated discovery of artificial intelligence (ai)/ machine learning (ml) assets in an enterprise
CN116702160B (en) Source code vulnerability detection method based on data dependency enhancement program slice
COSTA An approach to rank program transformations based on machine learning.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination