CN113434418A

CN113434418A - Knowledge-driven software defect detection and analysis method and system

Info

Publication number: CN113434418A
Application number: CN202110726068.8A
Authority: CN
Inventors: 薄莉莉; 曹思聪; 孙小兵; 李世豪; 李斌
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-24

Abstract

The invention discloses a knowledge-driven software defect detection and analysis method and a system, which mainly comprise the following steps: constructing a defect data set; constructing a code feature graph to model the code and embed the code into a vector space; learning implicit characteristics of the defect codes through a graph neural network, and training a defect detection model; detecting defects existing in the items by using a detection model, and calculating and matching the existing similar defects in the defect data set through similarity; and extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, and constructing a software defect knowledge map to assist a developer in understanding possible problems of the detected defects. The invention can solve the problem of high false alarm rate caused by the difficulty in formulating an effective and comprehensive defect mode in the traditional defect detection technology, and analyzes the detected defect from the knowledge perspective, so that the practical application field is wider, the precision is higher and the interpretability is stronger.

Description

Knowledge-driven software defect detection and analysis method and system

Technical Field

The invention belongs to the field of software security, and particularly relates to a knowledge-driven software defect detection and analysis method and system.

Background

A software bug refers to a software product whose results do not meet software requirements or end-user expectations. It often produces incorrect or unexpected results and behavior in an unexpected manner. The increasing complexity and dependence of software increases the difficulty of high quality, low cost and maintainability of software, as well as the possibility of creating software defects. Most of the traditional defect detection methods aim at certain specific types of defects, and sentences possibly having defects in codes are matched through regular expressions. However, it is impractical to rely entirely on defect rules made by human experts to cover the growing software defects.

There are some work currently using machine learning methods to detect software defects. The Hogiyuan et al provides a search-based semi-supervised integration method, and integrates a base classifier based on a small number of marked target examples through a genetic algorithm with global search capability, so that the performance of defect detection among projects is remarkably improved. Chen Shu et al propose to combine the field adaptation of example weighting with the predictive model training process of machine learning, and by constructing the weights associated with the target item samples, to influence the parameter learning process of the predictive model with the example weights, to adapt the distribution characteristics from the defect data set in the target item to the training data set, thereby achieving the multiplexing of the defect data samples and the cross-item software defect prediction. However, the granularity of most of the software defect detection methods based on machine learning is at a file level, so that when the software defect detection methods are deployed in an actual application scene, a user cannot assist in maintaining and guaranteeing codes according to a coarse-granularity detection result.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a knowledge-driven software defect detection and analysis method and system with the characteristics of wider application field, higher precision, stronger interpretability and the like aiming at the problems in the prior art.

The technical scheme is as follows: the technical scheme adopted for realizing the purpose of the invention is as follows: a knowledge-driven software defect detection and analysis method, the method comprising the steps of:

step 1, collecting public defect data and constructing a defect data set;

step 2, preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space; the code characteristic diagram is constructed according to an abstract syntax tree AST, a control flow diagram CFG and a program dependency diagram PDG, the edge type of the diagram is composed of the AST, the CFG and the PDG, and the node set is composed of the node set of the abstract syntax tree AST; extracting a defect feature subgraph related to a defect code line from the code feature graph for graph embedding;

step 3, learning implicit characteristics of the defect codes through a graph neural network, and training a defect detection model;

step 4, processing the item to be detected, inputting the processed item into an optimal detection model, detecting the defects in the item, carrying out similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity;

step 5, extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining key information in the similar defect report to assist a developer in understanding the detected defect;

further, the step 1 of collecting the disclosed defect data and constructing a defect data set includes the following specific processes:

step 1-1, collecting public defect data including a defect report and a defect code from a defect tracking library and an open source code library;

step 1-2, preprocessing the acquired defect data, extracting defect codes from function levels, cleaning the function level defect codes, and removing redundant information including code annotations and declared global parameters to obtain a defect data set.

Further, the step 2 of preprocessing the defect code, modeling the code through a code feature map and embedding the code into a vector space includes:

step 2-1, classifying the defects according to description information in the defect report and by combining the reasons of the defects, merging the defect sub-types through an abstract relation, and obtaining a defect type table, wherein the defect types comprise functional defects, interface defects, logic defects, calculation defects and data defects;

step 2-2, classifying the defect data in the defect data set, constructing a corpus, and dividing a training set and a verification set;

step 2-3, performing code representation on the defect codes in the training set, representing the defect codes as an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG, synthesizing the three graphs, and constructing a code characteristic graph, wherein the edge type of the graph is formed by the AST, the CFG and the PDG, and the node set is formed by a node set of the abstract syntax tree AST;

step 2-4, segmenting the defect code characteristic graph obtained in the step 2-3 through program slicing to obtain a defect characteristic subgraph related to a defect code line;

step 2-5, recognizing variable names and function names in the defect characteristic subgraph, renaming according to the occurrence sequence in a single function, standardizing codes, and eliminating noise caused by artificial naming;

and 2-6, respectively carrying out graph embedding on leaf nodes and statement nodes in the defect feature subgraph by using Word2Vec and Doc2Vec to obtain vector representation which can be used as input of a graph neural network.

Further, the step 3 of learning the implicit characteristics of the defect codes through the neural network of the graph and training the defect detection model comprises the following specific processes:

step 3-1, inputting the defect characteristic subgraph into a relational graph convolution network for training; the graph embedding vector obtained in the step 2 represents an initial node vector required by the iteration of the relational graph convolution network;

step 3-2, carrying out iterative polymerization on the node characteristics through a relation graph convolution network, wherein the polymerization process of the first time is calculated through the following formula:

wherein the content of the first and second substances,

representing the set of neighbor nodes for node i under the relationship R ∈ R (i.e., edge types AST, CFG, and PDG), R being the set of relationships. c. C_i,rIs used for calculating the number of neighbor nodes having a relation r with the node i. W_r ^(l)Representing the transition matrix for relation r. W_s ^(l)Is a self-join item for ensuring that the feature representation of the node in the iteration(s) l can also affect the feature representation of the iteration(s) l + 1.σ (-) is the activation function;

and 3-3, summing the feature embedded vectors based on the obtained node features, carrying out node aggregation on a graph level, training a classifier by utilizing a multi-class cross entropy loss function softmax to obtain a defect detection model, and inputting the defect codes in the verification set in the step 2-2 into the defect detection model obtained by training in combination with the code representation processing from the step 2-3 to the step 2-6 to carry out model optimization.

Further, the step 4 of processing the item to be detected, inputting the processed item into the optimal detection model, detecting the defects existing in the item, performing similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity, wherein the specific process comprises the following steps:

step 4-1, segmenting the item to be tested to obtain codes of a plurality of function levels, inputting the codes into the optimal detection model trained in the step 3 for detection by combining the code representation processing in the step 2, and outputting the function with the defects and the defect types corresponding to the function;

step 4-2, comparing the defect type obtained in the step 4-1 with the defect type in the defect data set, selecting a defect code set consistent with the detected defect code type from the data set as a search space, and inputting a code feature map obtained by the code characterization of the defect code set and the detected defect code type into a graph similarity function;

and 4-3, calculating the similarity between the graphs by using the following formula through a graph similarity function:

wherein S (G, G ') represents the similarity of diagrams G and G ', phi (G) and phi (G ') represent the graph embedding of the code characteristic graph, and d (phi (G), phi (G ')) represents the spatial distance between phi (G) and phi (G ');

and 4-4, ranking the inter-graph similarity obtained through calculation, selecting the known defects in the data set with the highest similarity as similar defects for matching, and outputting similar defect IDs.

Further, the step 5 of extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, and constructing a software defect knowledge graph comprises the following specific processes:

step 5-1: acquiring a corresponding defect report according to the similar defect ID, preprocessing the defect report, and extracting information with important semantic value, wherein the information comprises a plurality of defect numbers, defect titles, products, components, severity, modification states, defect handlers, defect reporters and defect description information;

and 5-2, extracting knowledge, namely applying a named entity identification network to classify the defect entities and identify entity relations from the defect titles and the defect description information extracted in the step 5-1, and extracting the defect entities, attributes and the interrelations among the entities from a knowledge graph corresponding to the defect data set.

Step 5-3, after acquiring defect knowledge, carrying out knowledge fusion to eliminate entity ambiguity; and adding a defect knowledge entity with no contradiction in semantics into the defect knowledge map.

Further, the method also comprises a step 6 of constructing a user portrait according to the problems related to the software defect field asked by the user, carrying out semantic understanding on the problems, carrying out atlas search on the constructed defect knowledge atlas according to entity knowledge in the problems, generating an answer list, and outputting a result by combining the user portrait, wherein the concrete process comprises the following steps:

step 6-1, preprocessing the user questions: analyzing, completing and standardizing the question sentence by combining the question and answer context information, and extracting defective entities and relations in the question sentence of the user;

step 6-2, map searching and reasoning: mapping the extracted defect entities and the relations into structured query sentences of a Neo4j graph database for graph search of a knowledge graph;

step 6-3, answer ranking: and (3) combining the user characteristics obtained by the user portrait with a candidate answer list obtained by sub-graph search, scoring the candidate answers to be ranked through a RankNet model, and ranking the candidate answers according to the scores.

Based on the same inventive concept, the invention provides an interpretable software vulnerability detection and recommendation system, which comprises:

the defect data preparation module is used for collecting public defect data, including a defect report and a defect code, and constructing a defect data set;

the code characteristic modeling module is used for preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space; the code characteristic diagram is constructed according to an abstract syntax tree AST, a control flow diagram CFG and a program dependency diagram PDG, the edge type of the diagram is composed of the AST, the CFG and the PDG, and the node set is composed of the node set of the abstract syntax tree AST; extracting a defect feature subgraph related to a defect code line from the code feature graph for graph embedding; the defect detection model building module is used for learning the implicit characteristics of the defect codes through a graph neural network and training a defect detection model;

the defect detection and matching module is used for processing the item to be detected, inputting the item to be detected into the optimal detection model, detecting the defects in the item, carrying out similarity calculation on the detected defects according to the defect types and the known defects of the same type in the defect data set, and matching the known defects with the highest similarity;

and the software defect knowledge map construction module is used for extracting the defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining the key information in the similar defect report to assist a developer in understanding the detected defect.

Furthermore, the system also comprises a question-answer construction module, which constructs a user portrait according to the problems related to the software defect field asked by the user, carries out semantic understanding on the problems, carries out map search on the constructed defect knowledge map according to the entity knowledge in the problems, generates an answer list, and outputs a result by combining the user portrait.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: 1) the implicit characteristic of the defect code is described from the angle of the graph, and the multi-dimensional defect code semantic characteristic is better utilized by constructing a code characteristic graph; 2) by introducing a relation graph convolution neural network, the edge type is also used as a special attribute to be embedded into the node feature learning process, so that the training effect of the model is improved; 3) utilizing similarity matching of the defect code feature subgraphs to furthest mine the incidence relation among the defect codes; 4) the existing knowledge graph and NLP technology are combined, some textual description and visual display are provided for the matched similar defects, fine-grained and interpretable auxiliary information is provided for the detection result to a certain extent, and a foundation is laid for the practical application research of subsequent software defect positioning and repairing.

Drawings

FIG. 1 is a general flow diagram of a knowledge-driven software bug detection and analysis method in one embodiment.

FIG. 2 is a detailed flow diagram of a knowledge-driven software defect detection and analysis method in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, with reference to fig. 1, the present invention provides a knowledge-driven software defect detection and analysis method, including the following steps:

step 1, collecting public defect data and constructing a defect data set;

step 2, preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space;

and 5, extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining key information (such as defect type, defect reason and the like) in the similar defect report to assist a developer in understanding the detected defects.

Further, in one embodiment, as shown in fig. 2, the step 1 of constructing the defect data set includes:

step 1-1, collecting public defect data including defect reports (such as defect description, defect types and the like) and defect codes from a defect tracking library (such as Bugzilla and the like) and an open source code library (such as GitHub and the like);

step 1-2, preprocessing the acquired defect data, extracting defect codes from function levels, cleaning the function level defect codes, and removing redundant information including code annotations, declared global parameters and the like to obtain a defect data set.

Further, in one embodiment, the step 2 of preprocessing the defect code, modeling the code through a code feature map, and embedding the code into a vector space includes:

step 2-1, classifying the defects according to description information in the defect report and by combining the reasons of the defects, merging the defect sub-types through an abstract relation, and obtaining a defect type table, wherein the defect types comprise functional defects, interface defects, logic defects, calculation defects and data defects; the defect type table of this example is shown in table 1 below.

TABLE 1 Defect types Table

Step 2-2, classifying the defect data in the defect data set by combining the table 1, constructing a corpus, and performing classification according to the following steps of 8: 2, dividing a training set and a verification set in proportion;

step 2-3, performing code representation on the defect codes in the training set, representing the defect codes as an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG through an open source tool Joern, synthesizing the three graphs, and constructing a code characteristic graph, wherein the edge type of the graph is formed by the AST, the CFG and the PDG, and the node set is formed by a node set of the abstract syntax tree AST;

step 2-4, identifying sensitive operation in the defect code, and segmenting the defect code characteristic diagram obtained in the step 2-3 through program slicing to obtain a defect characteristic subgraph closely related to a defect code line;

step 2-5, identifying Variable names Variable and function names Func in the defect feature sub-graph, and according to the appearance sequence in a single function, renaming the Variable names Var1, Var2, Func1, Func2 and the like, so as to standardize codes and eliminate noise caused by artificial naming;

Further, in one embodiment, the step 3 of learning the implicit features of the defect code through the neural network of the graph and training the defect detection model includes:

and 3-1, inputting the defect characteristic subgraph into a relational graph convolution network for training. Wherein, the graph embedding vector obtained in the step 2-6 represents an initial node vector required by the iteration of the convolution network of the relational graph;

wherein the content of the first and second substances,

Further, in one embodiment, the step 4 of processing the item to be detected, inputting the processed item into an optimal detection model, detecting the defects existing in the item, and performing similarity calculation on the detected defects and known defects of the same type in the defect data set according to the defect type of the detected defects, so as to match the known defects with the highest similarity, and the specific process includes:

step 4-1, segmenting the item to be detected to obtain codes of a plurality of function levels, inputting the codes into the optimal detection model obtained through optimization in the step 3-3 for detection by combining the code representation processing from the step 2-3 to the step 2-6, and outputting the function with the defects and the defect types corresponding to the function;

step 4-2, comparing the defect type obtained in the step 4-1 with the defect type of the corpus (including a training set and a verification set) in the step 2-2, selecting a defect code set consistent with the detected defect code type from the corpus as a search space, and inputting a code feature map obtained by code representation of the defect code set and the detected defect code type into a map similarity function;

and 4-4, ranking the inter-image similarity obtained through calculation, selecting the known defects in the corpus with the highest similarity as similar defects for matching, and outputting similar defect IDs.

Further, in one embodiment, the step 5 of extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, and constructing a software defect knowledge graph includes the specific processes:

step 5-1: acquiring a corresponding defect report according to the similar defect ID, carrying out natural language preprocessing, and extracting information with important semantic value from the similar defect report, wherein the information comprises a defect number, a defect title, a product, a component, severity, a modification state, a defect handler, a defect reporter, defect description information and the like;

step 5-2: defect knowledge extraction, namely, applying a named entity recognition network (such as a deep neural network Bi-LSTM + CRF combined with the Attention) to classify defect entities and recognize entity relations from the defect titles and the defect description information extracted in the step 5-1, and then extracting defect entities, attributes and interrelations among the entities from a knowledge graph corresponding to a defect data set (namely, extracting knowledge related to similar defect reports from a complete knowledge graph constructed based on the defect data acquired in the step 1);

step 5-3: knowledge fusion and detection: after acquiring defect knowledge, entity alignment is carried out to eliminate entity ambiguity, and the similarity between every two entities is calculated by using a Levenshtein distance, namely the minimum edit distance, wherein the calculation formula is as follows:

sim1＝1-(lev_a，b(|a|，|b|)/max(|a|，|b|))

wherein, a and b are two entity character strings; if sim1 is larger than the set threshold, then two entities are counted as the same entity; and after quality evaluation, adding qualified parts, namely defect knowledge entities without contradiction generation in semantics into the knowledge graph.

By adopting the scheme of the embodiment, the defect report is converted into the defect knowledge map to be used as the support of the next-stage defect question answering, so that the method has better effect compared with the traditional information retrieval-based mode.

Further, in one embodiment, the method further comprises a step 6 of constructing a user portrait according to the problems related to the software defect field asked by the user, performing semantic understanding on the problems, performing map search on the constructed defect knowledge map according to the entity knowledge in the problems, generating an answer list, and outputting a result by combining the user portrait. The method specifically comprises the following steps:

step 6-1, preprocessing the user questions: analyzing, completing and standardizing the question sentence by combining the question and answer context information, extracting defective entities and relations, and deleting meaningless words;

step 6-2, map searching and reasoning: mapping the extracted defect entities and the relations into a structured query sentence Cypher of a Neo4j graph database to perform graph search of a knowledge graph;

step 6-3, answer ranking: a user representation is generated, and emotion calculation for the user can be represented by a triple, which is expressed as follows:

ST＝<T，C，I>

where T denotes a set of user information, i.e., T ═ T₁，t₂，...t_nThese information carriers are also problems with software defects posed by users. C represents an emotion category or a set formed by different tendency classifications. For example, there may be some "urgency" or "frustration" when the user fails to obtain the desired result over multiple queries, where the user's interaction should be reduced or politely told that the user cannot find an answer rather than returning a irrelevant answer. I.e. C ═ C₁，c₂，...，c_n}. The method can express discrete emotion characteristics, can combine more complex emotions by using basic emotions, and therefore, the emotion characteristics can be divided into two or more categories according to different application purposes so as to create different emotion classification models. The model directly reflects the basic understanding of the emotion granularity. I denotes a set of different emotional feature strengths, i.e. { I ═ I }₁，i₂，...，i_nGeneral strength can be divided into 3 grades of high, medium and low, and can also be divided into 5 grades of extremely high, medium, low and extremely low, and the strength characteristics are combined with emotional characteristics to form the core and the foundation of emotional calculation.

According to the definition, the calculation of the user emotion can be expressed as the acquisition and identification of the knowledge of software defects in the user input problem, so that the calculation of the user emotion function on different dimensions is realized. Thus, the computation of emotion can be expressed as a state space combination formed by the Cartesian product of the three elements described above, i.e., as

ST＝T×C×I

Through the emotion calculation, the system can extract the emotion characteristics of the user, and establishes the portrait of the user by qualitative and quantitative analysis and behavior modeling, so that preparation is made for ordering the candidate answers in question answering.

Scoring the candidate answers to be ranked by combining the user characteristics obtained by the user portrait and the candidate answer list through a RankNet model, and ranking the candidate answers according to the scores;

by adopting the scheme of the embodiment, the defect knowledge map-based mode is combined with the query template, the matching of the user defect problems and the query of answers or solutions are efficiently and reliably carried out, the problem intention of the user is accurately understood by combining the user portrait, the accurate answers are given, and the satisfaction degree of the user is improved.

In one embodiment, a knowledge-driven software bug detection and analysis system is presented, the system comprising:

The system embodiment and the method embodiment belong to the same method concept, and specific implementation of each module refers to the method embodiment, which is not described herein again.

The invention can better utilize the implicit characteristics of the defect codes, fully excavate the incidence relation among similar defect codes, further analyze the detected defects by matching with some known defects on the basis of identifying various defect types in the codes, has stronger universality and universality, can relieve the problem of high false alarm rate caused by difficult establishment of effective and comprehensive defect modes in the traditional defect detection technology, and analyze the detected defects from the knowledge perspective, thereby having wider practical application field, higher precision and stronger interpretability.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A knowledge-driven software defect detection and analysis method is characterized by comprising the following steps:

step 1, collecting public defect data including a defect report and a defect code, and constructing a defect data set;

and 5, extracting defect knowledge with semantic value in the similar defect report, performing knowledge fusion and detection, constructing a software defect knowledge map, and mining key information in the similar defect report to assist a developer in understanding the detected defects.

2. The knowledge-driven software defect detection and analysis method according to claim 1, wherein the step 1 of collecting the public defect data and constructing the defect data set comprises the following specific processes:

3. The knowledge-driven software defect detection and analysis method of claim 1, wherein the step 2 preprocesses the defect code, models the code through a code feature map, and embeds the code into a vector space, and the specific process includes:

4. The knowledge-driven software defect detection and analysis method of claim 1, wherein step 3 learns the implicit features of the defect codes through the neural network, trains the defect detection model, and comprises the following specific processes:

wherein the content of the first and second substances,

representing a neighbor node set of a node i under the relation R belonging to R, wherein R is a relation set and comprises edge types AST, CFG and PDG; c. C_i,rThe method is used for calculating the number of neighbor nodes having a relation r with a node i;

indicating the transition matrix for the relationship r,

the self-connection item is used for ensuring that the feature representation of the node in the iteration of the time l also has influence on the feature representation of the iteration of the time l + 1; σ (-) is the activation function;

and 3-3, summing the feature embedded vectors based on the obtained node features, carrying out node aggregation at a graph level, and training a classifier by utilizing a multi-class cross entropy loss function softmax to obtain a defect detection model.

5. The knowledge-driven software defect detection and analysis method of claim 1, wherein the step 4 of processing the item to be detected, inputting the processed item into an optimal detection model, detecting the defects existing in the item, performing similarity calculation on the detected defects and known defects of the same type in the defect data set according to the defect type of the detected defects, and matching the known defects with the highest similarity comprises the following specific steps:

6. The knowledge-driven software defect detection and analysis method according to claim 1, wherein the step 5 of extracting defect knowledge with semantic value in similar defect reports, performing knowledge fusion and detection, and constructing a software defect knowledge graph comprises the following specific processes:

step 5-1, acquiring a corresponding defect report according to the similar defect ID, preprocessing the defect report, and extracting information with important semantic value, wherein the information comprises a plurality of defect numbers, defect titles, products, components, severity, modification states, defect handlers, defect reporters and defect description information;

step 5-2, extracting knowledge, namely, performing entity classification and entity relationship identification by applying a named entity identification network in the defect title and defect description information extracted in the step 5-1, and extracting entities, attributes and mutual relationships among the entities from a knowledge graph corresponding to a defect data set;

7. The knowledge-driven software defect detection and analysis method of claim 1, further comprising step 6 of constructing a user portrait according to the related problem of the software defect field asked by the user, performing semantic understanding of the problem, performing a map search on the constructed defect knowledge map according to the entity knowledge in the problem, generating an answer list, and outputting a result in combination with the user portrait.

8. The knowledge-driven software defect detection and analysis method of claim 7, wherein step 6 is to construct a user portrait according to the related problem in the software defect field asked by the user, to understand the semantics of the problem, to perform a map search on the constructed defect knowledge map according to the entity knowledge in the problem, to generate an answer list, and to output the result in combination with the user portrait, and the specific process includes:

step 6-1, preprocessing the user problems: analyzing, completing and standardizing the question sentence by combining the question and answer context information, and extracting defective entities and relations in the question sentence of the user;

step 6-2, map searching and reasoning: mapping the extracted defect entities and the relations into a structured query statement of a Neo4j graph database to perform subgraph search operation;

step 6-3, answer ranking: and (4) scoring the candidate answers to be ranked by combining the user characteristics obtained by the user portrait and the candidate answer list through a RankNet model, and ranking the candidate answers according to the score.

9. A knowledge-driven software bug detection and analysis system, the system comprising:

the code characteristic modeling module is used for preprocessing the defect codes, modeling the codes through a code characteristic diagram and embedding the codes into a vector space; the code characteristic diagram is constructed according to an abstract syntax tree AST, a control flow diagram CFG and a program dependency diagram PDG, the edge type of the diagram is composed of the AST, the CFG and the PDG, and the node set is composed of the node set of the abstract syntax tree AST; extracting a defect feature subgraph related to a defect code line from the code feature graph for graph embedding;

the defect detection model building module is used for learning the implicit characteristics of the defect codes through a graph neural network and training a defect detection model;

10. The knowledge-driven software defect detection and analysis system of claim 9, wherein the system further comprises:

and the question-answer construction module is used for constructing a user portrait according to the software defect field related questions asked by the user, performing semantic understanding on the questions, performing map search on the constructed defect knowledge map according to the entity knowledge in the questions, generating an answer list, and outputting results by combining the user portrait.