CN111611010B - Interpretable method for code modification real-time defect prediction - Google Patents
Interpretable method for code modification real-time defect prediction Download PDFInfo
- Publication number
- CN111611010B CN111611010B CN202010332906.9A CN202010332906A CN111611010B CN 111611010 B CN111611010 B CN 111611010B CN 202010332906 A CN202010332906 A CN 202010332906A CN 111611010 B CN111611010 B CN 111611010B
- Authority
- CN
- China
- Prior art keywords
- modification
- solution
- modified
- modifications
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/72—Code refactoring
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Genetics & Genomics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Physiology (AREA)
- Stored Programmes (AREA)
Abstract
The invention discloses an interpretable method for code modification real-time defect prediction, which comprises the following steps: 1) collecting historical modification information of the software codes; 2) extracting modified feature data and specific feature values corresponding to the feature data from the acquired information; 3) generating an initial feasible solution population by combining the extracted features, setting a fitness function, carrying out further evolution and analysis on the initial feasible solution population, carrying out continuous cross variation, eliminating feasible solutions with low fitness and obtaining an optimal feasible solution; 4) and calculating a modification score according to the optimal solution and the specific characteristic value of the modification to be predicted, dividing the modification score by the corresponding inspection cost, arranging the modifications according to a descending quotient order, and selecting the modification with the set number at the forefront to be returned to the developer as the modification to be inspected. The result predicted by the invention has good interpretability, is easy to be accepted by developers, and enables programmers to effectively save cost in the process of managing software to repair defects.
Description
Technical Field
The invention relates to a computer software technology, in particular to an interpretable method for code modification real-time defect prediction.
Background
At present, the society is highly informationized, software is ubiquitous, large and small software is also filled in both transnational enterprises and small companies, and in order to achieve the operation efficiency, how to ensure the software quality is one of the important concerns of the enterprises. During the whole life cycle of the software, the software is modified for a plurality of times for modifying functions or repairing defects, and each modification has the risk of introducing defects. Repairing a defect is a labor intensive task that is easier to repair if a latent defect can be discovered earlier, but is more difficult and less easily repaired if a code is discovered later that causes the defect. Therefore, if it can be predicted whether the modification introduces the defect, the range of positioning the defect can be reduced, and the cost can be greatly saved.
At present, in the field of defect prediction, real-time defect prediction (just-in-time defect prediction) shows advantages of the defect prediction. Real-time defect prediction is characterized in that after software is modified, whether defects are introduced by modification is predicted in time by utilizing modification information. The technique has the advantages of being capable of finding defects early, reducing the range of positioning the defects, helping programmers to position the defects, further saving labor and simultaneously seeking to find the most defects within the limited cost. Therefore, the technology has a wide application prospect and is favored by a plurality of internet companies, for example, Cisco has introduced the technology into the software development of the self.
Although real-time defect prediction has the advantages, the current commercial real-time defect prediction technologies have no good interpretability, which limits the application of real-time defect prediction. This patent provides an interpretable method: and (3) improving the adaptability of the feasible solution to the environment by using a genetic programming algorithm, further finding out a proper optimal solution, and finding out the modification which is most likely to introduce the defects by using the optimal solution and recommending the modification to a developer for inspection. Wherein the feasible solutions are: selecting the characteristics, operators or constant values from-1 to 1 of code modification according to a certain probability to form a mathematical expression with mathematical significance; the fitness is as follows: and calculating the modified score by using the feasible solution and the modified specific characteristic numerical value, arranging all modifications in an ascending order according to the score, and determining the sum of the modified serial numbers introducing the defects as the fitness. Different from the machine learning like a black box, the method utilizes a genetic algorithm based on biological theory to obtain an optimal solution (mathematical expression), and the obtained optimal solution and an operation result can enable a developer to better understand which characteristics are more likely to introduce defects when the characteristics are modified, so that the algorithm has good interpretability and is easy to be accepted by the developer.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide an interpretable method for code modification real-time defect prediction, aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: an interpretable method of code modification real-time defect prediction, comprising the steps of:
1) collecting historical modification information of the software codes; obtaining historical modification information of the software code from a code hosting system, such as a CVS, SVN, or Git;
the historical modification information includes the following characteristic data: NS, modifying the number of involved subsystems; NF, total number of files modified; entrypy, modifying the distribution in the file; LT, the number of lines of the file before modification; LA, modifying the added row number; LD, modifying the deleted line number; FIX, whether the current modification has the intention of repairing the defect; NEDV, total number of developers participating in modifying a document; AGE, the time interval between the current modification and the last modification; NUC, the number of files involved in the current modification but not involved in the last modification; EXP, developer's experience; SEXP, the development experience of a developer on a subsystem where a modified file is located;
2) extracting modified feature data and specific feature values corresponding to the feature data from the acquired information;
3) generating an initial feasible solution population by combining the extracted features, setting a fitness function, and performing further evolution and analysis on the initial feasible solution population, so that continuous cross variation is performed, and feasible solutions with low fitness are eliminated;
3.1) randomly selecting a plurality of characteristics (no less than 2 and can be selected repeatedly) for all characteristic data and four operators of addition, subtraction, multiplication and division, combining the four operators with corresponding quantity to form a mathematical expression with mathematical significance as a feasible solution, and generating an initial generation population consisting of a pile of feasible solutions;
3.2) calculating the score of each modification according to the specific characteristic numerical value and the mathematical expression, arranging all the modifications according to the ascending order of the score, and calculating the sum of the modified serial numbers with introduced defects, wherein the larger the sum, the higher the fitness;
3.3) continuously crossing and varying population, eliminating solutions with low fitness, and selecting the solution with highest fitness in the last generation as the optimal feasible solution
4) And iterating to the last generation, selecting the solution with the highest fitness as the optimal solution, calculating a modification score according to the optimal solution and the specific characteristic numerical value of the modification to be predicted, dividing the modification score by the corresponding checking cost (the sum of the number of the added and deleted lines of the modification), arranging the modification according to the descending order of the quotient, and selecting the modification with the set number at the forefront to return to the developer as the modification to be checked.
According to the scheme, the labeled (whether the defect is introduced) data obtained in the step 2) is used as a training set, and the following processing is carried out: in order to avoid data stress (the result is biased due to the fact that the number of a certain class in a training set is too large) caused by a large number of classes (the defect is not introduced to modify), randomly deleting the modifications without introducing the defects, and enabling the number of the modifications to be the same as the number of the modifications with introducing the defects; for the characteristics, LT is replaced by LT/NF, LA is replaced by LA/LT, LD is replaced by LD/LT, and finally the logarithm of 2 is taken after the numerical values of all the characteristics except FIX are added with 1.
According to the scheme, the modification of the set number of the front end selected in the step 4) is returned to the developer, and the specific steps are as follows:
sorting all modified scores under the optimal solution, and selecting a median of the scores as a threshold value;
and calculating the score of each modification in the training data according to the specific data and the optimal solution of the characteristics, wherein the modification predicted value, which is greater than the threshold value, is 1, otherwise, the modification predicted value is 0, the modifications are sorted in a descending order according to the value of the modification/(LA + LD), and a part of modifications are selected from the front to the back, so that the total number of modified lines accounts for 20% of the total number of modified lines, and the selected modifications are the modifications required to be checked.
The invention has the following beneficial effects:
the invention utilizes a genetic algorithm based on biological theory to obtain an optimal solution (mathematical expression), and the obtained optimal solution and the operation result can enable a developer to better understand which characteristics are modified and more likely to introduce defects, so that the predicted result has good interpretability and is easy to be accepted by the developer. The invention improves the interpretability of real-time defect prediction, and effectively saves the cost of enterprises or companies in the process of managing software.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
fig. 2 is a flow chart of a method of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to achieve the purpose that each software is subjected to multiple modifications in the life cycle, the invention is to judge which modifications are most likely to introduce defects immediately after the software is modified, and as shown in fig. 1 and 2, the invention provides an interpretable method for real-time defect prediction of code modification, which comprises the following steps:
1) collecting historical modification information of the software codes; obtaining historical modification information of the software code from a code hosting system, such as a CVS, SVN, or Git;
the historical modification information includes the following characteristic data: NS, modifying the number of involved subsystems; NF, total number of files modified; entrypy, modifying the distribution in the file; LT, the number of lines of the file before modification; LA, modifying the added row number; LD, modifying the deleted line number; FIX, whether the current modification has the intention of repairing the defect; NEDV, total number of developers participating in modifying a document; AGE, the time interval between the current modification and the last modification; NUC, the number of files involved in the current modification but not involved in the last modification; EXP, developer's experience; SEXP, the development experience of a developer on a subsystem where a modified file is located;
2) extracting modified feature data and specific feature values corresponding to the feature data from the acquired information;
preprocessing the data, and randomly deleting the modifications without introducing the defects to ensure that the number of the modifications is the same as the number of the modifications with introducing the defects; for the feature, replace LT with LT/NF, if LT has a value of 300 and NF has a value of 3, replace LT with 100 (300/3); then replacing LA with LA/LT, LD with LD/LT, and processing mode is the same as LT; finally, the logarithm of 2 is taken after adding 1 to the numerical value of all the characteristics (FIX does not participate in the processing);
3) generating an initial feasible solution population by combining the extracted features, setting a fitness function, and performing further evolution and analysis on the initial feasible solution population, so that continuous cross variation is performed, and feasible solutions with low fitness are eliminated;
3.1) selecting features and operators according to set probability for all feature data and four operation modes of addition, subtraction, multiplication and division to form a mathematical expression with mathematical significance as a feasible solution, and generating an initial generation population consisting of a pile of feasible solutions;
3.2) calculating the score of each modification according to the specific characteristic numerical value and the mathematical expression, arranging all the modifications according to the ascending order of the score, and calculating the sum of the modified serial numbers with introduced defects, wherein the larger the sum, the higher the fitness;
the fitness of each feasible solution is: calculating scores according to the specific characteristic values of the feasible solutions and the modifications, arranging all the modifications in a training set (labels are marked manually) in an ascending order according to the scores, wherein the sum of the sequence numbers of all the modifications introducing defects in the training set is the fitness of the feasible solutions
3.3) continuously carrying out cross variation on the population, eliminating solutions with low fitness, and selecting a solution with highest fitness in the last generation as an optimal feasible solution; and selecting the feasible solution with the highest fitness as the optimal feasible solution, sorting all modified scores under the optimal solution, and selecting the median of the scores as a threshold value.
4) And iterating to the last generation, selecting the solution with the highest fitness as the optimal solution, calculating a modification score according to the optimal solution and the specific characteristic numerical value of the modification to be predicted, dividing the modification score by the corresponding inspection cost (the number of the added and deleted lines of the modification), arranging the modification according to the descending order of the quotient, selecting some modifications at the forefront, and returning the modifications to the developer, namely the modifications to be inspected.
The specific method comprises the following steps: and calculating the score of each modification in the training data according to the specific data and the optimal solution of the characteristics, wherein the modification predicted value, which is greater than the threshold value, is 1, otherwise, the modification predicted value is 0, the modifications are sorted in a descending order according to the value of the modification/(LA + LD), and a part of modifications are selected from the front to the back, so that the total number of modified lines accounts for 20% of the total number of modified lines, and the selected modifications are the modifications required to be checked.
The invention can judge which modification is most likely to introduce the defects and introduce the defects according to the real-time information of software modification, and recommend the modification to a developer for inspection. The method can greatly save the energy of a programmer in repairing the defects.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.
Claims (3)
1. An interpretable method of code modification real-time defect prediction, comprising the steps of:
1) collecting historical modification information of the software codes; the historical modification information includes the following characteristic data: NS, modifying the number of involved subsystems; NF, total number of files modified; entrypy, modifying the distribution in the file; LT, the number of lines of the file before modification; LA, modifying the added row number; LD, modifying the deleted line number; FIX, whether the current modification has the intention of repairing the defect; NEDV, total number of developers participating in modifying a document; AGE, the time interval between the current modification and the last modification; NUC, the number of files involved in the current modification but not involved in the last modification; EXP, developer's experience; SEXP, the development experience of a developer on a subsystem where a modified file is located;
2) extracting modified feature data and specific feature values corresponding to the feature data from the acquired information;
3) generating an initial feasible solution population by combining the extracted features, setting a fitness function, carrying out further evolution and analysis on the initial feasible solution population, carrying out continuous cross variation, eliminating feasible solutions with low fitness and obtaining an optimal feasible solution;
the specific method for obtaining the optimal feasible solution is as follows:
3.1) randomly selecting a plurality of characteristics for all characteristic data and four operators of addition, subtraction, multiplication and division, combining the four operators with corresponding quantity to form a mathematical expression with mathematical significance as a feasible solution, and generating an initial generation population consisting of a pile of feasible solutions;
3.2) calculating the score of each modification according to the specific characteristic numerical value and the mathematical expression, arranging all the modifications according to the ascending order of the score, and calculating the sum of the modified serial numbers with introduced defects, wherein the larger the sum, the higher the fitness;
3.3) continuously carrying out cross variation on the population, eliminating solutions with low fitness, and selecting a solution with highest fitness in the last generation as an optimal feasible solution;
4) and iterating to the last generation, selecting the solution with the highest fitness as the optimal solution, calculating a modification score according to the optimal solution and the specific characteristic value of the modification to be predicted, dividing the modification score by the corresponding inspection cost, arranging the modifications according to a quotient descending order, and selecting the modification with the set number at the forefront to return to the developer as the modification to be inspected.
2. The interpretable method of code modification real-time defect prediction according to claim 1, wherein the data obtained in step 2) is further processed as follows:
using the data of whether the defect label is printed or not obtained in the step 2) as a training set, and randomly deleting the modification which does not introduce the defect in the training set to make the number of the modification identical to the number of the modification which introduces the defect; for the characteristics, LT is replaced by LT/NF, LA is replaced by LA/LT, LD is replaced by LD/LT, and finally the logarithm of 2 is taken after 1 is added to the numerical values after all the characteristic replacement processing except FIX.
3. The interpretable method of code modification real-time defect prediction according to claim 1, wherein the modification with the set number of the top selected in the step 4) is returned to the developer as follows:
sorting all modified scores under the optimal solution, and selecting a median of the scores as a threshold value; and calculating the score of each modification in the training data according to the specific data and the optimal solution of the characteristics, wherein the modification predicted value, which is greater than the threshold value, is 1, otherwise, the modification predicted value is 0, the modifications are sorted in a descending order according to the value of the modification/(LA + LD), and a part of modifications are selected from the front to the back, so that the total number of modified lines accounts for 20% of the total number of modified lines, and the selected modifications are the modifications required to be checked.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010332906.9A CN111611010B (en) | 2020-04-24 | 2020-04-24 | Interpretable method for code modification real-time defect prediction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010332906.9A CN111611010B (en) | 2020-04-24 | 2020-04-24 | Interpretable method for code modification real-time defect prediction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111611010A CN111611010A (en) | 2020-09-01 |
CN111611010B true CN111611010B (en) | 2021-10-08 |
Family
ID=72194894
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010332906.9A Active CN111611010B (en) | 2020-04-24 | 2020-04-24 | Interpretable method for code modification real-time defect prediction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111611010B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9141920B2 (en) * | 2013-05-17 | 2015-09-22 | International Business Machines Corporation | Project modeling using iterative variable defect forecasts |
CN105302724A (en) * | 2015-11-05 | 2016-02-03 | 南京大学 | Instant defect predicting method based on mixed effect removing |
CN106991047A (en) * | 2017-03-27 | 2017-07-28 | 中国电力科学研究院 | A kind of method and system for being predicted to object-oriented software defect |
CN109948791A (en) * | 2017-12-21 | 2019-06-28 | 河北科技大学 | Utilize the method for genetic algorithm optimization BP neural network and its application in positioning |
CN110378513A (en) * | 2019-06-04 | 2019-10-25 | 中国人民解放军海军航空大学青岛校区 | A kind of optimal solution for project planning determines method and system, equipment, medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018116545A (en) * | 2017-01-19 | 2018-07-26 | オムロン株式会社 | Prediction model creating device, production facility monitoring system, and production facility monitoring method |
CN108665112A (en) * | 2018-05-16 | 2018-10-16 | 东华大学 | Photovoltaic fault detection method based on Modified particle swarm optimization Elman networks |
CN110471854B (en) * | 2019-08-20 | 2023-02-03 | 大连海事大学 | Defect report assignment method based on high-dimensional data hybrid reduction |
CN110610320A (en) * | 2019-09-18 | 2019-12-24 | 江苏满运软件科技有限公司 | Financial risk level prediction method, device, electronic equipment and storage medium |
-
2020
- 2020-04-24 CN CN202010332906.9A patent/CN111611010B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9141920B2 (en) * | 2013-05-17 | 2015-09-22 | International Business Machines Corporation | Project modeling using iterative variable defect forecasts |
CN105302724A (en) * | 2015-11-05 | 2016-02-03 | 南京大学 | Instant defect predicting method based on mixed effect removing |
CN106991047A (en) * | 2017-03-27 | 2017-07-28 | 中国电力科学研究院 | A kind of method and system for being predicted to object-oriented software defect |
CN109948791A (en) * | 2017-12-21 | 2019-06-28 | 河北科技大学 | Utilize the method for genetic algorithm optimization BP neural network and its application in positioning |
CN110378513A (en) * | 2019-06-04 | 2019-10-25 | 中国人民解放军海军航空大学青岛校区 | A kind of optimal solution for project planning determines method and system, equipment, medium |
Non-Patent Citations (1)
Title |
---|
自动程序修复方法研究进展;玄跻峰 等;《软件学报》;20160430(第04期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111611010A (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108563555B (en) | Fault change code prediction method based on four-target optimization | |
CN112699052B (en) | Software test case evolution generation method based on relevant input variables | |
CN112070138A (en) | Multi-label mixed classification model construction method, news classification method and system | |
CN110414003B (en) | Method, device, medium and computing equipment for establishing text generation model | |
Nagwani et al. | Predicting expert developers for newly reported bugs using frequent terms similarities of bug attributes | |
CN112232564A (en) | Label processing device and method | |
CN111008706A (en) | Processing method for automatically labeling, training and predicting mass data | |
CN111611010B (en) | Interpretable method for code modification real-time defect prediction | |
CN112017730B (en) | Cell screening method and device based on expression quantity prediction model | |
CN112346974B (en) | Depth feature embedding-based cross-mobile application program instant defect prediction method | |
CN113010420A (en) | Method and terminal equipment for promoting collaborative evolution of test codes and product codes | |
Rosli et al. | The design of a software fault prone application using evolutionary algorithm | |
CN116756041A (en) | Code defect prediction and positioning method and device, storage medium and computer equipment | |
CN116932762A (en) | Small sample financial text classification method, system, medium and equipment | |
CN111459787A (en) | Test plagiarism detection method based on machine learning | |
Zhang et al. | Condition-guided adversarial generative testing for deep learning systems | |
CN115495085A (en) | Generation method and device based on deep learning fine-grained code template | |
CN112148605A (en) | Software defect prediction method based on spectral clustering and semi-supervised learning | |
Ronchieri et al. | Sentiment analysis for software code assessment | |
CN113656279B (en) | Code odor detection method based on residual network and metric attention mechanism | |
CN115762683B (en) | Method and device for processing fuel cell design data and electronic equipment | |
CN117807545B (en) | Abnormality detection method and system based on data mining | |
CN117193733B (en) | Method for constructing and using example library and method for evaluating generated example code | |
CN116681266B (en) | Production scheduling method and system of mirror surface electric discharge machine | |
Patnaik et al. | Sentiment Analysis of Software Project Code Commits |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |