CN111611010B - Interpretable method for code modification real-time defect prediction - Google Patents

Interpretable method for code modification real-time defect prediction Download PDF

Info

Publication number
CN111611010B
CN111611010B CN202010332906.9A CN202010332906A CN111611010B CN 111611010 B CN111611010 B CN 111611010B CN 202010332906 A CN202010332906 A CN 202010332906A CN 111611010 B CN111611010 B CN 111611010B
Authority
CN
China
Prior art keywords
modification
solution
modified
modifications
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010332906.9A
Other languages
Chinese (zh)
Other versions
CN111611010A (en
Inventor
玄跻峰
程航远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202010332906.9A priority Critical patent/CN111611010B/en
Publication of CN111611010A publication Critical patent/CN111611010A/en
Application granted granted Critical
Publication of CN111611010B publication Critical patent/CN111611010B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/72Code refactoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Physiology (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses an interpretable method for code modification real-time defect prediction, which comprises the following steps: 1) collecting historical modification information of the software codes; 2) extracting modified feature data and specific feature values corresponding to the feature data from the acquired information; 3) generating an initial feasible solution population by combining the extracted features, setting a fitness function, carrying out further evolution and analysis on the initial feasible solution population, carrying out continuous cross variation, eliminating feasible solutions with low fitness and obtaining an optimal feasible solution; 4) and calculating a modification score according to the optimal solution and the specific characteristic value of the modification to be predicted, dividing the modification score by the corresponding inspection cost, arranging the modifications according to a descending quotient order, and selecting the modification with the set number at the forefront to be returned to the developer as the modification to be inspected. The result predicted by the invention has good interpretability, is easy to be accepted by developers, and enables programmers to effectively save cost in the process of managing software to repair defects.

Description

Interpretable method for code modification real-time defect prediction
Technical Field
The invention relates to a computer software technology, in particular to an interpretable method for code modification real-time defect prediction.
Background
At present, the society is highly informationized, software is ubiquitous, large and small software is also filled in both transnational enterprises and small companies, and in order to achieve the operation efficiency, how to ensure the software quality is one of the important concerns of the enterprises. During the whole life cycle of the software, the software is modified for a plurality of times for modifying functions or repairing defects, and each modification has the risk of introducing defects. Repairing a defect is a labor intensive task that is easier to repair if a latent defect can be discovered earlier, but is more difficult and less easily repaired if a code is discovered later that causes the defect. Therefore, if it can be predicted whether the modification introduces the defect, the range of positioning the defect can be reduced, and the cost can be greatly saved.
At present, in the field of defect prediction, real-time defect prediction (just-in-time defect prediction) shows advantages of the defect prediction. Real-time defect prediction is characterized in that after software is modified, whether defects are introduced by modification is predicted in time by utilizing modification information. The technique has the advantages of being capable of finding defects early, reducing the range of positioning the defects, helping programmers to position the defects, further saving labor and simultaneously seeking to find the most defects within the limited cost. Therefore, the technology has a wide application prospect and is favored by a plurality of internet companies, for example, Cisco has introduced the technology into the software development of the self.
Although real-time defect prediction has the advantages, the current commercial real-time defect prediction technologies have no good interpretability, which limits the application of real-time defect prediction. This patent provides an interpretable method: and (3) improving the adaptability of the feasible solution to the environment by using a genetic programming algorithm, further finding out a proper optimal solution, and finding out the modification which is most likely to introduce the defects by using the optimal solution and recommending the modification to a developer for inspection. Wherein the feasible solutions are: selecting the characteristics, operators or constant values from-1 to 1 of code modification according to a certain probability to form a mathematical expression with mathematical significance; the fitness is as follows: and calculating the modified score by using the feasible solution and the modified specific characteristic numerical value, arranging all modifications in an ascending order according to the score, and determining the sum of the modified serial numbers introducing the defects as the fitness. Different from the machine learning like a black box, the method utilizes a genetic algorithm based on biological theory to obtain an optimal solution (mathematical expression), and the obtained optimal solution and an operation result can enable a developer to better understand which characteristics are more likely to introduce defects when the characteristics are modified, so that the algorithm has good interpretability and is easy to be accepted by the developer.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide an interpretable method for code modification real-time defect prediction, aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: an interpretable method of code modification real-time defect prediction, comprising the steps of:
1) collecting historical modification information of the software codes; obtaining historical modification information of the software code from a code hosting system, such as a CVS, SVN, or Git;
the historical modification information includes the following characteristic data: NS, modifying the number of involved subsystems; NF, total number of files modified; entrypy, modifying the distribution in the file; LT, the number of lines of the file before modification; LA, modifying the added row number; LD, modifying the deleted line number; FIX, whether the current modification has the intention of repairing the defect; NEDV, total number of developers participating in modifying a document; AGE, the time interval between the current modification and the last modification; NUC, the number of files involved in the current modification but not involved in the last modification; EXP, developer's experience; SEXP, the development experience of a developer on a subsystem where a modified file is located;
2) extracting modified feature data and specific feature values corresponding to the feature data from the acquired information;
3) generating an initial feasible solution population by combining the extracted features, setting a fitness function, and performing further evolution and analysis on the initial feasible solution population, so that continuous cross variation is performed, and feasible solutions with low fitness are eliminated;
3.1) randomly selecting a plurality of characteristics (no less than 2 and can be selected repeatedly) for all characteristic data and four operators of addition, subtraction, multiplication and division, combining the four operators with corresponding quantity to form a mathematical expression with mathematical significance as a feasible solution, and generating an initial generation population consisting of a pile of feasible solutions;
3.2) calculating the score of each modification according to the specific characteristic numerical value and the mathematical expression, arranging all the modifications according to the ascending order of the score, and calculating the sum of the modified serial numbers with introduced defects, wherein the larger the sum, the higher the fitness;
3.3) continuously crossing and varying population, eliminating solutions with low fitness, and selecting the solution with highest fitness in the last generation as the optimal feasible solution
4) And iterating to the last generation, selecting the solution with the highest fitness as the optimal solution, calculating a modification score according to the optimal solution and the specific characteristic numerical value of the modification to be predicted, dividing the modification score by the corresponding checking cost (the sum of the number of the added and deleted lines of the modification), arranging the modification according to the descending order of the quotient, and selecting the modification with the set number at the forefront to return to the developer as the modification to be checked.
According to the scheme, the labeled (whether the defect is introduced) data obtained in the step 2) is used as a training set, and the following processing is carried out: in order to avoid data stress (the result is biased due to the fact that the number of a certain class in a training set is too large) caused by a large number of classes (the defect is not introduced to modify), randomly deleting the modifications without introducing the defects, and enabling the number of the modifications to be the same as the number of the modifications with introducing the defects; for the characteristics, LT is replaced by LT/NF, LA is replaced by LA/LT, LD is replaced by LD/LT, and finally the logarithm of 2 is taken after the numerical values of all the characteristics except FIX are added with 1.
According to the scheme, the modification of the set number of the front end selected in the step 4) is returned to the developer, and the specific steps are as follows:
sorting all modified scores under the optimal solution, and selecting a median of the scores as a threshold value;
and calculating the score of each modification in the training data according to the specific data and the optimal solution of the characteristics, wherein the modification predicted value, which is greater than the threshold value, is 1, otherwise, the modification predicted value is 0, the modifications are sorted in a descending order according to the value of the modification/(LA + LD), and a part of modifications are selected from the front to the back, so that the total number of modified lines accounts for 20% of the total number of modified lines, and the selected modifications are the modifications required to be checked.
The invention has the following beneficial effects:
the invention utilizes a genetic algorithm based on biological theory to obtain an optimal solution (mathematical expression), and the obtained optimal solution and the operation result can enable a developer to better understand which characteristics are modified and more likely to introduce defects, so that the predicted result has good interpretability and is easy to be accepted by the developer. The invention improves the interpretability of real-time defect prediction, and effectively saves the cost of enterprises or companies in the process of managing software.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
fig. 2 is a flow chart of a method of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to achieve the purpose that each software is subjected to multiple modifications in the life cycle, the invention is to judge which modifications are most likely to introduce defects immediately after the software is modified, and as shown in fig. 1 and 2, the invention provides an interpretable method for real-time defect prediction of code modification, which comprises the following steps:
1) collecting historical modification information of the software codes; obtaining historical modification information of the software code from a code hosting system, such as a CVS, SVN, or Git;
the historical modification information includes the following characteristic data: NS, modifying the number of involved subsystems; NF, total number of files modified; entrypy, modifying the distribution in the file; LT, the number of lines of the file before modification; LA, modifying the added row number; LD, modifying the deleted line number; FIX, whether the current modification has the intention of repairing the defect; NEDV, total number of developers participating in modifying a document; AGE, the time interval between the current modification and the last modification; NUC, the number of files involved in the current modification but not involved in the last modification; EXP, developer's experience; SEXP, the development experience of a developer on a subsystem where a modified file is located;
2) extracting modified feature data and specific feature values corresponding to the feature data from the acquired information;
preprocessing the data, and randomly deleting the modifications without introducing the defects to ensure that the number of the modifications is the same as the number of the modifications with introducing the defects; for the feature, replace LT with LT/NF, if LT has a value of 300 and NF has a value of 3, replace LT with 100 (300/3); then replacing LA with LA/LT, LD with LD/LT, and processing mode is the same as LT; finally, the logarithm of 2 is taken after adding 1 to the numerical value of all the characteristics (FIX does not participate in the processing);
3) generating an initial feasible solution population by combining the extracted features, setting a fitness function, and performing further evolution and analysis on the initial feasible solution population, so that continuous cross variation is performed, and feasible solutions with low fitness are eliminated;
3.1) selecting features and operators according to set probability for all feature data and four operation modes of addition, subtraction, multiplication and division to form a mathematical expression with mathematical significance as a feasible solution, and generating an initial generation population consisting of a pile of feasible solutions;
3.2) calculating the score of each modification according to the specific characteristic numerical value and the mathematical expression, arranging all the modifications according to the ascending order of the score, and calculating the sum of the modified serial numbers with introduced defects, wherein the larger the sum, the higher the fitness;
the fitness of each feasible solution is: calculating scores according to the specific characteristic values of the feasible solutions and the modifications, arranging all the modifications in a training set (labels are marked manually) in an ascending order according to the scores, wherein the sum of the sequence numbers of all the modifications introducing defects in the training set is the fitness of the feasible solutions
3.3) continuously carrying out cross variation on the population, eliminating solutions with low fitness, and selecting a solution with highest fitness in the last generation as an optimal feasible solution; and selecting the feasible solution with the highest fitness as the optimal feasible solution, sorting all modified scores under the optimal solution, and selecting the median of the scores as a threshold value.
4) And iterating to the last generation, selecting the solution with the highest fitness as the optimal solution, calculating a modification score according to the optimal solution and the specific characteristic numerical value of the modification to be predicted, dividing the modification score by the corresponding inspection cost (the number of the added and deleted lines of the modification), arranging the modification according to the descending order of the quotient, selecting some modifications at the forefront, and returning the modifications to the developer, namely the modifications to be inspected.
The specific method comprises the following steps: and calculating the score of each modification in the training data according to the specific data and the optimal solution of the characteristics, wherein the modification predicted value, which is greater than the threshold value, is 1, otherwise, the modification predicted value is 0, the modifications are sorted in a descending order according to the value of the modification/(LA + LD), and a part of modifications are selected from the front to the back, so that the total number of modified lines accounts for 20% of the total number of modified lines, and the selected modifications are the modifications required to be checked.
The invention can judge which modification is most likely to introduce the defects and introduce the defects according to the real-time information of software modification, and recommend the modification to a developer for inspection. The method can greatly save the energy of a programmer in repairing the defects.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (3)

1. An interpretable method of code modification real-time defect prediction, comprising the steps of:
1) collecting historical modification information of the software codes; the historical modification information includes the following characteristic data: NS, modifying the number of involved subsystems; NF, total number of files modified; entrypy, modifying the distribution in the file; LT, the number of lines of the file before modification; LA, modifying the added row number; LD, modifying the deleted line number; FIX, whether the current modification has the intention of repairing the defect; NEDV, total number of developers participating in modifying a document; AGE, the time interval between the current modification and the last modification; NUC, the number of files involved in the current modification but not involved in the last modification; EXP, developer's experience; SEXP, the development experience of a developer on a subsystem where a modified file is located;
2) extracting modified feature data and specific feature values corresponding to the feature data from the acquired information;
3) generating an initial feasible solution population by combining the extracted features, setting a fitness function, carrying out further evolution and analysis on the initial feasible solution population, carrying out continuous cross variation, eliminating feasible solutions with low fitness and obtaining an optimal feasible solution;
the specific method for obtaining the optimal feasible solution is as follows:
3.1) randomly selecting a plurality of characteristics for all characteristic data and four operators of addition, subtraction, multiplication and division, combining the four operators with corresponding quantity to form a mathematical expression with mathematical significance as a feasible solution, and generating an initial generation population consisting of a pile of feasible solutions;
3.2) calculating the score of each modification according to the specific characteristic numerical value and the mathematical expression, arranging all the modifications according to the ascending order of the score, and calculating the sum of the modified serial numbers with introduced defects, wherein the larger the sum, the higher the fitness;
3.3) continuously carrying out cross variation on the population, eliminating solutions with low fitness, and selecting a solution with highest fitness in the last generation as an optimal feasible solution;
4) and iterating to the last generation, selecting the solution with the highest fitness as the optimal solution, calculating a modification score according to the optimal solution and the specific characteristic value of the modification to be predicted, dividing the modification score by the corresponding inspection cost, arranging the modifications according to a quotient descending order, and selecting the modification with the set number at the forefront to return to the developer as the modification to be inspected.
2. The interpretable method of code modification real-time defect prediction according to claim 1, wherein the data obtained in step 2) is further processed as follows:
using the data of whether the defect label is printed or not obtained in the step 2) as a training set, and randomly deleting the modification which does not introduce the defect in the training set to make the number of the modification identical to the number of the modification which introduces the defect; for the characteristics, LT is replaced by LT/NF, LA is replaced by LA/LT, LD is replaced by LD/LT, and finally the logarithm of 2 is taken after 1 is added to the numerical values after all the characteristic replacement processing except FIX.
3. The interpretable method of code modification real-time defect prediction according to claim 1, wherein the modification with the set number of the top selected in the step 4) is returned to the developer as follows:
sorting all modified scores under the optimal solution, and selecting a median of the scores as a threshold value; and calculating the score of each modification in the training data according to the specific data and the optimal solution of the characteristics, wherein the modification predicted value, which is greater than the threshold value, is 1, otherwise, the modification predicted value is 0, the modifications are sorted in a descending order according to the value of the modification/(LA + LD), and a part of modifications are selected from the front to the back, so that the total number of modified lines accounts for 20% of the total number of modified lines, and the selected modifications are the modifications required to be checked.
CN202010332906.9A 2020-04-24 2020-04-24 Interpretable method for code modification real-time defect prediction Active CN111611010B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010332906.9A CN111611010B (en) 2020-04-24 2020-04-24 Interpretable method for code modification real-time defect prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010332906.9A CN111611010B (en) 2020-04-24 2020-04-24 Interpretable method for code modification real-time defect prediction

Publications (2)

Publication Number Publication Date
CN111611010A CN111611010A (en) 2020-09-01
CN111611010B true CN111611010B (en) 2021-10-08

Family

ID=72194894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010332906.9A Active CN111611010B (en) 2020-04-24 2020-04-24 Interpretable method for code modification real-time defect prediction

Country Status (1)

Country Link
CN (1) CN111611010B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9141920B2 (en) * 2013-05-17 2015-09-22 International Business Machines Corporation Project modeling using iterative variable defect forecasts
CN105302724A (en) * 2015-11-05 2016-02-03 南京大学 Instant defect predicting method based on mixed effect removing
CN106991047A (en) * 2017-03-27 2017-07-28 中国电力科学研究院 A kind of method and system for being predicted to object-oriented software defect
CN109948791A (en) * 2017-12-21 2019-06-28 河北科技大学 Utilize the method for genetic algorithm optimization BP neural network and its application in positioning
CN110378513A (en) * 2019-06-04 2019-10-25 中国人民解放军海军航空大学青岛校区 A kind of optimal solution for project planning determines method and system, equipment, medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018116545A (en) * 2017-01-19 2018-07-26 オムロン株式会社 Prediction model creating device, production facility monitoring system, and production facility monitoring method
CN108665112A (en) * 2018-05-16 2018-10-16 东华大学 Photovoltaic fault detection method based on Modified particle swarm optimization Elman networks
CN110471854B (en) * 2019-08-20 2023-02-03 大连海事大学 Defect report assignment method based on high-dimensional data hybrid reduction
CN110610320A (en) * 2019-09-18 2019-12-24 江苏满运软件科技有限公司 Financial risk level prediction method, device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9141920B2 (en) * 2013-05-17 2015-09-22 International Business Machines Corporation Project modeling using iterative variable defect forecasts
CN105302724A (en) * 2015-11-05 2016-02-03 南京大学 Instant defect predicting method based on mixed effect removing
CN106991047A (en) * 2017-03-27 2017-07-28 中国电力科学研究院 A kind of method and system for being predicted to object-oriented software defect
CN109948791A (en) * 2017-12-21 2019-06-28 河北科技大学 Utilize the method for genetic algorithm optimization BP neural network and its application in positioning
CN110378513A (en) * 2019-06-04 2019-10-25 中国人民解放军海军航空大学青岛校区 A kind of optimal solution for project planning determines method and system, equipment, medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
自动程序修复方法研究进展;玄跻峰 等;《软件学报》;20160430(第04期);全文 *

Also Published As

Publication number Publication date
CN111611010A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CN108563555B (en) Fault change code prediction method based on four-target optimization
CN112699052B (en) Software test case evolution generation method based on relevant input variables
CN112070138A (en) Multi-label mixed classification model construction method, news classification method and system
CN110414003B (en) Method, device, medium and computing equipment for establishing text generation model
Nagwani et al. Predicting expert developers for newly reported bugs using frequent terms similarities of bug attributes
CN112232564A (en) Label processing device and method
CN111008706A (en) Processing method for automatically labeling, training and predicting mass data
CN111611010B (en) Interpretable method for code modification real-time defect prediction
CN112017730B (en) Cell screening method and device based on expression quantity prediction model
CN112346974B (en) Depth feature embedding-based cross-mobile application program instant defect prediction method
CN113010420A (en) Method and terminal equipment for promoting collaborative evolution of test codes and product codes
Rosli et al. The design of a software fault prone application using evolutionary algorithm
CN116756041A (en) Code defect prediction and positioning method and device, storage medium and computer equipment
CN116932762A (en) Small sample financial text classification method, system, medium and equipment
CN111459787A (en) Test plagiarism detection method based on machine learning
Zhang et al. Condition-guided adversarial generative testing for deep learning systems
CN115495085A (en) Generation method and device based on deep learning fine-grained code template
CN112148605A (en) Software defect prediction method based on spectral clustering and semi-supervised learning
Ronchieri et al. Sentiment analysis for software code assessment
CN113656279B (en) Code odor detection method based on residual network and metric attention mechanism
CN115762683B (en) Method and device for processing fuel cell design data and electronic equipment
CN117807545B (en) Abnormality detection method and system based on data mining
CN117193733B (en) Method for constructing and using example library and method for evaluating generated example code
CN116681266B (en) Production scheduling method and system of mirror surface electric discharge machine
Patnaik et al. Sentiment Analysis of Software Project Code Commits

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant