WO2021093140A1 - 一种跨项目软件缺陷预测方法及其系统 - Google Patents

一种跨项目软件缺陷预测方法及其系统 Download PDF

Info

Publication number
WO2021093140A1
WO2021093140A1 PCT/CN2020/070199 CN2020070199W WO2021093140A1 WO 2021093140 A1 WO2021093140 A1 WO 2021093140A1 CN 2020070199 W CN2020070199 W CN 2020070199W WO 2021093140 A1 WO2021093140 A1 WO 2021093140A1
Authority
WO
WIPO (PCT)
Prior art keywords
project
instance
defects
test set
value
Prior art date
Application number
PCT/CN2020/070199
Other languages
English (en)
French (fr)
Inventor
徐小龙
封功业
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Publication of WO2021093140A1 publication Critical patent/WO2021093140A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/366Software debugging using diagnostics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Definitions

  • the invention belongs to the field of software engineering, and specifically relates to a cross-project software defect prediction method and system.
  • IEEE24765-2017 has a standard definition of defects: from the inside of the product, defects are errors, defects and other problems in the development or maintenance of software products; from the outside of the product, defects are certain functions that the system needs to implement Invalidation or violation. Therefore, the hidden defects in the software may cause unpredictable consequences in actual operation, ranging from a slight impact on the quality of the software, and threatening the safety of people's lives and properties. From the perspective of the software itself, team work, and technical issues, the occurrence of software defects is mainly determined by the characteristics of the software product and the development process, and the existence of defects is inevitable.
  • Software defect prediction is a technology that can effectively mine the potential defects and their distribution that may be left in the software but have not been discovered.
  • the software defect prediction method builds a defect prediction model by mining the software historical warehouse, so as to perform new program modules. Defect prediction.
  • Program modules can be set as packages, files, classes or functions according to actual test requirements. When the test resources are sufficient, the technology can be used to check whether each program module still has defects; when the test resources are insufficient, the technology can be used to reasonably allocate resources to find as many defects as possible. It is essential to improve software quality, reduce software development costs and risks in the development process, and improve the software development process. It is one of the research hotspots in the field of software engineering data mining in recent years.
  • the target project that needs defect prediction may be a new startup project, or the training data of this project is relatively scarce.
  • some software metrics are used Tools (such as the Undetstand tool) can easily automatically collect software measurement information for program modules in the project.
  • domain experts need to deeply analyze the defect report information and information in the project defect tracking system.
  • the code modification log in the version control system has problems such as expensive module type marking and easy marking errors.
  • a simple solution is to directly use high-quality data sets that have been collected by other projects (ie, the source project) to build a defect prediction model for the target project.
  • the characteristics of different projects such as the application field, the development process used, the programming language used or the experience of the developer, etc.
  • the source project and the target project data sets have large metric values
  • the difference in distribution makes it difficult to satisfy the assumption of independence and identical distribution. Therefore, when constructing a defect prediction model, how to migrate knowledge related to the target project from the source project is a research challenge it faces, attracting the attention of domestic and foreign researchers, and calling this problem a cross-project defect prediction problem.
  • researchers generally use transfer learning to alleviate the difference in the distribution of data values. Transfer learning is a method of transforming the source project data set, learning and acquiring the most relevant knowledge of the target project for model construction.
  • the present invention provides a multi-source cross-project software defect prediction method and system for marking the severity of defects. Aiming at the advantages of the cross-project method, the present invention integrates naive Bayes and recent The advantages of neighboring, realize the defect prediction of the target software project.
  • a cross-project software defect prediction method including the following steps:
  • Step 1 Filter out all items that are different from the target item T from the software defect database, and integrate them into a source item S, using the source item S as the training set, and the target item T as the test set;
  • Step 2 Use the transformation method combining min-max and then natural logarithmic transformation to normalize each feature column of the training set and the test set to obtain a new training set P and a test set Q;
  • Step 3 Use the training set P to construct a naive Bayes classifier to predict the test set Q, and the naive Bayes classifier outputs the probability value a of defects in each instance in the test set Q; use the training set P to construct the nearest The neighbor classifier predicts the test set Q, and the nearest neighbor classifier outputs the probability value b of defects in each instance in the test set Q;
  • Step 4 Use the possibility value a and the possibility value b to mark all the instances in the test set Q to obtain a marking result c.
  • the marking value of the marking result c is 0, it means that the instance has no defects, and when the marking value is 0.5 , Indicates that the instance has common defects, when the mark value is 1, it indicates that the instance has serious defects;
  • Step 5 Determine whether the instance has defects according to the marking result c.
  • the source project must not have data of the same project in the target project.
  • formula (1) is used to normalize each feature column of the training set
  • the vector S j is the jth metric element in the source project S
  • the metric element corresponding to the i-th program module takes the value max(S j ) and min(S j ) are the maximum and minimum values in the vector S j, respectively.
  • the formula (1) is used to normalize each feature column of the test set to generate a new test set Q.
  • formula (2) is used to calculate the possibility value a:
  • the input space I is the set of n-dimensional vectors
  • the input is the feature vector x ⁇
  • x (x 1 ,x 2 ,...,x n )
  • the test set Q For each instance in, the output is the class label c k ⁇ ⁇
  • X is a random vector defined in the input space ⁇
  • Y is defined in the output Random variables in space ⁇
  • P(X,Y) is the joint probability distribution of X and Y
  • step of calculating the possibility value b in the step 3 is:
  • the nearest neighbor classifier uses Euclidean distance to measure distance.
  • the formula for calculating Euclidean distance is as follows:
  • I is an indicator function, when a ⁇ 0.5, I is 1, otherwise I is 0.
  • the present invention also discloses a prediction system of a cross-project software defect prediction method, including:
  • the source project integration module is used to integrate all the projects that are different from the target project T selected from the software defect database to obtain the source project;
  • the normalization processing module is used to normalize each feature column in the source item and the target item to obtain the training set P and the test set Q;
  • Naive Bayes classifier used to predict the test set Q, and output the probability value a of defects in each instance in the test set Q;
  • the nearest neighbor classifier is used to predict the test set Q, and the nearest neighbor classifier outputs the probability value b of defects in each instance in the test set Q;
  • the marking module is used to mark all the instances in the test set Q with the possibility value a and the possibility value b to obtain the marking result;
  • the display module is used to display the degree of defects of the instance according to the marking results, including no defects, common defects, and serious defects.
  • training set P is used to construct a naive Bayes classifier.
  • training set P is used to construct the nearest neighbor classifier.
  • the present invention has the following advantages:
  • the cross-project software defect prediction method designed by the present invention has a simple algorithm structure and low time complexity.
  • Figure 1 is a schematic flow chart of a cross-project software defect prediction method designed by the present invention
  • Figure 2 is a schematic flow chart of the method for marking the defect severity of the target instance.
  • the multi-source cross-project software defect prediction method and system of the present invention for marking the severity of defects combines the advantages of the cross-project method with the advantages of naive Bayes and nearest neighbors, which will distinguish the target project
  • the historical samples of all items are integrated into the training set, and the attribute of the severity of the defect is considered, and a marking method of the severity of the instance defect is proposed.
  • the cross-project software defect prediction method of this embodiment is used for defect prediction for a target software project.
  • the actual application process specifically includes the following steps:
  • Step 1 Filter out all the projects that are different from the target project from the software defect database, integrate them into a source project S, use the source project S as the training set, and the target project T as the test set, and proceed to step 2; here, "Differences "Y" means that there must be no data in the same project in the training set as in the test set.
  • Step 2 Take the PROMISE database as an example. The database statistics are shown in Table 1. If the target project is ant-1.3, the source project S must not contain the project ant- 1.4, the label of each instance in ant-1.5, ant-1.6, and ant-1.7.
  • the Dataset column is the name of each software project data set in the Promise software defect database
  • #Class column is the number of class files in the corresponding software project data set
  • #Defect column is the number of defect classes in the corresponding software project data set.
  • Step 2 According to the following design, normalize each feature column of the training set S and the test set T to obtain a new training set P and a test set Q, and proceed to step 3;
  • the vector S j is the jth metric element in the source project S
  • the metric element corresponding to the i-th program module takes the value max(S j ) and min(S j ) are the maximum and minimum values in the vector S j, respectively.
  • Step 3 use the training set P to construct a naive Bayes classifier to predict the test set Q.
  • the naive Bayes classifier outputs the probability value a of defects in each instance in the test set, and then enters step 5;
  • Input space I a set of n-dimensional vectors
  • X is a random vector defined in the input space ⁇
  • Y is a random variable defined in the output space ⁇ .
  • P(X,Y) is the joint probability distribution of X and Y.
  • Step 4 use the training set P to construct the nearest neighbor classifier, predict the test set Q, the classifier outputs the probability value b of defects in each instance in the test set, and go to step 5;
  • the nearest neighbor classifier uses Euclidean distance to measure distance.
  • the formula for calculating Euclidean distance is as follows:
  • the instance vector x t that is the nearest neighbor to each instance x in the test set Q in the training set P.
  • the class of the instance is y t
  • the calculation formula for the value b is as follows:
  • Step 5 As shown in Figure 2, use the values of a and b to mark all instances in the test set.
  • the marked values are 0, 0.5, and 1.
  • the size of the marked value indicates the severity of the defect. According to the following formula, the marking result is obtained c:
  • I is an indicator function, that is, I is 1 when a ⁇ 0.5, otherwise I is 0.
  • Step 6 If the marking result c of a certain instance is 0, it is predicted that the instance has no defects; otherwise, it is predicted that the instance has defects.
  • the source project integration module is used to integrate all the projects that are different from the target project T selected from the software defect database to obtain the source project;
  • the normalization processing module is used to normalize each feature column in the source item and the target item to obtain the training set P and the test set Q;
  • Naive Bayes classifier used to predict the test set Q, and output the probability value a of defects in each instance in the test set Q;
  • the nearest neighbor classifier is used to predict the test set Q, and the nearest neighbor classifier outputs the probability value b of defects in each instance in the test set Q;
  • the marking module is used to mark all the instances in the test set Q with the possibility value a and the possibility value b to obtain the marking result;
  • the display module is used to display the degree of defects of the instance according to the marking results, including no defects, common defects and serious defects.
  • the training set P is used to construct the naive Bayes classifier and the nearest neighbor classifier respectively.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Stored Programmes (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种跨项目软件缺陷预测方法及系统,针对跨项目方法的优势,融合朴素贝叶斯和最近邻的优点,将区别目标项目的所有项目的历史样本整合成训练集,结合朴素贝叶斯模型特点,使用0和0.5对目标项目中所有实例进行第一次标记,来将目标项目分为两类,值的大小表示缺陷的严重程度;利用最近邻模型的特质,二次标记目标实例,将所有实例标记为值0、0.5和1,根据标记结果,预测目标实例是否存在缺陷。

Description

一种跨项目软件缺陷预测方法及其系统 技术领域
本发明属于软件工程领域,具体涉及一种跨项目软件缺陷预测方法及其系统。
背景技术
IEEE24765-2017对缺陷有一个标准的定义:从产品内部看,缺陷是软件产品开发或维护过程中存在的错误、毛病等各种问题;从产品外部看,缺陷是系统所需要实现的某种功能的失效或违背。因而,软件内部隐藏的缺陷有可能导致其在实际运行时产生难以预料的后果,轻则稍许影响软件质量,重则威胁到人们的生命财产安全。从软件本身、团队工作和技术问题等角度来看,软件缺陷的产生主要是由软件产品的特点和开发过程决定的,缺陷的存在不可避免。
虽然难以杜绝缺陷,但可以对其进行分析与监测,以尽量减少。软件缺陷预测是一种能够有效地挖掘软件中可能还遗留而尚未被发现的潜在缺陷及其分布情况的技术,软件缺陷预测方法通过挖掘软件历史仓库构建缺陷预测模型,从而对新的程序模块进行缺陷预测。程序模块根据实际测试需求可设置为包、文件、类或函数等。当测试资源足够时,该技术可以用来检查每个程序模块是否还有缺陷;测试资源不足时,可通过该技术合理分配资源来尽可能多发现缺陷。在提高软件质量、降低软件开发成本和开发过程的风险,改进软件开发过程等方面至关重要,是近年软件工程数据挖掘领域的研究热点之一。
目前大部分研究工作都集中关注同项目缺陷预测问题,即选择同一项目的部分数据集作为训练集来构建模型,并用剩余未选择的数据作为测试集来获得模型的预测性能。但在实际的软件开发场景中,需要进行缺陷预测的目标项目可能是一个新启动项目,或这个项目已有的训练数据较为稀缺.目前在缺陷预测训练数据的搜集过程中,虽然借助一些软件度量工具(例如Undetstand工具)可以较为容易地自动搜集到项目内程序模块的软件度量信息,但在随后分析这些模块内部是否含有缺陷时,则需要领域专家深入分析项目缺陷跟踪系统中的缺陷报告信息和版本控制系统中的代码修改日志,因此存在模块类型标记代价高昂且容易标记出错等问题。
一种简单的解决方案是直接使用其他项目(即源项目)已经搜集的高质量数据集来为目标项目构建缺陷预测模型。但不同项目的特征(例如所处的应用领域、采用的开发流程、使用的编程语言或开发人员的经验等)并不相同,所以源项目与目标项目的数据集存在很大的度量元取值分布差异,难以满足独立同分布的假设。因此在缺陷预测模型构建时,如何从源项目中迁移出与目标项目相关的知识是其面临的研究挑战,吸引了国 内外研究人员的关注,并称该问题为跨项目缺陷预测问题。针对这一问题,研究人员一般借助迁移学习来缓解数据取值分布的差异性。迁移学习是对源项目数据集进行转换、学习并获取与目标项目最为相关的知识来用于模型构建的方法。
发明内容
本发明的目:为解决现有技术中存在的问题,本发明提供一种标记缺陷严重程度的多源跨项目软件缺陷预测方法及系统,针对跨项目方法的优势,融合朴素贝叶斯和最近邻的优点,实现对目标软件项目的缺陷预测。
技术方案:一种跨项目软件缺陷预测方法,包括以下步骤:
步骤1:从软件缺陷数据库中筛选出不同于目标项目T的所有项目,将其整合成一个源项目S,将源项目S作为训练集,目标项目T作为测试集;
步骤2:采用先min-max再自然对数变换相结合的变换方法,对训练集及测试集的各个特征列进行归一化处理,得到新的训练集P和测试集Q;
步骤3:采用训练集P构建朴素贝叶斯分类器,对测试集Q进行预测,朴素贝叶斯分类器输出测试集Q中每个实例存在缺陷的可能性值a;采用训练集P构建最近邻分类器,对测试集Q进行预测,最近邻分类器输出测试集Q中每个实例存在缺陷的可能性值b;
步骤4:利用可能性值a和可能性值b对测试集Q中所有实例进行标记,得到标记结果c,所述标记结果c的标记值为0时,表示实例没有缺陷,标记值为0.5时,表示实例存在普通缺陷,标记值为1时,表示实例存在严重缺陷;
步骤5:根据标记结果c,判断实例是否存在缺陷。
进一步的,所述源项目中不得有与目标项目中同项目的数据。
进一步的,所述步骤2中采用式(1)对训练集各个特征列进行归一化处理;
Figure PCTCN2020070199-appb-000001
式中,向量S j为源项目S中第j个度量元,其第i个程序模块对应的度量元取值为
Figure PCTCN2020070199-appb-000002
max(S j)和min(S j)分别为向量S j中的最大值和最小值。
同理,采用式(1)对测试集各个特征列进行归一化处理,生成新的测试集Q。
进一步的,所述步骤3中采用式(2)计算得到可能性值a:
Figure PCTCN2020070199-appb-000003
式中,输入空间
Figure PCTCN2020070199-appb-000004
为n维向量的集合,输出空间为类标记集合ψ={0,1},输入为特征向量x∈χ,x=(x 1,x 2,...,x n),即测试集Q中的每一个实例,输出为类标记c k∈ψ,c k=1表示实例存在缺陷,c k=0表示实例没有缺陷,X是定义在输入空间χ上的随机向量,Y是 定义在输出空间ψ上的随机变量,P(X,Y)是X和Y的联合概率分布,训练集P={(x 1,y 1),(x 2,y 2),...,(x n,y n)}由P(X,Y)独立同分布产生。
进一步的,所述步骤3中计算可能性值b的步骤为:
最近邻分类器使用欧式距离来度量距离,欧式距离计算公式如下:
Figure PCTCN2020070199-appb-000005
式中,x i,x j∈χ,
Figure PCTCN2020070199-appb-000006
根据给定的距离度量,在训练集P中找出与测试集Q中每个实例x最近邻的实例向量x t,得到该实例向量x t所属的类y t,则可能性值b的计算公式如下:
b=y t       (4)。
进一步的,所述步骤5中采用式(5)计算得到标记结果c:
Figure PCTCN2020070199-appb-000007
式中,I为指示函数,当a≥0.5时I为1,否则I为0。
本发明还公开了一种跨项目软件缺陷预测方法的预测系统,包括:
源项目整合模块,用于对从软件缺陷数据库中筛选出不同于目标项目T的所有项目进行整合,得到源项目;
归一化处理模块,用于对源项目和目标项目中的各个特征列进行归一化处理,得到训练集P和测试集Q;
朴素贝叶斯分类器,用于对测试集Q进行预测,输出测试集Q中每个实例存在缺陷的可能性值a;
最近邻分类器,用于对测试集Q进行预测,最近邻分类器输出测试集Q中每个实例存在缺陷的可能性值b;
标记模块,用于利用可能性值a和可能性值b对测试集Q中所有实例进行标记,得到标记结果;
显示模块,用于根据标记结果,显示实例的缺陷程度,包括没有缺陷、普通缺陷和严重缺陷。
进一步的,采用训练集P构建朴素贝叶斯分类器。
进一步的,采用训练集P构建最近邻分类器。
有益效果:本发明具有以下优点:
(1)针对跨项目方法的优势,融合朴素贝叶斯和最近邻的优点,提出一种标记缺陷严重程度的多源跨项目软件缺陷预测方法及系统。该方法将区别目标项目的所有项目的历史样本整合成训练集,结合朴素贝叶斯模型特点,使用0和0.5对目标项目中所有 实例进行第一次标记,来将目标项目分为两类,值的大小表示缺陷的严重程度;利用最近邻模型的特质,二次标记目标实例,将所有实例标记为值0、0.5和1,根据标记结果,预测目标实例是否存在缺陷。
(2)本发明设计的跨项目软件缺陷预测方法中,考虑了缺陷严重程度这一属性,提出一种实例缺陷严重程度的一种标记方法,以此标记结果进行缺陷预测;
(3)本发明设计的跨项目软件缺陷预测方法,在测试资源一定时,可以根据实例缺陷严重程度的标记值大小优先测试缺陷较严重的实例。
(4)本发明设计的跨项目软件缺陷预测方法算法结构简单,时间复杂度低。
附图说明
图1是本发明所设计跨项目软件缺陷预测方法的流程示意图;
图2是目标实例缺陷严重程度标记方法的流程示意图。
具体实施方式
现结合附图和实施例进一步阐述本发明的技术方案。
如图1所示,本发明的一种标记缺陷严重程度的多源跨项目软件缺陷预测方法及系统,将跨项目方法的优势融合了朴素贝叶斯和最近邻的优点,将区别目标项目的所有项目的历史样本整合成训练集,考虑了缺陷严重程度这一属性,提出一种实例缺陷严重程度的一种标记方法。结合朴素贝叶斯模型特点,使用0和0.5对目标项目中所有实例进行第一次标记,将目标项目分为两类,值的大小表示缺陷的严重程度;利用最近邻模型的特质,二次标记目标实例,将所有实例标记为值0、0.5和1,根据标记结果,预测目标实例是否存在缺陷。该方法在测试资源一定时,可以根据实例缺陷严重程度的标记值大小优先测试缺陷较严重的实例。
实施例1:
本实施例的跨项目软件缺陷预测方法,用于针对目标软件项目进行缺陷预测,实际应用过程当中,具体包括如下步骤:
步骤1:从软件缺陷数据库中筛选出区别于目标项目的所有项目,将其整合成一个源项目S,将源项目S作为训练集,目标项目T作为测试集,进入步骤2;此处“区别于”是指训练集中不得有与测试集同项目的数据,以PROMISE数据库为例,该数据库统计信息如表1所示,若目标项目是ant-1.3,则源项目S中不得包含项目ant-1.4、ant-1.5、ant-1.6、ant-1.7中每个实例的标签。
表1 Promise数据集的统计信息
Dataset #Class #Defect Dataset #Class #Defect
ant-1.3 125 20 lucene-2.0 195 91
ant-1.4 178 40 lucene-2.2 247 144
ant-1.5 293 32 lucene-2.4 340 203
ant-1.6 351 92 poi-1.5 237 141
ant-1.7 745 166 poi-2.0 314 37
camel-1.0 339 13 poi-2.5 385 248
camel-1.2 608 216 poi-3.0 442 281
camel-1.4 872 145 redaktor 176 27
camel-1.6 965 188 synapse-1.0 157 16
ckjm 10 5 synapse-1.1 222 60
ivy-1.1 111 63 synapse-1.2 256 86
ivy-1.4 241 16 tomcat 858 77
ivy-2.0 352 40 velocity-1.4 196 147
jedit-3.2 272 90 velocity-1.6 229 78
jedit-4.0 306 75 xalan-2.4 723 110
jedit-4.1 312 79 xalan-2.5 803 387
jedit-4.2 367 48 xalan-2.6 885 411
jedit-4.3 492 11 xalan-2.7 909 898
log4j-1.0 135 34 xerces-12 440 71
log4j-1.1 109 37 xerces-1.3 453 69
log4j-1.2 205 189 xerces-1.4 588 437
表中,Dataset列是Promise软件缺陷数据库中各个软件项目数据集的名称,#Class列为对应软件项目数据集中类文件的数量,#Defect列为对应软件项目数据集中缺陷类的数量。考虑实际情况,由跨项目的定义可知,假设软件项目A.1和A.2是A项目的两个不同版本,如果目标项目为A.1,那么源项目中不能有A项目的其他版本号,如A.2。
步骤2:按如下设计,对训练集S及测试集T的各个特征列进行归一化处理,得到新的训练集P和测试集Q,进入步骤3;
采用先min-max再自然对数变换相结合的变换方法,归一化的公式如下:
Figure PCTCN2020070199-appb-000008
其中,向量S j为源项目S中第j个度量元,其第i个程序模块对应的度量元取值为
Figure PCTCN2020070199-appb-000009
max(S j)和min(S j)分别为向量S j中的最大值和最小值。
步骤3:按如下设计,采用训练集P构建朴素贝叶斯分类器,对测试集Q进行预测,朴素贝叶斯分类器输出测试集中每个实例存在缺陷的可能性值a,进入步骤5;
输入空间
Figure PCTCN2020070199-appb-000010
为n维向量的集合,输出空间为类标记集合ψ={0,1}。输入为特征向量x∈χ,x=(x 1,x 2,...,x n),即测试集Q中的每一个实例,输出为类标记c k∈ψ,c k=1表示实例存在缺陷,c k=0表示实例没有缺陷。X是定义在输入空间χ上的随机向量,Y是定义在输出空间ψ上的随机变量。P(X,Y)是X和Y的联合概率分布。训练数据集P={(x 1,y 1),(x 2,y 2),...,(x n,y n)}由P(X,Y)独立同分布产生。根据以上条件,值a的计算公式如下:
Figure PCTCN2020070199-appb-000011
步骤4:按如下设计,采用训练集P构建最近邻分类器,对测试集Q进行预测,分类器输出测试集中每个实例存在缺陷的可能性值b,进入步骤5;
最近邻分类器使用欧式距离来度量距离,欧式距离计算公式如下:
Figure PCTCN2020070199-appb-000012
其中,x i,x j∈χ,
Figure PCTCN2020070199-appb-000013
根据给定的距离度量,在训练集P中找出与测试集Q中每个实例x最近邻的实例向量x t,该实例所属的类为y t,则值b的计算公式如下:
b=y t        (4)
步骤5:具体如图2所示,利用a和b的值对测试集中所有实例进行标记,标记值有0、0.5和1,标记值的大小表示缺陷的严重程度,按如下公式,得到标记结果c:
Figure PCTCN2020070199-appb-000014
其中,I为指示函数,即当a≥0.5时I为1,否则I为0。
步骤6:若某个实例的标记结果c为0,则预测该实例没有缺陷;否则,预测该实例存在缺陷。
实施例2:
本实施例的一种跨项目软件缺陷预测方法的预测系统,包括:
源项目整合模块,用于对从软件缺陷数据库中筛选出不同于目标项目T的所有项目进行整合,得到源项目;
归一化处理模块,用于对源项目和目标项目中的各个特征列进行归一化处理,得到训练集P和测试集Q;
朴素贝叶斯分类器,用于对测试集Q进行预测,输出测试集Q中每个实例存在缺陷的可能性值a;
最近邻分类器,用于对测试集Q进行预测,最近邻分类器输出测试集Q中每个实例存在缺陷的可能性值b;
标记模块,用于利用可能性值a和可能性值b对测试集Q中所有实例进行标记,得到标记结果;
显示模块,用于根据标记结果,显示实例的缺陷程度,包括没有缺陷、普通缺陷和 严重缺陷。
本实施例采用训练集P分别构建得到朴素贝叶斯分类器和最近邻分类器。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
最后应当说明的是:以上实施例仅用以说明本发明的技术方案而非对其限制,尽管参照上述实施例对本发明进行了详细的说明,所属领域的普通技术人员应当理解:依然可以对本发明的具体实施方式进行修改或者等同替换,而未脱离本发明精神和范围的任何修改或者等同替换,其均应涵盖在本发明的权利要求保护范围之内。

Claims (9)

  1. 一种跨项目软件缺陷预测方法,其特征在于:包括以下步骤:
    步骤1:从软件缺陷数据库中筛选出不同于目标项目T的所有项目,将其整合成一个源项目S,将源项目S作为训练集,目标项目T作为测试集;
    步骤2:采用先min-max再自然对数变换相结合的变换方法,对训练集及测试集的各个特征列进行归一化处理,得到新的训练集P和测试集Q;
    步骤3:采用训练集P构建朴素贝叶斯分类器,对测试集Q进行预测,朴素贝叶斯分类器输出测试集Q中每个实例存在缺陷的可能性值a;采用训练集P构建最近邻分类器,对测试集Q进行预测,最近邻分类器输出测试集Q中每个实例存在缺陷的可能性值b;
    步骤4:利用可能性值a和可能性值b对测试集Q中所有实例进行标记,得到标记结果c,所述标记结果c的标记值为0时,表示实例没有缺陷,标记值为0.5时,表示实例存在普通缺陷,标记值为1时,表示实例存在严重缺陷;
    步骤5:根据标记结果c,判断实例是否存在缺陷。
  2. 根据权利要求1所述的一种跨项目软件缺陷预测方法,其特征在于:所述源项目中不得有与目标项目中同项目的数据。
  3. 根据权利要求1所述的一种跨项目软件缺陷预测方法,其特征在于:所述步骤2中采用式(1)对训练集的各个特征列进行归一化处理;
    Figure PCTCN2020070199-appb-100001
    式中,向量S j为源项目S中第j个度量元,其第i个程序模块对应的度量元取值为
    Figure PCTCN2020070199-appb-100002
    max(S j)和min(S j)分别为向量S j中的最大值和最小值;
    采用式(1)对测试集的各个特征列进行归一化处理,生成测试集Q。
  4. 根据权利要求1所述的一种跨项目软件缺陷预测方法,其特征在于:所述步骤3中采用式(2)计算得到可能性值a:
    Figure PCTCN2020070199-appb-100003
    式中,输入空间
    Figure PCTCN2020070199-appb-100004
    为n维向量的集合,输出空间为类标记集合ψ={0,1},输入为特征向量x∈χ,x=(x 1,x 2,...,x n),即测试集Q中的每一个实例,输出为类标记c k∈ψ,c k=1表示实例存在缺陷,c k=0表示实例没有缺陷,X是定义在输入空间χ上的随机向量,Y是定义在输出空间ψ上的随机变量,P(X,Y)是X和Y的联合概率分布,训练集P={(x 1,y 1),(x 2,y 2),...,(x n,y n)}由P(X,Y)独立同分布产生。
  5. 根据权利要求4所述的一种跨项目软件缺陷预测方法,其特征在于:所述步骤3中计算可能性值b的步骤为:
    最近邻分类器使用欧式距离来度量距离,欧式距离计算公式如下:
    Figure PCTCN2020070199-appb-100005
    式中,x i,x j∈χ,
    Figure PCTCN2020070199-appb-100006
    根据给定的距离度量,在训练集P中找出与测试集Q中每个实例x最近邻的实例向量x t,得到该实例向量x t所属的类y t,则可能性值b的计算公式如下:
    b=y t    (4)。
  6. 根据权利要求1所述的一种跨项目软件缺陷预测方法,其特征在于:所述步骤5中采用式(5)计算得到标记结果c:
    Figure PCTCN2020070199-appb-100007
    式中,I为指示函数,当a≥0.5时I为1,否则I为0。
  7. 基于权利要求1至6任意一项所述的一种跨项目软件缺陷预测方法的预测系统,其特征在于:包括:
    源项目整合模块,用于对从软件缺陷数据库中筛选出不同于目标项目T的所有项目进行整合,得到源项目;
    归一化处理模块,用于对源项目和目标项目中的各个特征列进行归一化处理,得到训练集P和测试集Q;
    朴素贝叶斯分类器,用于对测试集Q进行预测,输出测试集Q中每个实例存在缺陷的可能性值a;
    最近邻分类器,用于对测试集Q进行预测,最近邻分类器输出测试集Q中每个实例存在缺陷的可能性值b;
    标记模块,用于利用可能性值a和可能性值b对测试集Q中所有实例进行标记,得到标记结果;
    显示模块,用于根据标记结果,显示实例的缺陷程度,包括没有缺陷、普通缺陷和严重缺陷。
  8. 根据权利要求7所述的预测系统,其特征在于:采用训练集P构建朴素贝叶斯分类器。
  9. 根据权利要求7所述的预测系统,其特征在于:采用训练集P构建最近邻分类器。
PCT/CN2020/070199 2019-11-11 2020-01-03 一种跨项目软件缺陷预测方法及其系统 WO2021093140A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911094169.7A CN110825644B (zh) 2019-11-11 2019-11-11 一种跨项目软件缺陷预测方法及其系统
CN201911094169.7 2019-11-11

Publications (1)

Publication Number Publication Date
WO2021093140A1 true WO2021093140A1 (zh) 2021-05-20

Family

ID=69553814

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/070199 WO2021093140A1 (zh) 2019-11-11 2020-01-03 一种跨项目软件缺陷预测方法及其系统

Country Status (2)

Country Link
CN (1) CN110825644B (zh)
WO (1) WO2021093140A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418222A (zh) * 2022-01-21 2022-04-29 广东电网有限责任公司 一种通过自适应集成的设备安全威胁预测方法及装置
CN114676298A (zh) * 2022-04-12 2022-06-28 南通大学 一种基于质量过滤器的缺陷报告标题自动生成方法
CN114706780A (zh) * 2022-04-13 2022-07-05 北京理工大学 一种基于Stacking集成学习的软件缺陷预测方法
CN114924962A (zh) * 2022-05-17 2022-08-19 北京航空航天大学 一种跨项目软件缺陷预测数据选择方法
CN115033493A (zh) * 2022-07-06 2022-09-09 陕西师范大学 一种基于线性规划的工作量感知即时软件缺陷预测方法
CN115269377A (zh) * 2022-06-23 2022-11-01 南通大学 一种基于优化实例选择的跨项目软件缺陷预测方法
CN115269378A (zh) * 2022-06-23 2022-11-01 南通大学 一种基于域特征分布的跨项目软件缺陷预测方法
CN116881172A (zh) * 2023-09-06 2023-10-13 南昌航空大学 一种基于图卷积网络的软件缺陷预测方法
CN118394664A (zh) * 2024-06-28 2024-07-26 华南理工大学 基于工作量感知即时软件缺陷预测方法及装置

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111367801B (zh) * 2020-02-29 2024-07-12 杭州电子科技大学 一种面向跨公司软件缺陷预测的数据变换方法
CN111581116B (zh) * 2020-06-16 2023-12-29 江苏师范大学 一种基于分层数据筛选的跨项目软件缺陷预测方法
CN111881048B (zh) * 2020-07-31 2022-06-03 武汉理工大学 一种跨项目软件老化缺陷预测方法
CN112214406B (zh) * 2020-10-10 2021-06-15 广东石油化工学院 一种基于选择性伪标记子空间学习的跨项目缺陷预测方法
CN112199287B (zh) * 2020-10-13 2022-03-29 北京理工大学 基于强化混合专家模型的跨项目软件缺陷预测方法
CN112306730B (zh) * 2020-11-12 2021-11-30 南通大学 基于历史项目伪标签生成的缺陷报告严重程度预测方法
CN112463640B (zh) * 2020-12-15 2022-06-03 武汉理工大学 一种基于联合概率域适应的跨项目软件老化缺陷预测方法
CN113157564B (zh) * 2021-03-17 2023-11-07 江苏师范大学 一种基于特征分布对齐和邻域实例选择的跨项目缺陷预测方法
CN114328277A (zh) * 2022-03-11 2022-04-12 广东省科技基础条件平台中心 一种软件缺陷预测和质量分析方法、装置、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
CN107133176A (zh) * 2017-05-09 2017-09-05 武汉大学 一种基于半监督聚类数据筛选的跨项目缺陷预测方法
CN107391369A (zh) * 2017-07-13 2017-11-24 武汉大学 一种基于数据筛选和数据过采样的跨项目缺陷预测方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150742A1 (en) * 2016-11-28 2018-05-31 Microsoft Technology Licensing, Llc. Source code bug prediction
CN107025503A (zh) * 2017-04-18 2017-08-08 武汉大学 基于迁移学习和缺陷数量信息的跨公司软件缺陷预测方法
CN108304316B (zh) * 2017-12-25 2021-04-06 浙江工业大学 一种基于协同迁移的软件缺陷预测方法
CN108763283A (zh) * 2018-04-13 2018-11-06 南京邮电大学 一种不平衡数据集过采样方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
CN107133176A (zh) * 2017-05-09 2017-09-05 武汉大学 一种基于半监督聚类数据筛选的跨项目缺陷预测方法
CN107391369A (zh) * 2017-07-13 2017-11-24 武汉大学 一种基于数据筛选和数据过采样的跨项目缺陷预测方法

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418222A (zh) * 2022-01-21 2022-04-29 广东电网有限责任公司 一种通过自适应集成的设备安全威胁预测方法及装置
CN114676298A (zh) * 2022-04-12 2022-06-28 南通大学 一种基于质量过滤器的缺陷报告标题自动生成方法
CN114676298B (zh) * 2022-04-12 2024-04-19 南通大学 一种基于质量过滤器的缺陷报告标题自动生成方法
CN114706780A (zh) * 2022-04-13 2022-07-05 北京理工大学 一种基于Stacking集成学习的软件缺陷预测方法
CN114924962A (zh) * 2022-05-17 2022-08-19 北京航空航天大学 一种跨项目软件缺陷预测数据选择方法
CN114924962B (zh) * 2022-05-17 2024-05-31 北京航空航天大学 一种跨项目软件缺陷预测数据选择方法
CN115269378A (zh) * 2022-06-23 2022-11-01 南通大学 一种基于域特征分布的跨项目软件缺陷预测方法
CN115269378B (zh) * 2022-06-23 2023-06-09 南通大学 一种基于域特征分布的跨项目软件缺陷预测方法
CN115269377A (zh) * 2022-06-23 2022-11-01 南通大学 一种基于优化实例选择的跨项目软件缺陷预测方法
CN115033493A (zh) * 2022-07-06 2022-09-09 陕西师范大学 一种基于线性规划的工作量感知即时软件缺陷预测方法
CN116881172A (zh) * 2023-09-06 2023-10-13 南昌航空大学 一种基于图卷积网络的软件缺陷预测方法
CN116881172B (zh) * 2023-09-06 2024-02-23 南昌航空大学 一种基于图卷积网络的软件缺陷预测方法
CN118394664A (zh) * 2024-06-28 2024-07-26 华南理工大学 基于工作量感知即时软件缺陷预测方法及装置

Also Published As

Publication number Publication date
CN110825644B (zh) 2021-06-11
CN110825644A (zh) 2020-02-21

Similar Documents

Publication Publication Date Title
WO2021093140A1 (zh) 一种跨项目软件缺陷预测方法及其系统
US11093519B2 (en) Artificial intelligence (AI) based automatic data remediation
Catal et al. Class noise detection based on software metrics and ROC curves
US20200053108A1 (en) Utilizing machine intelligence to identify anomalies
Huang et al. An effective fault diagnosis method for centrifugal chillers using associative classification
Xing et al. The prediction model of earthquake casuailty based on robust wavelet v-SVM
Chang et al. Integrating in-process software defect prediction with association mining to discover defect pattern
Maggo et al. A machine learning based efficient software reusability prediction model for java based object oriented software
Jaribion et al. [WiP] a novel method for big data analytics and summarization based on fuzzy similarity measure
CN117785858A (zh) 一种基于大数据的信息数据管理方法及系统
US20230040648A1 (en) String entropy in a data pipeline
JP2024538508A (ja) 電子通信における健康および安全性リスクを特定および予測するための機械学習モデル
CN117909333B (zh) 基于大数据结合人工智能实现数据的筛选方法及系统
Jia et al. Robust and transferable log-based anomaly detection
Alghanim et al. Software defect density prediction using deep learning
CN117319452A (zh) 应用于硫酸钡制备下的安全巡检方法及系统
CN103136440A (zh) 数据处理方法和装置
Kumar et al. Empirical validation for effectiveness of fault prediction technique based on cost analysis framework
CN116739605A (zh) 交易数据检测方法、装置、设备及存储介质
Berman et al. Active learning to improve static analysis
Garg et al. Machine learning-based abnormality detection approach for vacuum pump assembly line
US20230259756A1 (en) Graph explainable artificial intelligence correlation
CN111221704B (zh) 一种确定办公管理应用系统运行状态的方法及系统
Ayesha et al. Review on code examination proficient system in software engineering by using machine learning approach
EP4339845A1 (en) Method, apparatus and electronic device for detecting data anomalies, and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20887099

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20887099

Country of ref document: EP

Kind code of ref document: A1