KR20230122370A

KR20230122370A - Method and system for predicting heterogeneous defect through correlation-based selection of multiple source projects and ensemble learning

Info

Publication number: KR20230122370A
Application number: KR1020220018945A
Authority: KR
Inventors: 백종문; 김은섭
Original assignee: 한국과학기술원
Priority date: 2022-02-14
Filing date: 2022-02-14
Publication date: 2023-08-22

Abstract

상관관계 기반 다중 학습 프로젝트 선택 및 앙상블 학습을 통한 이종 결함 예측 방법 및 시스템이 개시된다. 일 실시예에 따른 결함 예측 시스템에 의해 수행되는 이종 결함 예측 방법은, 복수 개의 소스 프로젝트와 타켓 프로젝트의 상관관계를 분석함에 따라 앙상블 학습을 위한 적어도 하나 이상의 소스 프로젝트를 선택하는 단계; 및 상기 선택된 적어도 하나 이상의 소스 프로젝트에 대한 앙상블 학습을 통해 타겟 프로젝트의 이종 결함을 예측하는 단계를 포함할 수 있다. A method and system for predicting heterogeneous defects through correlation-based multiple learning project selection and ensemble learning are disclosed. A heterogeneous defect prediction method performed by a defect prediction system according to an embodiment includes selecting at least one source project for ensemble learning by analyzing a correlation between a plurality of source projects and a target project; and predicting heterogeneous defects of the target project through ensemble learning of the selected at least one or more source projects.

Description

Heterogeneous defect prediction method and system through correlation-based multiple learning project selection and ensemble learning

아래의 설명은 소프트웨어 결함 예측 기술에 관한 것이다. The description below relates to software defect prediction techniques.

소프트웨어 결함 예측(Software defect prediction; SDP)은 수십 년 동안 광범위하게 연구되어 왔으며, 이제 소프트웨어 엔지니어링의 주요 주제 중 하나가 되었다. 소프트웨어 결함 예측은 결함이 발생하기 쉬운 모듈(그 세분성에 따라 함수, 클래스 또는 파일)을 예측하는 것을 목표로 하며, 소프트웨어 엔지니어가 테스트 노력을 줄이고 소프트웨어 품질을 향상시키는데 도움을 제공한다.Software defect prediction (SDP) has been studied extensively for decades and has now become one of the major topics in software engineering. Software defect prediction aims to predict fault-prone modules (functions, classes or files depending on its granularity), and helps software engineers reduce testing effort and improve software quality.

최근에는 두 프로젝트가 이종 메트릭 집합을 가지고 있더라도 결함을 예측하기 위해 이종 결함 예측(HDP)이 제안되었다. 여러 이종 결함 예측 연구는 메트릭을 일치시켜 메트릭을 하고 예측 성능을 향상시키는 데 초점을 맞추고 있다. 이러한 이종 결함 예측 방법에는 하나의 소스 프로젝트 데이터와 하나의 타겟 프로젝트 데이터가 입력으로 필요하며 예측된 타겟 레이블을 출력으로 반환한다. Recently, heterogeneous defect prediction (HDP) has been proposed to predict defects even if the two projects have heterogeneous sets of metrics. Several heterogeneous defect prediction studies focus on matching metrics to improve prediction performance. This heterogeneous defect prediction method requires one source project data and one target project data as input and returns the predicted target label as output.

모델 학습에 사용할 수 있는 오픈 소스 프로젝트 데이터가 많이 존재한다. 따라서, 이종 결함 예측 방법을 적용하기 위해 개발자는 데이터 분포가 타겟 프로젝트와 유사한 적절한 소스 프로젝트를 선택해야 한다. 그럼에도 불구하고, 대부분의 연구는 소스 프로젝트의 선택 가이드라인을 제안하지 않았으며 소스 프로젝트를 무작위로 선택한다고 가정하여 평가 기준의 평균을 측정했다.There are many open source project data available for model training. Therefore, in order to apply the heterogeneous defect prediction method, the developer must select an appropriate source project whose data distribution is similar to that of the target project. Nonetheless, most studies did not propose guidelines for the selection of source projects, and averaged the evaluation criteria assuming that source projects were chosen randomly.

더욱이, 결함 데이터는 일반적으로 매우 불균형적이며, 이는 결함 있는 모듈이 결함이 없는 모듈보다 훨씬 적다는 것을 의미한다. 클래스 불균형 문제는 결함 예측 모델의 성능에 상당한 부정적인 영향을 미치지만, 일부 기법은 이 문제를 고려하지 않았다.Furthermore, the defect data is generally highly disproportionate, meaning that there are far fewer defective modules than non-faulty modules. The class imbalance problem has a significant negative impact on the performance of defect prediction models, but some techniques do not consider this problem.

상관관계 분석을 수행함에 따라 앙상블 학습을 위한 적어도 하나 이상의 소스 프로젝트를 선택하는 방법 및 시스템을 제공할 수 있다.As the correlation analysis is performed, a method and system for selecting at least one or more source projects for ensemble learning may be provided.

클래스 불균형 문제를 해결하기 위한 앙상블 학습을 통해 타겟 프로젝트의 이종 결함을 예측하는 방법 및 시스템을 제공할 수 있다. It is possible to provide a method and system for predicting heterogeneous defects in a target project through ensemble learning to solve the class imbalance problem.

결함 예측 시스템에 의해 수행되는 이종 결함 예측 방법은, 복수 개의 소스 프로젝트와 타켓 프로젝트의 상관관계를 분석함에 따라 앙상블 학습을 위한 적어도 하나 이상의 소스 프로젝트를 선택하는 단계; 및 상기 선택된 적어도 하나 이상의 소스 프로젝트에 대한 앙상블 학습을 통해 타겟 프로젝트의 이종 결함을 예측하는 단계를 포함할 수 있다. The heterogeneous defect prediction method performed by the defect prediction system includes: selecting at least one source project for ensemble learning by analyzing a correlation between a plurality of source projects and a target project; and predicting heterogeneous defects of the target project through ensemble learning of the selected at least one or more source projects.

상기 이종 결함 예측 방법은, 복수 개의 소스 프로젝트와 타켓 프로젝트의 메트릭 매칭을 수행하는 단계를 더 포함할 수 있다. The heterogeneous defect prediction method may further include performing metric matching between a plurality of source projects and a target project.

상기 메트릭 매칭을 수행하는 단계는, 상기 복수 개의 소스 프로젝트와 상기 타겟 프로젝트에 대한 이종 메트릭 셋을 통합하기 위해 Kolmogorov-Smirnov 테스트를 이용하여 상기 복수 개의 소스 프로젝트의 메트릭과 상기 타겟 프로젝트의 메트릭을 포함하는 메트릭 쌍의 유사성을 측정하는 단계를 포함할 수 있다. The performing of the metric matching includes metrics of the plurality of source projects and metrics of the target project using the Kolmogorov-Smirnov test to integrate heterogeneous metric sets for the plurality of source projects and the target project. It may include measuring the similarity of the pair of metrics.

상기 메트릭 매칭을 수행하는 단계는, 상기 소스 프로젝트의 게인 비율을 따라 기 설정된 기준 이상의 소스 프로젝트의 메트릭을 선택하는 단계를 포함할 수 있다. The performing of the metric matching may include selecting a metric of the source project equal to or greater than a preset standard according to a gain ratio of the source project.

상기 메트릭 매칭을 수행하는 단계는, 상기 메트릭 쌍에 대한 상기 Kolmogorov-Smirnov 테스트의 p-값을 계산하고, 상기 계산된 p-값이 컷오프 임계값 보다 작을 경우, 상기 매트릭 쌍을 필터링하는 단계를 포함할 수 있다. The performing of the metric matching may include calculating a p-value of the Kolmogorov-Smirnov test for the metric pair, and filtering the metric pair if the calculated p-value is less than a cutoff threshold. can do.

상기 메트릭 매칭을 수행하는 단계는, 상기 매트릭 쌍의 필터링을 통해 남은 매트릭 쌍들에 대하여 최대 가중치 이분 매칭을 적용하여 p-값이 가장 높은 매트릭 쌍을 선택하는 단계를 포함할 수 있다. The performing of the metric matching may include selecting a metric pair having the highest p-value by applying maximum weight bipartite matching to remaining metric pairs through filtering of the metric pair.

상기 선택하는 단계는, 상기 복수 개의 소스 프로젝트와 상기 타겟 프로젝트와의 메트릭 매칭이 수행됨에 따라 상기 복수 개의 소스 프로젝트에 대한 소스 데이터와 상기 타겟 프로젝트에 대한 타겟 데이터 사이의 적합성 점수를 계산하는 단계를 포함할 수 있다. The selecting may include calculating a suitability score between source data for the plurality of source projects and target data for the target project as metric matching between the plurality of source projects and the target project is performed. can do.

상기 선택하는 단계는, 상기 복수 개의 소스 프로젝트와 상기 타겟 프로젝트에 대한 매칭이 성공될 경우, 상기 복수 개의 소스 프로젝트와 상기 타겟 프로젝트 간의 상관 점수와 상기 복수 개의 소스 프로젝트와 상기 타겟 프로젝트 간 모듈 수의 비율로 구성되는 적합성 점수를 계산하는 단계를 포함할 수 있다. In the selecting step, if matching between the plurality of source projects and the target project is successful, the ratio of the correlation score between the plurality of source projects and the target project to the number of modules between the plurality of source projects and the target project. It may include calculating a suitability score consisting of

상기 선택하는 단계는, 상기 복수 개의 소스 프로젝트와 상기 타겟 프로젝트에 대한 매칭이 실패할 경우, 적합성 점수를 0으로 도출하는 단계를 포함할 수 있다. The selecting may include deriving a suitability score of 0 when matching between the plurality of source projects and the target project fails.

상기 선택하는 단계는, 상기 복수 개의 소스 프로젝트에 대한 소스 데이터와 상기 타겟 데이터에 대한 소스 데이터의 각각에서 모듈을 랜덤으로 샘플링하고, 상기 복수 개의 소스 프로젝트에 대한 소스 데이터와 상기 타겟 데이터에 대한 소스 데이터를 정렬하는 단계를 포함할 수 있다. The selecting step may include randomly sampling a module from source data for the plurality of source projects and source data for the target data, respectively, and source data for the plurality of source projects and source data for the target data. It may include a step of sorting.

상기 선택하는 단계는, 상기 샘플링과 상기 정렬이 완료됨에 따라 상기 복수 개의 소스 프로젝트에 대한 소스 데이터와 상기 타겟 데이터에 대한 타겟 데이터 간 Spearman상관계수를 사용하여 상관 점수를 계산하고, 상기 계산된 상관 점수의 p-값이 컷오프 임계값보다 클 경우 0으로 설정하고, 상기 복수 개의 소스 프로젝트와 상기 타겟 프로젝트에 수행된 메트릭 각각에 대해 계산된 상관 점수의 평균을 도출하는 단계를 포함할 수 있다. The selecting may include calculating a correlation score using a Spearman correlation coefficient between source data for the plurality of source projects and target data for the target data as the sampling and sorting are completed, and the calculated correlation score and setting the p-value of to 0 when it is greater than a cutoff threshold, and deriving an average of correlation scores calculated for each of the metrics performed on the plurality of source projects and the target project.

상기 이종 결함을 예측하는 단계는, 상기 선택된 적어도 하나 이상의 소스 프로젝트의 각각에 대해 로지스틱 회귀 기반의 예측 모델을 구축하고, 상기 구축된 로지스틱 회귀 기반의 예측 모델을 사용하여 상기 타겟 프로젝트의 결함 레이블을 예측하는 단계를 포함할 수 있다. The predicting of the heterogeneous defects may include constructing a logistic regression-based predictive model for each of the selected at least one or more source projects, and using the constructed logistic regression-based predictive model to predict a defect label of the target project steps may be included.

상기 이종 결함을 예측하는 단계는, 소프트 보팅(soft voting)을 통해 각 로지스틱 회귀 기반의 예측 모델로부터 예측된 확률을 반환하고, 상기 반환된 확률에 대한 평균을 최종 예측 레이블로 결정하는 단계를 포함할 수 있다. The predicting of the heterogeneous defects may include returning probabilities predicted from each logistic regression-based predictive model through soft voting, and determining an average of the returned probabilities as a final predicted label. can

결함 예측 시스템에 의해 수행되는 이종 결함 예측 방법을 실행시키기 위해 컴퓨터 판독 가능한 저장매체에 저장된 컴퓨터 프로그램은, 복수 개의 소스 프로젝트와 타켓 프로젝트의 상관관계를 분석함에 따라 앙상블 학습을 위한 적어도 하나 이상의 소스 프로젝트를 선택하는 단계; 및 상기 선택된 적어도 하나 이상의 소스 프로젝트에 대한 앙상블 학습을 통해 타겟 프로젝트의 이종 결함을 예측하는 단계를 포함할 수 있다. A computer program stored in a computer readable storage medium to execute the heterogeneous defect prediction method performed by the defect prediction system analyzes the correlation between a plurality of source projects and a target project, thereby generating at least one source project for ensemble learning. choosing; and predicting heterogeneous defects of the target project through ensemble learning of the selected at least one or more source projects.

이종 결함 예측 방법을 상기 결함 예측 시스템에 실행시키기 위해 비-일시적인 컴퓨터 판독가능한 기록 매체에 저장되는 컴퓨터 프로그램을 포함할 수 있다.It may include a computer program stored in a non-transitory computer readable recording medium to execute the heterogeneous defect prediction method in the defect prediction system.

결함 예측 시스템은, 복수 개의 소스 프로젝트와 타켓 프로젝트의 상관관계를 분석함에 따라 적어도 하나 이상의 소스 프로젝트를 선택하는 상관관계 분석부; 및 상기 선택된 적어도 하나 이상의 소스 프로젝트에 대한 앙상블 학습을 통해 타겟 프로젝트의 이종 결함을 예측하는 앙상블 학습부를 포함할 수 있다. The defect prediction system includes: a correlation analyzer for selecting at least one source project as the correlation between a plurality of source projects and a target project is analyzed; and an ensemble learning unit that predicts heterogeneous defects of the target project through ensemble learning of the selected at least one or more source projects.

학습을 위한 타겟 데이터가 없는 이종 결함 예측 내 소스 프로젝트를 선택하는 가이드라인을 제공할 수 있다.Guidelines can be provided for selecting source projects within heterogeneous defect prediction that do not have target data for training.

상관관계 기반 및 앙상블 학습 방식을 통해 타겟 프로젝트에 적합한 소스 프로젝트를 선택하여 클래스 불균형 문제를 완화하고, 이종 결함 예측 성능을 향상시킬 수 있다. The class imbalance problem can be mitigated and heterogeneous defect prediction performance can be improved by selecting a source project suitable for the target project through correlation-based and ensemble learning methods.

새로 개발하는 프로젝트에 적용하여 결함 여부를 미리 예측함으로써 개발중인 소프트웨어에서 결함이 존재할 것으로 예상되는 모듈부터 점검 및 테스팅하여 비용을 절감할 수 있다. 또한, 결함이 존재할 것으로 예상되는 부분에 더 많은 테스팅 비용을 집중할 수 있으므로 소프트웨어의 품질을 높일 수 있다.Cost can be reduced by inspecting and testing modules that are expected to have defects in the software under development by predicting defects in advance by applying it to a newly developed project. In addition, since more testing costs can be concentrated on the part where defects are expected to exist, the quality of the software can be improved.

도 1은 일 실시예에 있어서, 결함 예측 시스템의 구성을 설명하기 위한 블록도이다.
도 2는 일 실시예에 있어서, 결함 예측 시스템에서 이종 결함 예측 방법을 설명하기 위한 흐름도이다.
도 3은 일 실시예에 있어서, 이종 결함 예측 동작을 설명하기 위한 도면이다.
도 4는 일 실시예에 있어서, 이종 결함 예측 동작의 의사코드를 나타낸 도면이다.
도 5는 일 실시예에 있어서, 적합성 점수의 의사코드를 나타낸 도면이다. 1 is a block diagram for explaining the configuration of a defect prediction system according to an embodiment.
2 is a flowchart illustrating a heterogeneous defect prediction method in a defect prediction system according to an embodiment.
3 is a diagram for explaining a heterogeneous defect prediction operation according to an embodiment.
4 is a diagram illustrating pseudocode of a heterogeneous defect prediction operation according to an embodiment.
5 is a diagram showing pseudocodes of suitability scores according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, an embodiment will be described in detail with reference to the accompanying drawings.

실시예에서는 주어진 타겟 프로젝트에 대한 적절한 소스 프로젝트를 선택하고 상관관계 기반 복수 개의 소스 프로젝트 선택이라는 불균형 데이터를 고려함으로써 성능을 더욱 향상시키는 새로운 이종 결함 예측 동작에 대하여 설명하기로 한다. In the embodiment, a novel heterogeneous defect prediction operation that further improves performance by selecting an appropriate source project for a given target project and considering imbalanced data of selecting a plurality of source projects based on correlation will be described.

도 1은 일 실시예에 있어서, 결함 예측 시스템의 구성을 설명하기 위한 블록도이고, 도 2는 일 실시예에 있어서, 결함 예측 시스템에서 이종 결함 예측 방법을 설명하기 위한 흐름도이다. 1 is a block diagram illustrating a configuration of a defect prediction system according to an embodiment, and FIG. 2 is a flowchart illustrating a heterogeneous defect prediction method in a defect prediction system according to an embodiment.

결함 예측 시스템(100)의 프로세서는 상관관계 분석부(110) 및 앙상블 학습부(120)를 포함할 수 있다. 이러한 프로세서의 구성요소들은 결함 예측 시스템에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 프로세서 및 프로세서의 구성요소들은 도 2의 이종 결함 예측 방법이 포함하는 단계들(210 내지 220)을 수행하도록 결함 예측 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. The processor of the defect prediction system 100 may include a correlation analysis unit 110 and an ensemble learning unit 120 . Components of such a processor may be representations of different functions performed by the processor according to control instructions provided by program codes stored in the defect prediction system. The processor and components of the processor may control the defect prediction system to perform steps 210 to 220 included in the heterogeneous defect prediction method of FIG. 2 . In this case, the processor and components of the processor may be implemented to execute instructions according to the code of an operating system included in the memory and the code of at least one program.

프로세서는 이종 결함 예측 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 결함 예측 시스템에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 결함 예측 시스템을 제어할 수 있다. 이때, 상관관계 분석부(110) 및 앙상블 학습부(120) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(210 내지 220)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다.The processor may load the program code stored in the file of the program for the heterogeneous defect prediction method into the memory. For example, when a program is executed in the defect prediction system, a processor may control the defect prediction system to load a program code from a program file into a memory under control of an operating system. At this time, each of the correlation analysis unit 110 and the ensemble learning unit 120 executes a command of a corresponding part of the program code loaded into the memory to perform different functional functions of the processor for executing the subsequent steps 210 to 220. can be expressions.

사전에, 메트릭 매칭부(미도시)에 의해 복수 개의 소스 프로젝트와 타켓 프로젝트의 메트릭 매칭이 수행될 수 있다. 메트릭 매칭부는 복수 개의 소스 프로젝트와 타겟 프로젝트에 대한 이종 메트릭 셋을 통합하기 위해 Kolmogorov-Smirnov 테스트를 이용하여 복수 개의 소스 프로젝트의 메트릭과 타겟 프로젝트의 메트릭을 포함하는 메트릭 쌍의 유사성을 측정할 수 있다. 메트릭 매칭부는 소스 프로젝트의 게인 비율을 따라 기 설정된 기준 이상의 소스 프로젝트의 메트릭을 선택할 수 있다. 메트릭 매칭부는 메트릭 쌍에 대한 Kolmogorov-Smirnov 테스트의 p-값을 계산하고, 계산된 p-값이 컷오프 임계값 보다 작을 경우, 매트릭 쌍을 필터링할 수 있다. 메트릭 매칭부는 매트릭 쌍의 필터링을 통해 남은 매트릭 쌍들에 대하여 최대 가중치 이분 매칭을 적용하여 p-값이 가장 높은 매트릭 쌍을 선택할 수 있다.In advance, metric matching between a plurality of source projects and a target project may be performed by a metric matching unit (not shown). The metric matching unit may measure the similarity of metric pairs including metrics of the plurality of source projects and metrics of the target project by using the Kolmogorov-Smirnov test to integrate heterogeneous metric sets of the plurality of source projects and the target project. The metric matching unit may select a metric of a source project equal to or greater than a preset standard according to a gain ratio of the source project. The metric matcher may calculate a p-value of the Kolmogorov-Smirnov test for the metric pair, and filter the metric pair if the calculated p-value is smaller than a cutoff threshold. The metric matching unit may select a metric pair having the highest p-value by applying maximum weight bipartite matching to remaining metric pairs through metric pair filtering.

단계(210)에서 상관관계 분석부(110)는 복수 개의 소스 프로젝트와 타켓 프로젝트의 상관관계를 분석함에 따라 적어도 하나 이상의 소스 프로젝트를 선택할 수 있다. 상관관계 분석부(110)는 복수 개의 소스 프로젝트와 타겟 프로젝트와의 메트릭 매칭이 수행됨에 따라 복수 개의 소스 프로젝트에 대한 소스 데이터와 타겟 프로젝트에 대한 타겟 데이터 사이의 적합성 점수를 계산할 수 있다. 상관관계 분석부(110)는 복수 개의 소스 프로젝트와 타겟 프로젝트에 대한 매칭이 성공될 경우, 복수 개의 소스 프로젝트와 타겟 프로젝트 간의 상관 점수와 복수 개의 소스 프로젝트와 타겟 프로젝트 간 모듈 수의 비율로 구성되는 적합성 점수를 계산할 수 있다. 상관관계 분석부(110)는 복수 개의 소스 프로젝트와 타겟 프로젝트에 대한 매칭이 실패할 경우, 적합성 점수를 0으로 도출할 수 있다. 상관관계 분석부(110)는 복수 개의 소스 프로젝트에 대한 소스 데이터와 타겟 데이터에 대한 소스 데이터의 각각에서 모듈을 랜덤으로 샘플링하고, 복수 개의 소스 프로젝트에 대한 소스 데이터와 타겟 데이터에 대한 소스 데이터를 정렬할 수 있다. 상관관계 분석부(110)는 샘플링과 정렬이 완료됨에 따라 복수 개의 소스 프로젝트에 대한 소스 데이터와 타겟 데이터에 대한 타겟 데이터 간 Spearman상관계수를 사용하여 상관 점수를 계산하고, 계산된 상관 점수의 p-값이 컷오프 임계값보다 클 경우 0으로 설정하고, 복수 개의 소스 프로젝트와 타겟 프로젝트에 수행된 메트릭 각각에 대해 계산된 상관 점수의 평균을 도출할 수 있다. In step 210, the correlation analysis unit 110 may select at least one source project by analyzing the correlation between a plurality of source projects and a target project. The correlation analyzer 110 may calculate a suitability score between the source data of the plurality of source projects and the target data of the target project as metric matching is performed between the plurality of source projects and the target project. When the matching between the plurality of source projects and the target project is successful, the correlation analysis unit 110 determines the suitability consisting of the correlation score between the plurality of source projects and the target project and the ratio of the number of modules between the plurality of source projects and the target project. score can be calculated. The correlation analysis unit 110 may derive a suitability score of 0 when matching between a plurality of source projects and a target project fails. The correlation analysis unit 110 randomly samples a module from each of the source data for the source data and the target data for a plurality of source projects, and sorts the source data for the plurality of source projects and the source data for the target data. can do. As the sampling and sorting are completed, the correlation analysis unit 110 calculates a correlation score using the Spearman correlation coefficient between source data for a plurality of source projects and target data for target data, and calculates p- of the calculated correlation score. If the value is greater than the cutoff threshold, it may be set to 0, and an average of correlation scores calculated for each of the metrics performed on a plurality of source projects and target projects may be derived.

단계(220)에서 앙상블 학습부(120)는 선택된 적어도 하나 이상의 소스 프로젝트에 대한 앙상블 학습을 통해 타겟 프로젝트의 이종 결함을 예측할 수 있다. 앙상블 학습부(120)는 선택된 적어도 하나 이상의 소스 프로젝트의 각각에 대해 로지스틱 회귀 기반의 예측 모델을 구축하고, 구축된 로지스틱 회귀 기반의 예측 모델을 사용하여 타겟 프로젝트의 결함 레이블을 예측할 수 있다. 앙상블 학습부(120)는 소프트 보팅(soft voting)을 통해 각 로지스틱 회귀 기반의 예측 모델로부터 예측된 확률을 반환하고, 반환된 확률에 대한 평균을 최종 예측 레이블로 결정할 수 있다. In step 220, the ensemble learning unit 120 may predict heterogeneous defects of the target project through ensemble learning of at least one selected source project. The ensemble learning unit 120 may build a logistic regression-based prediction model for each of the selected at least one or more source projects, and predict a defect label of the target project using the logistic regression-based prediction model. The ensemble learning unit 120 may return probabilities predicted from each logistic regression-based prediction model through soft voting, and may determine an average of the returned probabilities as a final predicted label.

도 3은 일 실시예에 있어서, 이종 결함 예측 동작을 설명하기 위한 도면이다. 3 is a diagram for explaining a heterogeneous defect prediction operation according to an embodiment.

결함 예측 시스템은 Kolmogorov-Smirnov 테스트(310)를 통한 메트릭 매칭을 수행할 수 있다. 결함 예측 시스템은 소스 프로젝트와 타겟 프로젝트 간의 이종 메트릭 셋(set)을 통합하기 위해 Kolmogorov-Smirnov 테스트(310)를 활용할 수 있다. 예를 들면, 소스 프로젝트의 메트릭 중 15%가 게인 비율 점수로 선택되고, 소스 프로젝트와 타겟 프로젝트 간의 메트릭은 컷오프 임계값이(예를 들면, 0.05)이 있는 Kolmogorov-Smirnov 테스트(310)의 p-값을 기반으로 매칭될 수 있다. 모든 메트릭 매칭이 수행된 p-값이 컷오프 임계값(예를 들면, 0.05)보다 작으면, 소스 프로젝트가 필터링될 수 있다. The defect prediction system may perform metric matching through the Kolmogorov-Smirnov test (310). The defect prediction system may utilize the Kolmogorov-Smirnov test 310 to integrate heterogeneous metric sets between the source project and the target project. For example, 15% of the source project's metrics are chosen as the gain ratio score, and the metrics between the source and target projects are p- You can match based on value. If the p-value for which all metric matching has been performed is less than a cutoff threshold (eg, 0.05), the source project may be filtered out.

보다 상세하게는, 만약, 소스 프로젝트가 34개의 메트릭을 포함하는 프로젝트일 때 게인 비율에 따라 순위가 부여된 상위 6개의 메트릭이 선택될 수 있다. 서로 다른 두 개의 프로젝트의 메트릭 셋(소스 프로젝트와 타겟 프로젝트의 메트릭 셋)을 통합하기 위해 Kolmogorov-Smirnov테스트(310)를 사용하여 모든 메트릭 쌍의 유사성이 측정될 수 있다. 일례로, 소스 프로젝트 및 타겟 프로젝트를 포함하는 프로젝트에서 각각 6개(메트릭 선택 후)와 29개의 메트릭이 포함될 수 있다. 가능한 174 메트릭 쌍에 대한 Kolmogorov-Smirnov테스트(310)의 p-값이 계산되어야 한다. 이때, p-값이 컷오프 임계값(예를 들면, 0.05)보다 작을 경우, 매칭이 삭제되고 고려되지 않는다. 그런 다음 최대 가중치 이분 매칭을 적용하여 점수 합계가 가장 높은 매칭 메트릭 셋이 선택될 수 있다. 예를 들면, 두 개의 소스 메트릭(S₁ 및 S₂)과 두 개의 타겟 메트릭(T₁ 및 T₂)이 존재한다고 가정하기로 한다. 점수 합계가 1.3으로 가장 높기 때문에 메트릭 쌍(S₁, T₁), (S₁, T₂), (S₂, T₁), (S₂, T₂)에 대한 Kolmogorov-Smirnov테스트(310)의 p-값이 각각 0.9, 0.7, 0.6, 0.1인 경우, (S1, T2) 및 (S2, T1) 매트릭 쌍이 선택될 수 있다.More specifically, if the source project is a project including 34 metrics, the top 6 metrics ranked according to the gain ratio may be selected. The similarity of all pairs of metrics can be measured using the Kolmogorov-Smirnov test 310 to integrate the metric sets of the two different projects (the metric sets of the source project and the target project). As an example, 6 (after metric selection) and 29 metrics may be included in the project including the source project and the target project, respectively. The p-values of the Kolmogorov-Smirnov test (310) for the possible 174 metric pairs have to be calculated. At this time, if the p-value is smaller than the cutoff threshold (eg, 0.05), the match is discarded and not considered. Then, the matching metric set with the highest sum of scores may be selected by applying maximum weight bipartite matching. For example, it is assumed that there are two source metrics (S ₁ and S ₂ ) and two target metrics (T ₁ and T ₂ ). Kolmogorov-Smirnov test for metric pairs (S ₁ , T ₁ ), (S ₁ , T ₂ ), (S ₂ , T ₁ ), (S ₂ , T ₂ ) because the score sum is the highest at 1.3 (310) When the p-values of are 0.9, 0.7, 0.6, and 0.1, respectively, the (S1, T2) and (S2, T1) metric pairs can be selected.

결함 예측 시스템은 상관관계 분석(320)을 통해 복수 개의 소스 프로젝트로부터 적어도 하나 이상의 소스 프로젝트를 선택할 수 있다. 결함 예측 시스템은 앙상블 학습을 위한 소스 프로젝트를 선택하기 위해 각 소스 프로젝트에 대한 적합성 점수를 계산할 수 있다. 예를 들면, 적합성 점수가 높을수록 학습해야 하는 소스 프로젝트가 적절하다는 것을 의미한다. 적합성 점수가 가장 높은 소스 프로젝트에 대하여 적어도 하나 이상을 선택하여 학습이 실시될 수 있다. 선택되는 소스 프로젝트의 수는 선택 비율에 따라 달라질 수 있으며, 기본값은 0.6일 수 있다. The defect prediction system may select at least one or more source projects from a plurality of source projects through correlation analysis 320 . The defect prediction system can calculate a suitability score for each source project to select a source project for ensemble learning. For example, a higher suitability score means that the source project to learn from is appropriate. Learning may be performed by selecting at least one source project having the highest suitability score. The number of selected source projects may vary depending on the selection ratio, and the default value may be 0.6.

보다 상세하게는, 복수 개의 소스 프로젝트에 대한 소스 데이터와 타겟 프로젝트에 대한 타겟 데이터 사이의 적합성 점수를 계산하기 위하여, 적합성 점수가 다음과 같이 정의될 수 있다. 도 4를 참고하면, 이종 결함 예측 동작의 의사코드를 나타낸 것이고, 도 5를 참고하면, 적합성 점수에 대한 의사코드를 나타낸 예이다. More specifically, in order to calculate a suitability score between source data for a plurality of source projects and target data for a target project, the suitability score may be defined as follows. Referring to FIG. 4, pseudocode of a heterogeneous defect prediction operation is shown, and referring to FIG. 5, an example of pseudocode for suitability score is shown.

수학식 1:Equation 1:

여기서, 여기서 X_S와 X_T는 각각 통합된 소스 데이터 및 타겟 데이터를 나타내며 module(X)은 데이터 X의 모듈 수를 나타내고, 이다. 이러한 공식은 메트릭 매칭이 성공할 때만 적용되며(3-5행), 매트릭 매칭에 실패할 경우 적합성 점수가 0이된다(6행). 이러한 적합성 점수에는 적합성 점수는 복수 개의 소스 프로젝트와 타켓 프로젝트 간의 상관 점수와, 복수 개의 소스 프로젝트와 타겟 프로젝트 간 모듈 수의 비율을 포함하는 두 가지 항으로 구성될 수 있다. 이때, 실험을 통해 를 0.95로 기본값으로 설정될 수 있다.where X _S and X _T denote integrated source data and target data, respectively, and module(X) denotes the number of modules of data X, am. These formulas are applied only if the metric match succeeds (lines 3-5), and if the metric match fails, the suitability score is zero (line 6). The suitability score may include two terms including a correlation score between the plurality of source projects and the target project and a ratio of the number of modules between the plurality of source projects and the target project. At this time, through the experiment may be set to 0.95 by default.

2차원 데이터(통합된 소스 데이터 및 타겟 데이터)인 X_S와 X_T 사이의 상관 점수는 도 5의 의사코드에서 제공되는 corScore 함수에 의해 계산될 수 있다. 상관계수는 인스턴스 수가 동일한 1차원 데이터 간에 계산될 수 있다. 따라서, 각 메트릭 데이터에 대해 corScore는 두 데이터를 매칭하기 위해 더 많은 모듈이 있는 데이터에서 랜덤으로 모듈을 샘플링하여 두 데이터를 정렬할 수 있다(1-9행). 샘플링 및 정렬 절차가 완료됨에 따라 상관 점수가 획득될 수 있다(10행). Spearman의 상관계수를 사용하여 상관 점수가 계산될 수 있다. 상관 점수가 계산된 결과의 p-값이 기 설정된 값(예를 들면, 0.05)보다 클 경우, 상관관계가 통계적으로 유의하지 않음을 의미한다. 이에 따라 상관 점수가 0으로 설정될 수 있다(11-13행). 모든 메트릭에 대한 상관 점수가 계산된 후 평균이 도출될 수 있다(14-16행). 이러한 전체 프로세스가 반복적으로 수행될 수 있다(17행). 다시 말해서, 전체 프로세스가 100회 반복될 수 있다.A correlation score between two-dimensional data (integrated source data and target data) X _S and X _T can be calculated by the corScore function provided in the pseudo code of FIG. 5 . The correlation coefficient can be calculated between one-dimensional data with the same number of instances. So, for each metric data, corScore can align the two data by randomly sampling a module from the data with more modules to match (lines 1-9). As the sampling and alignment procedure is completed, a correlation score can be obtained (line 10). A correlation score can be calculated using Spearman's correlation coefficient. If the p-value of the result of calculating the correlation score is greater than a preset value (eg, 0.05), it means that the correlation is not statistically significant. Accordingly, the correlation score may be set to 0 (lines 11-13). After correlation scores for all metrics have been calculated, an average can be derived (lines 14-16). This entire process can be performed repeatedly (line 17). In other words, the entire process can be repeated 100 times.

결함 예측 시스템은 선택된 적어도 하나 이상의 소스 프로젝트에 대한 앙상블 학습(330)을 통해 타켓 프로젝트의 이종 결함을 예측할 수 있다. 앙상블 학습이란 클래스 불균형 학습의 기존 방법을 개선하기 위하여 사용된 것이다. 앙상블 분류기는 기본 학습자의 예측 결과를 결합하여 기본 학습자보다 더욱 강력한 일반화 능력과 보이지 않는 데이터에 대한 더 나은 통찰력을 갖는다. 예를 들면, 다중 커널 분류기를 앙상블 학습과 통합하여 클래스 불균형 학습으로 인한 영향을 완화시킴으로써 예측 성능을 향상시킬 수 있다. 결함 예측 시스템은 클래스 불균형 문제를 완화하기 위하여 앙상블 학습을 활용하여 선택된 각 소스 프로젝트에 대해 로지스틱 회귀 분류기를 사용하여 예측 모델을 학습시키고 타겟 프로젝트에 대한 결함 레이블을 예측할 수 있다. 그런 다음, 소프트 보팅(soft voting) 분류기를 통해 결함 모듈의 오분류를 줄일 수 있도록 결과(예측된 결함 레이블)를 결합할 수 있다. The defect prediction system may predict heterogeneous defects of a target project through ensemble learning 330 for at least one selected source project. Ensemble learning is used to improve the existing method of class imbalance learning. Ensemble classifiers have stronger generalization ability and better insights into unseen data than the basic learners by combining the prediction results of the basic learners. For example, multi-kernel classifiers can be integrated with ensemble learning to improve prediction performance by mitigating the effects of class imbalance learning. The defect prediction system can utilize ensemble learning to mitigate the class imbalance problem, train a predictive model using a logistic regression classifier for each selected source project, and predict defect labels for the target project. The results (predicted defect labels) can then be combined to reduce the misclassification of defect modules via soft voting classifiers.

보다 상세하게는, 로지스틱 회귀는 이종 결함 예측에서 가장 널리 사용되는 분류기이다. 도 4를 참고하면, 선택된 적어도 하나 이상의 각 소스 프로젝트에 대한 로지스틱 회귀 기반의 예측 모델을 구축하고, 타겟 프로젝트의 결함 레이블을 예측할 수 있다(13-15행). 예측 결과를 집계하고 클래스 불균형 문제를 완화하기 위해 보팅이 적용될 수 있다. 일반적으로, 하드 보팅과 소프트 보팅을 포함하는 두 가지의 보팅 전략이 존재한다. 하드 보팅은 각 기본 분류기에서 예측된 레이블이 보팅되어 다수 레이블이 최종 결과가 되는 것이다. 소프트 보팅은 각 기본 분류기에서 예측된 확률이 반환되고, 반환된 확률의 평균이 최종 결과를 결정한다. 실시예에서는 소프트 보팅을 통해 최종 예측 레이블 L(16행)은 다음과 같이 정의될 수 있다.More specifically, logistic regression is the most widely used classifier in heterogeneous defect prediction. Referring to FIG. 4 , a logistic regression-based predictive model for each selected at least one or more source projects may be built, and defect labels of the target project may be predicted (lines 13 to 15). Voting can be applied to aggregate prediction results and mitigate class imbalance problems. Generally, there are two voting strategies including hard voting and soft voting. In hard voting, the labels predicted by each basic classifier are voted on, and the majority of labels becomes the final result. In soft voting, predicted probabilities from each basic classifier are returned, and the average of the returned probabilities determines the final result. In the embodiment, the final predicted label L (row 16) through soft voting may be defined as follows.

여기서, N'은 선택된 소스 프로젝트의 수이고, Z_i은 단일 소스 프로젝트의 예측 확률을 나타낸다. Here, N' is the number of selected source projects, and Z _i represents the predicted probability of a single source project.

일 실시예에 따르면, 복수 개의 소스 프로젝트의 상관관계 기반 선택 및 이종 결함 예측에 대한 앙상블 학습 방법을 통해 주어진 타겟 프로젝트에 적합한 소스 프로젝트를 선택하고 클래스 불균형 문제를 완화할 수 있다. 일례로, 37개 프로젝트에 대한 실험이 수행될 수 있고, Wilcoxon 부호 순위 검정과 Cliff의 효과 크기 검정이 수행될 수 있다. 실험 결과에 따르면, 실시예에서 제안된 방법이 적절한 소스 프로젝트를 선택하지 않는 종래의 이종 결함 예측 방법보다 훨씬 더 나은 성능을 보이고, 상관관계 기반 선택 및 앙상블 학습 접근 방식 모두 성능이 향상되고, 학습 프로젝트로 복수 개의 소스 프로젝트를 선택하는 것은 0.6 선택 비율이 최적의 결과를 도출하도록 하는 것을 확인할 수 있다. 결론적으로, 학습을 위한 타겟 데이터가 없어도 이종 결함 예측에서 소스 프로젝트에 대한 선택 가이드라인을 제공하여 예측 성능을 향상시키고 프로젝트 초기 단계에 적용할 수 있도록 한다. 다시 말해서, 적절한 수의 소스 데이터를 선택하는 것이 예측 성능에 중요하다는 것을 확인할 수 있다.According to an embodiment, a source project suitable for a given target project may be selected and the class imbalance problem may be mitigated through an ensemble learning method for correlation-based selection of a plurality of source projects and prediction of heterogeneous defects. As an example, experiments on 37 projects can be performed, and the Wilcoxon signed-rank test and Cliff's Effect size tests may be performed. According to the experimental results, the proposed method in the embodiment shows much better performance than the conventional heterogeneous defect prediction method without selecting an appropriate source project, the performance of both correlation-based selection and ensemble learning approaches is improved, and the learning project It can be seen that selecting a plurality of source projects with 0.6 produces the optimal result. In conclusion, even without target data for learning, heterogeneous defect prediction provides selection guidelines for source projects to improve prediction performance and apply them in the early stages of a project. In other words, we can confirm that choosing an appropriate number of source data is important for predictive performance.

보다 상세하게는, 실험을 위해 3가지 질문이 설정될 수 있다. 질문 1(실시예에서 제안된 방법이 기존의 이종 결함 예측 방법보다 더 나은 예측 성능을 달성하는가?), 질문 2(실시예에서 제안된 방법의 두 가지 핵심 방법(상관관계 기반 다중 소스 선택 및 앙상블 학습)은 예측 성능에 얼마나 효과적인가?), 질문 3(실시예에서 제안된 방법에 대한 적절한 파라미터 값(즉, 적합성 함수의 알파, 상관계수 및 선택 비율)은 무엇인가?)라는 3가지 질문이 설정될 수 있다. More specifically, three questions may be set for the experiment. Question 1 (does the method proposed in the embodiment achieve better prediction performance than the existing heterogeneous defect prediction methods?), question 2 (two key methods of the method proposed in the embodiment (correlation-based multi-source selection and ensemble How effective is learning) for predictive performance?), Question 3 (What are the appropriate parameter values (i.e., alpha of the fit function, correlation coefficient, and selection ratio) for the method proposed in the example?), three questions are set. It can be.

또한, NASA, SOFTLAB, ReLink, AIEEM, PROMISE를 포함하여 이종 결함 예측에 널리 사용된 공개적으로 사용 가능한 데이터 셋이 수집될 수 있다. NASA 데이터 셋에는 NASA 소프트웨어 시스템의 다양한 코드 메트릭이 포함되어 있다. NASA의 정적 코드 메트릭에는 크기와 복잡성이 포함될 수 있다. 이에, 37개의 공통 메트릭을 가진 CM1, MW1, PC1, PC3, PC4 등 5개의 프로젝트 데이터만 사용하기로 한다. SOFTLAB 데이터 셋은 백색 가전제품을 위한 컨트롤러 소프트웨어이다. SOFTLAB 데이터 셋은 Halstead와 McCabe의 순환 행렬을 포함하여 29개의 메트릭으로 구성된다. ReLink 데이터 셋은 고품질 결함 데이터를 얻고 예측 성능을 향상시키기 위해 수집된 것으로, 코드 행, 주기적 복잡성 등과 같은 파일 수준의 정적 코드 복잡성 측정을 포함한 26개의 메트릭이 포함되어 있다. AEEEM 데이터 셋은 5개 시스템의 변경, 버그 및 버전 정보로 구성되어 있다. AEEEM 데이터는 소스 코드 메트릭, 변경 메트릭의 엔트로피, 소스 코드 메트릭의 전환 등을 포함한 61개의 메트릭을 포함한다. PROMISE 데이터 셋은 ant, camel, ivy 등과 같은 오픈 소스 프로젝트로 구성될 수 있다. PROMISE 데이터 셋은 CK 메트릭과 OO 메트릭을 포함하는 20개의 메트릭을 포함한다. 외부 유효성에 대한 위협을 완화하기 위해 65개의 메트릭을 가진 JIRA 데이터 셋을 실험에 추가로 사용하기로 한다. 총 6개의 데이터 셋에서 37개의 프로젝트가 사용되었으며, 표 1에 이러한 프로젝트가 설명되어 있다. 이러한 데이터 셋마다 메트릭 수가 다르다.Additionally, publicly available datasets widely used for heterogeneous fault prediction can be collected, including NASA, SOFTLAB, ReLink, AIEEM, and PROMISE. The NASA data set contains metrics for various codes of NASA software systems. NASA's static code metrics can include size and complexity. Therefore, only 5 project data such as CM1, MW1, PC1, PC3, and PC4 with 37 common metrics will be used. The SOFTLAB data set is controller software for white goods. The SOFTLAB data set consists of 29 metrics, including Halstead and McCabe's circulant matrices. The ReLink data set was collected to obtain high-quality defect data and improve predictive performance, and includes 26 metrics, including file-level static code complexity measures such as lines of code, cyclical complexity, and more. The AEEEM data set consists of change, bug, and version information for five systems. AEEEM data contains 61 metrics, including source code metric, entropy of change metric, conversion of source code metric, etc. PROMISE data sets can be composed of open source projects such as ant, camel, and ivy. The PROMISE data set contains 20 metrics including CK metrics and OO metrics. To mitigate the threat of external validity, we decide to additionally use the JIRA dataset with 65 metrics for the experiment. A total of 37 projects were used in 6 data sets, and these projects are described in Table 1. Each of these data sets has a different number of metrics.

표 1: Table 1:

또한, 실시예에서 제안된 방법은 HDP-KS, EMKCA, CTKCCA, MSMDA의 네 가지 다른 이종 결함 예측 방법과 비교될 수 있다. 이종 결함 예측에 대한 경험적 연구는 CTKCCA가 선형 분리 불가분 및 클래스 불균형 문제를 고려하여 가장 잘 수행되며, 그 다음으로 EMKCA가 수행된다고 평가될 수 있다. 메트릭 선택 기반 접근 방식 중에서 HDP-KS가 가장 높은 성능을 보였다. HDP-KS, EMKCA, CTKCCA는 소스 프로젝트를 무작위로 선택했다고 가정했을 때 평가 기준의 평균 또는 중간값을 측정했다. 학습을 위해 타겟 데이터가 필요한 혼합 프로젝트 모델로서의 MSMDA는 매니폴드 판별 정렬을 기반으로 한 다중 소스 선택 방법이다.In addition, the method proposed in the embodiment can be compared with four other heterogeneous defect prediction methods of HDP-KS, EMKCA, CTKCCA, and MSMDA. An empirical study on heterogeneous defect prediction can be evaluated that CTKCCA is best performed considering linear indivisibility and class imbalance problems, followed by EMKCA. Among the metric selection-based approaches, HDP-KS showed the highest performance. HDP-KS, EMKCA, and CTKCCA measured the mean or median of the evaluation criteria assuming that source projects were randomly selected. MSMDA as a mixed project model requiring target data for training is a multi-source selection method based on manifold discriminant sorting.

이러한 이종 결함 예측 기술을 평가하기 위하여 일반적으로 사용되는 두 가지 성능 척도, 즉 effort-불인식 성능 지표(NPI라고도 함)를 위한 AUC와 effort-인식 성능 지표(EPI)를 위한 ACC이 사용될 수 있다. Two commonly used performance metrics can be used to evaluate these heterogeneous defect prediction techniques: AUC for effort-unrecognized performance metric (also called NPI) and ACC for effort-perceived performance metric (EPI).

소프트웨어 결함 예측에서, 일반적으로 결함이 있는 모듈을 양의 인스턴스로, 결함이 없는 모듈을 부정적인 인스턴스로 취급될 수 있다. 모듈의 실제 결함과 예측 결함에 따라 예측 결과가 참 양성(tp), 거짓 양성(fp), 거짓 음성(fn), 참 음성(tn)의 4가지 유형으로 분류될 수 있다. 소프트웨어 결함 예측에서 오차 행렬은 표 2에 설명되어 있다.In software fault prediction, one can generally treat faulty modules as positive instances and non-faulty modules as negative instances. Depending on the actual flaws and prediction flaws of the module, prediction results can be classified into four types: true positive (tp), false positive (fp), false negative (fn), and true negative (tn). The error matrix in software defect prediction is described in Table 2.

표 2:Table 2:

AUC는 수신기 작동 특성 곡선 아래의 영역을 나타낸다. 곡선은 거짓 양성 비율(거짓 경보 발생 확률, pf라고도 함)을 x 좌표로 그리고 참 양성 비율(검출 확률, pd라고도 함)을 y 좌표로 표시한다. pf와 pd는 각각 fp/(fp+tn)와 tp/(tp+fn)로 정의된다. 정밀도 및 리콜과 같은 임계값 의존적 측정은 컷오프 임계값 설정과 클래스 불균형 문제에 의해 영향을 받을 수 있다. 반면에 AUC는 임계값과 독립적이며 클래스 불균형 문제에 덜 민감하므로 결함 예측 연구에 널리 사용된다. AUC는 0과 1 사이에 속하며, AUC가 높을수록 더 나은 성능을 나타낸다. 완전한 모델은 1.0의 AUC를 가지며 랜덤 예측은 0.5의 AUC를 산출한다.AUC represents the area under the receiver operating characteristic curve. The curve plots the false positive rate (probability of false alarm, also called pf) as the x-coordinate and the true positive rate (probability of detection, also called pd) as the y-coordinate. pf and pd are defined as fp/(fp+tn) and tp/(tp+fn), respectively. Threshold-dependent measures such as precision and recall can be affected by setting cutoff thresholds and class imbalance issues. On the other hand, AUC is widely used in defect prediction studies as it is threshold independent and less sensitive to class imbalance problems. AUC falls between 0 and 1, with higher AUC indicating better performance. A complete model has an AUC of 1.0 and random prediction yields an AUC of 0.5.

AUC와 같은 NPI와는 달리 EPI는 프로그램 모듈을 검사하기 위한 effort를 고려한다. 대부분의 기존 HDP 연구는 일반적으로 NPI에서만 성능을 평가했지만, 편향된 결과를 초래할 수 있다. 따라서 적시(JIT) 결함 예측에 사용된 EPI에 대해 ACC를 추가로 고려한다.Unlike NPIs such as AUC, EPI considers effort to examine program modules. Most existing HDP studies have generally evaluated performance only in NPI, but this can lead to biased results. Therefore, we further consider ACC for EPI used for just-in-time (JIT) defect prediction.

ACC는 전체 노력의 20%를 소비할 때 결함 있는 모듈의 리콜을 나타낸다. effort를 측정하기 위해 검사된 모듈의 수를 사용하기로 한다. 예를 들면, 프로젝트에는 1,000개의 모듈이 있으며 그 중 100개의 모듈에서 버그가 발생할 수 있다. 전체 effort의 20%를 소비할 때 예측 점수가 가장 높은 200개의 모듈을 검사할 수 있다. 200개의 검사된 모듈 중 35개의 모듈이 버그가 발생했다고 가정하기로 한다. 그런 다음 100개 중 35개의 불량 모듈이 발견될 수 있다. 따라서 ACC는 35/100 = 0.35이다.ACC represents the recall of a faulty module when it expends 20% of its total effort. We will use the number of modules tested to measure effort. For example, a project may have 1,000 modules, and a bug may occur in 100 of them. When 20% of the total effort is spent, the 200 modules with the highest prediction scores can be tested. Let's assume that 35 of the 200 checked modules have bugs. Then 35 bad modules out of 100 can be found. So ACC is 35/100 = 0.35.

질문 1을 해결하기 위해 실시예에서 제안된 방법의 예측 성능을 4개의 기준과 비교할 수 있다. HDP-KS, EMKCA, CTKCCA는 일대일 HDP로 하나의 소스 프로젝트와 하나의 타겟 프로젝트가 입력으로 필요하다는 의미이다. 따라서 37개 프로젝트 중 하나를 타겟 프로젝트로 선택하고 다른 데이터 셋의 각 프로젝트를 소스 프로젝트로 사용할 수 있다. 예를 들면, CM1이 타겟 프로젝트인 경우 NASA 데이터셋의 5개 프로젝트를 제외한 32개의 예측 조합이 존재한다(표 1 참조). 전체적으로 EMKCA와 CTKCCA는 1,092개의 가능한 예측 조합을 가지고 있다. HDP-KS는 동일한 1,092개의 조합을 시도하지만 295개의 조합이 메트릭 매칭에 실패하기 때문에 797개의 조합만 유효하다. MSMDA와 실시예에서 제안된 방법은 다대일 이종 결함 예측으로, 각 타겟 프로젝트에 대해 여러 소스 프로젝트가 선택되어 학습될 수 있다. 따라서 MSMDA와 실시예에서 제안된 방법에는 37개의 가능한 예측 조합이 있다. 다른 네 방법과 달리, MSMDA는 레이블이 지정된 타겟 데이터의 10%를 학습용으로 사용하는 혼합 프로젝트 이종 결함 예측이므로 MSMDA 과정을 30회 반복한 후 평균을 구할 수 있다. 모든 방법에는 공정한 비교를 위해 기본 파라미터가 있는 LIBLINEAR의 로지스틱 회귀 분류기가 사용될 수 있다. 이에, 기준 MATLAB 코드가 활용될 수 있다. 그러나 레이블 정보에 의해 타겟 데이터를 사전 처리하는 데 심각한 버그가 있다는 것을 발견했다. 즉, 모든 결함 모듈은 앞에 배치되고 결함이 없는 모듈은 뒤에 온다. 예측된 결과를 평가해야만 타겟 데이터의 레이블 정보에 접근할 수 있기 때문에 그러한 과정은 분명히 문제가 있다. 실험에서 기준치의 성능을 저하시킬 수 있는 정렬 프로세스를 제거하여 코드가 수정될 수 있다.To solve question 1, the predictive performance of the method proposed in the embodiment can be compared with four criteria. HDP-KS, EMKCA, and CTKCCA are one-to-one HDPs, meaning they require one source project and one target project as input. Therefore, you can select one of the 37 projects as a target project and use each project in a different data set as a source project. For example, if CM1 is the target project, there are 32 prediction combinations excluding 5 projects in the NASA dataset (see Table 1). In total, EMKCA and CTKCCA have 1,092 possible prediction combinations. HDP-KS tries the same 1,092 combinations, but only 797 combinations are valid because 295 combinations fail to match the metric. The method proposed in MSMDA and the embodiment is a many-to-one heterogeneous defect prediction, in which several source projects can be selected and learned for each target project. Thus, there are 37 possible prediction combinations in MSMDA and the method proposed in the examples. Unlike the other four methods, MSMDA is a mixed project heterogeneous defect prediction that uses 10% of the labeled target data for training, so the average can be obtained after 30 iterations of the MSMDA process. All methods can use LIBLINEAR's logistic regression classifier with default parameters for fair comparison. Thus, reference MATLAB code can be utilized. However, we found a serious bug in the preprocessing of target data by label information. That is, all defective modules are placed first, and non-defective modules come after. Obviously, such a process is problematic because the label information of the target data can only be accessed by evaluating the predicted outcome. In experiments, the code can be modified to remove the sorting process that can degrade the performance of the baseline.

질문 2를 해결하기 위해 실시예에서 제안된 방법에서 일부 프로세스를 생략하고 결과를 비교하는 두 가지 새로운 방법을 생성할 수 있다. 첫째, HDP-KS-EL(앙상블 학습이 있는 HDPKS)은 상관관계 기반 선택 과정을 거치지 않고 모든 예측 결과에서 앙상블 학습을 적용할 수 있다. 둘째, CASS(상관관계 기반 단일 소스 선택)는 앙상블 학습을 적용하지 않고 상관관계 기반 선택 프로세스에 의해 하나의 소스 프로젝트를 선택할 수 있다. CASS는 적합성 점수가 가장 높은 프로젝트를 하나만 선택하고 선택한 프로젝트로 모델을 학습시킬 수 있다. 실시예에서는 HDP-KS-EL, CASS, 실시예에서 제안된 방법(CAMEL)의 세 가지 방법의 AUC와 ACC를 비교할 수 있다. 표 3에는 각 방법이 선택 접근 방식과 앙상블 학습 접근 방식을 적용하는지 여부가 요약되어 있다.In order to solve question 2, some processes may be omitted from the method proposed in the embodiment and two new methods for comparing results may be created. First, HDP-KS-EL (HDPKS with ensemble learning) can apply ensemble learning to all prediction results without going through a correlation-based selection process. Second, CASS (Correlation-based Single Source Selection) can select one source project by a correlation-based selection process without applying ensemble learning. CASS can select only one project with the highest suitability score and train the model with the selected project. In the embodiment, AUC and ACC of the three methods of HDP-KS-EL, CASS, and the method proposed in the embodiment (CAMEL) can be compared. Table 3 summarizes whether each method applies a selection approach and an ensemble learning approach.

표 3: Table 3:

질문 1과 질문 2의 경우 통계 테스트가 수행될 수 있다. 통계적 유의성을 평가하기 위해 95%의 신뢰 수준에서 비모수 Wilcoxon 부호 순위 검정이 실시될 수 있다. Wilcoxon 단일 순위 검정은 두 개의 관련 샘플을 비교하는 데 사용되며 실시예에서 제안된 방법(CAMEL)과 각 기준치 간의 결과를 비교할 수 있다. 또한 실시예에서 제안된 방법(CAMEL)과 각 기준치 사이의 성능 차이의 크기를 측정하기 위해 비모수 효과 크기 테스트인 Cliff의 을 계산할 수 있다. Cliff의 는 [-1, 1]에 속하며, 표 4는 다양한 값에 대한 효율성 수준을 나타낸다.For questions 1 and 2, statistical tests may be performed. A non-parametric Wilcoxon signed-rank test can be performed at the 95% confidence level to assess statistical significance. The Wilcoxon single rank test is used to compare two related samples and the results between the method proposed in the examples (CAMEL) and each baseline can be compared. In addition, to measure the size of the difference in performance between the method (CAMEL) proposed in the embodiment and each reference value, Cliff's non-parametric effect size test can be calculated. Cliff's belongs to [-1, 1], and Table 4 shows various Indicates the efficiency level for the value.

표 4: Table 4:

질문 3을 해결하기 위해 최적의 알파, 상관계수 및 선택 비율을 찾으려고 한다. 대부분의 HDP 접근방식의 성능이 파라미터 설정의 영향을 받으므로 경험적으로 파라미터(예: EMKCA 의 가우스 커널 파라미터)를 결정하는 데 필요하다는 것이 발견되었다. 먼저 적합성 함수에서 최적의 알파 값을 찾을 수 있다. 0.05부터 1.00까지의 값을 0.05의 간격으로 사용한다. 0.95와 1.00 사이의 결과 차이가 크다는 것을 고려하여 0.99의 값을 추가로 시도한다. 그 결과, 21개의 다른 알파 값으로 AUC 값을 계산하고 가장 좋은 값을 찾을 수 있다. 가 1.00일 때 적합성 함수는 두 프로젝트 사이의 상관 점수만 포함하고 모듈 수의 비율은 포함하지 않는다는 점에 유의한다. 최적의 알파 값을 찾은 후, 세 가지 유형의 상관계수와 아홉 가지 선택 비율을 시도할 수 있다. 매칭되는 메트릭 간의 상관 점수를 계산하기 위해 Pearson 상관계수, Spearman의 순위 상관계수, Kendall 순위 상관계수를 사용하며, 각각 Pearson의 , Spearman의 r 및 Kendall의 라고 한다. 계수는 Python의 Scipy 모듈을 사용하여 계산될 수 있다. 여러 소스 프로젝트를 선택하기 위해 0.1 ~ 0.9의 선택 비율을 0.1 간격으로 사용한다. 즉, 선택 비율에 따라 적합성 점수가 가장 높은 매칭 소스 프로젝트를 10%, 20%, ??, 80%, 90%로 선택할 수 있다. 따라서, 27개의 파라미터 설정으로 AUC 값을 계산하고 가장 좋은 값을 찾을 수 있다. To solve question 3, we want to find the optimal alpha, correlation coefficient and selection ratio. It has been found that it is necessary to determine the parameters empirically (e.g. Gaussian kernel parameters of EMKCA) as the performance of most HDP approaches is affected by parameter settings. First, the optimal alpha value can be found in the fit function. from 0.05 to 1.00 Values are used at intervals of 0.05. Considering that the difference in results between 0.95 and 1.00 is large, we try an additional value of 0.99. As a result, we can calculate the AUC value with 21 different alpha values and find the best one. Note that when is equal to 1.00, the fit function includes only the correlation score between the two projects and not the ratio of the number of modules. After finding the optimal alpha value, you can try three types of correlation coefficients and nine selection ratios. To calculate the correlation score between matching metrics, Pearson's correlation coefficient, Spearman's rank correlation coefficient, and Kendall's rank correlation coefficient are used, respectively. , Spearman's r and Kendall's It is said. Coefficients can be computed using Python's Scipy module. To select multiple source projects, use a selection ratio of 0.1 to 0.9 with an interval of 0.1. That is, according to the selection ratio, the matching source project with the highest suitability score can be selected as 10%, 20%, ??, 80%, or 90%. Therefore, it is possible to calculate the AUC value with 27 parameter settings and find the best value.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the heterogeneous defect prediction method performed by the defect prediction system,
selecting at least one source project for ensemble learning by analyzing a correlation between a plurality of source projects and a target project; and
Predicting heterogeneous defects of a target project through ensemble learning of the selected at least one or more source projects.
Heterogeneous defect prediction method comprising a.

According to claim 1,
A step of performing metric matching of multiple source projects and target projects
Heterogeneous defect prediction method further comprising.

According to claim 2,
The step of performing the metric matching,
Measuring a similarity between a metric pair including a metric of the plurality of source projects and a metric of the target project using a Kolmogorov-Smirnov test to integrate heterogeneous metric sets of the plurality of source projects and the target project.
Heterogeneous defect prediction method comprising a.

According to claim 3,
The step of performing the metric matching,
Selecting a metric of a source project equal to or higher than a preset standard according to a gain ratio of the source project
Heterogeneous defect prediction method comprising a.

According to claim 3,
The step of performing the metric matching,
calculating a p-value of the Kolmogorov-Smirnov test for the pair of metrics, and filtering the pair of metrics if the calculated p-value is less than a cutoff threshold;
Heterogeneous defect prediction method comprising a.

According to claim 5,
The step of performing the metric matching,
Selecting a metric pair having the highest p-value by applying maximum weight bipartite matching to remaining metric pairs through filtering of the metric pair;
Heterogeneous defect prediction method comprising a.

According to claim 1,
The selection step is
Calculating a suitability score between source data for the plurality of source projects and target data for the target project as metric matching between the plurality of source projects and the target project is performed
Heterogeneous defect prediction method comprising a.

According to claim 7,
The selection step is
When matching between the plurality of source projects and the target project is successful, a suitability score composed of a correlation score between the plurality of source projects and the target project and a ratio of the number of modules between the plurality of source projects and the target project steps to calculate
Heterogeneous defect prediction method comprising a.

According to claim 7,
The selection step is
Deriving a suitability score of 0 when matching between the plurality of source projects and the target project fails.
Heterogeneous defect prediction method comprising a.

According to claim 7,
The selection step is
Randomly sampling modules from each of the source data for the plurality of source projects and the source data for the target data, and aligning the source data for the plurality of source projects and the source data for the target data.
Heterogeneous defect prediction method comprising a.

According to claim 10,
The selection step is
Upon completion of the sampling and the sorting, a correlation score is calculated using a Spearman correlation coefficient between source data for the plurality of source projects and target data for the target data, and the p-value of the calculated correlation score is a cutoff Deriving an average of correlation scores calculated for each of the metrics performed on the plurality of source projects and the target project by setting it to 0 if it is greater than the threshold value
Heterogeneous defect prediction method comprising a.

According to claim 1,
The step of predicting the heterogeneous defect,
Constructing a logistic regression-based predictive model for each of the selected at least one or more source projects, and predicting a defect label of the target project using the constructed logistic regression-based predictive model.
Heterogeneous defect prediction method comprising a.

According to claim 12,
The step of predicting the heterogeneous defect,
Returning predicted probabilities from each logistic regression-based prediction model through soft voting, and determining an average of the returned probabilities as a final predicted label.
Heterogeneous defect prediction method comprising a.

A computer program stored in a computer readable storage medium for executing a heterogeneous defect prediction method performed by a defect prediction system,
selecting at least one source project for ensemble learning by analyzing a correlation between a plurality of source projects and a target project; and
Predicting heterogeneous defects of a target project through ensemble learning of the selected at least one or more source projects.
A computer program stored in a computer readable storage medium comprising a.

In the defect prediction system,
A correlation analyzer for selecting at least one source project for ensemble learning by analyzing the correlation between a plurality of source projects and a target project; and
An ensemble learning unit that predicts heterogeneous defects of a target project through ensemble learning of the selected at least one source project.
Defect prediction system comprising a.