KR102144044B1

KR102144044B1 - Apparatus and method for classification of true and false positivies of weapon system software static testing based on machine learning

Info

Publication number: KR102144044B1
Application number: KR1020200007947A
Authority: KR
Inventors: 이주현; 신영섭; 조동래; 조규태
Original assignee: 엘아이지넥스원 주식회사
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-08-12

Abstract

The present invention relates to an apparatus capable of automating classification of false alarm by applying machine learning when a false alarm is determined in a software static test and a method thereof. The apparatus comprises: a false alarm product database in which a product template for each type of software static test violations is recorded and stored; a data integration unit which merges data extracted from a plurality of sources; a data preprocessing unit which performs predetermined data preprocessing in order to input the data merged through the data integration unit into a machine learning algorithm; a model selection unit which performs cross validation by separating the preprocessed merged data into a train set and a test set and performs learning evaluation for a plurality of machine learning algorithms and hyper parameters so as to select an optical machine learning algorithm and hyper parameter; a data prediction unit which receives the preprocessed merged data and determines whether the preprocessed merged data is true or false with respect to static test violations by applying the optical machine learning algorithm and hyper parameter selected in the optimal model selection unit; and a false alarm product writing unit which generates a false alarm report by referring to template information of the false alarm product database, with respect to false alarm determined to be false in the data prediction unit.

Description

Machine learning based software static test false alarm classification device and method {APPARATUS AND METHOD FOR CLASSIFICATION OF TRUE AND FALSE POSITIVIES OF WEAPON SYSTEM SOFTWARE STATIC TESTING BASED ON MACHINE LEARNING}

본 발명은 소프트웨어 정적 시험에서 거짓경보 판단 시 기계학습을 적용하여 거짓경보 분류를 자동화하여 소프트웨어 정적 시험 결과 참(True) 및 거짓(False) 신호(Alarm)의 분류를 신속하게 하고, 관련 산출물 작성 효율을 높일 수 있는 장치 및 방법에 관한 것이다.The present invention automates the classification of false alarms by applying machine learning when determining false alarms in a software static test, speeding the classification of true and false signals as a result of a software static test, and the efficiency of creating related products. It relates to an apparatus and method capable of increasing the.

무기체계 개발 시 소프트웨어가 차지하는 비중이 점점 높아짐에 따라 대내, 외적으로 소프트웨어 품질 관리에 대한 관심이 높아져 가고 있으며, 이를 위한 활동 중 하나로 소프트웨어 정적 시험을 수행하고 있다. As the proportion of software in the development of weapon systems increases, interest in software quality management internally and externally is increasing, and as one of the activities for this, software static tests are being conducted.

소프트웨어 정적 시험은 소프트웨어를 실행하지 않은 상태에서 자동화 도구를 통해 검출된 위반사항들을 수정 및 조치하는 활동을 의미한다. 소프트웨어 정적 시험의 특성상 자동화 도구가 자체적으로 가지고 있는 분석 모델을 기반으로 수행되어, 빠른 시간 안에 개발자에게 여러 위반사항을 보고하여 조치할 수 있는 장점이 있다. 하지만, 이와 같은 장점 대비 실제 소프트웨어를 실행하여 얻는 결과가 아니라, 자체 모델을 기반으로 분석된 결과이므로 거짓경보(False Positive)가 빈번하게 발생되는 문제점이 있다. Software static testing refers to the activity of correcting and taking action against violations detected by an automated tool without running software. Due to the nature of the software static test, it is performed based on the analysis model that the automation tool has itself, so it has the advantage of being able to report various violations to the developer in a short time and take action. However, compared to the above advantages, since it is not a result obtained by executing actual software, but a result of analysis based on its own model, there is a problem in that false positives are frequently generated.

즉, 소프트웨어 정적 시험은 동적 시험에 비해 상대적으로 적은 노력으로 의미 있는 잠재적인 결함을 사전에 식별할 수 있다는 장점이 있으나, 거짓경보(False Positive)가 많이 발생하여 시험평가 진행 시 거짓경보 여부 판단 및 거짓경보에 대한 보고서 작성에 개발자들의 과도한 시간 및 노력이 소요되는 문제점이 발생되고 있다.In other words, the software static test has the advantage of being able to identify meaningful potential defects in advance with relatively little effort compared to the dynamic test, but false positives are generated so that it is determined whether or not a false alarm is performed during the test evaluation. There is a problem that developers take too much time and effort to write a report on false alarms.

일반적으로, 소프트웨어 정적 시험 수행 후 거짓경보 판단을 위해서는 소스코드 AST(Abstract Syntax Tree) 등의 단일 정보 만을 사용하여 학습하여 거짓경보 판단을 진행하였으며, 소프트웨어 정적 시험의 거짓경보 판단을 위해 규칙 별 거짓경보 체크리스트를 활용하여 거짓경보 분류 활동에 적용하였다.In general, to determine a false alarm after performing a software static test, the false alarm was determined by learning using only single information such as the source code AST (Abstract Syntax Tree), and a false alarm for each rule to determine the false alarm of the software static test. The checklist was applied to the false alarm classification activities.

그러나, 거짓경고 판단을 위해서 소스코드 AST 정보만을 사용할 경우, 개발환경 및 개발자의 코드 스타일 차이에 따른 AST 차이를 제대로 학습시키지 못하여 거짓경보 분류 정확도가 낮아질 수 있으며, 특정 자동화 도구로 한정되어 범용적으로 사용될 수 없는 한계가 존재한다. 또한, 규칙 별 거짓경보 체크리스트를 활용하여 수행할 경우, 자동으로 진행되지 않기 때문에 많은 시간이 소요되는 문제점이 발생한다.However, if only the source code AST information is used to determine false warnings, the accuracy of false alarm classification may be lowered because the AST difference according to the development environment and the developer's code style difference cannot be properly learned, and it is limited to a specific automation tool. There are limitations that cannot be used. In addition, when the false alarm checklist for each rule is used to perform it, a problem occurs that takes a lot of time because it does not proceed automatically.

본 발명의 일실시예에 따르면, 소프트웨어 정적 시험에서 거짓경보 자동 분류를 위한 거짓경보 유형별 산출물 자동생성 장치 및 방법을 제공하는 것을 목적으로 한다.According to an embodiment of the present invention, an object of the present invention is to provide an apparatus and method for automatically generating outputs for each false alarm type for automatic classification of false alarms in a software static test.

또한, 본 발명의 일실시예에 따르면, 거짓경보 특징(Feature) 추출을 자동화함으로써 거짓경보 분류에 소요되는 시간을 단축시킬 수 있으며 효율성을 높일 수 있는 거짓경보 유형별 산출물 자동생성 장치 및 방법을 제공하는 것을 목적으로 한다.In addition, according to an embodiment of the present invention, by automating the extraction of false alarm features, it is possible to shorten the time required for false alarm classification and to provide an apparatus and method for automatically generating outputs for each false alarm type that can increase efficiency. It is aimed at.

그러나 본 발명의 목적은 상기에 언급된 사항으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the object of the present invention is not limited to the above-mentioned matters, and other objects that are not mentioned will be clearly understood by those skilled in the art from the following description.

본 명세서의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치는 소프트웨어 정적 시험 위반사항의 종류별 산출물 템플릿이 기록 저장되어 있는 거짓경보 산출물 데이터베이스; 복수의 소스로부터 추출된 데이터를 병합하는 데이터 통합부; 상기 데이터 통합부를 통하여 병합된 데이터를 기계학습 알고리즘에 입력하기 위하여 소정의 데이터 전처리를 수행하는 데이터 전처리부; 상기 전처리된 통합 데이터를 학습 세트(Train Set)와 테스트 세트(Test Set)로 분리하여 교차검증(Cross Validation)을 수행하고, 복수의 기계학습 알고리즘 및 하이퍼파라미터(Hyper Parameter)에 대한 학습평가를 수행하여 최적의 기계학습 알고리즘과 하이퍼파라미터를 선정하는 모델 선정부; 상기 전처리된 통합 데이터를 입력 받아, 상기 모델 선정부에서 선정된 최적의 기계학습 알고리즘과 하이퍼파라미터를 적용하여 정적 시험 위반사항들에 대한 참 신호(True Alarm) 또는 거짓 신호(False Alarm) 여부를 판별하는 데이터 예측부; 및 상기 데이터 예측부에서 거짓 신호(False Alarm)로 판별된 거짓경보에 대해서 상기 거짓경보 산출물 데이터베이스의 템플릿 정보를 참조하여 거짓경보 보고서를 생성하는 거짓경보 산출물 작성부를 포함한다.A machine learning-based software static test false alarm classification apparatus according to an embodiment of the present specification includes: a false alarm product database in which product templates for each type of software static test violation are recorded and stored; A data integration unit for merging data extracted from a plurality of sources; A data preprocessor for performing predetermined data preprocessing to input data merged through the data integration unit into a machine learning algorithm; The preprocessed integrated data is separated into a training set and a test set to perform cross validation, and learning evaluation for a plurality of machine learning algorithms and hyperparameters. A model selection unit for selecting an optimal machine learning algorithm and hyperparameter; Receives the preprocessed integrated data and applies the optimal machine learning algorithm and hyperparameter selected by the model selection unit to determine whether a true signal or a false alarm for static test violations A data prediction unit; And a false alarm product creation unit generating a false alarm report by referring to template information of the false alarm product database for the false alarm determined as a false alarm by the data prediction unit.

바람직하게는, 데이터 통합부는 소프트웨어 정적 시험 자동화 도구로부터 추출한 자동화 도구 데이터, 소스코드 구조분석을 위해 추출한 소스코드 메트릭 데이터, 소스코드 AST(Abstract Syntax Tree)로부터 획득한 소스코드 일반정보 데이터, 및 사용자로부터 입력 받은 개발환경 일반정보 데이터를 병합하는 것을 특징으로 한다. Preferably, the data integration unit includes automation tool data extracted from a software static test automation tool, source code metric data extracted for source code structure analysis, source code general information data obtained from source code AST (Abstract Syntax Tree), and It is characterized by merging the inputted development environment general information data.

또한, 데이터 통합부는 상기 자동화 도구 데이터 및 소스코드 메트릭 데이터를 정적 시험 대상 파일 경로와 대상 함수 명칭으로 1차 병합하고, 상기 소스코드 일반정보 데이터와 상기 1차 병합 데이터를 파일 경로, 위반사항 변수 이름, 함수 이름, 위반사항 소스코드 라인을 기준으로 2차 병합하고, 상기 2차 병합 데이터에 상기 개발환경 일반정보 데이터를 추가 병합하는 것을 특징으로 한다.In addition, the data integration unit first merges the automation tool data and source code metric data into a static test target file path and a target function name, and combines the source code general information data and the first merged data into a file path, a violation variable name. , Function names, and violations source code lines are secondarily merged, and the development environment general information data is additionally merged with the second merged data.

바람직하게는 상기 데이터 전처리부는 기계학습을 위한 특징(Feature)으로 사용되지 않는 데이터는 삭제하고, 범주형 데이터를 기 설정된 문자열로 전환하는 제1 단계; 범주형 데이터에 대해서는 원-핫 인코더(One-hot Encoder)를 적용하고, 숫자형 데이터에 대해서는 정규화(Normalization)를 수행하는 제2 단계; 및 벡터화(Vectorization)을 수행하는 제3 단계를 포함한다.Preferably, the data preprocessor includes a first step of deleting data that is not used as a feature for machine learning and converting the categorical data into a preset character string; A second step of applying a one-hot encoder to categorical data and performing normalization to numeric data; And a third step of performing vectorization.

바람직하게는 상기 거짓경보 산출물 작성부는 상기 거짓경보 산출물 데이터베이스의 템플릿 정보에 제조사 제공 확인서가 있는 경우 상기 확인서를 추가하여 상기 거짓경보 보고서를 생성하는 것을 특징으로 한다.Preferably, the false alarm product creation unit is characterized in that when there is a manufacturer-supplied confirmation letter in the template information of the false alarm product database, it generates the false alarm report by adding the confirmation letter.

바람직하게는 상기 거짓경보 보고서는 거짓경보로 판단된 위반사항에 대한 개략정보, 작성자 또는 확인자 정보를 포함하거나, 거짓경보 유형에 따른 필수 작성사항을 포함하거나, 동일 거짓경보 유형이 발생한 소스코드의 스니펫(Snippet)을 포함할 수 있다.Preferably, the false alarm report includes outline information on violations judged as false alarms, information on the author or verifier, or includes mandatory information according to the type of false alarm, or the source code for which the same false alarm type occurs. It may include a snippet.

바람직하게는 상기 모델 선정부는 상기 학습 세트(Train Set)와 상기 테스트 세트(Test Set)를 일정 비율로 분리하여 Cross Validation 5 교차검증 기법을 적용하는 것을 특징으로 한다.Preferably, the model selection unit is characterized in that a Cross Validation 5 cross-validation technique is applied by separating the training set and the test set at a predetermined ratio.

또한, 본 명세서의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 방법은 복수의 소스로부터 데이터를 추출하여 병합하는 제1 단계; 상기 병합된 데이터를 기계학습 알고리즘에 입력하기 위하여 소정의 데이터 전처리를 수행하는 제2 단계; 상기 전처리된 통합 데이터를 학습 세트(Train Set)와 테스트 세트(Test Set)로 분리하여 교차검증(Cross Validation)을 수행하고, 복수의 기계학습 알고리즘 및 하이퍼파라미터(Hyper Parameter)에 대한 학습평가를 수행하여 최적의 기계학습 알고리즘과 하이퍼파라미터를 선정하는 제3 단계; 상기 전처리된 통합 데이터를 입력 받아, 상기 최적 모델로 선정된 최적의 기계학습 알고리즘과 하이퍼파라미터를 적용하여 정적 시험 위반사항들에 대한 참 신호(True Alarm) 또는 거짓 신호(False Alarm) 여부를 판별하는 제4 단계; 및 상기 판별 결과 거짓 신호(False Alarm)로 판별된 거짓경보에 대해서 거짓경보 산출물 데이터베이스의 템플릿 정보를 참조하여 거짓경보 보고서를 생성하는 제5 단계를 포함한다. In addition, the machine learning-based software static test false alarm classification method according to an embodiment of the present specification includes: a first step of extracting and merging data from a plurality of sources; A second step of performing predetermined data pre-processing to input the merged data into a machine learning algorithm; The preprocessed integrated data is separated into a training set and a test set to perform cross validation, and learning evaluation for a plurality of machine learning algorithms and hyperparameters. A third step of selecting an optimal machine learning algorithm and hyperparameters; Receives the pre-processed integrated data and applies the optimal machine learning algorithm and hyperparameter selected as the optimal model to determine whether a true signal or a false alarm for static test violations The fourth step; And a fifth step of generating a false alarm report by referring to template information of a false alarm output database for the false alarm determined as a false alarm as a result of the determination.

바람직하게는, 상기 제5 단계는 상기 거짓경보 산출물 데이터베이스의 템플릿 정보에 제조사 제공 확인서가 있는 경우 상기 확인서를 추가하여 상기 거짓경보 보고서를 생성하는 것을 특징으로 한다.Preferably, the fifth step is characterized in that, when there is a manufacturer-supplied confirmation letter in the template information of the false alarm product database, the false alarm report is generated by adding the confirmation letter.

본 발명의 일실시예에 따르면, 소프트웨어 정적 시험에서 거짓경보 자동 분류를 위한 거짓경보 유형별 산출물 자동생성 장치 및 방법이 제공되는 효과가 있다.According to an embodiment of the present invention, there is an effect of providing an apparatus and method for automatically generating outputs for each false alarm type for automatic classification of false alarms in a software static test.

또한, 본 발명의 일실시예에 따르면, 거짓경보 특징(Feature) 추출을 자동화함으로써 거짓경보 분류에 소요되는 시간을 단축시킬 수 있으며 효율성을 높일 수 있는 효과가 있다.In addition, according to an embodiment of the present invention, by automating the extraction of false alarm features, it is possible to shorten the time required for classification of false alarms and improve efficiency.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치의 구성을 개략적으로 도시한 블록도이다.
도 2는 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치에서 데이터 병합 과정을 설명하기 위한 참고도이다.
도 3은 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치에서 데이터 전처리부의 구성을 설명하기 위한 참고도이다.
도 4는 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치에서 거짓경보 산출물 데이터베이스의 템플릿 정보를 참조하여 생성된 거짓경보 보고서를 도시한 도면이다.
도 5는 데이터 추출 도구로부터 획득된 데이터를 구조적으로 도시한 참고도이다.
도 6은 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 방법을 순차적으로 도시한 순서도이다.1 is a block diagram schematically showing the configuration of a machine learning-based software static test false alarm classification apparatus according to an embodiment of the present invention.
2 is a reference diagram for explaining a data merging process in a machine learning-based software static test false alarm classification apparatus according to an embodiment of the present invention.
3 is a reference diagram for explaining the configuration of a data preprocessor in the apparatus for classifying a false alarm for a static test of software based on machine learning according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a false alarm report generated by referring to template information of a false alarm product database in a machine learning-based software static test false alarm classification apparatus according to an embodiment of the present invention.
5 is a reference diagram structurally showing data obtained from a data extraction tool.
6 is a flowchart sequentially illustrating a method of classifying a false alarm for a machine learning-based software static test according to an embodiment of the present invention.

이하, 본 발명의 바람직한 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 우선 각 도면의 구성요소들에 참조 부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. 또한, 이하에서 본 발명의 바람직한 실시예를 설명할 것이나, 본 발명의 기술적 사상은 이에 한정하거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있음은 물론이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. First of all, in adding reference numerals to elements of each drawing, it should be noted that the same elements are to have the same numerals as possible even if they are indicated on different drawings. In addition, in describing the present invention, when it is determined that a detailed description of a related known configuration or function may obscure the subject matter of the present invention, a detailed description thereof will be omitted. In addition, a preferred embodiment of the present invention will be described below, but the technical idea of the present invention is not limited or limited thereto, and may be modified and variously implemented by those skilled in the art.

무기체계 소프트웨어 개발 시 소프트웨어(Software, SW) 신뢰성 확보를 위해 SW 신뢰성 시험을 수행해야 한다. 그 중 SW 정적 시험은 SW를 실행하지 않은 상태에서 자동화 도구를 통해 검출된 위반사항들을 수정 및 조치하는 활동을 의미한다. SW 정적 시험 특성 상 자동화 도구가 자체적으로 가지고 있는 분석 모델을 기반으로 수행되어, 빠른 시간 안에 개발자에게 여러 위반사항을 보고하여 조치할 수 있도록 한다. 하지만, 이러한 장점 대비 실제 SW를 실행하여 얻는 결과가 아니라, 자체 모델을 기반으로 분석된 결과이므로 거짓경보(False Positive)가 빈번하게 발생되며, 최근 이를 처리하기 위한 개발자들의 시간 및 노력이 크게 늘고 있다. 또한, 이러한 거짓경보들에 대한 처리는 개발자의 개발경험 및 숙련도에 따라 거짓경보 판단 및 산출물 작성의 투입 시간/수준에 편차가 크게 발생하여 전체적인 SW 신뢰성 시험 수행에 큰 영향을 미친다.When developing weapon system software, SW reliability tests should be performed to secure software (SW) reliability. Among them, the SW static test refers to the activity of correcting and taking measures against violations detected through the automation tool in the state that the SW is not executed. Due to the nature of SW static testing, it is performed based on the analysis model that the automation tool has, so that it can report various violations to the developer in a short time and take action. However, in contrast to these advantages, false positives are frequently generated because they are not results obtained by executing actual software, but are analyzed based on their own models, and developers' time and effort to deal with them are increasing significantly. . In addition, the processing of these false alarms greatly affects the overall software reliability test performance because there is a large deviation in the input time/level of false alarm judgment and output creation according to the developer's development experience and skill level.

이에 따라, 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치는 자동화 도구를 대상으로 하여, 발생 빈도가 가장 높았던 3가지 위반사항 유형을 선정하였으며, 거짓경보 분류를 위한 특징(Feature) 설계 및 추출 방안, 기계학습 알고리즘 적용 방법을 제안하였다. 또한, SW를 직접 구현하여 분류된 거짓경보 대상 위반사항들의 산출물 작성의 많은 부분을 자동 생성할 수 있다.Accordingly, for the machine learning-based software static test false alarm classification device, the three types of violations with the highest frequency of occurrence were selected as the target of the automation tool, and the feature design and extraction method for false alarm classification, and the machine A method of applying the learning algorithm is proposed. In addition, by directly implementing the SW, it is possible to automatically generate many parts of the output of classified false alarm target violations.

도 1은 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치의 구성을 개략적으로 도시한 블록도이다.1 is a block diagram schematically showing the configuration of a machine learning-based software static test false alarm classification apparatus according to an embodiment of the present invention.

본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치는 데이터 통합부(10), 데이터 전처리부(20), 모델 선정부(30), 데이터 예측부(40), 거짓경보 산출물 작성부(50) 및 산출물 데이터베이스(60)를 포함한다.The machine learning-based software static test false alarm classification apparatus according to an embodiment of the present invention includes a data integration unit 10, a data preprocessor 20, a model selection unit 30, a data prediction unit 40, and a false alarm output. It includes a creation unit 50 and a product database 60.

산출물 데이터베이스(60)에는 소프트웨어 정적 시험 위반사항의 종류별 산출물 템플릿이 기록 저장되어 있다.Product templates for each type of software static test violation are recorded and stored in the product database 60.

데이터 통합부(10)는 복수의 소스로부터 추출된 데이터를 병합한다.The data integration unit 10 merges data extracted from a plurality of sources.

본 발명의 일실시예에 따르면, 데이터 통합부(10)는 소프트웨어 정적 시험 자동화 도구로부터 추출한 자동화 도구 데이터, 소스코드 구조분석을 위해 추출한 소스코드 메트릭 데이터, 소스코드 AST(Abstract Syntax Tree)로부터 획득한 소스코드 일반정보 데이터, 및 사용자로부터 입력 받은 개발환경 일반정보 데이터를 추출하여 병합할 수 있다.According to an embodiment of the present invention, the data integration unit 10 includes automation tool data extracted from a software static test automation tool, source code metric data extracted for source code structure analysis, and a source code obtained from an abstract syntax tree (AST). Source code general information data and development environment general information data input from a user can be extracted and merged.

데이터 통합부(10)는 자동화 도구 데이터 및 소스코드 메트릭 데이터를 정적 시험 대상 파일 경로와 대상 함수 명칭으로 1차 병합하고, 소스코드 일반정보 데이터와 1차 병합 데이터를 파일 경로, 위반사항 변수 이름, 함수 이름, 위반사항 소스코드 라인을 기준으로 2차 병합하고, 2차 병합 데이터에 상기 개발환경 일반정보 데이터를 추가 병합할 수 있다.The data integration unit 10 first merges the automation tool data and source code metric data into a static test target file path and a target function name, and combines the source code general information data and the first merged data into a file path, a violation variable name, and Based on the function name and the violation source code line, the second merger may be performed, and the development environment general information data may be additionally merged with the second merged data.

데이터 전처리부(20)는 데이터 통합부(10)를 통하여 병합된 데이터를 기계학습 알고리즘에 입력하기 위하여 소정의 데이터 전처리를 수행한다.The data preprocessor 20 performs predetermined data preprocessing in order to input the merged data through the data integration unit 10 into the machine learning algorithm.

데이터 전처리부(20)는 기계학습을 위한 특징(Feature)으로 사용되지 않는 데이터는 삭제하고, 범주형 데이터를 기 설정된 문자열로 전환하는 제1 단계, 범주형 데이터에 대해서는 원-핫 인코더(One-hot Encoder)를 적용하고, 숫자형 데이터에 대해서는 정규화(Normalization)를 수행하는 제2 단계 및 벡터화(Vectorization)을 수행하는 제3 단계를 포함할 수 있다.The data preprocessing unit 20 deletes data that is not used as a feature for machine learning, and converts the categorical data to a preset character string, and a one-hot encoder for categorical data. Hot Encoder) may be applied, and a second step of performing normalization on numeric data and a third step of vectorization may be included.

모델 선정부(30)는 전처리된 통합 데이터를 학습 세트(Train Set)와 테스트 세트(Test Set)로 분리하여 교차검증(Cross Validation)을 수행하고, 복수의 기계학습 알고리즘 및 하이퍼파라미터(Hyper Parameter)에 대한 학습평가를 수행하여 최적의 기계학습 알고리즘과 하이퍼파라미터를 선정한다.The model selection unit 30 separates the preprocessed integrated data into a training set and a test set to perform cross validation, and a plurality of machine learning algorithms and hyperparameters. The optimal machine learning algorithm and hyperparameter are selected by performing learning evaluation for.

모델 선정부(30)는 학습 세트와 상기 테스트 세트를 일정 비율로 분리하여 교차검증(Cross Validation) 5 교차검증 기법을 적용할 수 있다.The model selection unit 30 may apply a cross-validation 5 cross-validation technique by separating the training set and the test set at a predetermined ratio.

데이터 예측부(40)는 전처리된 통합 데이터를 입력 받아 정적 시험 위반사항들에 대한 참 신호(True Alarm) 또는 거짓 신호(False Alarm) 여부를 판별하며, 이때, 모델 선정부(30)에서 선정된 최적의 기계학습 알고리즘과 하이퍼파라미터를 적용할 수 있다.The data prediction unit 40 receives the pre-processed integrated data and determines whether a true signal or a false alarm for static test violations. At this time, the model selection unit 30 Optimal machine learning algorithms and hyperparameters can be applied.

거짓경보 산출물 작성부(50)는 데이터 예측부(40)에서 거짓 신호(False Alarm)로 판별된 거짓경보에 대해서 거짓경보 산출물 데이터베이스(60)의 템플릿 정보를 참조하여 거짓경보 보고서를 생성한다.The false alarm product creation unit 50 generates a false alarm report with reference to the template information of the false alarm product database 60 for a false alarm determined as a false alarm by the data prediction unit 40.

거짓경보 산출물 작성부(50)는 거짓경보 산출물 데이터베이스(60)의 템플릿 정보에 제조사 제공 확인서가 있는 경우 확인서를 추가하여 거짓경보 보고서를 생성할 수 있다.The false alarm product creation unit 50 may generate a false alarm report by adding a confirmation document if there is a manufacturer-supplied confirmation letter in the template information of the false alarm product database 60.

거짓경보 보고서는 거짓경보로 판단된 위반사항에 대한 개략정보, 작성자 또는 확인자 정보를 포함할 수 있으며, 반드시 이에 한정되는 것은 아니다.The false alarm report may include outline information, author or confirmer information on violations judged as false alarms, but is not limited thereto.

거짓경보 보고서는 거짓경보 유형에 따른 필수 작성사항을 포함할 수 있다.The false alarm report may include required information according to the type of false alarm.

거짓경보 보고서는 동일 거짓경보 유형이 발생한 소스코드의 스니펫(Snippet)을 포함할 수 있다.The false alarm report may contain a snippet of the source code where the same false alarm type occurred.

도 2는 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치에서 데이터 병합 과정을 설명하기 위한 참고도이며, 본 발명에 따른 기계학습에 사용되는 특징(Feature)을 확보하기 위하여 4가지 서로 다른 소스로부터 데이터를 추출한다. 이때, 기계학습에 사용되는 특징(Feature)은 데이터 출처에 따라서 하기 표 1과 같이 추출될 수 있다.2 is a reference diagram for explaining a data merging process in a machine learning-based software static test false alarm classification apparatus according to an embodiment of the present invention, in order to secure a feature used for machine learning according to the present invention Extract data from four different sources. At this time, features used for machine learning may be extracted as shown in Table 1 below according to data sources.

데이터 출처Data source Feature 정보Feature information 목적purpose 개발환경 일반정보General information on development environment 컴파일러 정보, 버전정보 등Compiler information, version information, etc. 개발환경 일반 정보에 따른 거짓경보 판단 여부 영향성 판단Determination of the impact of whether false alarms are judged according to general information of the development environment 소프트웨어 정적 시험 자동화 도구 결과Software static test automation tool results 개발언어, 위반사항 이름, 거짓경보 여부 등Development language, name of violation, false alarm, etc. 소프트웨어 정적 시험 자동화 도구 정보에 따른 거짓경보 판단여부 영향성 판단Determination of the influence of false alarm judgment based on software static test automation tool information 소스코드 메트릭Source code metrics Cyclomatic Complexity, Max Nesting Depth 등Cyclomatic Complexity, Max Nesting Depth, etc. 소스코드 구조 정보에 따른 거짓경보 판단여부 영향성 판단Determining the impact of whether to judge false alarms according to the source code structure information 소스코드 일반정보(llvm / clang)Source code general information (llvm / clang) const 변수 여부, 포인터 변수 여부, 변수 Scope 여부, Predicate/ Computational Usage 정보 등Const variable, pointer variable, variable scope, Predicate/ Computational Usage information, etc. 결함을 유발한 변수 정보, 사용형태, Validity 확인 여부 등에 따른 거짓경보 판단 여부 영향성 판단Whether to judge false alarms based on information on variables that caused defects, usage patterns, and validation of validity, etc.

4가지 데이터 소스로부터 획득되는 데이터는 사용자가 직접 입력하는 개발환경 일반정보 데이터(D4), 소프트웨어 정적 시험 자동화 도구로부터 추출한 데이터(D1), 소스코드 AST(Abstract Syntax Tree)로부터 획득한 일반정보 데이터(D3), 소스코드 구조 분석을 위해 추출한 소스코드 메트릭 데이터(D2) 등이 이에 포함된다. 도 2에 도시된 바와 같이, 서로 다른 소스로부터 추출된 데이터들(D1, D2, D3, D4))은 3번의 병합 절차를 걸쳐 최종 데이터로 확보된다. 즉, 자동화 도구 데이터(D1)와 소스코드 메트릭 데이터(D2)는 정적 시험 대상 파일 경로와 대상 함수 명칭으로 병합된다(TD). 또한, 소스코드 일반정보 데이터(D3)와 임시로 병합된 데이터(TD)는 각각 파일 경로, 위반사항 변수 이름, 함수 이름, 위반사항 소스코드 라인을 기준으로 병합된다. 이렇게 병합된 데이터는 개발환경 일반정보 데이터(D4)을 일괄 추가하여 최종 병합이 완료된다(UD).The data obtained from the four data sources include development environment general information data (D4) directly input by the user, data extracted from software static test automation tools (D1), and general information data obtained from source code AST (Abstract Syntax Tree) ( This includes D3) and source code metric data (D2) extracted for source code structure analysis. As shown in FIG. 2, data (D1, D2, D3, D4) extracted from different sources are secured as final data through three merging procedures. That is, the automation tool data D1 and the source code metric data D2 are merged into the static test target file path and the target function name (TD). Also, the source code general information data D3 and the temporarily merged data TD are merged based on the file path, the violation variable name, the function name, and the violation source code line, respectively. The merged data is finally merged by adding the development environment general information data (D4) at once (UD).

도 3은 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치에서 데이터 전처리부의 구성을 설명하기 위한 참고도이다.3 is a reference diagram for explaining the configuration of a data preprocessor in the apparatus for classifying a false alarm for a static test of software based on machine learning according to an embodiment of the present invention.

데이터 전처리부(20)는 앞서 설명한 바와 같이, 최종 병합된 데이터를 기반으로 기계학습 알고리즘에 입력을 위해 데이터 전처리를 수행한다. 이때, 데이터 전처리 과정은 우선, Preprocess #1에서 Feature로 사용되지 않은 데이터는 삭제하고, 범주형 데이터의 경우 Human Readable한 문자열로 전환된다. 또한, Preprocess #2에서는 범주형 데이터에 대해 One-hot Encoder를 적용하고, 숫자형 데이터는 정규화(Normalization)을 수행한다. 이후, 최종적으로 기계학습 알고리즘 입력에 적합하도록 Preprocess #3에서 벡터화(Vectorization)를 수행한다.As described above, the data preprocessor 20 performs data preprocessing for input to a machine learning algorithm based on the final merged data. At this time, in the data preprocessing process, first, data that has not been used as a feature in Preprocess #1 is deleted, and in the case of categorical data, it is converted into a human readable character string. In addition, in Preprocess #2, one-hot encoder is applied to categorical data, and normalization is performed on numeric data. Thereafter, vectorization is performed in Preprocess #3 to finally fit the machine learning algorithm input.

전처리가 완료된 데이터 셋은 최적 모델 선정을 위해 Train / Test Set 분리 장치를 통해 Train Set과 Test Set으로 분류되고, 준비된 기계학습 알고리즘 및 알고리즘 별 하이퍼파라미터(HyperParameter) 선정 장치를 통해 학습 및 평가를 수행하게 된다. 이후, 최고 성능을 지닌 모델은 별로도 저장되어 데이터 예측부(40)에서 데이터 판별에 사용될 수 있다.The pre-processed data set is classified into Train Set and Test Set through a Train / Test Set separation device for optimal model selection, and learning and evaluation are performed through a prepared machine learning algorithm and a hyperparameter selection device for each algorithm. do. Thereafter, the model with the highest performance may be separately stored and used for data determination in the data prediction unit 40.

본 발명의 일실시예에 따르면, 정확한 학습을 위해 입력된 Data Set은 Train Set과 Test Set을 일정 비율로 분리하고, 하이퍼파라미터(Hyperparameter)를 선정하기 위한 교차검증으로 Cross Validation 5를 적용하도록 설정될 수 있다.According to an embodiment of the present invention, the input data set for accurate learning separates the Train Set and the Test Set at a certain ratio, and is set to apply Cross Validation 5 as a cross-validation for selecting a hyperparameter. I can.

여기서, 일정 비율은 8:2의 비율로 설정할 수 있으며, 반드시 이에 한정되는 것은 아니며, 필요에 의해 비율을 조정할 수 있다.Here, the predetermined ratio may be set to a ratio of 8:2, but is not necessarily limited thereto, and the ratio may be adjusted if necessary.

여기서 교차검증은 하이퍼파라미터를 튜닝하여 최적의 모델을 선정하기 위한 것으로서, 예를 들어 k-Fold 교차검증(Cross Validation)은 학습 데이터 세트(Train Data set)를 균등하게 k개의 그룹(Fold라고 함)으로 나누고 (k - 1)개의 Test Fold와 1개의 Validation Fold로 지정한다. 이후, 총 k회 검증을 수행 하는데, 각 검증마다 Test Fold를 다르게 지정하여 성능을 측정한다. 이와 같은 방식으로 k회 검증이 완료되면 각 하이퍼파라미터(Hyperparameter)에 대한 검증 결과를 평균을 내어 하이퍼파라미터들(Hyperparameters)을 튜닝 할 수 있다.Here, cross-validation is to select the optimal model by tuning hyperparameters.For example, k-Fold Cross Validation means that the training data set is equally divided into k groups (referred to as fold). Divide into (k-1) Test Fold and 1 Validation Fold. Thereafter, a total of k verifications are performed, and the performance is measured by designating the Test Fold differently for each verification. When the verification is completed k times in this way, the verification results for each hyperparameter are averaged to tune the hyperparameters.

교차검증 수행을 통해 선정된 기계학습 알고리즘 및 해당 알고리즘의 Hyper parameter 정보를 설정하고, 알고리즘 및 Hyperparameter 별 학습 평가를 수행한 후, 평가지표(Accuracy)가 가장 좋은 알고리즘을 최적 모델로 선정하여 저장한다. 상기 최적 모델로 선정된 알고리즘은 False Alarm 판별 시 적용될 수 있다. 즉, 모델 선정부(30)에서 선정된 최적의 기계학습 알고리즘과 하이퍼파라미터는 데이터 예측부(40)에서 전처리된 통합 데이터를 입력 받아 정적 시험 위반사항들에 대한 True Alarm 또는 False Alarm 여부를 판별할 때 사용된다.After cross-validation is performed, the selected machine learning algorithm and hyper parameter information of the corresponding algorithm are set, learning evaluation for each algorithm and hyperparameter is performed, and then the algorithm with the best accuracy is selected and stored as an optimal model. The algorithm selected as the optimal model can be applied when determining a false alarm. In other words, the optimal machine learning algorithm and hyperparameter selected by the model selection unit 30 receive integrated data preprocessed by the data prediction unit 40 to determine whether a true alarm or false alarm for static test violations. When used.

새롭게 분석된 소프트웨어 정적 시험 위반사항들은 데이터 예측부(40)의 입력으로 주어져 True Alarm인지 False Alarm인지 판별된다. False Alarm으로 판별된 거짓경보들은 산출물 데이터베이스(60)로부터 산출물 양식 정보를 불러와 예측 결과 기반 거짓경보 산출물 작성부(50)를 통해 보고서로 생성된다. 또한, 해당 거짓경보 유형에 제조사 제공 확인서가 있다면 생성된 산출물에 추가적으로 작성될 수 있다.The newly analyzed software static test violations are given as input to the data prediction unit 40 to determine whether they are True Alarm or False Alarm. The false alarms determined as false alarms are generated as reports through the false alarm output creation unit 50 based on the prediction result by calling the product form information from the product database 60. In addition, if there is a manufacturer-supplied confirmation letter for the corresponding false alarm type, it can be additionally written to the generated output.

도 4는 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치에서 거짓경보 산출물 데이터베이스의 템플릿 정보를 참조하여 생성된 거짓경보 보고서를 도시한 도면이다.FIG. 4 is a diagram illustrating a false alarm report generated by referring to template information of a false alarm product database in a machine learning-based software static test false alarm classification apparatus according to an embodiment of the present invention.

도시된 바와 같이, 산출물은 산출물 데이터베이스(60)로부터 산출물 양식 정보를 불러와 예측 결과 기반 산출물 작성부(50)를 통해 기본이 작성되며, 대략적으로 3부분으로 나뉘어 작성이 될 수 있다. 즉, 거짓경보(False Alarm)로 판명된 위반사항에 대한 개략정보 및 작성자/확인자 정보(410)와, 거짓경보 유형에 따라 필수적으로 작성해야 하는 내용(420)과, 동일 거짓경보 유형이 발생한 모든 소스코드 스니펫(Snippet)의 내용(430)으로 각각 나뉘어져 작성이 된다. 스니펫(snippet)은 재사용 가능한 소스 코드, 기계어, 텍스트의 작은 부분을 일컫는 프로그래밍 용어이며, 사용자가 루틴 편집 조작 중 반복 타이핑을 회피할 수 있도록 도와준다.As shown, the product is basically prepared through the product creation unit 50 based on the prediction result by calling the product form information from the product database 60, and may be roughly divided into three parts. In other words, outline information and author/verifier information (410) for violations found to be false alarms, content that must be prepared according to the type of false alarm (420), and all the same false alarm types Each of the contents 430 of the source code snippet is divided and written. Snippet is a programming term that refers to small parts of reusable source code, machine language, and text, and helps users avoid repetitive typing during routine editing operations.

한편, 만약 해당 거짓경보 유형에 제조사 제공 확인서가 존재한다면, 상기 생성된 산출물에 추가적으로 제조사 제공 확인서가 포함되도록 작성될 수 있다. 여기서 제조사 제공 확인서라 함은 제조사가 공식적으로 인정한 해당 거짓경보 건에 대한 정당화(Justification) 보고서를 의미한다.On the other hand, if there is a manufacturer-supplied confirmation letter in the corresponding false alarm type, it may be written to include a manufacturer-supplied confirmation in addition to the generated output. Here, the manufacturer-provided confirmation letter means a justification report for the false alarm case officially recognized by the manufacturer.

이와 같이, 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 장치를 활용함으로써, 소프트웨어 정적 시험 결과 True / False Alarm 분류 판단 시간 절약이 가능하며, 소프트웨어 정적 시험 False Alarm 관련 산출물 작성 시간을 절약할 수 있고, 소프트웨어 정적 시험 False Alarm 관련 산출물 작성 수준도 향상시킬 수 있다. 더 나아가서, 지속적인 기계학습 및 산출물 데이터베이스 갱신에 따라 False Alarm 처리에 최신 경향을 적용하는 것도 가능하게 된다.As described above, by using the machine learning-based software static test false alarm classification device according to an embodiment of the present invention, it is possible to save time for determining true / false alarm classification as a result of the software static test, and the time to create a product related to the software static test It can save and improve the level of software static test false alarm-related output. Furthermore, it is also possible to apply the latest trends in false alarm processing with continuous machine learning and product database updates.

도 5는 데이터 추출 도구로부터 획득된 데이터를 구조적으로 도시한 참고도이며, 각 데이터의 출처(510), 데이터 명칭(520), 데이터 설명(530), 데이터 타입(540), 학습 Feature의 사용 여부(550) 등으로 추출 데이터 및 Feature 정보가 구성된다. 이와 같이 추출 데이터를 유형화함으로써 각각 서로 다른 소스로부터 데이터 추출 시에도 Feature를 신속하고 용이하게 추출할 수 있도록 할 수 있다.5 is a reference diagram structurally showing data obtained from a data extraction tool, and whether each data source 510, data name 520, data description 530, data type 540, and learning feature are used Extracted data and feature information are composed of (550), etc. By categorizing the extracted data in this way, even when data is extracted from different sources, features can be quickly and easily extracted.

도 6은 본 발명의 일실시예에 따른 기계학습 기반 소프트웨어 정적 시험 거짓경보 분류 방법을 순차적으로 도시한 순서도이다.6 is a flowchart sequentially illustrating a method of classifying a false alarm for a machine learning-based software static test according to an embodiment of the present invention.

우선, 복수의 소스로부터 데이터를 추출하고(S601), 추출된 데이터를 병합한다(S603).First, data is extracted from a plurality of sources (S601), and the extracted data is merged (S603).

이때, 복수의 데이터 소스로부터 획득되는 데이터는 사용자가 직접 입력하는 개발환경 일반정보 데이터(D4), 소프트웨어 정적 시험 자동화 도구로부터 추출한 데이터(D1), 소스코드 AST(Abstract Syntax Tree)로부터 획득한 일반정보 데이터(D3), 소스코드 구조 분석을 위해 추출한 소스코드 메트릭 데이터(D2) 등이 이에 포함되며, 서로 다른 소스로부터 추출된 데이터들(D1, D2, D3, D4)은 3번의 병합 절차를 걸쳐 최종 데이터로 확보된다. At this time, the data obtained from multiple data sources is the development environment general information data (D4) directly input by the user, data extracted from the software static test automation tool (D1), and general information obtained from the source code AST (Abstract Syntax Tree). This includes data (D3) and source code metric data (D2) extracted for source code structure analysis, and data extracted from different sources (D1, D2, D3, D4) are finalized through three merge procedures. Secured with data.

이후, 상기 병합된 데이터를 기계학습 알고리즘에 입력하기 위하여 소정의 데이터 전처리를 수행한다(S605). 데이터 전처리 과정은 기계학습을 위한 Feature로 사용되지 않는 데이터를 삭제하고, 범주형 데이터를 기 설정된 문자열로 전환하는 단계; 범주형 데이터에 대해서는 One-hot Encoder를 적용하고, 숫자형 데이터에 대해서는 정규화(Normalization)를 수행하는 단계; 및 최종적으로 벡터화(Vectorization)을 수행하는 단계를 포함할 수 있다.Thereafter, a predetermined data pre-processing is performed to input the merged data into a machine learning algorithm (S605). The data preprocessing process includes deleting data that is not used as a feature for machine learning, and converting the categorical data into a preset character string; Applying a One-hot Encoder to categorical data and performing normalization to numeric data; And it may include the step of finally performing vectorization (Vectorization).

여기서, 기 설정된 문자열은 Human Redable한 문자열로서, 거짓경보 분류 장치를 사용하는 사용자가 읽을 수 있는 문자열일 수 있으며, 이는 반드시 한정되지 않는다.Here, the preset character string is a human redable character string, and may be a character string that can be read by a user using the false alarm classification device, which is not necessarily limited.

S605 단계를 통해서 전처리된 통합 데이터는 Train Set과 Test Set으로 분리되어(S607) 교차검증(Cross Validation)을 수행한다. The integrated data preprocessed through step S605 is separated into a train set and a test set (S607), and cross validation is performed.

본 발명의 일실시예에 따르면, 정확한 학습을 위해 입력된 Data Set은 Train Set과 Test Set을 일정 비율로 분리하고, 하이퍼파라미터(Hyperparameter)를 선정하기 위한 교차검증으로 Cross Validation 5를 적용하도록 설정될 수 있다. 여기서, 일정 비율은 8:2의 비율로 설정할 수 있으며, 반드시 이에 한정되는 것은 아니며, 필요에 의해 비율을 조정할 수 있다.According to an embodiment of the present invention, the input data set for accurate learning separates the Train Set and the Test Set at a certain ratio, and is set to apply Cross Validation 5 as a cross-validation for selecting a hyperparameter. I can. Here, the predetermined ratio may be set to a ratio of 8:2, but is not necessarily limited thereto, and the ratio may be adjusted if necessary.

이후, 선정된 기계학습 알고리즘 및 해당 알고리즘의 Hyper parameter 정보를 설정하고(S609), 알고리즘 및 Hyperparameter 별 기계학습 평가를 수행한다(S611). Thereafter, the selected machine learning algorithm and Hyper parameter information of the corresponding algorithm are set (S609), and machine learning evaluation for each algorithm and hyperparameter is performed (S611).

이후, 평가지표(Accuracy)가 가장 좋은 알고리즘을 최적 모델로 선정하여 저장한다(S613). 상기 최적 모델로 선정된 알고리즘은 False Alarm 판별 시 적용될 수 있다. 즉, 모델 선정부(30)에서 선정된 최적의 기계학습 알고리즘과 하이퍼파라미터는 데이터 예측부(40)에서 전처리된 통합 데이터를 입력 받아 정적 시험 위반사항들에 대한 True Alarm 또는 False Alarm 여부를 판별할 때 사용된다.Thereafter, an algorithm having the best accuracy is selected as an optimal model and stored (S613). The algorithm selected as the optimal model can be applied when determining a false alarm. In other words, the optimal machine learning algorithm and hyperparameter selected by the model selection unit 30 receive integrated data preprocessed by the data prediction unit 40 to determine whether a true alarm or false alarm for static test violations. When used.

한편, S605 단계를 통해서 전처리된 통합 데이터를 입력 받아, S607 내지 S613 단계를 거쳐 최적 모델로 선정된 최적의 기계학습 알고리즘과 하이퍼파라미터를 적용하여, 정적 시험 위반사항들에 대한 True Alarm 또는 False Alarm 여부를 판별한다(S621).On the other hand, by receiving the pre-processed integrated data through step S605 and applying the optimal machine learning algorithm and hyperparameter selected as the optimal model through steps S607 to S613, whether True Alarm or False Alarm for static test violations It is determined (S621).

상기 S621 단계의 판별 결과 False Alarm으로 판별된 거짓경보에 대해서 거짓경보 산출물 데이터베이스의 템플릿 정보를 참조하여 거짓경보 보고서를 생성한다(S623).As a result of the determination in step S621, a false alarm report is generated by referring to the template information of the false alarm output database for the false alarm determined as a false alarm (S623).

본 발명의 일 실시예에 따르면, 템플릿 정보는 보고서 템플릿에 입력되는 정보로서, 별로도 정의되어 있으며, 필요에 따라 템플릿의 모양, 스타일 등을 재구성하여 사용할 수 있다.According to an embodiment of the present invention, the template information is information input to the report template, is defined separately, and can be used by reconfiguring the shape and style of the template as needed.

이때, 거짓경보 산출물 데이터베이스의 템플릿 정보에 제조사 제공 확인서가 있는지 여부를 확인하여(S627), 제조사 제공 확인서가 존재하는 경우 상기 확인서를 추가하여 거짓경보 보고서를 최종적으로 생성한다(S627).At this time, it is checked whether there is a manufacturer-supplied confirmation letter in the template information of the false alarm output database (S627), and if there is a manufacturer-supplied confirmation letter, the false alarm report is finally generated by adding the confirmation letter (S627).

S623 단계에서 생성되는 보고서는 거짓경보로 판단된 위반사항에 대한 개략정보, 작성자 또는 확인자 정보, 거짓경보 유형에 따른 필수 작성사항 및 동일 거짓경보 유형이 발생한 소스코드의 스니펫(Snippet)을 포함할 수 있다. 여기서, 생성되는 보고서는 개략정보, 작성자 또는 확인자 정보를 포함하는 것으로 한정되지 않으며, 추가의 정보를 더 포함할 수 있다.The report generated in step S623 will include outline information on violations judged as false alarms, author or verifier information, required information according to false alarm type, and a snippet of the source code where the same false alarm type occurred. I can. Here, the generated report is not limited to including outline information, author or verifier information, and may further include additional information.

이상에서 설명한 본 발명의 실시예를 구성하는 모든 구성요소들이 하나로 결합하거나 결합하여 동작하는 것으로 기재되어 있다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 또한, 이와 같은 컴퓨터 프로그램은 USB 메모리, CD 디스크, 플래쉬 메모리 등과 같은 컴퓨터가 읽을 수 있는 기록매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 기록매체로서는 자기 기록매체, 광 기록매체 등이 포함될 수 있다.Even though all the components constituting the embodiments of the present invention described above are described as being combined into one or operating in combination, the present invention is not necessarily limited to these embodiments. That is, as long as it is within the scope of the object of the present invention, one or more of the components may be selectively combined and operated. In addition, although all the components can be implemented as one independent hardware, a program module that performs some or all functions combined in one or more hardware by selectively combining some or all of the components. It may be implemented as a computer program having In addition, such a computer program is stored in a computer readable media such as a USB memory, a CD disk, a flash memory, etc., and is read and executed by a computer, thereby implementing an embodiment of the present invention. The recording medium of the computer program may include a magnetic recording medium, an optical recording medium, and the like.

또한, 기술적이거나 과학적인 용어를 포함한 모든 용어들은, 상세한 설명에서 다르게 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 사전에 정의된 용어와 같이 일반적으로 사용되는 용어들은 관련 기술의 문맥상의 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In addition, all terms, including technical or scientific terms, have the same meaning as commonly understood by those of ordinary skill in the art, unless otherwise defined in the detailed description. Terms generally used, such as terms defined in the dictionary, should be interpreted as being consistent with the meaning of the context of the related technology, and are not interpreted in an ideal or excessively formal meaning unless explicitly defined in the present invention.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. 따라서, 본 발명에 개시된 실시예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구 범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the technical field to which the present invention belongs can make various modifications, changes and substitutions within the scope not departing from the essential characteristics of the present invention. will be. Accordingly, the embodiments disclosed in the present invention and the accompanying drawings are not intended to limit the technical idea of the present invention, but to explain the technical idea, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings. . The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

10: 데이터 통합부
20: 데이터 전처리부
30: 모델 선정부
40: 데이터 예측부
50: 거짓 경보 산출물 작성부
60: 산출물 데이터베이스10: data integration unit
20: data preprocessor
30: model selection section
40: data prediction unit
50: false alarm output creation unit
60: output database

Claims

A false alarm output database in which product templates for each type of software static test violation are recorded and stored;
A data integration unit for merging data extracted from a plurality of sources;
A data preprocessor for performing predetermined data preprocessing to input data merged through the data integration unit into a machine learning algorithm;
The preprocessed integrated data is separated into a training set and a test set to perform cross validation, and learning evaluation for a plurality of machine learning algorithms and hyperparameters. A model selection unit for selecting a machine learning algorithm and a hyperparameter;
Data that receives the preprocessed integrated data and applies the machine learning algorithm and hyperparameter selected by the model selection unit to determine whether a true signal or a false alarm for static test violations Prediction unit; And
Machine learning-based software static test including a false alarm product creation unit that generates a false alarm report by referring to the template information of the false alarm product database for the false alarm when the data prediction unit determines that the false alarm is the false alarm False alarm classification device.

The method of claim 1,
The data integration unit,
Automation tool data extracted from software static test automation tools, source code metric data extracted for source code structure analysis, source code general information data obtained from source code AST (Abstract Syntax Tree), and development environment general information data input from users Machine learning-based software static test false alarm classification device, characterized in that merging.

The method of claim 2,
The data integration unit,
First merge the automation tool data and source code metric data into the static test target file path and target function name,
The source code general information data and the first merged data are secondarily merged based on the file path, violation variable name, function name, and violation source code line,
A machine learning-based software static test false alarm classification device, characterized in that the development environment general information data is additionally merged with the secondary merged data.

The method of claim 1,
The data preprocessing unit,
A first step of deleting data that is not used as a feature for machine learning and converting the categorical data into a preset character string;
A second step of applying a one-hot encoder to categorical data and performing normalization to numeric data; And
Machine learning-based software static test false alarm classification device, characterized in that it comprises a third step of performing vectorization.

The method of claim 1,
The false alarm output creation unit,
Machine learning-based software static test false alarm classification device, characterized in that generating the false alarm report by adding the confirmation if there is a manufacturer provided confirmation in the template information of the false alarm output database.

The method of claim 1,
The above false alarm report,
Machine learning-based software static test false alarm classification device, characterized in that it includes outline information on violations determined as false alarms, author or verifier information.

The method of claim 1,
The above false alarm report,
Machine learning-based software static test false alarm classification device, characterized in that it includes required information according to the false alarm type.

The method of claim 1,
The above false alarm report,
Machine learning-based software static test false alarm classification device, characterized in that it includes a snippet (Snippet) of the source code in which the same false alarm type occurred.

The method of claim 1,
The model selection unit,
A machine learning-based software static test false alarm classification apparatus, characterized in that the learning set and the test set are separated by a predetermined ratio and the cross validation is applied.

A first step of extracting and merging data from a plurality of sources;
A second step of performing predetermined data pre-processing to input the merged data into a machine learning algorithm;
The preprocessed integrated data is separated into a training set and a test set to perform cross validation, and learning evaluation for a plurality of machine learning algorithms and hyperparameters. A third step of selecting a machine learning algorithm and a hyperparameter;
A fourth step of receiving the preprocessed integrated data and determining whether a true signal or a false alarm for static test violations by applying the selected machine learning algorithm and hyperparameters; And
Machine learning-based software static test false alarm classification method comprising a fifth step of generating a false alarm report by referring to template information of a false alarm output database for the false alarm when the determination result is determined as the false alarm .

The method of claim 10,
The first step,
Automation tool data extracted from software static test automation tools, source code metric data extracted for source code structure analysis, source code general information data obtained from source code AST (Abstract Syntax Tree), and development environment general information data input from users Machine learning-based software static test false alarm classification method, characterized in that merging.

The method of claim 11,
First merge the automation tool data and source code metric data into the static test target file path and target function name,
The source code general information data and the first merged data are secondarily merged based on the file path, violation variable name, function name, and violation source code line,
A machine learning-based software static test false alarm classification method, characterized in that the development environment general information data is additionally merged with the secondary merged data.

The method of claim 10,
The second step,
Deleting data that is not used as a feature for machine learning, and converting the categorical data into a preset character string;
Applying a one-hot encoder for categorical data and performing normalization for numeric data; And
Machine learning-based software static test false alarm classification method comprising the step of performing vectorization.

The method of claim 10,
The fifth step,
When there is a manufacturer-supplied confirmation letter in the template information of the false alarm product database, the false alarm report is generated by adding the confirmation letter.

The method of claim 10,
The false alarm report of the fifth step,
Machine learning-based software static test false alarm classification method, characterized in that it includes outline information on violations judged as false alarms, author or verifier information.

The method of claim 10,
The false alarm report of the fifth step,
Machine learning-based software static test false alarm classification method, characterized in that it includes required information according to the false alarm type.

The method of claim 10,
The false alarm report of the fifth step,
Machine learning-based software static test false alarm classification method, characterized in that it includes a snippet of the source code in which the same false alarm type occurs.

The method of claim 10,
The second step,
The machine learning-based software static test false alarm classification method, characterized in that the cross validation is applied by separating the learning set and the test set at a predetermined ratio.

A computer-readable recording medium storing a program for implementing the method according to any one of claims 10 to 18.