KR102500395B1

KR102500395B1 - Apparatus and method for repairing bug source code for program

Info

Publication number: KR102500395B1
Application number: KR1020210041052A
Authority: KR
Inventors: 이병정; 호혜민; 양근석
Original assignee: 서울시립대학교 산학협력단
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2023-02-16
Also published as: KR20220135414A

Abstract

프로그램 소스코드의 버그를 정정하는 장치는 프로그램 소스코드에 포함된 복수의 버기 라인 각각에 표시 토큰을 추가함으로써 프로그램 소스코드에 대한 전처리를 수행하는 전처리부, 전처리된 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환하는 코드 변환부, 변환된 코드 블록 기반의 소스 코드에 기초하여 버그 정정 모델을 학습시키는 학습부 및 학습된 버그 정정 모델을 이용하여 신규 프로그램 소스 코드에 대한 버그를 정정하는 버그 정정부를 포함할 수 있다. A device for correcting bugs in a program source code includes a preprocessor that performs preprocessing on the program source code by adding a display token to each of a plurality of buggy lines included in the program source code, and converts the preprocessed program source code into a code block-based source code. A code conversion unit for converting into code, a learning unit for learning a bug correction model based on the source code based on the converted code block, and a bug correction unit for correcting bugs in the new program source code using the learned bug correction model. can include

Description

Apparatus and method for correcting bugs in program source code {APPARATUS AND METHOD FOR REPAIRING BUG SOURCE CODE FOR PROGRAM}

본 발명은 프로그램 소스코드의 버그를 정정하는 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for correcting a bug in a program source code.

종래의 경우, 프로그램에서 버그가 발생하면, 버그 리포트가 생성되고, 개발자는 버그 리포트에 기초하여 프로그램의 디버깅을 진행하여 프로그램에 대한 버그를 직접 정정하였다. 이후, 품질 보증자가 테스트 케이스 등의 방법을 통해 버그가 정정된 프로그램의 검증을 진행하고, 검증이 완료된 프로그램에 대하여 최종 배포를 진행한다. In the conventional case, when a bug occurs in a program, a bug report is generated, and a developer proceeds with debugging of the program based on the bug report to directly correct the bug in the program. Thereafter, the quality assurer proceeds to verify the program in which the bug has been corrected through methods such as test cases, and proceeds with the final distribution of the verified program.

최근, 프로그램의 규모와 복잡성의 증가로 프로그램의 버그가 많이 발생하고 있다. 이러한 이유로 지능형 소프트웨어 분야에서 머신러닝을 활용한 프로그램 버그 정정 연구가 활발하게 진행되고 있다. Recently, many program bugs have occurred due to the increase in the size and complexity of programs. For this reason, research on program bug correction using machine learning is being actively conducted in the field of intelligent software.

프로그램 버그 정정과 관련된 기존의 머신러닝 기반 기술에는 DeepFix, GenProg, jGenProg 등이 있다. Existing machine learning-based technologies related to program bug correction include DeepFix, GenProg, and jGenProg.

여기서, DeepFix는 어텐션(Attention) 알고리즘 기반의 컴파일 에러를 위한 버그 정정 기술이다. DeepFix은 컴파일 에러만 정정하기 때문에 다양한 프로그램 실행 에러를 정정할 수 없다. Here, DeepFix is a bug correction technology for compilation errors based on the Attention algorithm. Because DeepFix only corrects compilation errors, it cannot correct various program execution errors.

GenProg는 유전 프로그래밍(GP, Genetic Programming)을 사용한 C 언어 기반의 프로그램 정정 기술이다. GenProg은 휴리스틱 방법으로 개체 프로그램 AST(Abstract Syntax Tree)에 대해 유전 연산을 진행하고, 생성된 프로그램의 적합성을 평가하기 위해 적합도 함수를 이용한다. GenProg is a C language-based program correction technology using GP (Genetic Programming). GenProg performs a genetic operation on an object program AST (Abstract Syntax Tree) using a heuristic method, and uses a fitness function to evaluate the fitness of the generated program.

하지만, 기존의 머신러닝 기반 기술은 아직까지 다양한 프로젝트에 대한 검증 성능 및 프로그램 버그 정정 성능이 떨어진다는 문제점이 있었다. However, existing machine learning-based technologies still have a problem in that their verification performance and program bug correction performance for various projects are poor.

이와 관련하여, 도 2a 내지 2b는 DeepFix에 대한 버그 정정 결과를 설명하기 위한 도면이다. In this regard, FIGS. 2A and 2B are diagrams for explaining bug correction results for DeepFix.

도 2a를 살펴보면, 라인 11 번의 "character % 64"에서 적절하지 않은 변수가 사용되어 프로그램 버그가 발생하게 된다. Referring to FIG. 2A, a program bug occurs because an inappropriate variable is used in “character % 64” at line 11.

한편, 도 2b와 같이 DeepFix에 테스트 케이스("O Brother Where Art Thou?)를 적용하면 "Check sum is F"가 출력되어야 한다. 하지만, 도 2a에서 라인 11번에서 발생한 프로그램 버그로 인해 적절하지 않는 결과("Check sum is ")가 출력된다. On the other hand, if the test case ("O Brother Where Art Thou?) is applied to DeepFix as shown in Fig. 2b, "Check sum is F" should be output. However, due to a program bug that occurred in line 11 in Fig. 2a, an inappropriate The result ("Check sum is ") is printed.

한국공개특허공보 제10-2019-0089615호 (2019.07.31. 공개)Korean Patent Publication No. 10-2019-0089615 (published on July 31, 2019)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환하고, 변환된 코드 블록 기반의 소스 코드에 기초하여 학습된 버그 정정 모델을 이용하여 신규 프로그램 소스 코드에 대한 버그를 정정하고자 한다. The present invention is to solve the above-described problems of the prior art, by converting a program source code into a code block-based source code, and using a bug correction model learned based on the converted code block-based source code to create a new program. I'd like to fix a bug in the source code.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problem to be achieved by the present embodiment is not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 프로그램 소스코드의 버그를 정정하는 장치는 프로그램 소스코드에 포함된 복수의 버기 라인 각각에 표시 토큰을 추가함으로써 상기 프로그램 소스코드에 대한 전처리를 수행하는 전처리부; 상기 전처리된 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환하는 코드 변환부; 상기 변환된 코드 블록 기반의 소스 코드에 기초하여 버그 정정 모델을 학습시키는 학습부; 및 상기 학습된 버그 정정 모델을 이용하여 상기 신규 프로그램 소스 코드에 대한 버그를 정정하는 버그 정정부를 포함할 수 있다. As a technical means for achieving the above technical problem, an apparatus for correcting a bug in a program source code according to a first aspect of the present invention adds a display token to each of a plurality of buggy lines included in the program source code, thereby a pre-processing unit that performs pre-processing on code; a code conversion unit that converts the preprocessed program source code into a code block-based source code; a learning unit for learning a bug correction model based on the converted code block-based source code; and a bug correction unit correcting bugs of the new program source code using the learned bug correction model.

본 발명의 제 2 측면에 따른 프로그램 소스코드의 버그를 정정하는 방법은 프로그램 소스코드에 포함된 복수의 버기 라인 각각에 표시 토큰을 추가함으로써 상기 프로그램 소스코드에 대한 전처리를 수행하는 단계; 상기 전처리된 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환하는 단계; 상기 변환된 코드 블록 기반의 소스 코드에 기초하여 버그 정정 모델을 학습시키는 단계; 및 상기 학습된 버그 정정 모델을 이용하여 상기 신규 프로그램 소스 코드에 대한 버그를 정정하는 단계를 포함할 수 있다. A method for correcting a bug in a program source code according to a second aspect of the present invention includes performing preprocessing on the program source code by adding a display token to each of a plurality of buggy lines included in the program source code; converting the preprocessed program source code into code block-based source code; learning a bug correction model based on the source code based on the converted code block; and correcting bugs of the new program source code using the learned bug correction model.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described means for solving the problems is only illustrative and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 본 발명은 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환하고, 변환된 코드 블록 기반의 소스 코드에 기초하여 학습된 버그 정정 모델을 이용하여 신규 프로그램 소스 코드에 대한 버그를 정정할 수 있다. According to any one of the problem solving means of the present invention described above, the present invention converts the program source code into a code block-based source code, and uses a bug correction model learned based on the converted code block-based source code. Bugs can be corrected for new program source code.

이를 통해, 자동화된 본 발명의 버그 정정 기술을 활용하여 개발자들의 개발 생산성을 향상시킬 수 있고, 소프트웨어 품질을 향상시킬 수 있다. Through this, development productivity of developers can be improved and software quality can be improved by utilizing the automated bug correction technology of the present invention.

또한, 본 발명의 버그 정정 기술을 활용하면 적은 인력 및 노력으로 버그 정정 활동이 진행되기 때문에 소프트웨어 개발 인력을 추가적으로 배치하는 등의 유연한 소프트웨어 인력 운영을 할 수 있다. In addition, if the bug correction technology of the present invention is used, since the bug correction activity is performed with a small amount of manpower and effort, flexible software manpower management such as additionally deploying software development manpower can be performed.

도 1은 본 발명의 일 실시예에 따른, 프로그램 버그 정정 장치의 블록도이다.
도 2a 내지 2b는 종래의 프로그램 버그 자동 정정 방법을 설명하기 위한 도면이다.
도 3a 내지 3c는 본 발명의 일 실시예에 따른, 코드 블록 처리 방법을 설명하기 위한 도면이다.
도 4a 내지 4c는 본 발명의 일 실시예에 따른, 신규 프로그램 소스 코드에 대한 버그를 정정하는 방법을 설명하기 위한 도면이다.
도 5a 내지 5b는 종래의 베이스라인 모델과 본 발명의 버그 정정 모델을 비교 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른, 프로그램 소스코드의 버그를 정정하는 방법을 나타낸 흐름도이다. 1 is a block diagram of an apparatus for correcting program bugs according to an embodiment of the present invention.
2A and 2B are diagrams for explaining a conventional method of automatically correcting program bugs.
3A to 3C are diagrams for explaining a code block processing method according to an embodiment of the present invention.
4a to 4c are views for explaining a method of correcting a bug in a new program source code according to an embodiment of the present invention.
5A to 5B are diagrams for explaining and comparing a conventional baseline model and a bug correction model of the present invention.
6 is a flowchart illustrating a method of correcting a bug in a program source code according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail so that those skilled in the art can easily practice the present invention with reference to the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element interposed therebetween. . In addition, when a certain component is said to "include", this means that it may further include other components without excluding other components unless otherwise stated.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다. In this specification, a "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, and two or more units may be realized by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다. In this specification, some of the operations or functions described as being performed by a terminal or device may be performed instead by a server connected to the terminal or device. Likewise, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the corresponding server.

이하, 첨부된 구성도 또는 처리 흐름도를 참고하여, 본 발명의 실시를 위한 구체적인 내용을 설명하도록 한다. Hereinafter, specific details for the implementation of the present invention will be described with reference to the accompanying configuration diagram or process flow chart.

도 1은 본 발명의 일 실시예에 따른, 프로그램 버그 정정 장치(10)의 블록도이다. 1 is a block diagram of a program bug correction device 10 according to an embodiment of the present invention.

도 1을 참조하면, 프로그램 버그 정정 장치(10)는 전처리부(100), 코드 변환부(110), 학습부(120) 및 버그 정정부(130)를 포함할 수 있다. 여기서, 버그 정정부(130)는 후보 정정코드 도출부(132) 및 최종 정정코드 도출부(134)를 포함할 수 있다. 다만, 도 1에 프로그램 버그 정정 장치(10)는 본 발명의 하나의 구현 예에 불과하며, 도 1에 도시된 구성요소들을 기초로 하여 여러 가지 변형이 가능하다. Referring to FIG. 1 , the program bug correction device 10 may include a pre-processing unit 100, a code conversion unit 110, a learning unit 120, and a bug correction unit 130. Here, the bug correction unit 130 may include a candidate correction code derivation unit 132 and a final correction code derivation unit 134 . However, the program bug correction device 10 in FIG. 1 is only one implementation example of the present invention, and various modifications are possible based on the components shown in FIG. 1 .

전처리부(100)는 프로그램 소스코드에 포함된 복수의 버기 라인 각각에 표시 토큰을 추가함으로써 프로그램 소스코드에 대한 전처리를 수행할 수 있다. 전처리부(100)는 프로그램 소스코드에 포함된 복수의 버기 라인 각각의 앞과 뒤에 표시 토큰을 추가할 수 있다. The preprocessing unit 100 may perform preprocessing on the program source code by adding a display token to each of a plurality of buggy lines included in the program source code. The pre-processing unit 100 may add display tokens before and after each of the plurality of buggy lines included in the program source code.

예를 들어, 전처리부(100)는 'private static final double DEFAULT_EPSILON = 10e-p;'를 포함하는 버기 라인의 앞에 시작을 알리는 <START> 토큰을 표시하고, 해당 버기 라인의 앞에 종료를 알리는 <END>를 추가 기재함으로써 'private static final double DEFAULT_EPSILON = 10e-p;'을 '<START> private static final double DEFAULT_EPSILON = 10e-p; <END>'으로 전처리할 수 있다. 이러한 방식으로 프로그램 소스코드에 대한 전처리를 수행함으로써 버그 정정 모델이 프로그램 소스코드 상에서 버그를 쉽게 인식하여 버그를 탐색하는 시간을 줄일 수 있다. For example, the pre-processor 100 displays a <START> token indicating the start of a buggy line including 'private static final double DEFAULT_EPSILON = 10e-p;', and displays <END' indicating an end before the buggy line. > by adding 'private static final double DEFAULT_EPSILON = 10e-p;' to '<START> private static final double DEFAULT_EPSILON = 10e-p; <END>' can preprocess it. By pre-processing the program source code in this way, the bug correction model can easily recognize bugs in the program source code and reduce the time to search for the bug.

또한, 프로그램 소스코드는 자연 언어로 구성되어 있기 때문에 프로그램 소스코드에 표시 토큰을 추가함으로써 후술할 버그 정정 모델의 인코더(Encoder) 및 디코더(Decoder)가 자연어 처리를 생략하는 것이 가능하다. In addition, since the program source code is composed of natural language, it is possible to omit natural language processing by the encoder and decoder of the bug correction model to be described later by adding a display token to the program source code.

코드 변환부(110)는 전처리된 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환할 수 있다. 여기서, 코드 블록 기반의 소스코드는 전처리된 프로그램 소스 코드로부터 처리된 코드 표현 기법이며, 각 메서드가 적어도 하나의 멤버 변수와 함께 하나의 코드 블록을 구성한다.The code conversion unit 110 may convert the preprocessed program source code into code block-based source code. Here, the code block-based source code is a code expression technique processed from preprocessed program source code, and each method constitutes one code block together with at least one member variable.

코드 변환부(110)는 전처리된 프로그램 소스 코드에 포함된 멤버 변수 및 메서드를 추출할 수 있다. The code conversion unit 110 may extract member variables and methods included in the preprocessed program source code.

예를 들어, 도 3a은 프로그램 소스 코드를 코드 블록으로 처리한 예시 도면이다. 도 3a에 도시된 프로그램 소스 코드는 목표 디렉터리 중에 있는 모든 자바(Java) 파일을 설정된 경로로 복사하는 기능을 갖고 있는 코드이다. 도 3a를 참조하면, 프로그램 소스 코드는 2개의 멤버 변수(라인 5의 fromPath 멤버 변수 및 라인 6의 toPath 멤버 변수) 및 3개의 메서드(라인 7의 main 함수, 라인 10의 filter 함수 및 라인 13의 copyFile 함수)를 포함하고 있다. 코드 변환부(110)는 프로그램 소스 코드로부터 2개의 멤버 변수 및 3개의 메서드를 추출할 수 있다. For example, FIG. 3A is an exemplary diagram in which program source codes are processed into code blocks. The program source code shown in FIG. 3A is code having a function of copying all Java files in a target directory to a set path. Referring to Figure 3a, the program source code has two member variables (fromPath member variable in line 5 and toPath member variable in line 6) and three methods (main function in line 7, filter function in line 10, and copyFile in line 13). function) is included. The code conversion unit 110 may extract two member variables and three methods from the program source code.

코드 변환부(110)는 추출된 멤버 변수 및 메서드를 결합하여 복수의 코드 블록을 생성하고, 생성된 복수의 코드 블록에 기초하여 전처리된 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환할 수 있다. The code conversion unit 110 may generate a plurality of code blocks by combining the extracted member variables and methods, and convert the preprocessed program source code into a code block-based source code based on the generated plurality of code blocks. .

예를 들어, 도 3a 및 3b를 함께 참조하면, 코드 변환부(110)는 프로그램 소스 코드로부터 추출된 fromPath 멤버 변수, toPath 멤버 변수 및 copyFile 함수를 결합하여 제 1 코드 블록을 생성할 수 있다. 또한, 코드 변환부(110)는 추출된 fromPath 멤버 변수, toPath 멤버 변수 및 main 함수를 결합하여 제 2 코드 블록을 생성할 수 있다. 또한, 코드 변환부(110)는 추출된 fromPath 멤버 변수, toPath 멤버 변수 및 filter 함수를 결합하여 제 3 코드 블록을 생성할 수 있다. For example, referring to FIGS. 3A and 3B together, the code conversion unit 110 may generate a first code block by combining the fromPath member variable, the toPath member variable, and the copyFile function extracted from the program source code. Also, the code conversion unit 110 may generate a second code block by combining the extracted fromPath member variable, toPath member variable, and main function. Also, the code conversion unit 110 may generate a third code block by combining the extracted fromPath member variable, toPath member variable, and filter function.

이어서, 코드 변환부(110)는 생성된 제 1 코드 블록, 제 2 코드 블록 및 제 3 코드 블록에 기초하여 전처리된 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환할 수 있다.Subsequently, the code conversion unit 110 may convert the preprocessed program source code into a code block-based source code based on the generated first code block, second code block, and third code block.

도 3c는 프로그램 소스 코드로부터 코드 블록 기반의 소스 코드로 변환하는 과정을 설명하기로 한다. 3C describes a process of converting a program source code into a code block-based source code.

도 3c를 참조하면, 코드 변환부(110)는 프로그램 소스 코드(Frequency.java)에 포함된 멤버 변수(멤버 변수 1, .. , 멤버 변수 N) 및 메서드(메서드 1, 메서드 2), 메서드 M)를 식별할 수 있다. Referring to FIG. 3C, the code conversion unit 110 converts member variables (member variables 1, .., member variables N) and methods (method 1, method 2), method M included in the program source code (Frequency.java). ) can be identified.

이어서, 코드 변환부(110)는 식별된 멤버 변수 및 메서드를 결합하여 구성하되, 하나의 메서드당 하나의 코드 블록으로 구성할 수 있다. 즉, 코드 블록의 개수는 메서드 개수와 동일하게 구성될 수 있다. Subsequently, the code conversion unit 110 is configured by combining the identified member variables and methods, but may be configured with one code block per method. That is, the number of code blocks may be configured equal to the number of methods.

이 때, 멤버 변수의 경우, 둘 이상의 코드 블록에 포함될 수 있다. 예를 들어, 코드 변환부(110)는 멤버 변수 1, 멤버 변수 N 및 메서드 1을 포함하는 제 1 코드 블록을 생성하고, 멤버 변수 1, 멤버 변수 N 및 메서드 2를 포함하는 제 2 코드 블록을 생성하고, 멤버 변수 1, 멤버 변수 N 및 메서드 M을 포함하는 제 M 코드 블록을 생성할 수 있다. 이렇게 제 1 코드 블록, 제 2 코드 블록, 제 M 코드 블록에 기초하여 생성된 코드 블록 기반의 소스 코드는 버그 정정 모델(30)의 입력값으로 입력되어 버그 정정 모델(30)의 학습에 이용될 수 있다. In this case, member variables may be included in two or more code blocks. For example, the code conversion unit 110 generates a first code block including member variable 1, member variable N, and method 1, and generates a second code block including member variable 1, member variable N, and method 2. and create an Mth code block including member variable 1, member variable N, and method M. The code block-based source code generated based on the first code block, the second code block, and the M-th code block is input as an input value of the bug correction model 30 and used for learning the bug correction model 30. can

학습부(120)는 변환된 코드 블록 기반의 소스 코드에 기초하여 버그 정정 모델을 학습시킬 수 있다. 여기서, 버그 정정 모델은 예를 들면, LSTM(Long Short-Term Memory) 기반 모델일 수 있다. 여기서, LSTM은 긴 시퀀스의 학습 과정에서 기울기 소멸 문제를 해결하기 위한 특수 RNN(Recurrent Neural Network)이다. The learning unit 120 may train a bug correction model based on the converted code block-based source code. Here, the bug correction model may be, for example, a long short-term memory (LSTM) based model. Here, LSTM is a special recurrent neural network (RNN) to solve the gradient vanishing problem in the learning process of a long sequence.

한편, 버그 정정 모델이 전체 프로그램 소스 코드를 이용하여 학습하는 경우 입력 시퀀스가 매우 길기 때문에 버그 정정 모델의 학습에 어려움이 있다. 이에 따라, 종래의 버그 정정 모델은 긴 시퀀스의 입력으로 학습에 어려움을 겪고 있다. Meanwhile, when the bug correction model is trained using the entire program source code, it is difficult to learn the bug correction model because the input sequence is very long. Accordingly, the conventional bug correction model has difficulty in learning with a long sequence of inputs.

본 발명에 따르면, 전체 프로그램 소스 코드보다 짧은 코드 블록 단위의 소스 코드를 통해 버그 정정 모델을 학습함에 따라 코드 블록 단위별로 표시된 표시 토큰 간에 강력한 관계 정보를 추출하도록 함으로써 버그를 쉽게 인식하여 버그를 탐색하는 시간을 줄일 수 있다. 이에 따라, 버그 정정 모델의 성능을 향상시킬 수 있다.According to the present invention, as the bug correction model is learned through the source code of code block units shorter than the entire program source code, strong relationship information is extracted between display tokens displayed for each code block unit, thereby easily recognizing bugs and searching for bugs. can save time Accordingly, the performance of the bug correction model can be improved.

또한, 본 발명에 따르면, 버그 정정 모델이 코드 블록 기반의 소스 코드를 학습할 때, 코드블록에만 집중하고 해당 코드 블록이 포함되지 않은 다른 소스 코드 라인에 대해서는 관심을 갖지 않도록 하여 다른 소스코드 라인으로부터의 노이즈를 최소화할 수 있다. In addition, according to the present invention, when the bug correction model learns a code block-based source code, it concentrates only on the code block and does not pay attention to other source code lines that do not include the corresponding code block, so that it is possible to detect from other source code lines. noise can be minimized.

본 발명에 있어서, 버그 정정 모델은 인코더 및 디코더를 포함하는 순환 신경망으로 구성될 수 있다. 여기서, 버그 정정 모델의 인코더는 코드 블록 기반의 소스 코드에 포함된 표시 토큰의 정보를 통합하는 양방향 인코더일 수 있다. 본 발명에서는 인코더가 순환 신경망으로 입력을 처리하기 위해 LSTM를 이용한다. In the present invention, the bug correction model may be composed of a recurrent neural network including an encoder and a decoder. Here, the encoder of the bug correction model may be a bi-directional encoder that integrates information of a display token included in a code block-based source code. In the present invention, the encoder uses LSTM to process the input into the recurrent neural network.

학습부(120)는 버그 정정 모델을 통해 코드 블록 기반의 소스 코드에 대한 버그를 정정하도록 버그 정정 모델을 학습시킬 수 있다. The learning unit 120 may train a bug correction model to correct bugs in the code block-based source code through the bug correction model.

버그 정정부(130)는 학습된 버그 정정 모델을 이용하여 신규 프로그램 소스 코드에 대한 버그를 정정할 수 있다. The bug correction unit 130 may correct bugs in the new program source code using the learned bug correction model.

후보 정정코드 도출부(132)는 학습된 버그 정정 모델로부터 신규 프로그램 소스 코드 중 특정 코드 블록에 대한 적어도 하나 이상의 후보 정정 코드를 도출할 수 있다. The candidate correction code derivation unit 132 may derive at least one candidate correction code for a specific code block in the new program source code from the learned bug correction model.

예를 들어, 도 2a, 4a 및 4b를 함께 참조하면, 도 2a의 프로그램 소스 코드를 학습된 버그 정정 모델(30)에 입력하면, 후보 정정코드 도출부(132)는 도 2a의 프로그램 소스 코드에서 라인 9의 버그 정정에 이용될 제 1 후보 정정코드(sum = sum + (int) character;) 및 제 2 후보 정정코드(character = (char) (sum + (int) characer);)를 도출하고, 도 2a의 프로그램 소스 코드에서 라인 11의 버그 정정에 이용될 제 3 후보 정정코드(remainder=(char) ((sum%64) + 22);)를 도출할 수 있다. For example, referring to FIGS. 2A, 4A, and 4B together, when the program source code of FIG. 2A is input to the learned bug correction model 30, the candidate correction code derivation unit 132 derives from the program source code of FIG. 2A. Derive a first candidate correction code (sum = sum + (int) character;) and a second candidate correction code (character = (char) (sum + (int) characer);) to be used for the bug correction of line 9, A third candidate correction code (remainder=(char) ((sum%64) + 22);) to be used for the bug correction of line 11 in the program source code of FIG. 2A can be derived.

최종 정정코드 도출부(134)는 적어도 하나 이상의 후보 정정 코드를 적합도 측정 모델에 입력하여 적어도 하나 이상의 후보 정정 코드 각각에 대한 정정 적합도를 도출할 수 있다. 여기서, 적합도 측정 모델에서 사용하는 적합도 함수는 [수학식 1]과 같이 표현될 수 있다. The final correction code derivation unit 134 may input at least one or more candidate correction codes into a fitness measurement model to derive a correction fitness for each of the at least one or more candidate correction codes. Here, the fitness function used in the fitness measurement model can be expressed as [Equation 1].

[수학식 1][Equation 1]

여기서, Passed Testcases는 후보 정정 코드가 통과한 테스트 케이스의 개수를 의미하고, All Testcases는 후보 정정 코드의 총 개수를 의미하고, Value는 정정 적합도를 의미한다. Here, Passed Testcases means the number of test cases that candidate correction codes passed, All Testcases means the total number of candidate correction codes, and Value means correction suitability.

[수학식 1]을 살펴보면, 통과한 테스트 케이스의 개수가 많아질수록 정정 적합도는 기설정된 값(예컨대, 1)에 근접하게 된다. 테스트 케이스를 모두 통과한 경우, 정정 적합도는 기설정된 값으로 계산된다. Referring to [Equation 1], as the number of passed test cases increases, the correction fitness approaches a predetermined value (eg, 1). If all test cases are passed, the corrected fit is calculated as a preset value.

최종 정정코드 도출부(134)는 후보 정정 코드 각각에 대한 정정 적합도에 기초하여 적어도 하나 이상의 후보 정정 코드 중 최종 정정 코드를 도출할 수 있다. The final correction code derivation unit 134 may derive a final correction code from among at least one or more candidate correction codes based on the correction suitability for each candidate correction code.

예를 들어, 도 4a 및 4c를 참조하면, 최종 정정코드 도출부(134)는 제 1 후보 정정코드, 제 2 후보 정정코드 및 제 3 후보 정정코드를 적합도 측정 모델(32)에 입력하여 제 1 후보 정정코드, 제 2 후보 정정코드 및 제 3 후보 정정코드 각각에 대한 정정 적합도를 계산할 수 있다. For example, referring to FIGS. 4A and 4C , the final correction code derivation unit 134 inputs the first candidate correction code, the second candidate correction code, and the third candidate correction code to the fitness measurement model 32 to determine the first A correction suitability for each of the candidate correction code, the second candidate correction code, and the third candidate correction code may be calculated.

이 때, 정정 적합도가 기설정된 값(예컨대, 1)에 가까운 후보 정정 코드는 프로그램 버그 정정에 적합한 정정 코드로서 판단될 수 있다. 만일 모든 후보 정정 코드가 기설정된 값에 도달하지 않는다면, 후보 정정코드 도출부(132)는 신규 프로그램 소스 코드로부터 특정 코드 블록에 대한 적어도 하나 이상의 후보 정정 코드를 다시 도출할 수 있다. At this time, a candidate correction code whose correction suitability is close to a predetermined value (eg, 1) may be determined as a correction code suitable for correcting a program bug. If all of the candidate correction codes do not reach the preset value, the candidate correction code derivation unit 132 may derive at least one or more candidate correction codes for a specific code block from the new program source code again.

예를 들어, 도 2b 및 도 4b를 함께 참조하면, 제 1 후보 정정코드(sum = sum + (int) character;) 및 제 2 후보 정정코드(character = (char) (sum + (int) characer);)를 적합도 측정 모델(32)에 입력할 때 "Check sum is ㅁ"이 출력되고, 제 3 후보 정정코드(remainder=(char) ((sum%64) + 22);)를 적합도 측정 모델(32)에 입력할 때 "Check sum is F"가 출력될 경우, 최종 정정코드 도출부(134)는 제 3 후보 정정코드를 최종 정정 코드로 도출할 수 있다. For example, referring to FIGS. 2B and 4B together, a first candidate correction code (sum = sum + (int) character;) and a second candidate correction code (character = (char) (sum + (int) characer) ;) is input into the fitness measurement model 32, "Check sum is ㅁ" is output, and the third candidate correction code (remainder=(char) ((sum%64) + 22);) is applied to the fitness measurement model ( 32), if "Check sum is F" is output, the final correction code derivation unit 134 may derive the third candidate correction code as the final correction code.

버그 정정부(130)는 특정 코드 블록을 최종 정정 코드로 대체하여 신규 프로그램 소스 코드에 대한 버그를 정정할 수 있다. The bug correcting unit 130 may correct bugs in the new program source code by replacing a specific code block with the final corrected code.

한편, 당업자라면, 전처리부(100), 코드 변환부(110), 학습부(120), 버그 정정부(130), 후보 정정코드 도출부(132) 및 최종 정정코드 도출부(134) 각각이 분리되어 구현되거나, 이 중 하나 이상이 통합되어 구현될 수 있음을 충분히 이해할 것이다. On the other hand, those skilled in the art, the pre-processing unit 100, the code conversion unit 110, the learning unit 120, the bug correction unit 130, the candidate correction code derivation unit 132 and the final correction code derivation unit 134, respectively It will be fully understood that they may be implemented separately or integrated with one or more of them.

도 5a 내지 5b는 종래의 버그 정정 모델과 본 발명의 버그 정정 모델을 비교 설명하기 위한 도면이다. 5A to 5B are diagrams for comparing and explaining a conventional bug correction model and a bug correction model of the present invention.

도 5a를 참조하면, 종래의 버그 정정 모델은 "Z. Chen, S. J. Kommrusch, M. Tufano, L. N. Pouchet, D. Poshyvanyk, & M. Monperrus, "Sequencer: Sequence-to-sequence learning for endto-end program repair", IEEE Transactions on Software Engineering, pp.1-19, 2019"에서 제안된 모델로서 딥러닝 학습 기반으로 오픈 소스 커밋 정보를 활용하여 프로그램 버그 정정을 진행하는 모델이다. 종래의 버그 정정 모델 및 본 발명의 버그 정정 모델 각각이 동일한 데이터 셋을 학습한 상태에서 특정 프로그램 소스 코드에 대한 버그를 정정할 때의 예측 정확도를 비교하였다.여기서, 데이터 셋은 "Z. Chen and M. Monperrus, "The CodRep Machine Learning on Source Code Competition," ArXiv eprints, Jul. pp.1-6, 2018. arXiv: 1807.03200 [cs.SE]"에서 제안된 CodRep이며, 특정 프로그램 소스 코드 또한 CodRep의 일부를 사용하였다.Referring to Figure 5a, the conventional bug correction model is "Z. Chen, S. J. Kommrusch, M. Tufano, L. N. Pouchet, D. Poshyvanyk, & M. Monperrus, "Sequencer: Sequence-to-sequence learning for endto-end program repair", IEEE Transactions on Software Engineering, pp.1-19, 2019", a model that corrects program bugs by using open source commit information based on deep learning learning. The predictive accuracy of the conventional bug-correction model and the bug-correction model of the present invention were compared when correcting bugs for a specific program source code while learning the same data set. Here, the data set is "Z. Chen and M. Monperrus, "The CodRep Machine Learning on Source Code Competition," ArXiv eprints, Jul. pp.1-6, 2018. arXiv: 1807.03200 [cs.SE]" is the proposed CodRep, and certain program source code is also some were used.

비교 결과, 도 5a에서의 그래프(X축은 모델 학습 횟수를 의미하고, Y축은 정확도를 의미함)와 같이, 본 발명의 코드 블록 기반의 버그 정정 모델의 성능이 종래의 버그 정정 모델에 비해 약 85%의 정확성을 보이고, 적절하게 프로그램 버그 정정을 하는 것을 확인할 수 있다. As a result of comparison, as shown in the graph in FIG. 5A (X-axis means model training count, Y-axis means accuracy), the performance of the code block-based bug correction model of the present invention is about 85% compared to the conventional bug correction model. % accuracy, and it can be confirmed that program bug corrections are performed appropriately.

도 5b를 참조하면, 도 5b의 그래프(X축은 프로그램 소스 코드의 종류를 의미하고, Y축은 정정된 코드의 개수를 의미함)에서 본 발명의 버그 정정 모델은 특정 프로그램 소스 코드에서 898개의 버그를 정정하였다. 이에 반해, 종래의 버그 정정 모델은 특정 프로그램 소스 코드에서 728 개의 버그를 정정하였다. Referring to FIG. 5B, in the graph of FIG. 5B (the X-axis means the type of program source code, and the Y-axis means the number of corrected codes), the bug correction model of the present invention identifies 898 bugs in a specific program source code. Corrected. In contrast, the conventional bug-correction model corrected 728 bugs in a specific program source code.

이러한 실험 결과를 살펴볼 때, 본 발명의 코드 블록 기반의 버그 정정 모델의 성능이 종래의 버그 정정 모델에 비해 상당히 높다는 것을 확인할 수 있었다. Looking at these experimental results, it was confirmed that the performance of the code block-based bug correction model of the present invention is significantly higher than that of the conventional bug correction model.

도 6은 본 발명의 일 실시예에 따른, 프로그램 소스코드의 버그를 정정하는 방법을 나타낸 흐름도이다. 6 is a flowchart illustrating a method of correcting a bug in a program source code according to an embodiment of the present invention.

도 6을 참조하면, 단계 S601에서 프로그램 버그 정정 장치(10)는 프로그램 소스코드에 포함된 복수의 버기 라인 각각에 표시 토큰을 추가함으로써 프로그램 소스코드에 대한 전처리를 수행할 수 있다. Referring to FIG. 6 , in step S601, the program bug correction device 10 may perform preprocessing on the program source code by adding a display token to each of a plurality of buggy lines included in the program source code.

단계 S603에서 프로그램 버그 정정 장치(10)는 전처리된 프로그램 소스 코드를 코드 블록 기반의 소스 코드로 변환할 수 있다. In step S603, the program bug correction device 10 may convert the preprocessed program source code into code block-based source code.

단계 S605에서 프로그램 버그 정정 장치(10)는 변환된 코드 블록 기반의 소스 코드에 기초하여 버그 정정 모델을 학습시킬 수 있다. In step S605, the program bug correction device 10 may train a bug correction model based on the converted code block-based source code.

단계 S607에서 프로그램 버그 정정 장치(10)는 학습된 버그 정정 모델을 이용하여 신규 프로그램 소스 코드에 대한 버그를 정정할 수 있다. In step S607, the program bug correction device 10 may correct the bug of the new program source code by using the learned bug correction model.

상술한 설명에서, 단계 S601 내지 S607은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. In the foregoing description, steps S601 to S607 may be further divided into additional steps or combined into fewer steps, depending on an embodiment of the present invention. Also, some steps may be omitted if necessary, and the order of steps may be changed.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may be implemented in the form of a recording medium including instructions executable by a computer, such as program modules executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer readable media may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes, and those skilled in the art can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof should be construed as being included in the scope of the present invention. .

10: 프로그램 버그 정정 장치
100: 전처리부
110: 코드 변환부
120: 학습부
130: 버그 정정부
132: 후보 정정코드 도출부
134: 최종 정정코드 도출부10: Program bug correction device
100: pre-processing unit
110: code conversion unit
120: learning unit
130: bug corrector
132: candidate correction code derivation unit
134: final correction code derivation unit

Claims

In the apparatus for correcting bugs in program source code,
a preprocessing unit which preprocesses the program source code by adding a display token to each of a plurality of buggy lines included in the program source code;
a code conversion unit that converts the preprocessed program source code into a code block-based source code;
a learning unit for learning a bug correction model based on the converted code block-based source code; and
A bug correcting unit correcting bugs in the new program source code using the learned bug correction model.
A program bug correction device comprising a.

According to claim 1,
The code converter
Extracting member variables and methods included in the preprocessed program source code;
Combining the extracted member variables and methods to generate a plurality of code blocks;
Converting the preprocessed program source code into the code block-based source code based on the generated plurality of code blocks. Program bug fixing device.

According to claim 2,
The number of the plurality of code blocks is the same as the number of the extracted methods.

According to claim 1,
The bug correction model is a long short-term memory (LSTM) based model, program bug correction device.

According to claim 1,
Wherein the learning unit learns to correct bugs in the source code based on the converted code block through the bug correction model.

According to claim 1,
The bug correction unit is a candidate correction code derivation unit for deriving at least one candidate correction code for a specific code block of the new program source code from the learned bug correction model.
A program bug correction device comprising a.

According to claim 6,
The bug correction unit inputs the at least one or more candidate correction codes into a fitness measurement model to derive a correction fitness for each of the at least one or more candidate correction codes, and based on the correction fitness for each of the at least one or more candidate correction codes, the at least one or more candidate correction codes. A final correction code derivation unit for deriving a final correction code from among candidate correction codes.
A program bug correction device comprising a.

According to claim 7,
Wherein the bug correcting unit corrects the bug of the new program source code by replacing the specific code block with the final corrected code.

In the method of correcting bugs in program source code,
performing preprocessing on the program source code by adding a display token to each of a plurality of buggy lines included in the program source code;
converting the preprocessed program source code into code block-based source code;
learning a bug correction model based on the source code based on the converted code block; and
Correcting a bug in a new program source code using the learned bug correction model
To include, program bug correction method.