KR20210084873A

KR20210084873A - Method for constructing patched source code from assembly code and apparatus therefor

Info

Publication number: KR20210084873A
Application number: KR1020190177356A
Authority: KR
Inventors: 오희국; 사미울라
Original assignee: 한양대학교 에리카산학협력단
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2021-07-08
Also published as: KR102282705B1

Abstract

Disclosed are a method and a device for constructing a source code patched from an assembly code. A method for constructing a source code patched in an assembly code according to an embodiment of the present invention includes the steps of: assembling the source code of a first function to obtain the assembly code of the first function, when the first function is changed to a second function; extracting a change based on a difference between the assembly code of the first function and the assembly code of the second function; converting changes in the assembly code into a pseudo code by mapping the extracted change to the pseudo code; and constructing the source code of the second function by patching the source code of the first function using the converted pseudo code.

Description

{Method for constructing patched source code from assembly code and apparatus therefor}

본 발명은 어셈블리 코드에서 패치된 소스 코드를 구성하는 기술에 관한 것으로, 보다 상세하게는 어셈블리 명령에서 의미 있는 변경 사항을 추출하고, 그것들을 그들의 의사 코드에 매핑하며, 이전 버전의 소스 코드 변경 사항을 스티칭함으로써, 새로운 소스 코드로 패치할 수 있는 방법 및 그 장치에 관한 것이다.The present invention relates to techniques for constructing patched source code from assembly code, and more particularly to extracting meaningful changes from assembly instructions, mapping them to their pseudocode, and resolving source code changes from previous versions. It relates to a method and apparatus for patching new source code by stitching.

취약점은 소프트웨어 개발 프로세스의 일부로서, 지속적으로 발견되고 빠르게 패치된다. 취약점의 원인을 아는 것은 다른 프로젝트에 사용될 가능성이 높은 많은 오픈 소스 소프트웨어에 필수적이다. 그러한 오픈 소스 소프트웨어 중 하나는 Apple XNU 커널이다. 이진 패치와 소스 코드 패치의 출시 날짜는 큰 시간 차이를 가지고 있으며, 취약점의 원인을 아는 것은 이진 분석으로만 가능하다. 그러나 어셈블리 레벨에서 코드 변경을 해석하는 것은 최적화 사례로 인한 오류 발생 가능성이 높고 또한 어셈블리 코드는 사람에게 친숙하지 않다.As part of the software development process, vulnerabilities are continuously discovered and quickly patched. Knowing the cause of a vulnerability is essential for many open source software that will likely be used in other projects. One such open source software is the Apple XNU kernel. The release dates of binary patches and source code patches have a large time gap, and knowing the cause of a vulnerability is only possible with binary analysis. However, interpreting code changes at the assembly level is error prone due to optimization practices, and assembly code is not human friendly.

일부 취약점에 대해 이진 패치가 출시된 후 업데이트된 이진 파일을 얻을 수 있지만 업데이트된 소스 코드가 없다. 연구 문제는 소스 코드 레벨에서 업데이트된 이진을 더 잘 나타내는 방법이다. 이진 파일은 어셈블리 코드로 분해할 수 있지만, 어셈블리 레벨에서 변경 사항을 분석하는 것은 컴파일러 최적화(소스 코드를 어셈블리 코드로 변환하는 동안)와 비우호적인 어셈블리의 특성 때문에 오류가 발생하기 쉽다. 더 나은 표현 문제를 해결하기 위해, 이전의 모든 연구는 디컴파일러가 어셈블리 코드를 의사 코드(Pseudocode)로 변환할 것을 제안하였다.After binary patches are released for some vulnerabilities, you can get updated binaries, but no updated source code. The research question is how to better represent the updated binary at the source code level. Binary files can be decomposed into assembly code, but analyzing changes at the assembly level is error-prone due to compiler optimizations (during source code to assembly code conversion) and unfriendly nature of the assembly. To better solve the representation problem, all previous studies have suggested that the decompiler convert the assembly code into pseudocode.

디컴파일러의 초점은 컴파일 가능한 의사 코드를 생성하는 것이다. Hex-ray와 같은 디컴파일러는 함수의 어셈블리 코드를 원래 소스 코드와 기능적으로 동일한 의사 코드로 변환할 수 있다. 어셈블리 코드는 변수 이름, 구조 및 typedef 등의 정의를 잃어버리므로 생성된 의사 코드도 같은 결함을 갖는다. 이러한 제한에 대처하기 위해 디컴파일러는 변수 이름과 유형 정보를 무작위로 생성한다. 구조, typedef 등이 많은 XNU 커널과 같은 큰 프로젝트의 경우 디컴파일러에서 생성된 의사 코드는 원래 소스 코드와는 구조적으로 다르게 보인다. 비록 의사 코드가 65~75%의 소스 코드와 유사하지만, 소스 코드 패치 생성을 위해 더 나은 표현은 아니다. 현재까지는 의사 코드는 변환 후 어셈블리 코드를 가장 잘 나타내는 것으로 간주되고 있다.The focus of the decompiler is to generate compilable pseudocode. A decompiler such as Hex-ray can transform the assembly code of a function into pseudocode that is functionally equivalent to the original source code. Assembly code loses definitions of variable names, structures, and typedefs, so the generated pseudocode has the same flaws. To cope with this limitation, the decompiler randomly generates variable name and type information. For large projects such as the XNU kernel with many structures, typedefs, etc., the pseudo code generated by the decompiler looks structurally different from the original source code. Although the pseudocode resembles 65-75% of the source code, it is not a better representation for generating source code patches. So far, pseudocode is considered to be the best representation of assembly code after conversion.

본 발명의 실시예들은, 어셈블리 명령에서 의미 있는 변경 사항을 추출하고, 그것들을 그들의 의사 코드에 매핑하며, 이전 버전의 소스 코드 변경 사항을 스티칭함으로써, 새로운 소스 코드로 패치할 수 있는 방법 및 그 장치를 제공한다. Embodiments of the present invention provide a method and apparatus for patching into new source code by extracting meaningful changes from assembly instructions, mapping them to their pseudo-code, and stitching changes to the source code of previous versions. provides

본 발명의 실시예들은, 소스 코드 레벨에서 이진 업데이트를 표현할 수 있는 방법 및 그 장치를 제공한다.Embodiments of the present invention provide a method and apparatus for expressing binary updates at the source code level.

본 발명의 일 실시예에 따른 어셈블리 코드에서 패치된 소스 코드 구성 방법은 제1 함수가 제2 함수로 변경되면 상기 제1 함수의 소스 코드를 어셈블리 처리하여 상기 제1 함수의 어셈블리 코드를 획득하는 단계; 상기 제1 함수의 어셈블리 코드와 상기 제2 함수의 어셈블리 코드 차이에 기초하여 변경 사항을 추출하는 단계; 상기 추출된 변경 사항을 의사 코드에 매핑하여 어셈블리 코드의 변경 사항을 의사 코드로 변환하는 단계; 및 상기 변환된 의사 코드를 이용하여 상기 제1 함수의 소스 코드를 패치함으로써, 상기 제2 함수의 소스 코드를 구성하는 단계를 포함한다.The method for constructing a source code patched from assembly code according to an embodiment of the present invention includes: when a first function is changed to a second function, assembly processing the source code of the first function to obtain the assembly code of the first function; ; extracting a change based on a difference between the assembly code of the first function and the assembly code of the second function; converting the changes in assembly code into pseudo code by mapping the extracted changes to pseudo code; and constructing the source code of the second function by patching the source code of the first function using the converted pseudo code.

상기 변경 사항을 추출하는 단계는 최적화 기반으로 상기 추출된 변경 사항을 필터링함으로써, 실제 변경된 변경 사항만을 추출하고, 상기 의사 코드로 변환하는 단계는 상기 실제 변경된 변경 사항만을 의사 코드로 변환할 수 있다.In the extracting of the changes, only the actually changed changes are extracted by filtering the extracted changes based on optimization, and the converting into the pseudo code may convert only the actually changed changes into the pseudo code.

상기 변경 사항을 추출하는 단계는 상기 제1 함수의 어셈블리 코드와 상기 제2 함수의 어셈블리 코드에 대한 텍스트 기반 디핑(diffing)을 이용하여 어셈블리 코드의 라인 별 변경 사항을 추출할 수 있다.The extracting of the change may include extracting the change for each line of the assembly code by using text-based diffing of the assembly code of the first function and the assembly code of the second function.

상기 제2 함수의 소스 코드를 구성하는 단계는 상기 제1 함수의 소스 코드에 대한 추상 구문 트리(abstract syntax tree)에서 상기 변환된 의사 코드를 검색함으로써, 패치 위치에 대한 인덱스를 확인하며, 상기 확인된 인덱스를 스티칭하여 상기 제1 함수의 소스 코드를 패치함으로써, 상기 제2 함수의 소스 코드를 구성할 수 있다.In the step of constructing the source code of the second function, by searching the converted pseudo code in an abstract syntax tree for the source code of the first function, the index to the patch position is identified, The source code of the second function may be configured by patching the source code of the first function by stitching the index.

상기 제2 함수의 소스 코드를 구성하는 단계는 상기 변환된 의사 코드와 상기 추상 구문 트리 사이의 구조적 유사성을 찾아내고 코드 유사성 비율이 최대인 인덱스를 선택하는 구조적 매칭 기반 검색(Structural matching based search)과 깊이 우선 검색에서 상기 추상 구문 트리 위로 미리 설정된 윈도우를 슬라이딩함으로써, 의사 코드의 유사성 비율을 계산한 후 상기 추상 구문 트리의 최대 비율 노드를 매칭으로 선택하는 공격적 검색(Aggressive search)에 기초하여 상기 패치 위치에 대한 인덱스를 확인할 수 있다.The step of constructing the source code of the second function includes a structural matching based search for finding the structural similarity between the transformed pseudo code and the abstract syntax tree and selecting an index having the largest code similarity ratio; By sliding a preset window over the abstract syntax tree in depth-first search, the patch position based on an aggressive search that calculates the pseudocode similarity ratio and then selects the maximum ratio node of the abstract syntax tree as a match You can check the index for .

본 발명의 일 실시예에 따른 어셈블리 코드에서 패치된 소스 코드 구성 장치는 제1 함수가 제2 함수로 변경되면 상기 제1 함수의 소스 코드를 어셈블리 처리하여 상기 제1 함수의 어셈블리 코드를 획득하는 획득부; 상기 제1 함수의 어셈블리 코드와 상기 제2 함수의 어셈블리 코드 차이에 기초하여 변경 사항을 추출하는 추출부; 상기 추출된 변경 사항을 의사 코드에 매핑하여 어셈블리 코드의 변경 사항을 의사 코드로 변환하는 변환부; 및 상기 변환된 의사 코드를 이용하여 상기 제1 함수의 소스 코드를 패치함으로써, 상기 제2 함수의 소스 코드를 구성하는 구성부를 포함한다.The apparatus for constructing a source code patched from assembly code according to an embodiment of the present invention obtains the assembly code of the first function by assembling the source code of the first function when the first function is changed to the second function part; an extractor configured to extract a change based on a difference between the assembly code of the first function and the assembly code of the second function; a conversion unit that maps the extracted changes to pseudo code and converts the changes in assembly code into pseudo code; and a component configured to configure the source code of the second function by patching the source code of the first function using the converted pseudo code.

상기 추출부는 최적화 기반으로 상기 추출된 변경 사항을 필터링함으로써, 실제 변경된 변경 사항만을 추출하고, 상기 변환부는 상기 실제 변경된 변경 사항만을 의사 코드로 변환할 수 있다.The extractor may extract only the actually changed changes by filtering the extracted changes based on optimization, and the converter may convert only the actually changed changes into pseudo code.

상기 추출부는 상기 제1 함수의 어셈블리 코드와 상기 제2 함수의 어셈블리 코드에 대한 텍스트 기반 디핑(diffing)을 이용하여 어셈블리 코드의 라인 별 변경 사항을 추출할 수 있다.The extractor may extract the change for each line of the assembly code by using text-based diffing of the assembly code of the first function and the assembly code of the second function.

상기 구성부는 상기 제1 함수의 소스 코드에 대한 추상 구문 트리(abstract syntax tree)에서 상기 변환된 의사 코드를 검색함으로써, 패치 위치에 대한 인덱스를 확인하며, 상기 확인된 인덱스를 스티칭하여 상기 제1 함수의 소스 코드를 패치함으로써, 상기 제2 함수의 소스 코드를 구성할 수 있다.The constructing unit checks the index for the patch position by searching for the converted pseudo code in an abstract syntax tree for the source code of the first function, and stitches the identified index to the first function By patching the source code of , it is possible to configure the source code of the second function.

상기 구성부는 상기 변환된 의사 코드와 상기 추상 구문 트리 사이의 구조적 유사성을 찾아내고 코드 유사성 비율이 최대인 인덱스를 선택하는 구조적 매칭 기반 검색(Structural matching based search)과 깊이 우선 검색에서 상기 추상 구문 트리 위로 미리 설정된 윈도우를 슬라이딩함으로써, 의사 코드의 유사성 비율을 계산한 후 상기 추상 구문 트리의 최대 비율 노드를 매칭으로 선택하는 공격적 검색(Aggressive search)에 기초하여 상기 패치 위치에 대한 인덱스를 확인할 수 있다.The constructing unit finds the structural similarity between the transformed pseudo code and the abstract syntax tree, and selects an index having the largest code similarity ratio above the abstract syntax tree in a structural matching based search and a depth-first search. By sliding the preset window, it is possible to check the index for the patch position based on an aggressive search that selects the maximum ratio node of the abstract syntax tree as a match after calculating the similarity ratio of the pseudo code.

본 발명의 실시예들에 따르면, 어셈블리 명령에서 의미 있는 변경 사항을 추출하고, 그것들을 그들의 의사 코드에 매핑하며, 이전 버전의 소스 코드 변경 사항을 스티칭함으로써, 새로운 소스 코드로 패치할 수 있다. 또한, 본 발명은 소스 코드 레벨에서 이진 업데이트를 더 잘 표현할 수 있다.According to embodiments of the present invention, it is possible to patch to new source code by extracting meaningful changes from assembly instructions, mapping them to their pseudo-code, and stitching changes to the source code of the previous version. Also, the present invention can better represent binary updates at the source code level.

단일 함수를 스티칭하는 것은 그리 간단하지 않으며 많은 난제를 수반하는데, 본 발명의 코드 스티칭 기술은 패치된 소스 코드를 효율적으로 처리하고 구성할 수 있다.Stitching a single function is not so simple and involves many difficulties, and the code stitching technique of the present invention can efficiently process and organize the patched source code.

본 발명은 소스 코드가 다른 소프트웨어 제품에서 의존적으로 널리 사용되는 오픈 소스 소프트웨어에 유용하다. 종속성 소프트웨어에서 일부 취약점이 발견되고 이 취약점을 패치하기 위해 이진 업데이트가 출시되지만 소스 코드 출시가 지연되고 종속성 소프트웨어에 패치를 적용할 수 있는 업데이트된 소스 코드가 없는 시나리오를 고려할 수 있다. 신속한 보안 패치의 경우 업데이트 버전 소스 코드(업데이트 전 및 업데이트 후 이진 파일)와 이전 버전의 소스 코드를 구성해야 한다. 그렇지 않으면 종속성 소프트웨어는 발견된 취약점에 취약한 상태로 유지된다. 본 발명의 코드 스티칭 기술은 이 시나리오를 처리하고 업데이트 버전 소스 코드를 효율적으로 구성할 수 있으며, 이는 종속성 소프트웨어에 의해 사용될 수 있다.The present invention is useful for open source software whose source code is widely used depending on other software products. We can consider a scenario where some vulnerabilities are found in the dependent software and a binary update is released to patch this vulnerability, but the source code release is delayed and there is no updated source code available to patch the dependent software. For rapid security patching, you need to configure the source code of the updated version (binaries before and after the update) and the source code of the previous version. Otherwise, the dependent software remains vulnerable to the discovered vulnerability. The code stitching technique of the present invention can handle this scenario and efficiently construct an updated version of the source code, which can be used by the dependent software.

본 발명은 Apple XNU 커널에서 수행될 수 있으며, XNU는 공개 소스 커널이며 많은 학술/군사 연구 프로젝트에서 널리 사용된다. 일부 취약점이 발견되면 Apple은 이진 패치를 빠르게 릴리즈(release)하지만 소스코드 릴리즈는 의도적으로 지연시킨다. 업데이트 전에 XNU의 이진 및 소스 코드를 모두 사용할 경우, 본 발명의 기술은 업데이트된 버전의 XNU의 소스 코드를 구성할 수 있다. The present invention can be practiced in the Apple XNU kernel, which is an open source kernel and is widely used in many academic/military research projects. When some vulnerabilities are discovered, Apple releases binary patches quickly, but deliberately delays source code releases. When both binary and source code of XNU are used before the update, the technology of the present invention can constitute an updated version of source code of XNU.

도 1은 본 발명의 일 실시예에 따른 코드 스티칭 방법을 나타낸 것이다.
도 2는 추출된 변경 오브젝트에 대한 일 예시도를 나타낸 것으로, 변경 오브젝트의 의사 코드 버전에 대한 일 예시도를 나타낸 것이다.
도 3은 구조적 매칭 기반 검색 알고리즘에 대한 일 실시예의 흐름도를 나타낸 것이다.
도 4는 공격적 검색 알고리즘에 대한 일 실시예의 흐름도를 나타낸 것이다.
도 5는 코드가 삭제된 경우에 대한 본 발명의 방법에 따른 결과에 대한 일 예시도를 나타낸 것이다.
도 6은 새로운 코드가 추가된 경우에 대한 본 발명의 방법에 따른 결과에 대한 일 예시도를 나타낸 것이다.
도 7은 삭제와 추가가 결합된 경우에 대한 본 발명의 방법에 따른 결과에 대한 일 예시도를 나타낸 것이다.1 shows a code stitching method according to an embodiment of the present invention.
2 shows an exemplary diagram of an extracted change object, and shows an exemplary diagram of a pseudo code version of the change object.
3 shows a flow diagram of one embodiment for a structural matching based search algorithm.
4 shows a flow diagram of one embodiment for an aggressive search algorithm.
5 shows an exemplary view of a result according to the method of the present invention for a case in which the code is deleted.
6 shows an exemplary view of a result according to the method of the present invention when a new code is added.
7 shows an exemplary view of a result according to the method of the present invention for a case where deletion and addition are combined.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형 태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be embodied in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of describing the embodiments, and is not intended to limit the present invention. As used herein, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, "comprises" and/or "comprising" refers to the presence of one or more other components, steps, operations and/or elements mentioned. or addition is not excluded.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein may be used with the meaning commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예들을 보다 상세하게 설명하고자 한다. 도면 상의 동일한 구성요소에 대해서는 동일한 참조 부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the accompanying drawings. The same reference numerals are used for the same components in the drawings, and repeated descriptions of the same components are omitted.

본 발명의 실시예들은, 어셈블리 명령에서 의미 있는 변경 사항을 추출하고, 그것들을 그들의 의사 코드에 매핑하며, 이전 버전의 소스 코드 변경 사항을 스티칭함으로써, 새로운 소스 코드로 패치하는 것을 그 요지로 한다.Embodiments of the present invention are directed to patching new source code by extracting meaningful changes from assembly instructions, mapping them to their pseudo-code, and stitching source code changes from previous versions.

이 때, 본 발명은 소스 코드 레벨에서 이진 업데이트를 더 잘 표현할 수 있다.In this case, the present invention can better represent the binary update at the source code level.

본 발명의 목적은 소스 코드 레벨에서 이진 업데이트를 더 잘 표현하는 것이다. 의사 코드는 다음과 같은 이유로 인해 해당 요건을 충족하지 못한다. It is an object of the present invention to better represent binary updates at the source code level. Pseudocode does not meet this requirement for the following reasons:

구조적 차이structural difference

함수의 소스 코드와 의사 코드는 기능적으로 유사하지만 구조적으로 다르다. 제어 흐름 그래프(CFG)는 종종 컴파일러 최적화로 인해 다른다. 예를 들어, "If" 블록을 "If... else" 블록으로 변환하여 CFG 구조를 변경하는 경우가 많다. CFG 이외에 변수 이름과 유형 정의가 누락되어 소스 코드와 의사 코드의 문장은 매우 다르다. 그러한 구조적 차이 때문에 분석이 어렵고 오류가 발생하기 쉽다. The source code and pseudocode of a function are functionally similar, but structurally different. Control flow graphs (CFGs) are often different due to compiler optimizations. For example, the CFG structure is often changed by converting an "If" block into an "If... else" block. In addition to CFG, the statements in the source code and pseudo code are very different due to the missing variable names and type definitions. Such structural differences make analysis difficult and error-prone.

매크로 함수 및 함수 인라이닝(inlining)Macro functions and function inlining

컴파일 사전 처리 단계에서 매크로 기능과 인라인 함수 호출은 그 정의로 대체된다. 컴파일러는 어셈블리 코드의 최적화와 함수 호출 비용 절감을 위해 이 작업을 수행한다. 결과적으로 어셈블리 코드에서 생성된 의사 코드는 구조적으로 컴파일러 전처리 단계 소스 코드와 동일하다. 그래서 의사 코드는 원래 소스 코드와는 다르게 보이고 보안 패치 후 원래 변경 사항을 최적화 기반과 구별하기 어렵다.In the compilation preprocessing stage, macro functions and inline function calls are replaced by their definitions. The compiler does this to optimize assembly code and reduce the cost of calling functions. As a result, the pseudocode generated from the assembly code is structurally identical to the compiler preprocessing stage source code. So the pseudocode looks different from the original source code and it is difficult to distinguish the original changes after the security patch from the optimization base.

본 발명에 대해 상세히 설명하면 다음과 같다.The present invention will be described in detail as follows.

일부 업데이트 후 수정되고 수정된 기능을 A'으로 나타내도록 하는 함수 A가 있다고 가정한다. 본 발명은 함수 A 소스 코드와 어셈블리 코드, 함수 A' 어셈블리 코드만 가지고 있다. 본 발명의 목적은 함수 A'을 함수 A 소스 코드의 수정 버전인 소스 코드 레벨에서 나타내는 것이다. Suppose we have a function A that is modified after some update and causes the modified function to be denoted by A'. The present invention has only the function A source code and assembly code, and the function A' assembly code. It is an object of the present invention to represent function A' at the source code level, which is a modified version of the function A source code.

상술한 바와 같이, 의사 코드가 소스 코드 레벨에서 함수 A를 가장 잘 나타내는 것은 아니다. 그래서 본 발명은 더 나은 표현이 필요하다. 관찰에 따르면, 업데이트 후 어떤 함수의 변경은 대개 함수 크기에 비해 매우 작다. 그래서 완전한 함수 의사 코드를 채택하는 것보다 작은 변경에 초점을 맞추고 함수 A의 소스 코드에 이러한 작은 변경 의사 코드를 스티치하여 A' 소스 코드를 얻는 것이 좋다. 그것은 실현 가능한 솔루션이며 본 발명은 이를 코드 스티칭이라 부른다.As noted above, pseudocode is not the best representation of function A at the source code level. So the present invention needs a better representation. Observations have shown that changes to a function after an update are usually very small compared to the size of the function. So, rather than adopting complete function pseudocode, it is better to focus on small changes and stitch these small change pseudocode into function A's source code to get A' source code. It is a feasible solution and the present invention calls it code stitching.

본 발명의 코드 스티칭은 어셈블리 코드 차이에서 추출한 변경 사항의 의사 코드를 사용하여 소스 코드를 패치하는 것이다. 코드 스티칭은 말처럼 간단하지 않으며 3가지 주요 요구 사항을 따라야 한다.The code stitching of the present invention is patching the source code by using the pseudo code of the change extracted from the assembly code difference. Code stitching is not as simple as it sounds and it must follow 3 main requirements:

첫째, 함수 A 소스 코드의 편집 가능한 형식은 임의의 위치(인덱스)에서 정밀하게 수정할 수 있다. 직접 텍스트 검색은 오류가 발생하기 쉽고 부정확한 기술이기 때문에 본 발명은 직접 텍스트 검색을 할 수 없다. 추상 구문 트리(AST; Abstract syntax tree)는 이러한 요구 사항을 정확하게 충족시킬 수 있는 소스 코드의 표현이다.First, the editable form of the function A source code can be precisely modified at any position (index). Since direct text search is an error-prone and inaccurate technique, the present invention cannot perform direct text search. An abstract syntax tree (AST) is a representation of the source code that can precisely meet these requirements.

둘째, 업데이트 후 함수의 변경은 전체 함수에 비해 작은 경우가 많다. 코드 스티칭의 목적은 변경 부분만 스티칭하는 것이지만, 이진 디핑 출력(binary diffing output)은 어셈블리 코드로서, 소스 코드에서는 스티칭할 수 없다. 어셈블리 코드를 의사 코드에 매핑하여 어셈블리 변경사항을 의사 코드로 변환할 필요가 있다.Second, the change of the function after update is often small compared to the overall function. The purpose of code stitching is to stitch only the changed parts, but the binary diffing output is assembly code, not the source code. We need to map assembly code to pseudo code to translate assembly changes into pseudo code.

셋째, 일단 의사 코드가 변경되면, 패치 위치를 찾기 위하여 소스 코드 AST에서 의사 코드를 검색한다. 의사 코드는 소스 코드의 정확한 사본이 아니며 구조적으로 다르다. AST에서 최고의 매치 및 리턴 노드 인덱스를 제공할 수 있는 효율적인 알고리즘이 있어야 한다.Third, once the pseudocode is changed, the pseudocode is searched in the source code AST to find the patch location. The pseudocode is not an exact copy of the source code and is structurally different. There must be an efficient algorithm that can provide the best match and return node index in the AST.

상기 요건들은 적절하게 처리되어야 할 필요가 있는 특별한 도전 과제들이 있다. 그러한 도전 과제들을 고려하여 본 발명은 도 1에 도시된 바와 같이, 코드 스티칭 방법을 제공한다. 본 발명의 코디 스티칭 방법은 입력 도메인에 따라 어셈블리 처리 모듈, 소스 코드와 스티칭 단계로 나눌 수 있다.The above requirements have special challenges that need to be addressed properly. In consideration of such challenges, the present invention provides a code stitching method as shown in FIG. 1 . The coordinating stitching method of the present invention can be divided into an assembly processing module, a source code, and a stitching step according to an input domain.

각각의 구성은 모듈형이며 각 모듈 기능의 세부 사항은 다음과 같다.Each configuration is modular, and the details of each module function are as follows.

편집 가능한 AST(소스 코드):Editable AST (source code):

함수 A 소스 코드를 수정하려면 AST와 같이 편집 가능한 형식이어야 한다. AST를 구문 분석 및 생성할 수 있는 인기 있는 유틸리티는 clang(clang.cindex)이지만, clang 파서(parser)에서는 완전한 AST를 생성하지 못한다. 중첩된 루프 또는 중첩된 if 블록의 경우, clang은 AST에 중첩된 블록을 포함하지 않으므로 생성된 AST가 완전하지 않다. clang은 컴파일할 수 없는 소스 코드 파일을 분석할 수 있다. 이는 장점처럼 보이지만 본 발명의 경우에는 부족하다. 본 발명은 MACRO 함수, 인라인 함수 정의를 포함하는 함수 소스 코드를 전처리하여 어셈블리 코드에 동등한 소스 파일을 준비해야 한다. 컴파일러에 의해 사전 처리된 소스 코드를 취한 다음 완전한 AST를 생성하는 파서가 본 발명의 요구사항이다.To modify the function A source code, it must be in an editable format such as AST. A popular utility that can parse and generate ASTs is clang (clang.cindex), but the clang parser does not generate complete ASTs. For nested loops or nested if blocks, clang does not include nested blocks in the AST, so the generated AST is incomplete. clang can analyze source code files that cannot be compiled. This seems like an advantage, but in the case of the present invention it is lacking. According to the present invention, a source file equivalent to the assembly code should be prepared by preprocessing the function source code including the MACRO function and inline function definition. A parser that takes source code preprocessed by a compiler and then generates a complete AST is a requirement of the present invention.

1) 도전 과제(Challenges)1) Challenges

파서의 경우, 소스 코드가 컴파일될 때 완전성이 보장된다. 이진 디핑 출력은 함수의 목록이며, 구체적으로 소스 코드를 구문 분석하고자 할 경우에만 함수 이름을 가지고 있다. 단일 함수가 주어지면, 함수 소스 코드에 정의되지 않은 매크로 함수, typedef, 구조 및 기타 사용자 정의 데이터 유형을 사용할 수 있다. 이러한 누락된 정보는 단일 함수를 컴파일할 수 없게 만들며, 그러한 함수를 구문 분석하면 AST가 완료되지 않을 것이다. clang은 컴파일할 수 없는 함수를 구문 분석할 수 있으므로 완전성을 보장하지 않는다. 완전성을 보장하고 컴파일 가능한 함수만 구문 분석하는 또 다른 널리 사용되는 파서인 "pycparser"가 있다.For parsers, completeness is guaranteed when the source code is compiled. The binary dipping output is a list of functions, with function names only if you specifically want to parse the source code. Given a single function, it can use macro functions, typedefs, structures, and other user-defined data types that are not defined in the function source code. This missing information makes a single function uncompiled, and parsing such a function will result in incomplete AST. clang does not guarantee completeness as it can parse functions that cannot be compiled. There is another popular parser, "pycparser", which guarantees completeness and only parses compilable functions.

본 발명은 누락된 정보가 많을 때 단일 함수를 컴파일 가능하게 만들기 위해, 누락된 모든 정의를 포함해야 하며, 누락된 정의를 자동으로 준비하고 단일 함수를 컴파일할 수 있는 프레임워크를 제공한다.In order to make a single function compilable when there is a lot of missing information, the present invention should include all missing definitions, and provides a framework capable of automatically preparing missing definitions and compiling a single function.

구문 분석 프레임워크(Parsing Framework)Parsing Framework

본 발명은 단일 함수를 구문 분석하여 그것을 컴파일할 수 있게 만드는 것이다. 두 작업 모두 효율적이고 빠른 XNU 커널 함수, 구조, typedef 등의 정의가 필요하다. 이러한 효율적인 솔루션은 데이터베이스의 모든 정의를 구문 분석 및 추출하고 나중에 SQL 쿼리를 통해 초과한다. 본 발명은 단일 데이터베이스에서 XNU 커널의 모든 정의를 추출하기 위해 clang을 이용한 "소스 코드 구문 분석" 모듈을 포함할 수 있다.The invention parses a single function and makes it compileable. Both tasks require the definition of efficient and fast XNU kernel functions, structures, typedefs, etc. This efficient solution parses and extracts all definitions from the database and later exceeds them via SQL queries. The present invention may include a "source code parsing" module using clang to extract all definitions of the XNU kernel from a single database.

도 1은 본 발명의 일 실시예에 따른 코드 스티칭 방법을 나타낸 것으로, 세 가지의 주요 블록으로 나눌 수 있으며, 1) 어셈블리 코드에서의 동작, 2) 단일 함수의 소스 코드 파싱(parsing)과 AST 생성, 3) 의사 코드에서 오브젝트를 변경하는 최종 스티칭으로 나눌 수 있다. 최종 스티칭을 통해 수정된 버전을 생성한다. 절차는 다음과 같다.1 shows a code stitching method according to an embodiment of the present invention, which can be divided into three main blocks: 1) operation in assembly code, 2) source code parsing of a single function and AST generation , 3) the final stitching that changes the object in pseudo code. Create a modified version through final stitching. The procedure is as follows.

함수 이름이 주어지면, 데이터베이스(DB)에서 소스 코드를 가져온다. 소스 코드로에서 데이터 유형의 키워드 목록, 매크로 호출을 분할하고 추출한다. 데이터베이스에서 누락된 키워드 목록 정의를 반복적으로 가져온다. 누락된 모든 정의를 별도 헤더 파일에 기록한다. 누락된 모든 정의를 포함하면 함수는 컴파일 가능 상태이다. 마지막으로, 파서 예를 들어, "pycparser"를 통해 함수 소스를 구문 분석한다. 출력은 훌륭하고 빼어난 편집 가능 AST로서, 편집하고 소스 코드로 다시 쓸 수 있다.If a function name is given, the source code is fetched from the database (DB). Splits and extracts keyword lists of data types, macro calls from, into source code. Recursively fetch missing keyword list definitions from database. Write all missing definitions to a separate header file. If you include all the missing definitions, the function is compilable. Finally, the function source is parsed through a parser eg "pycparser". The output is a nice, nice editable AST that you can edit and rewrite as source code.

어셈블리 디핑(Assembly diffing)Assembly diffing

이진 디핑 출력은 부분적으로 변경된 함수의 목록이며, 본 발명은 수정 세부사항을 추출하기 위해 그들의 어셈블리 코드를 "diff"할 수 있다. 본 발명은 함수 A와 함수 A'의 어셈블리 코드가 있고 텍스트 기반의 디핑을 이용하면 어셈블리 코드의 라인 별 변경사항을 추출할 수 있다. 본 발명은 통합된 diff 출력을 획득하기 위하여 "difflib" python 모듈을 이용할 수 있다. 이 텍스트 기반 디핑 프로세스는 오류가 발생하기 쉬우며, 특정 변경 사항은 어셈블리 코드에서 감지되어 소스 코드 레벨에는 반영되지 않는다. 이러한 사소한 변경은 일반적으로 최적화를 기반으로 하며 어셈블리 명령은 레지스터 이름 등에서만 다를 뿐이다. 통합된 diff 출력으로부터 본 발명은 다음과 같은 작업을 수행한다.The binary dipping output is a list of partially modified functions, and the present invention can "diff" their assembly code to extract the modification details. In the present invention, there are assembly codes of function A and function A', and by using text-based dipping, it is possible to extract changes for each line of assembly code. The present invention may use the "difflib" python module to obtain a unified diff output. This text-based dipping process is error-prone, and certain changes are detected in the assembly code and not reflected at the source code level. These minor changes are usually based on optimizations, and assembly instructions differ only in register names, etc. From the integrated diff output, the present invention performs the following operations.

(1) 컨텍스트와 함께 변경 사항 추출(1) Extract changes with context

(2) 변경 사항을 실제 대 최적화 기반으로 필터링/분류(2) Filter/classify changes based on actual vs. optimization

(3) 실제 변경 사항만 처리(3) process only actual changes

1) 변경사항은 무엇인가?1) What are the changes?

통합된 diff 출력에서, 본 발명은 1) '+'로 시작, 2) '-'로 시작, 3) ' '(공간)으로 시작과 같은 세 가지의 라인 유형이 표시된다. 음의 라인과 양의 라인은 명확하게 수정된 라인이며 공간으로 시작하는 라인은 함수 A와 함수 A'에 공통되는 라인이다. 본 발명은 공통 라인을 컨텍스트로 사용하며, 이러한 라인들은 최종 스티칭 단계에서 수정 사항을 찾는 데 사용될 수 있다. 따라서 변경 오브젝트는 유형(+, -, !), 상부 컨텍스트, 음 어셈블리 라인(있는 경우), 양 어셈블리 라인(있는 경우), 하부 컨텍스트와 같은 순서가 된다.In the integrated diff output, the present invention shows three line types: 1) starting with '+', 2) starting with '-', and 3) starting with ' ' (space). The negative and positive lines are clearly modified lines, and the lines starting with a space are common to functions A and A'. The present invention uses common lines as context, and these lines can be used to find modifications in the final stitching step. Thus, the change object is in the following order: type (+, -, !), upper context, negative assembly line (if any), positive assembly line (if present), sub-context.

디핑 프레임워크(Diffing Framework)Diffing Framework

디핑 프레임워크는 실제와 최적화 측면에서 변경 관련성과 관계 없이 통합된 diff 출력으로부터 변경을 추출할 수 있다.The dipping framework can extract changes from the consolidated diff output regardless of the relevance of the changes in practice and optimization.

두 개의 함수 어셈블리가 주어지면, Given two function assemblies,

(1) 라인별 차이를 획득하기 위하여 통합된 diff를 구한다.(1) Obtain the integrated diff to obtain the line-by-line difference.

(2) 통합된 diff 라인을 반복한다.(2) iterate over the merged diff lines.

1) 변경 라인(+, -)이 감지되는 경우 상부 컨텍스트(최대 12개 라인)를 추출한다. 이 때, 라인 수는 10개 이상이어야 한다. 그 이유는 이러한 어셈블리 라인은 의사 코드에 매핑되고 보통 3개 이상 라인이 하나의 의사 코드 라인을 구성하기 때문이다. AST에서 검색하려면 컨텍스트에서 3~5개의 의사 코드 라인이 필요하다. 본 발명은 12개의 어셈블리 라인을 컨텍스트로 선택할 수 있다.1) When a change line (+, -) is detected, the upper context (up to 12 lines) is extracted. At this time, the number of lines should be 10 or more. The reason is that these assembly lines are mapped to pseudo code, and usually three or more lines make up one pseudo code line. Searching in the AST requires 3-5 lines of pseudo-code in the context. The present invention can select 12 assembly lines as a context.

2) 상부 컨텍스트 후 다음 라인이 공백으로 시작될 때까지 모든 변경 라인을 추출한다.2) After the parent context, extract all change lines until the next line starts with a blank.

3) 변경 라인을 추출한 후 하부 컨텍스트를 추출한다. 여기서, 하부 컨텍스트의 라인 수는 12개일 수 있다.3) After extracting the change line, extract the sub-context. Here, the number of lines of the lower context may be 12.

4) 컨텍스트와 변경 라인을 추출한 후 다음 기준에 따라 변경 오브젝트의 유형을 결정한다. 여기서, 조건은 양의 목록이 비어 있고 음의 목록만 있는 경우 유형은 '-'이고, 양의 목록만 있고 음의 목록이 비어 있는 경우 유형은 '+'이며, 두 개의 목록이 모두 비어 있지 않는 경우 유형은 '~'일 수 있고, 추가와 삭제 케이스가 함께 있을 수도 있다.4) After extracting the context and change line, determine the type of change object according to the following criteria. Here, the condition is that the type is '-' if the positive list is empty and only the negative list is empty, the type is '+' if there is only a positive list and the negative list is empty, and both lists are not empty. The case type can be '~', and there can be both add and delete cases.

통합된 diff 출력에 대해 상술한 과정을 반복 수행한다.Repeat the above process for the combined diff output.

도 2는 추출된 변경 오브젝트에 대한 일 예시도를 나타낸 것으로, 변경 오브젝트의 의사 코드 버전에 대한 일 예시도를 나타낸 것이다. 도 2에 도시된 바와 같이 변경 오브젝트는 양의 변경, 음의 변경, 상부 컨텍스트와 하부 컨텍스트를 포함하는 구조를 가지는 것을 알 수 있다. 이 변경 오브젝트는 처음에 diff 파일에서 준비되며, 단일 diff 파일에는 변경 횟수에 따라 이러한 오브젝트가 많이 있을 수 있다. 나중에 각 변경 오브젝트는 맵을 이용하여 대응하는 의사 코드로 변환될 수 있다.2 shows an exemplary diagram of an extracted change object, and shows an exemplary diagram of a pseudo code version of the change object. As shown in FIG. 2 , it can be seen that the change object has a structure including a positive change, a negative change, and an upper context and a lower context. These change objects are initially staged in a diff file, and there can be many such objects in a single diff file depending on the number of changes. Later, each change object can be transformed into the corresponding pseudo-code using the map.

여과(Filtration)Filtration

통합된 diff 출력에는 기본적으로 최적화 또는 컴파일러 변환 등에 기인한 관련 없는 많은 변경이 있다. 각 변경 라인에서 니모닉(mnemonics)이 동일하지만 레지스터 이름에 차이가 있고 경우에 따라 다른 오프셋이 있는 경우도 있다. 다른 데이터 레지스터를 사용한다고 해서 명령이 의미적으로 다른 것은 아니다. 이러한 명령은 의미적으로 동일하며 본 발명은 이에 대처하기 위해 word2vec 기반 inst2vec 유사성 검사를 사용할 수 있는데, 이에 대해서는 이진 디핑에서 설명한다. 이 때, inf2vec는 구조적으로 다르게 보일 수도 있지만 의미적 유사성을 보장한다.The consolidated diff output has many unrelated changes by default due to optimizations or compiler transformations etc. In each change line, the mnemonics are the same, but the register names are different and in some cases have different offsets. Using different data registers does not mean that the instructions are semantically different. These commands are semantically identical and the present invention can use the word2vec-based inst2vec similarity check to cope with this, which is described in Binary Dipping. At this time, although inf2vec may look structurally different, it guarantees semantic similarity.

본 발명의 여과 프레임워크는 다음과 같다.The filtration framework of the present invention is as follows.

(1) 입력은 변경 오브젝트이다.(1) The input is a change object.

(2) 양의 변경과 음의 변경의 의미적 유사성을 확인한다.(2) Check the semantic similarity between positive and negative changes.

(3) 의미상 동일할 경우 해당 어셈블리 라인을 삭제한다.(3) If the meaning is the same, the corresponding assembly line is deleted.

(4) 삭제 후 변경 오브젝트의 양의 목록과 음의 목록이 비어 있으면 전체 변경 오브젝트를 삭제한다.(4) After deletion, if the positive and negative lists of change objects are empty, the entire change object is deleted.

이러한 여과 프로세스는 관련없는 모든 변경 오브젝트를 제거하며 성공은 ML 기반 의미적 유사성에 따라 달라진다. 따라서, 여과 프로세스의 최종 결과는 소스 코드에서 실제 변경에 해당하는 깨끗하고 관련성 있는 변경 오브젝트들의 목록이다.This filtering process removes all irrelevant change objects, and success depends on ML-based semantic similarity. Thus, the end result of the filtering process is a clean and relevant list of change objects that correspond to actual changes in the source code.

변경 오브젝트는 어셈블리 라인으로 구성되며, 본 발명의 목적은 소스 코드 AST의 변경 사항을 스티치하는 것이다. 어셈블리 명령은 스티칭 작업에 전혀 맞지 않는다. 본 발명은 변경 오브젝트를 소스 코드 또는 소스 코드 표현에 가깝게 표현해야 한다. 의사 코드는 좋은 후보이며, 사용 가능한 옵션이다.The change object is made up of assembly lines, and the purpose of the present invention is to stitch the changes in the source code AST. Assembly instructions are not at all suitable for stitching operations. The present invention should represent the change object in source code or close to the source code representation. Pseudocode is a good candidate and a viable option.

어셈블리를 의사 코드에 매핑(Assembly to Pseudocode Mapping)Assembly to Pseudocode Mapping

디핑 프레임워크 출력은 변경 오브젝트의 목록이며 단일 변경 오브젝트는 어셈블리 명령의 컨텍스트(상하부)와 변경(추가 및 삭제)으로 구성된다. 본 발명은 스티칭을 하기 위하여, 함수 A의 소스 코드에서 컨텍스트를 검색하여 찾은 인덱스에서 양과 음의 변경을 스티칭해야 한다. 이는 어셈블리 코드로는 달성할 수 없으며, 본 발명은 어셈블리 명령을 동등한 의사 코드 라인에 제시한다.The dipping framework output is a list of change objects, and a single change object consists of the context (top and bottom) and changes (add and delete) of assembly instructions. In the present invention, in order to perform stitching, it is necessary to stitch positive and negative changes in the index found by searching the context in the source code of the function A. This cannot be achieved with assembly code, and the present invention presents assembly instructions in equivalent pseudo-code lines.

변경 오브젝트는 함수 어셈블리 명령의 극히 일부분을 가지고 있으므로 어셈블리 명령에서 의사 코드로의 매핑을 찾아야 한다. IDA Pro(7.0) 기본 기능은 본 발명의 요구 사항과 반대된다. 즉, 전체 함수 의사 코드를 생성할 수 있을 뿐이다. 그래서 본 발명은 IDA API 기능을 이용하여 맞춤형 맵을 생성하고, 그 단계는 다음과 같다.Since the change object contains only a fraction of the function assembly instructions, we need to find a mapping from the assembly instructions to the pseudo-code. IDA Pro (7.0) basic functionality is contrary to the requirements of the present invention. That is, it can only generate full function pseudocode. So, the present invention creates a custom map using the IDA API function, and the steps are as follows.

(1) 주어진 함수를 디컴파일하고 오브젝트를 디컴파일하기 위해 리턴된 포인터를 획득한다.(1) Decompile the given function and get the returned pointer to decompile the object.

(2) IDA Pro 디컴파일 오브젝트 구조는 명령 주소와 의사코드 라인 주소의 맵인 "eamap"이 있다.(2) IDA Pro decompiled object structure has "eamap" which is a map of instruction address and pseudo code line address.

(3) GetDisasm()을 사용하여 어셈블리 명령을 획득한다.(3) Get the assembly instruction using GetDisasm().

(4) 유사하게 idaapi.qstring_printer_t()를 사용하여 의사 코드 라인을 획득한다.(4) Similarly, use idaapi.qstring_printer_t() to obtain pseudo code lines.

(5) 마지막으로 어셈블리 명령과 의사 코드 라인을 얻어 맞춤형 맵을 생성한다.(5) Finally, we get the assembly instructions and pseudo-code lines to create a custom map.

이러한 단계는 데이터베이스의 각 함수에 대해 반복되며 맵은 SQlite 데이터베이스의 "JSON" 형식으로 저장되므로 나중에 코드 스티칭의 단계에서 쉽게 액세스할 수 있다. 간단히 말해, 함수 A와 함수 A'에 대한 매핑이 있다.These steps are repeated for each function in the database, and the map is stored in "JSON" format in the SQlite database so that it can be easily accessed later in the code stitching steps. Simply put, there is a mapping between function A and function A'.

본 발명은 의사 코드 맵에 대한 어셈블리 명령이 있을 때, 어셈블리에서 디핑 프레임워크 변경 오브젝트는 직접적으로 그들의 의사 코드로 변환될 수 있다. 변환된 변경 오브젝트에 일 예는 도 2에 도시된 바와 같다. 상부와 하부 컨텍스트가 함수 A와 함수 A'에 모두 존재하므로 변환에 대해서는 함수 A 맵 또는 함수 A' 맵을 사용할 수 있다. 그러나 삭제된 어셈블리 명령은 함수 A 맵에서만 사용할 수 있으며 추가된 어셈블리 명령은 함수 A' 맵에 존재한다. 그래서 변환 과정에서 본 발명은 가용성에 따라 그에 상응하는 맵를 사용할 수 있다. 마지막 결과는 변경 오브젝트의 훌륭한 의사 코드 표현일 수 있다.In the present invention, when there is an assembly instruction for a pseudo code map, the dipping framework change objects in assembly can be directly converted to their pseudo code. An example of the transformed change object is shown in FIG. 2 . Since the upper and lower contexts exist in both function A and function A', we can use either a function A map or a function A' map for transformations. However, the deleted assembly instruction can only be used in the function A map, and the added assembly instruction exists in the function A' map. So, in the conversion process, the present invention can use a map corresponding to the availability. The end result can be a nice pseudocode representation of the change object.

EAST에서 컨텍스트 검색Context retrieval from EAST

본 발명은 변경 오브젝트의 의사 코드 버전을 가지고 있다면, AST(소스 코드)에서 의사 코드 컨텍스트를 검색할 수 있다. 컨텍스트를 검색하는 것은 의사 코드와 소스 코드가 100% 일치하지 않으므로 간단하지 않다. 유사성은 보통 60~80% 사이이다. 본 발명의 목적은 AST에서 컨텍스트의 정확한 인덱스 또는 위치를 찾고 AST(소스 코드)에서 의사 코드의 비슷한 부분을 찾는 것이다.The present invention can retrieve the pseudo-code context from the AST (source code) if it has a pseudo-code version of the change object. Detecting the context is not straightforward as the pseudocode and source code are not 100% consistent. The similarity is usually between 60 and 80%. It is an object of the present invention to find the exact index or location of a context in the AST and to find similar parts of the pseudocode in the AST (source code).

1) 도전 과제1) Challenges

의사 코드는 누락된 유형, 변수의 실제 이름 정보 등으로 인하여 정확한 소스 코드의 사본은 아니다. The pseudocode is not an exact copy of the source code due to missing types, real name information of variables, etc.

매크로 함수 호출은 의사 코드에서 그 정의로 대체된다.Macro function calls are replaced by their definitions in pseudocode.

유사하게 인라인 함수에 대한 함수 호출도 의사 코드의 정의로 대체된다. Similarly, function calls to inline functions are replaced with pseudocode definitions.

구조적 매칭을 위해서는 함수 소스 코드를 사전 처리해야 한다. For structural matching, the function source code must be preprocessed.

이러한 검색 문제를 해결하기 위해 본 발명은 구조적 매칭 기반 검색(Structural matching based search)과 공격적 검색(Aggressive search)의 두 개의 알고리즘을 제공한다.In order to solve such a search problem, the present invention provides two algorithms: a structural matching based search and an aggressive search.

구조적 매칭 기반 검색Structural Match-Based Search

구조적 매칭은 컨텍스트 의사 코드와 소스 코드 AST 사이의 구조적 유사성 예를 들어, 슬라이딩 윈도우 방법에 의하여 컨텍스트 의사 코드와 소스 코드 AST 사이의 구조적 유사성을 찾아내고 마지막으로 코드 유사성 비율이 최대인 인덱스를 선택하는 것이다. 동일한 구조가 함수 소스 코드의 여러 인덱스에 존재할 수 있으므로 두 번째 단계의 유사성은 필수적이다.Structural matching is to find the structural similarity between the context pseudo code and the source code AST, for example, the structural similarity between the context pseudo code and the source code AST by the sliding window method, and finally select the index with the highest code similarity ratio. . The second level of similarity is essential, as the same structure can exist in multiple indexes of the function source code.

1) 의사 코드의 토큰화(tokenization of pseudocode)1) Tokenization of pseudocode

컨텍스트 의사 코드 라인은 변경 위치에 따라 무작위로 선택되며, "if" 블록의 일부 또는 블록 시작과 같은 소스 코드 라인이 될 수 있고, 엔딩 블록 괄호 등은 없을 수 있다. 이 라인들의 컴파일 확률은 매우 낮거나 거의 불가능하다. 하지만, 만약 컨텍스트가 컴파일 가능하다면, 본 발명은 그것을 AST로 간단히 변환할 수 있고 이 문제는 단순한 AST-AST 검색이 될 수 있다. 이러한 것을 가능하기 위하여, 본 발명은 의사 코드 라인에서 AST와 같은 특징들을 추출하기 위하여 구문 분석 프레임워크를 제공한다. 하나의 의사 코드 라인은 다음과 같은 구조로 변환될 수 있다.Context pseudo-code lines are randomly selected according to the change location, and may be part of an "if" block or a line of source code, such as the beginning of a block, without ending block parentheses, etc. The compile probability of these lines is very low or almost impossible. However, if the context is compilable, the present invention can simply convert it to an AST and the problem can be a simple AST-AST search. To enable this, the present invention provides a parsing framework for extracting AST-like features from pseudo-code lines. One pseudo-code line can be converted to the following structure.

(1) Type : if, else, assign, loop 등(1) Type: if, else, assign, loop, etc.

(2) Spelling: 유형 또는 변수 이름의 문자열 버전(2) Spelling: String version of type or variable name

(3) Src : 의사 코드 라인 자체(3) Src: the pseudo-code line itself

(4) Children : 의사코드 라인에 하위 라인이 있으면 재귀적으로 토큰화한다. (4) Children: If there is a subline in the pseudocode line, tokenize it recursively.

본 발명의 토큰화 프레임워크에서는 컨텍스트에서 각 의사 코드 라인에 대해 토큰 오브젝트를 준비할 수 있다.The tokenization framework of the present invention may prepare a token object for each pseudo-code line in the context.

2 ) 알고리즘:2 ) Algorithm:

구조적 매칭 기반 검색 알고리즘은 AST에서 의사 코드 컨텍스트를 검색하는 검색 알고리즘으로, 매칭 구조적 특징을 기반으로 한다. 구조적 매칭 기반 검색 알고리즘은 첫 번째 단계에서 구조적 특징과 매칭한 다음 성공하면 의사 코드 컨텍스트와 AST에서 발견된 인덱스 소스 코드 간의 유사성 비율을 계산한다. 유사성 비율이 0.9를 초과하면 매칭 노드로 선언한다. 해당 알고리즘의 구체적인 흐름도는 도 3에 도시된 바와 같다.The structural matching-based search algorithm is a search algorithm that searches the pseudocode context in the AST, based on the matching structural features. Structural matching-based search algorithms match the structural features in the first step and, if successful, calculate the similarity ratio between the pseudocode context and the indexed source code found in the AST. If the similarity ratio exceeds 0.9, it is declared as a matching node. A detailed flowchart of the algorithm is shown in FIG. 3 .

(1) 함수의 AST가 주어지면 노드 유형 목록을 작성한다. 본 발명은 깊이 우선 검색을 이용하여 AST를 통해 반복하며 각 반복(또는 노드)에서 노드 유형과 리턴을 획득한다. 결과는 깊이 우선 순서에 따른 노드 유형 목록이다.(1) Given an AST of a function, build a list of node types. The present invention iterates through the AST using a depth-first search to obtain the node type and return at each iteration (or node). The result is a list of node types in depth-first order.

(2) 의사 코드의 컨텍스트에 대해 상기 (1)을 반복한다. 차이점은 소스 코드의 경우 적절하게 구문 분석된 AST를 사용한다는 것과 의사 코드의 경우 토큰화 프레임워크 출력이 사용된다는 것이다.(2) Repeat (1) above for the context of the pseudo code. The difference is that for source code we use a properly parsed AST, and for pseudocode we use tokenization framework output.

(3) 소스 코드 목록과 컨텍스트 의사 코드 목록의 크기 차이는 클 것이다. 본 발명의 목적은 소스 코드에서 의사 코드 구조적 정보를 검색하는 것이다. (3) The size difference between the source code list and the context pseudo code list will be large. It is an object of the present invention to retrieve pseudo-code structural information from source code.

(4) 이러한 목적을 위해 본 발명은 슬라이딩 윈도우 프로세스를 채택할 수 있다. 컨텍스트 의사 코드 목록을 소스 코드 목록 위로 단계적으로 슬라이딩한다. 이 프로세스의 결과는 매칭 포인트의 인덱스 위치이다. 함수에서 동일한 구조가 반복됨에 따라 단일 매칭, 다중 매칭 또는 없음일 수 있다.(4) For this purpose the present invention may adopt a sliding window process. Step by step sliding the context pseudocode list over the source code list. The result of this process is the index position of the matching point. It can be a single match, multiple matches, or none as the same structure is repeated in a function.

(5) 일단 인덱스를 갖게 되면, 본 발명은 컨텍스트 의사 코드를 이용하여 해당 인덱스에서 소스 코드의 텍스트 기반 유사성을 계산한다. 본 발명은 텍스트 기반 유사성을 찾기 위해 "difflib" 시퀀스 매칭 알고리즘을 이용할 수 있다.(5) Once we have the index, the present invention uses the context pseudocode to compute the text-based similarity of the source code at that index. The present invention may use a "difflib" sequence matching algorithm to find text-based similarities.

(6) 일부 인덱스의 유사성 비율이 0.9보다 크면 최종 노드로 간주하고, 그렇지 않으면 찾을 수 없다.(6) If the similarity ratio of some indices is greater than 0.9, it is considered as a final node, otherwise it cannot be found.

3) 한계3) limit

의사 코드와 소스 코드는 구조적으로 유사하지 않다. 이 검색 알고리즘은 의사 코드가 소스 코드에 가까운 작은 함수와 함수에 작용한다. 그러나 대부분의 경우 이 알고리즘은 실패한다. 본 발명은 소스 코드의 컨텍스트를 검색하기 위해 또 다른 대체 알고리즘이 필요하다.Pseudocode and source code are not structurally similar. This search algorithm operates on small functions and functions whose pseudocode is close to the source code. However, in most cases this algorithm will fail. The present invention requires another alternative algorithm to retrieve the context of the source code.

공격적 검색(Aggressive Search)Aggressive Search

복잡한 함수의 경우, 의사 코드는 구조적으로 소스 코드와 다르며 구조 기반 매칭이 작동하지 않을 가능성이 높다. 이러한 한계를 극복하기 위해, 본 발명은 알고리즘의 특성에 의해 유도되는 공격적 검색이라는 새로운 알고리즘을 제공한다. 공격적 알고리즘은 도 4에 도시된 바와 같다. 공격적 알고리즘은 깊이 우선 검색에서 AST 위로 윈도우를 슬라이딩함으로써, 컨텍스트 의사 코드의 유사성 비율을 계산하며, AST의 리프 노드까지 각 노드를 통과한다. 결과는 유사성 비율 목록이며, 나중에 최대 비율 노드가 매칭으로 선택될 수 있다.For complex functions, the pseudocode is structurally different from the source code, and structure-based matching is unlikely to work. In order to overcome this limitation, the present invention provides a new algorithm called aggressive search driven by the characteristics of the algorithm. The aggressive algorithm is shown in FIG. 4 . By sliding a window over the AST in a depth-first search, the aggressive algorithm computes the similarity ratio of the context pseudocode, traversing each node up to the leaf nodes of the AST. The result is a list of similarity ratios, after which the maximum ratio node can be chosen as a match.

1) 알고리즘1) Algorithm

이 알고리즘은 구조 기반 검색 알고리즘과 텍스트 기반 유사성 비율에 따라 완전히 다르다. 이 알고리즘은 깊이 우선 탐색에서 각 노드를 통과하며 의사 코드 컨텍스트와 슬라이딩 윈도우의 유사성을 계산한다. 단계별 접근방식은 다음과 같다. This algorithm is completely different between a structure-based search algorithm and a text-based similarity ratio. The algorithm traverses each node in a depth-first search and computes the similarity of the pseudo-code context and sliding window. The step-by-step approach is as follows.

(1) 소스 코드를 저장할 큐(queue)의 길이를 정의한다. 큐의 길이는 컨텍스트에서 의사 코드의 라인 수에 따라 달라진다. 본 발명이 텍스트 유사성에 기초하여 접근하기 때문에, 공정한 비교를 위해 큐의 길이는 의사 코드의 길이와 같을 수 있다.(1) Define the length of the queue to store the source code. The length of the queue depends on the number of lines of pseudocode in the context. Since the present invention approaches based on text similarity, the length of the queue may be equal to the length of the pseudocode for fair comparison.

(2) 검색은 AST 루트 노드의 첫 번째 자식 노드(children)부터 시작하여 깊이 우선 패턴으로 반복하여 끝난다. 간단히 말해서, 검색은 윈도우를 슬라이딩함으로써 모든 소스 코드 라인을 통과하게 된다. 각 반복 시 현재 노드의 소스 코드가 큐로 푸시될 수 있다. (2) The search starts from the first children of the AST root node and ends iteratively in a depth-first pattern. Simply put, the search will traverse through every line of source code by sliding the window. At each iteration, the source code of the current node may be pushed to the queue.

(3) 각 푸시마다 소스 코드의 유사성 비율은 고정 길이의 의사 코드(컨텍스트)와 비교되고 계산되어 출력 목록에 저장된다.(3) For each push, the similarity ratio of the source code is compared with the fixed-length pseudo code (context), calculated and stored in the output list.

(4) 상기 (3) 단계가 각 반복에 대해 반복되고 유사성이 계산되며 마지막으로 비율과 해당 노드의 목록이 나온다.(4) Step (3) above is repeated for each iteration, similarity is calculated, and finally a list of ratios and corresponding nodes is displayed.

(5) 상기 (1)~(4) 단계는 상부와 하부 컨텍스트 모두에 대해 반복되며 본 발명은 두 개의 유사성 비율의 목록을 가지게 된다.(5) The above steps (1) to (4) are repeated for both the upper and lower contexts, and the present invention has a list of two similarity ratios.

(6) 나중에 상부와 하부 컨텍스트에서 최대 비율 노드를 찾는다.(6) Later find the maximum ratio node in the upper and lower contexts.

(7) 최대 비율 노드 거리 범위가 10개의 인덱스 내에 있으면 매칭 항목을 찾았다는 것을 의미한다. 다른 경우에는 다음 최대 비율 노드가 발견되고 동일한 규칙이 적용된다. 이 재귀 최대 비율 찾기 프로세스는 최대 4번의 반복만으로 실행될 수 있으며, 이는 안전한 재귀 구속을 보장할 수 있다.(7) If the maximum ratio node distance range is within 10 indices, it means that a match is found. In other cases, the next maximum rate node is found and the same rules apply. This recursive maximal ratio finding process can only be executed up to 4 iterations, which can guarantee a safe recursive constraint.

검색 전략(Search Strategy)Search Strategy

의사 코드 CFG는 종종 원래의 소스 코드와 다르다, 그것은 의사코드와 소스 코드의 구조적 차이를 도입한다. 이러한 시나리오에서 구조적 기반 검색 알고리즘은 실패하며 발견된 인덱스는 신뢰할 수 없다. 이와 반대로, 공격적 검색 알고리즘은 텍스트 유사성에 기초하고 스티칭 인덱스를 찾기 위한 최대 유사성 비율을 찾는다. 텍스트 유사성 관점 의사 코드는 소스 코드와 70~80% 유사하므로 다른 CFG에서도 최대 유사성 노드를 찾을 가능성이 높다. 본 발명의 테스트 환경에서 공격적 검색 알고리즘은 구조 기반 검색 알고리즘보다 더 잘 수행된다. 그러나 소스 코드에 비해 의사 코드의 복잡성과 특성을 고려하여 두 알고리즘의 조합을 사용할 수 있다.Pseudocode CFG is often different from the original source code, it introduces a structural difference between the pseudocode and the source code. In such a scenario, the structural-based search algorithm fails and the indexes found are unreliable. In contrast, an aggressive search algorithm is based on text similarity and finds the maximum similarity ratio to find the stitching index. Since the text similarity perspective pseudo code is 70-80% similar to the source code, it is highly likely to find the maximum similarity node in other CFGs as well. In the test environment of the present invention, the aggressive search algorithm performs better than the structure-based search algorithm. However, considering the complexity and nature of the pseudocode compared to the source code, a combination of the two algorithms can be used.

검색 알고리즘의 목적은 컨텍스트를 찾아 변경사항을 추가하거나 삭제할 수 있는 인덱스를 찾는 것이다. 그 다음이자 마지막 단계는 함수 A의 소스 코드를 패치하여 함수 A'의 소스 코드를 생성하는 것이다. 최종 스티칭의 세부 사항으로 넘어가기 전에 가능한 변경의 유형을 먼저 고려할 수 있으며, 그 변경은 일반적으로 세 가지 유형일 수 있다. The purpose of the search algorithm is to find the context and find an index where changes can be added or deleted. The next and final step is to patch the source code of function A to generate the source code of function A'. Before moving on to the details of the final stitching, we can first consider the types of possible changes, which in general can be of three types.

(1) 함수 A 코드에서 삭제만 (-)(1) Delete only (-) in function A code

(2) 함수 A 코드에서 추가만 (+)(2) Add only (+) in function A code

(3) 함수 A 코드에서 추가 및 삭제 모두 (!)(3) both add and delete (!) in function A code

모든 종류의 변경을 스티치하는 전략은 거의 비슷하며, 다음과 같을 수 있다.Strategies for stitching all kinds of changes are pretty much the same, and could be:

(1) 변경 오브젝트(의사 코드 표현)가 주어지면, 검색 알고리즘의 입력으로 상부 컨텍스트와 하부 컨텍스트를 입력하고 패치 인덱스?z 찾는다.(1) Given a change object (pseudocode representation), input the upper context and lower context as input to the search algorithm and find the patch index?z.

(2) 삭제만(only deletion): (1) 단계를 통해 변경된 노드(상부 컨텍스트와 하부 컨텍스트 사이)의 인덱스를 확보한다. 노드를 삭제하는 것은 민감하고 노드를 정확히 삭제해야 하므로, 먼저 삭제의 의사 코드를 가진 (1) 단계 인덱스 노드의 코드 유사성 비율을 찾아낸다. 유사성 비율이 0.9를 넘으면 완벽한 매칭으로 간주하여 AST에서 노드를 안전하게 삭제하고 마지막으로 수정된 노드를 소스 코드로 작성하여 함수 A'의 소스 코드를 얻는다.(2) only deletion: The index of the node (between the upper context and the lower context) that has been changed through step (1) is secured. Deleting a node is sensitive and it is necessary to delete the node correctly, so first find the code similarity ratio of the (1)-level index node with the pseudo-code of the deletion. If the similarity ratio exceeds 0.9, it is regarded as a perfect match, the node is safely deleted from the AST, and the last modified node is written as the source code to obtain the source code of the function A'.

(3) 추가만(only addition): 추가가 삭제 사례보다 좀 더 간단하다. 본 발명은 상부 인덱스 다음과 하부 인덱스 이전의 인덱스를 찾으면 된다. 그러므로 (1) 단계에서 찾아진 인덱스도 믿을만하고 찾아진 인덱스에 양의 변경 의사 코드를 추가할 수 있다. 본 발명은 AST에 의사 코드를 직접 삽입할 수 없고, 의사 코드를 AST 노드로 변환한 다음, 찾아진 인덱스에 간단히 삽입해야 한다. 의사 코드를 AST 노드로 변환하기 위해, 소스 코드 AST 단계와 유사한 "pycparser"를 사용할 수 있다.(3) only addition: addition is simpler than the deletion case. In the present invention, it is enough to find the index after the upper index and before the lower index. Therefore, the index found in step (1) is also reliable, and a positive change pseudo code can be added to the found index. In the present invention, the pseudo code cannot be directly inserted into the AST, and the pseudo code must be converted into an AST node and then simply inserted into the found index. To transform the pseudocode into AST nodes, you can use "pycparser" similar to the source code AST steps.

(4) 삭제 및 추가 결합: 삭제와 추가 결합은 개별적인 것보다 좀 복잡하다. 일반적인 경우, (1) 단계를 사용하여 상부 컨텍스트 다음과 하부 컨텍스트 전의 위치를 찾는다. 변경은 삭제와 추가가 동시에 이루어진다. 먼저 삭제 사례와 유사한 노드를 삭제하고 나중에 같은 인덱스에 새로운 코드를 추가한다. 삭제와 추가가 서로 독립되어 있는 규칙적인 경우는 위의 전략이 잘 통한다. 그러나 삭제된 코드가 다시 추가되는 경우가 있다. 이런 경우, 양의 의사 코드에서 삭제된 코드를 매칭시켜 교체한 다음 일반적인 덧셈처럼 간단하게 변경사항을 추가한다. 이는 본 발명에서 의사 코드의 사용을 최소화하고자 하는 것이다.(4) Combining deletions and additions: Combining deletions and additions is a little more complex than the individual ones. In the general case, step (1) is used to find the position after the upper context and before the lower context. Changes are made at the same time as deletion and addition. First, delete the node similar to the delete case, and later add new code to the same index. The above strategy works well for regular cases where deletion and addition are independent of each other. However, there are cases where the deleted code is added again. In this case, the deleted code is matched and replaced in the positive pseudocode, and then the change is added as simple as normal addition. This is to minimize the use of pseudo code in the present invention.

단일 함수는 상기 세 가지 유형의 변경을 모두 가질 수 있으므로 전체 함수 패치의 경우 모든 변경 오브젝트에 대해 전체 프로세스가 반복되고 최종 AST는 소스 코드 파일로 작성된다. 최종 결과는 업데이트 전에 소스 코드의 수정 버전일 수 있다.A single function can have all three types of changes, so in the case of a full function patch, the whole process is repeated for every change object, and the final AST is written to the source code file. The end result may be a modified version of the source code before the update.

다양한 공격 시나리오에 대한 본 발명의 해결책의 효과를 보여주는 예를 설명하면 다음과 같다.An example showing the effect of the solution of the present invention for various attack scenarios will be described as follows.

사례 1: 삭제된 코드만Case 1: Only deleted code

업데이트 버전에서 일부 코드 라인이 삭제되는 높은 시에라(high sierra)의 사례 함수를 생각해 본다. 도 5는 코드가 삭제된 경우에 대한 본 발명의 방법에 따른 결과에 대한 일 예시도를 나타낸 것으로, 좌측은 원본 소스를 나타내고 우측은 스티칭된 버전을 나타낸 것이다. 도 5에 도시된 바와 같이, 삭제 사례를 위한 코드 스티치는 다른 모든 사례보다 상당히 간단하고 정확하다. 코드 스티칭 프로세스는 어셈블리 함수의 이진 디핑에 의해 시작되며, 통합된 diff 파일에서 컨텍스트와 함께 변경 오브젝트를 추출하고, 나중에 이러한 변경 오브젝트는 그들의 의사 코드에 상응하는 버전으로 변환된다. 스티칭의 경우, 본 발명의 검색 알고리즘을 사용하여 AST에서 희사 코드로부터 상부와 하부 컨텍스트를 찾고, 매칭하는 컨텍스트가 발견되면, 발견된 컨텍스트 노드 사이의 삭제된 의사 코드를 AST 노드와 매칭시키기도 한다. 노드가 매칭하는 경우, 검색된 매칭 노드를 -ve 변경 의사 코드로 간단히 삭제한다. 최종 출력은 소스 코드로 변환할 수 있는 새로운 AST이다. 도 5에서는 삭제 사례를 위한 코드 스티칭의 출력이 우측에 표시되며 100% 소스 코드이다. 이 사례는 본 발명이 왜 순수한 의사 소드보다 코드 스티칭을 선호하는지 보여주는 강한 동기를 부여한다.Consider a case function in high sierra where some lines of code are deleted in the updated version. Fig. 5 shows an exemplary view of the result according to the method of the present invention for the case where the code is deleted, the left side showing the original source and the right side showing the stitched version. As shown in Fig. 5, the code stitching for the deletion case is considerably simpler and more accurate than all other cases. The code stitching process is initiated by binary dipping of assembly functions, extracting change objects along with their contexts from a consolidated diff file, and later these change objects are converted into their pseudocode equivalent versions. In the case of stitching, the search algorithm of the present invention is used to find the upper and lower contexts from the spurious code in the AST, and when a matching context is found, the deleted pseudocode between the found context nodes is also matched with the AST node. If a node matches, simply delete the found matching node with -ve change pseudocode. The final output is a new AST that can be converted to source code. In Figure 5, the output of the code stitching for the deletion case is shown on the right and is 100% source code. This example provides strong motivation to show why the present invention favors code stitching over pure pseudo-sword.

사례 2: 추가된 코드만Case 2: Only added code

새로운 코드의 추가가 변경되는 함수의 또 다른 예를 고려한다. 코드 스티칭 목표는 이진 레벨에서 변경을 찾아 A함수의 AST에 인덱스를 스티칭하고 AST에 새 노드를 추가하여 A'함수를 준비하는 것이다. 상세한 절차는 상술한 검색 알고리즘에 상세히 설명되어 있다. 도 6은 새로운 코드가 추가된 경우에 대한 본 발명의 방법에 따른 결과에 대한 일 예시도를 나타낸 것으로, 좌측은 함수 A의 원본 소스 코드를 나타내고 우측은 스티칭된 버전 즉, 함수 A'을 나타낸 것이다. 도 6에 도시된 바와 같이, 오른쪽의 녹색 라인은 블록과 함수 호출 IOCPURunPlatformQuiesceActions() 사이에 스티칭된 수정(또는 추가) 부분이다. 스티치 코드는 구조적으로 xnu-4570.71.2의 함수 A 소스 코드와 유사하며, 녹색 섹션에서 강조된 작은 차이가 있다. 원본 버전에서는 녹색 섹션이 순수한 c 코드이지만, 스티칭된 버전에서는 녹색 부분이 의사 코드이다. 물론, 이는 일 예일 뿐이며, 상황에 따라 스티칭된 버전도 c 코드일 수 있다.Consider another example of a function that is changed by the addition of new code. The goal of code stitching is to prepare function A' by looking for changes at the binary level, stitching the index into the AST of function A, and adding a new node to the AST. The detailed procedure is described in detail in the above-mentioned search algorithm. 6 shows an exemplary view of a result according to the method of the present invention when a new code is added. The left side shows the original source code of the function A, and the right side shows the stitched version, that is, the function A'. . As shown in Figure 6, the green line on the right is the modified (or added) part stitched between the block and the function call IOCPURunPlatformQuiesceActions(). The stitch code is structurally similar to the function A source code of xnu-4570.71.2, with small differences highlighted in the green section. In the original version, the green section is pure c code, but in the stitched version, the green section is pseudocode. Of course, this is only an example, and depending on the situation, the stitched version may also be a c code.

사례 3: 삭제 및 추가 결합 Case 3: Combine delete and append

새로운 코드 추가와 함께 삭제가 있을 때 다른 함수를 고려한다. 삭제된 코드와 추가된 코드가 서로 독립적이거나 삭제된 코드가 추가된 코드에 부분적으로 포함될 가능성이 크게 두 가지 있다. 첫 번째 사례의 경우, AST에서 동일한 인덱스에 추가된 간단한 삭제만으로도 새로운 AST를 얻기에 충분하다. 이후의 사례도 같은 전략 즉, 삭제 후 추가로 스티치 될 수 있다. 그러나 본 발명이 하고 있는 것은 원본 소스 코드를 삭제하고 새로운 의사 코드를 추가하는 것이다. 본 발명의 목표는 의사 코드를 가능한 한 적게 스티치하고 동등한 C에 가까운 새 버전을 만드는 것이다. 따라서 삭제된 소스 코드가 추가된 코드에 부분적으로 존재하면 삭제된 소스 코드를 해당 소스 코드와 일치시키고 대체한다. 이러한 방법을 사용하면 추가된 코드는 소스 코드와 의사 코드의 조합이 된다.Consider other functions when there is a deletion along with the addition of new code. There are two main possibilities that the deleted code and the added code are independent of each other, or the deleted code is partially included in the added code. In the first case, a simple delete appended to the same index in the AST is sufficient to obtain a new AST. Subsequent cases may also be stitched with the same strategy, that is, after deletion. However, what the present invention is doing is deleting the original source code and adding the new pseudo code. The goal of the present invention is to stitch the pseudocode as little as possible and create a new version close to C equivalent. Therefore, if the deleted source code partially exists in the added code, the deleted source code is matched with the corresponding source code and replaced. Using this method, the added code becomes a combination of source code and pseudocode.

이후의 사례의 예는 도 7에 도시된 바와 같으며, 함수에 대한 소스-소스의 diff 버전이며 소스 코드의 원본 변경을 나타낸 것이다. 도 7에 도시된 바와 같이, 삭제와 추가가 모두 포함된 복잡한 사례를 나타낸 것으로, while 블록은 새로운 if 블록으로 이동하고 다른 else 블록이 추가된다. 이진 관점에서 코드가 먼저 삭제되고 다시 추가된다. 삭제된 코드가 추가된 의사 코드에 존재하면 해당 의사 코드는 소스 코드로 대체될 수 있다. 예컨대, clock_continuoustime_interval_to_deadline() 함수가 이동되면 새로운 if 블록 내부로 이동하며, 새로운 블록에서 다른 함수가 추가된다. 텍스트 기반 디핑 관점에서 clock_continuoustime_interval_to_deadline() 함수를 먼저 삭제한 다음 if 블록 내부에 추가한다. 본 발명은 디핑 관점을 고려하여 새로운 코드를 재구성해야 한다.An example of a later case is as shown in FIG. 7 , which is a source-source diff version for a function and shows the original change of the source code. As shown in Fig. 7, it shows a complex case including both deletion and addition, the while block is moved to a new if block and another else block is added. From a binary point of view, the code is first deleted and then added back. If the deleted code exists in the added pseudo-code, that pseudo-code can be replaced by the source code. For example, when the clock_continuoustime_interval_to_deadline() function is moved, it moves into a new if block, and another function is added in the new block. In terms of text-based dipping, first delete the clock_continuoustime_interval_to_deadline() function and then add it inside the if block. In the present invention, the new code must be reconstructed in consideration of the dipping point of view.

이와 같이, 본 발명의 실시예에 따른 방법은 어셈블리 명령에서 의미 있는 변경 사항을 추출하고, 그것들을 그들의 의사 코드에 매핑하며, 이전 버전의 소스 코드 변경 사항을 스티칭함으로써, 새로운 소스 코드로 패치할 수 있다. 또한, 본 발명은 소스 코드 레벨에서 이진 업데이트를 더 잘 표현할 수 있다.As such, the method according to an embodiment of the present invention extracts meaningful changes from assembly instructions, maps them to their pseudocode, and stitches the changes to the source code of the previous version so that it can be patched to the new source code. have. Also, the present invention can better represent binary updates at the source code level.

또한, 본 발명의 실시예에 따른 방법은 패치된 소스 코드를 효율적으로 처리하고 구성할 수 있으며, 업데이트 버전 소스 코드를 효율적으로 구성할 수 있고, 이는 종속성 소프트웨어에 의해 사용될 수 있다.In addition, the method according to the embodiment of the present invention can efficiently process and configure the patched source code, and efficiently configure the updated version source code, which can be used by dependency software.

즉, 본 발명의 실시예에 따른 방법은 상술한 내용을 토대로, 제1 함수가 제2 함수로 변경되면 상기 제1 함수의 소스 코드를 어셈블리 처리하여 상기 제1 함수의 어셈블리 코드를 획득하는 과정, 상기 제1 함수의 어셈블리 코드와 상기 제2 함수의 어셈블리 코드 차이에 기초하여 변경 사항을 추출하는 과정, 상기 추출된 변경 사항을 의사 코드에 매핑하여 어셈블리 코드의 변경 사항을 의사 코드로 변환하는 과정과 상기 변환된 의사 코드를 이용하여 상기 제1 함수의 소스 코드를 패치함으로써, 상기 제2 함수의 소스 코드를 구성하는 과정을 포함하도록 구성될 수 있다. 물론, 각각의 과정은 상술한 내용을 모두 포함할 수 있으며, 이러한 방법은 기능적인 구성 수단을 포함하는 장치를 통해 구현될 수 있다.That is, in the method according to an embodiment of the present invention, when the first function is changed to the second function based on the above description, the process of assembling the source code of the first function to obtain the assembly code of the first function; extracting a change based on a difference between the assembly code of the first function and the assembly code of the second function, mapping the extracted change to pseudo code and converting the change of the assembly code into pseudo code; and constructing the source code of the second function by patching the source code of the first function using the converted pseudo code. Of course, each process may include all of the above contents, and this method may be implemented through an apparatus including functional configuration means.

예컨대, 본 발명의 실시예에 따른 장치는 제1 함수가 제2 함수로 변경되면 상기 제1 함수의 소스 코드를 어셈블리 처리하여 상기 제1 함수의 어셈블리 코드를 획득하는 획득부, 상기 제1 함수의 어셈블리 코드와 상기 제2 함수의 어셈블리 코드 차이에 기초하여 변경 사항을 추출하는 추출부, 상기 추출된 변경 사항을 의사 코드에 매핑하여 어셈블리 코드의 변경 사항을 의사 코드로 변환하는 변환부와 상기 변환된 의사 코드를 이용하여 상기 제1 함수의 소스 코드를 패치함으로써, 상기 제2 함수의 소스 코드를 구성하는 구성을 포함하도록 구성될 수 있다. 물론, 장치의 각 구성 수단은 상술한 코드 스티칭 방법에서 기술한 모든 내용을 포함할 수 있으며, 이는 이 기술 분야에 종사하는 당업자에게 있어서 자명하다.For example, the apparatus according to an embodiment of the present invention includes an acquisition unit configured to obtain an assembly code of the first function by assembling the source code of the first function when the first function is changed to the second function, An extraction unit for extracting changes based on a difference between the assembly code and the assembly code of the second function, a conversion unit for mapping the extracted changes to pseudo code to convert the changes in the assembly code into pseudo code, and the converted and fetching the source code of the first function by using the pseudo code, thereby configuring the source code of the second function. Of course, each component of the device may include all the contents described in the above-described code stitching method, which will be apparent to those skilled in the art.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, the devices and components described in the embodiments may include a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), and a programmable logic unit (PLU). It may be implemented using one or more general purpose or special purpose computers, such as a logic unit, microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에서 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. can be embodied in The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 이때, 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. In this case, the medium may be to continuously store a program executable by a computer, or to temporarily store it for execution or download. In addition, the medium may be various recording means or storage means in the form of a single or several hardware combined, it is not limited to a medium directly connected to any computer system, and may exist distributed on a network. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like. In addition, examples of other media may include recording media or storage media managed by an app store that distributes applications, sites that supply or distribute various other software, and servers.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible for those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

when the first function is changed to the second function, assembly processing the source code of the first function to obtain the assembly code of the first function;
extracting a change based on a difference between the assembly code of the first function and the assembly code of the second function;
converting the changes in assembly code into pseudo code by mapping the extracted changes to pseudo code; and
constructing the source code of the second function by patching the source code of the first function using the converted pseudo code;
How to construct patched source code from assembly code containing

According to claim 1,
The step of extracting the changes is
By filtering the extracted changes based on optimization, only the actual changed changes are extracted,
The step of converting to the pseudo code is
A method of constructing a patched source code from assembly code, characterized in that only the actually changed changes are converted into pseudo code.

According to claim 1,
The step of extracting the changes is
The method for constructing patched source code from assembly code, characterized in that the change of each line of the assembly code is extracted by using text-based diffing of the assembly code of the first function and the assembly code of the second function.

According to claim 1,
The step of composing the source code of the second function is
By searching the converted pseudo-code in an abstract syntax tree for the source code of the first function, an index for a patch position is identified, and the identified index is stitched to the source code of the first function By patching the source code, the method for constructing the source code patched from assembly code, characterized in that the source code of the second function is configured.

5. The method of claim 4,
The step of composing the source code of the second function is
A window preset above the abstract syntax tree in a structural matching based search and depth-first search that finds structural similarity between the transformed pseudo code and the abstract syntax tree and selects an index with the largest code similarity ratio By sliding , the index for the patch location is checked based on an aggressive search that calculates the similarity ratio of the pseudo code and then selects the maximum ratio node of the abstract syntax tree as a match. How to configure the source code patched in .

an acquisition unit configured to obtain an assembly code of the first function by assembling the source code of the first function when the first function is changed to the second function;
an extractor configured to extract a change based on a difference between the assembly code of the first function and the assembly code of the second function;
a conversion unit that maps the extracted changes to pseudo code and converts the changes in assembly code into pseudo code; and
A component that configures the source code of the second function by patching the source code of the first function using the converted pseudo code
A source code component that is patched from assembly code containing

7. The method of claim 6,
The extraction unit
By filtering the extracted changes based on optimization, only the actual changed changes are extracted,
the conversion unit
An apparatus for constructing a source code patched from assembly code, characterized in that only the actually changed changes are converted into pseudo code.

7. The method of claim 6,
The extraction unit
The apparatus for constructing a source code patched from assembly code, characterized in that the change of each line of the assembly code is extracted by using text-based diffing of the assembly code of the first function and the assembly code of the second function.

7. The method of claim 6,
the component part
By searching the converted pseudo-code in an abstract syntax tree for the source code of the first function, an index for a patch position is identified, and the identified index is stitched to the source code of the first function By patching, the apparatus for constructing a source code patched from assembly code, characterized in that the source code of the second function is configured.

10. The method of claim 9,
the component part
A window preset above the abstract syntax tree in a structural matching based search and depth-first search that finds structural similarity between the transformed pseudo code and the abstract syntax tree and selects an index with the largest code similarity ratio By sliding , the index for the patch location is checked based on an aggressive search that calculates the similarity ratio of the pseudo code and then selects the maximum ratio node of the abstract syntax tree as a match. Source code configuration device patched from .