KR101579347B1

KR101579347B1 - Method of detecting software similarity using feature information of executable files and apparatus therefor

Info

Publication number: KR101579347B1
Application number: KR1020130000212A
Authority: KR
Inventors: 조성제; 임을규; 윤주환; 김태근
Original assignee: 단국대학교 산학협력단
Priority date: 2013-01-02
Filing date: 2013-01-02
Publication date: 2015-12-22
Also published as: KR20140089044A

Abstract

실행 파일의 특징 정보를 이용한 소프트웨어 유사도 탐지 방법 및 장치가 개시된다. 본 발명에 따른 유사도 탐지 방법은 비교 대상이 되는 제 1 및 제 2 소프트웨어의 실행 파일들-제 1 실행 파일 및 제 2 실행 파일-을 디스어셈블(disassemble)하는 단계, 상기 제 1 및 제 2 실행 파일의 디스어셈블된 결과를 함수 호출 명령어를 포함하는 복수의 블록들로 분할하는 단계 및 상기 제 1 실행 파일의 블록들과 상기 제 2 실행 파일의 블록들 간의 유사도를 산출하는 단계를 포함하여 구성된다. 따라서, 본 발명에 따른 방법을 이용할 경우에는, 소스 코드를 이용하지 않고 소프트웨어를 구성하는 바이너리 실행 파일에 기반하여, 효율적인 불법 복제 또는 표절 여부의 탐지가 가능해진다.A software similarity detection method and apparatus using feature information of an executable file are disclosed. The method of detecting similarity according to the present invention includes the steps of disassembling executable files of a first and a second software to be compared, that is, a first executable file and a second executable file, Dividing the disassembled result of the first executable file into a plurality of blocks including a function call instruction, and calculating a degree of similarity between the blocks of the first executable file and the blocks of the second executable file. Therefore, when the method according to the present invention is used, efficient illegal copying or detection of plagiarism can be performed based on the binary executable file constituting the software without using the source code.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a software similarity detection method and apparatus using feature information of an executable file,

본 발명은 소프트웨어의 유사도를 탐지하는 방법 및 장치에 관한 것으로, 더욱 상세하게는 소프트웨어 소스 코드 도용으로 인해 발생 가능한 저작권 침해를 방지하기 위해서 바이너리 실행 파일의 고유 정보를 추출하여 소프트웨어 간의 유사도를 탐지하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for detecting the similarity of software, and more particularly, to a method for detecting similarity between software by extracting unique information of a binary executable file in order to prevent copyright infringement that may occur due to software source code theft And apparatus.

인터넷과 같은 통신의 발달로 소프트웨어 불법 복제 및 표절이 급증하여, 소프트웨어 산업 발전에 걸림돌이 되고 있다. With the development of communication such as the Internet, software piracy and plagiarism are soaring, which is a stumbling block to the development of the software industry.

소프트웨어의 불법 복제(piracy)란 특정 소프트웨어를 그대로 복제(copy)하여 유통하거나 사용하는 것을 의미하며, 소프트웨어의 표절/도용(plagiarism/theft)이란 소프트웨어의 전체 코드 또는 일부 코드를 역공학(reverse engineering) 등의 방법으로 도용하여 사용하는 것을 의미한다. Piracy of software means copying and distributing certain software as it is, and plagiarism / theft of software means reverse engineering of whole or partial code of software. And the like.

종래의 소프트웨어 표절 검출 기술은 크게 두 가지 방법이 존재하며, 첫 번째 방법은 비교 대상 소프트웨어들의 소스 코드간의 유사성을 비교하는 것이고, 두 번째 방법은 비교 대상 소프트웨어들의 실행 파일(executable file)들간의 유사성을 비교하는 것이다.The first method is to compare the similarity between the source codes of the comparative software and the second method is to compare the similarities between the executable files of the comparative software .

첫 번째 방법은 소스 코드들로부터 추출된 토큰 시퀀스(token sequence)나 구문 트리(syntax tree)를 상호 비교하는 방법으로서, 근본적으로 비교 대상이 되는 소프트웨어들의 소스코드를 확보하여야 가능한 방법으로서 소스 코드의 확보가 쉽지 않은 상황인 경우에 적용되기 어렵다.The first method is a method of comparing the token sequence extracted from the source codes or the syntax tree. It is necessary to obtain the source code of the software to be basically compared, It is difficult to apply the present invention to a situation in which it is not easy.

두 번째 방법은, 프로그램 실행 파일들로부터 추출된 특징 정보를 상호 비교하는 방법이지만, 현존하는 대부분의 특징 정보에 관한 연구는 Java 프로그램의 특징 정보의 정의 및 추출 방법에 초점을 두고 있으며 윈도우(Windows) 프로그램의 특징 정보의 정의나 추출 방법에 대한 연구는 미비한 상황이다. The second method is to compare feature information extracted from program executables. However, research on most existing feature information focuses on defining and extracting feature information of Java programs. There is little research on the definition and extraction methods of feature information of programs.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 소프트웨어를 구성하는 소스 코드를 참조할 필요 없이 실행 파일만을 이용하여 소프트웨어 표절 또는 도용을 탐지하기 위해서, 실행 파일들의 특징 정보를 추출하여 비교 대상 소프트웨어들의 유사도를 탐지하는 방법을 제공하는데 있다.In order to solve the above problems, an object of the present invention is to extract characteristic information of executable files to detect plagiarism or theft of software using only executable files without referring to source codes constituting software, And a method for detecting the similarity of the images.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 소프트웨어를 구성하는 소스 코드를 참조할 필요 없이 실행 파일만을 이용하여 소프트웨어 표절 또는 도용을 탐지하기 위해서, 실행 파일들의 특징 정보를 추출하여 비교 대상 소프트웨어들의 유사도를 탐지하는 장치를 제공하는데 있다.It is another object of the present invention to solve the above problems, and it is an object of the present invention to extract characteristic information of executable files to detect plagiarism or theft of software using only executable files without referring to source codes constituting software, And an apparatus for detecting the similarity of software.

상기 목적을 달성하기 위한 본 발명은, 비교 대상이 되는 제 1 및 제 2 소프트웨어의 실행 파일들-제1 실행 파일 및 제 2 실행 파일-을 디스어셈블(disassemble)하는 단계(a), 상기 제 1 및 제 2 실행 파일의 디스어셈블된 결과를 함수 호출 명령어를 포함하는 복수의 블록들로 분할하는 단계(b) 및 상기 제 1 실행 파일의 블록들과 상기 제 2 실행 파일의 블록들 간의 유사도를 산출하는 단계(c)를 포함한 것을 특징으로 하는 소프트웨어 유사도 탐지 방법을 제공한다.According to an aspect of the present invention, there is provided a method of analyzing a program, the method comprising: (a) disassembling executable files of a first and a second software to be compared, the first executable file and the second executable file; (B) dividing the disassembled result of the second executable file into a plurality of blocks including a function call instruction and calculating a degree of similarity between the blocks of the first executable file and blocks of the second executable file (C) of detecting the similarity of the software.

여기에서, 상기 단계(b)에서 상기 블록들은 상기 함수 호출 명령어 전후의 소정 개수 명령어들을 포함하여 구성될 수 있다. Here, in the step (b), the blocks may include a predetermined number of instructions before and after the function call instruction.

이때, 상기 블록에 포함되는 상기 함수 호출 명령어 전후의 명령어 개수는 상기 함수의 인자(argument)에 기초하여 결정될 수 있다. 이때, 상기 블록에 포함되는 상기 함수 호출 명령어 전후의 명령어 개수는 상기 함수의 반환 값(return value)에 기초하여 결정될 수 있다.In this case, the number of instructions before and after the function call instruction included in the block may be determined based on an argument of the function. At this time, the number of instructions before and after the function call instruction included in the block may be determined based on a return value of the function.

여기에서, 상기 단계(c)는 상기 제 1 실행 파일의 m개(m은 자연수) 블록과 상기 제 2 실행 파일의 n개(n은 자연수) 블록에 대하여, 상기 제 1 실행 파일의 m개 블록 각각에 대하여 상기 제 2 실행 파일의 n개 블록에 대한 유사도 값을 산출하도록 구성될 수 있다. Here, the step (c) may include a step of, for m (m is a natural number) blocks of the first executable file and n (n is a natural number) blocks of the second executable file, m blocks And calculate the similarity value for n blocks of the second executable file with respect to each of the n executable files.

이때, 상기 유사도 값은 상기 비교 대상 블록들을 구성하는 어셈블리 코드의 집합 또는 순서에 기초한 자카드(Jaccard) 유사도 또는 코사인(cosine) 유사도로 산출될 수 있다. 이때, 상기 단계(c)는 상기 제 1 실행 파일의 m개 블록 각각에 대한 상기 제 2 실행 파일의 n 개 블록들의 유사도 값들 중 가장 높은 유사도 값들을 결정하고, 상기 가장 높은 유사도 값들의 평균을 상기 제 1 실행 파일과 상기 제 2 실행 파일의 유사도 값으로 산출하도록 구성될 수 있다.
At this time, the similarity value may be calculated as a Jaccard similarity or a cosine similarity based on a set or order of assembly codes constituting the comparison blocks. Wherein the step (c) comprises: determining the highest similarity values among the similarity values of n blocks of the second execution file for each of m blocks of the first execution file, and determining an average of the highest similarity values And calculating the similarity value between the first executable file and the second executable file.

상기 다른 목적을 달성하기 위한 본 발명은, 비교 대상이 되는 제 1 및 제 2 소프트웨어의 실행 파일들-제1 실행 파일 및 제 2 실행 파일-을 저장하는 저장부, 상기 제 1 및 제 2 실행 파일을 디스어셈블하는 디스어셈블부, 상기 제 1 및 제 2 실행 파일의 디스어셈블된 결과를 함수 호출 명령어를 포함하는 복수의 블록들로 분할하는 블록 분할부 및 상기 제 1 실행 파일의 블록들과 상기 제 2 실행 파일의 블록들 간의 유사도를 산출하는 유사도 산출부를 포함한 것을 특징으로 하는 소프트웨어 유사도 탐지 장치를 제공한다.According to another aspect of the present invention, there is provided an information processing apparatus including a storage unit for storing executable files of a first software and a second software to be compared, the first executable file and the second executable file, A block dividing unit for dividing the disassembled result of the first and second executable files into a plurality of blocks including a function call instruction, and a block dividing unit for dividing the blocks of the first executable file, And a similarity calculating unit for calculating a similarity between blocks of the two execution files.

여기에서, 상기 블록 분할부는 상기 함수 호출 명령어 전후의 소정 개수 명령어들을 포함하여 상기 블록들을 분할하도록 구성될 수 있다. Here, the block division unit may be configured to divide the blocks including a predetermined number of instructions before and after the function call instruction.

여기에서, 상기 유사도 산출부는 상기 제 1 실행 파일의 m개(m은 자연수) 블록과 상기 제 2 실행 파일의 n개(n은 자연수) 블록에 대하여, 상기 제 1 실행 파일의 m개 블록 각각에 대하여 상기 제 2 실행 파일의 n개 블록에 대한 유사도 값을 산출하도록 구성될 수 있다. Here, the similarity calculation unit may calculate, for each of m blocks (m is a natural number) of the first executable file and n (n is a natural number) blocks of the second executable file, The similarity value to the n blocks of the second execution file.

이때, 상기 유사도 값은 상기 비교 대상 블록들을 구성하는 어셈블리 코드의 집합 또는 순서에 기초한 자카드(Jaccard) 유사도 또는 코사인(cosine) 유사도로 산출될 수 있다. 이때, 상기 유사도 산출부는 상기 제 1 실행 파일의 m개 블록 각각에 대한 상기 제 2 실행 파일의 n 개 블록들의 유사도 값들 중 가장 높은 유사도 값들을 결정하고, 상기 가장 높은 유사도 값들의 평균을 상기 제 1 실행 파일과 상기 제 2 실행 파일의 유사도 값으로 산출하도록 구성될 수 있다.At this time, the similarity value may be calculated as a Jaccard similarity or a cosine similarity based on a set or order of assembly codes constituting the comparison blocks. The similarity calculation unit may determine the highest similarity values among the similarity values of n blocks of the second execution file for each of m blocks of the first execution file, And calculating the similarity value between the executable file and the second executable file.

상기와 같은 본 발명에 따른 소프트웨어 유사도 탐지 방법 및 장치를 이용할 경우는, 소스코드 도용 여부를 탐지하기 위해 바이너리 파일로부터 고유 정보를 추출하고 추출된 해당 고유정보를 이용하여 실제 바이너리 파일의 비교를 수행할 수 있다. When the method and apparatus for detecting the similarity of software according to the present invention as described above are used, unique information is extracted from a binary file to detect whether or not the source code is stolen, and comparison of actual binary files is performed using the extracted unique information .

즉, 본 발명에 따른 방법 및 장치를 이용할 경우에는, 소스 코드를 이용하지 않고 소프트웨어를 구성하는 바이너리 실행 파일에 기반하여, 효율적인 불법 복제 또는 표절 여부의 탐지가 가능해진다.That is, when using the method and apparatus according to the present invention, it is possible to efficiently detect illegal copying or plagiarism based on a binary executable file constituting software without using a source code.

도 1은 본 발명에 따른 소프트웨어 유사도 탐지 방법의 일 예를 설명하기 위한 순서도이다.
도 2는 본 발명에 따른 소프트웨어 유사도 탐지 방법에서 소프트웨어를 구성하는 실행 파일을 블록 단위로 분할하는 과정을 설명하기 위한 개념도이다.
도 3은 본 발명에 따른 소프트웨어 유사도 탐지 방법에서 실행 파일들의 블록 단위로 유사도를 비교하는 과정을 설명하기 위한 개념도이다.
도 4는 본 발명에 따른 소프트웨어 유사도 탐지 방법을 수행하는 소프트웨어 유사도 탐지 장치의 구성 예를 설명하기 위한 블록도이다.FIG. 1 is a flow chart for explaining an example of a software similarity detection method according to the present invention.
FIG. 2 is a conceptual diagram for explaining a process of dividing an executable file constituting software into blocks in a software similarity detection method according to the present invention.
FIG. 3 is a conceptual diagram for explaining a process of comparing similarities in units of blocks of execution files in the method of detecting similarity of software according to the present invention.
FIG. 4 is a block diagram illustrating an exemplary configuration of a software similarity detecting apparatus for performing a software similarity detecting method according to the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.
Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

본 발명에 따른 소프트웨어 유사도 탐지 방법의 구성The configuration of the software similarity detection method according to the present invention

도 1은 본 발명에 따른 소프트웨어 유사도 탐지 방법의 일 예를 설명하기 위한 순서도이다.FIG. 1 is a flow chart for explaining an example of a software similarity detection method according to the present invention.

도 1을 참조하면, 본 발명에 따른 소프트웨어 유사도 탐지 방법의 일 예는, 비교 대상이 되는 제 1 및 제 2 소프트웨어의 실행 파일들-제 1 실행 파일 및 제 2 실행 파일-을 디스어셈블(disassemble)하는 단계(S110), 상기 제 1 및 제 2 실행 파일의 디스어셈블된 결과를 함수 호출 명령어를 포함하는 복수의 블록들로 분할하는 단계(S120) 및 상기 제 1 실행 파일의 블록들과 상기 제 2 실행 파일의 블록들 간의 유사도를 산출하는 단계(S130)를 포함하여 구성될 수 있다.
Referring to FIG. 1, an exemplary method for detecting the similarity of software according to the present invention includes disassembling executable files of a first and a second software to be compared, that is, a first executable file and a second executable file, (S120) of dividing the disassembled result of the first and second executable files into a plurality of blocks including a function call instruction (S120) And calculating a degree of similarity between the blocks of the executable file (S130).

먼저, 단계(S110)에서는, 유사도를 산출할 대상이 되는 소프트웨어들, 즉, 비교의 대상이 되는 소프트웨어)로부터 직접적인 비교의 대상이 되는 실행 파일(executable file)을 추출하고, 실행 파일들을 디스어셈블하는 과정을 수행하게 된다. First, in step S110, an executable file to be directly subjected to comparison is extracted from the software to be subjected to the similarity calculation, that is, the software to be compared, .

이하에서는, 비교 대상 소프트웨어들로부터 추출되어 직접적인 비교 대상이 되는 실행 파일들을 각각 제 1 실행 파일과 제 2 실행 파일로 명칭할 수 있다.Hereinafter, the execution files extracted from the comparison target software and directly subjected to comparison can be referred to as a first execution file and a second execution file, respectively.

이때, 소프트웨어들로부터 추출되는 실행 파일은 해당 소프트웨어들의 메인 실행 파일일 수도 있으며, 해당 소프트웨어를 구성하는 여러 개의 실행 파일들 중에서 핵심적인 기능을 구현한 실행 파일이 될 수 있다. 다만, 제 1 실행 파일과 제 2 실행 파일은 상호간에 대응되는 기능을 수행하는 실행 파일들로 선택되어야 할 것이다. In this case, the executable file extracted from the software may be a main executable file of the corresponding software, or an executable file that implements core functions among a plurality of executable files constituting the corresponding software. However, the first executable file and the second executable file should be selected as executable files that perform mutually corresponding functions.

또한, 이하에서는 제 1 실행 파일과 제 2 실행 파일로 명칭되는 2개의 비교 대상 실행 파일만을 특정하여 언급하고 있으나, 실시예에 따라서는 비교 대상이 되는 소프트웨어들로부터 복수의 실행 파일들을 추출하여 실행 파일별로 유사도를 산출하는 실시예도 가능할 것이다.In the following description, only two comparison target execution files, which are referred to as a first execution file and a second execution file, are specified, but according to an embodiment, a plurality of execution files are extracted from software to be compared, It is also possible to calculate the degree of similarity.

한편, 디스어셈블 과정에서 이용되는 디스어셈블러(disassembler)로는 대표적으로 IDApro 디스어셈블러가 이용될 수 있으나, 실행 파일이 동작하는 프로세서와 하드웨어 플랫폼에 따라서 목적에 부합되는 기타 다양한 디스어셈블러가 이용될 수 있다.
An IDApro disassembler may be used as a disassembler used in the disassembling process. However, various other disassemblers may be used depending on the purpose of the processor and the hardware platform.

다음으로, 단계(S120)에서는 제 1 실행 파일과 제 2 실행 파일의 디스어셈블된 결과를 다수 개의 블록들로 분할하는 과정을 수행한다.Next, in step S120, the disassembled result of the first executable file and the second executable file is divided into a plurality of blocks.

실행 파일들의 특징은 해당 실행 파일이 주로 호출하는 운영체제에서 제공하는 API(Application Programming Interface) 등과 해당 실행 파일에 포함된 개발자가 직접 작성한 사용자 정의 함수 등에 의해서 결정될 수 있다. The characteristics of the executable files can be determined by an application programming interface (API) provided by the operating system, which is mainly executed by the executable file, and a user-defined function created by the developer included in the executable file.

본 발명에 따른 소프트웨어 유사도 탐지 방법에서는 이에 착안하여 실행 파일을 블록 단위로 분할하게 되는데, 이 때 각각의 블록에는 함수 호출 명령어(예컨대, 어셈블리 코드로서 call 명령어)를 포함하게 된다. 부가적으로, 각각의 블록들은 각각의 블록들의 특징을 잘 나타낼 수 있도록 함수 호출 명령어 전후의 명령어들을 포함하여 구성된다. In the method of detecting the similarity of software according to the present invention, the executable file is divided into blocks in consideration of this. In this case, each block includes a function call command (for example, a call command as assembly code). In addition, each block is configured to include instructions before and after the function call instruction so as to better describe the characteristics of each block.

각각의 블록에 포함되는 함수 호출 명령어 전후의 명령어 개수는 고정 값을 가질 수도 있고 함수 호출시 함수의 인자(argument)에 기초하여 결정될 수 있다. 또는, 각각의 블록에 포함되는 함수 호출 명령어 전후의 명령어 개수는 해당 함수의 반환 값(return value)에 기초하여 결정될 수 있다. 또는, 각각의 블록에 포함되는 함수 호출 명령어 전후의 명령어 개수는 함수 호출시 함수의 인자와 함수의 반환 값에 모두 기초하여 결정될 수도 있다.The number of instructions before and after the function call instruction included in each block may have a fixed value or may be determined based on an argument of the function at the time of function call. Alternatively, the number of instructions before and after the function call instruction included in each block can be determined based on the return value of the function. Alternatively, the number of instructions before and after the function call instruction included in each block may be determined based on both the function argument and the return value of the function at the time of the function call.

예컨대, 함수 호출시 함수의 인자들(반환 값)이 많은 경우는, 블록에 포함되는 함수 호출 명령어 전후의 명령어들의 개수를 증가시킬 수 있고, 함수의 인자들(반환 값)이 적은 경우는 블록에 포함되는 함수 호출 명령어 전후의 명령어들의 개수를 감소시킬 수 있다. 물론, 상술된 함수 인자와 함수 반환 값과 블록에 포함되는 명령어의 개수 관계는 역으로도 구성이 가능할 것이다.For example, when there are many function arguments (return values) at the time of function call, it is possible to increase the number of commands before and after the function call instruction included in the block. If the function arguments (return value) You can reduce the number of commands before and after the included function call instruction. Of course, the relationship between the above-mentioned function parameter, the function return value, and the number of commands included in the block can be reversely configured.

또한, 함수의 인자들이나 반환 값의 데이터 형태나 기타 특징에 기초하여 블록에 포함되는 명령어 개수를 가변적으로 조절할 수 있다.
In addition, the number of instructions included in the block can be variably controlled based on the data type of the arguments of the function or the return value or other characteristics.

도 2는 본 발명에 따른 소프트웨어 유사도 탐지 방법에서 소프트웨어를 구성하는 실행 파일을 블록 단위로 분할하는 과정을 설명하기 위한 개념도이다.FIG. 2 is a conceptual diagram for explaining a process of dividing an executable file constituting software into blocks in a software similarity detection method according to the present invention.

도 2를 참조하면, 실행 파일(제 1 실행 파일)은 예시적으로 총 3개의 블록들(210, 220, 230)로 분할되어 있다.Referring to FIG. 2, an executable file (first executable file) is illustratively divided into three blocks 210, 220 and 230 in total.

도 2에서는, 제 1 실행 파일의 블록 분할이 예시되어 있으나 제 2 실행 파일에 대해서도 유사한 과정을 통하여 블록 분할이 수행될 수 있다.In FIG. 2, the block division of the first executable file is illustrated, but block division may be performed for the second executable file through a similar process.

제 1 블록(210)은 함수 호출 명령어(211)를 전후하여 2개씩의 명령어를 가지는 것으로 구성되며, 제 2 블록(220)은 함수 호출 명령어(221)를 전후하여 2개씩의 명령어를 가지는 것으로 구성되며, 제 3 블록(230)은 함수 호출 명령어(231)를 전후하여 2개씩의 명령어를 가지는 것으로 구성되어 있다.The first block 210 is configured to have two instructions before and after the function call instruction 211 and the second block 220 is configured to have two instructions after the function call instruction 221 And the third block 230 is composed of two instruction words before and after the function call instruction 231.

결과적으로, 제 1 블록을 구성하는 어셈블리 코드의 집합(261)은 {mov, push, call, neg, sbb}로 구성되며, 제 2 블록을 구성하는 어셈블리 코드의 집합(262)은 {xor, xor, call, mov, pop}로 구성되며, 제 3 블록을 구성하는 어셈블리 코드의 집합(263)은 {push, push, call, pop, pop}으로 구성된다.As a result, the set of assembly codes 261 constituting the first block is composed of {mov, push, call, neg, sbb}, and the set of assembly codes 262 constituting the second block consists of {xor, xor , call, mov, pop}, and the set of assembly codes 263 constituting the third block is composed of {push, push, call, pop, pop}.

도 2에서는, 제 1, 2, 3 블록이 모두 함수 호출 명령어를 전후하여 2개씩의 명령어를 포함하여 구성되는 것으로 예시되어 있으나, 실제 구현에 있어서, 각각의 블록이 서로 다른 개수의 명령어를 포함하여 구성될 수도 있을 것이다.
In FIG. 2, the first, second, and third blocks are exemplified as including two instruction words before and after the function call instruction. However, in actual implementation, each block includes a different number of instructions .

다음으로, 단계(S130)에서는, 상기 제 1 실행 파일의 블록들과 상기 제 2 실행 파일의 블록들 간의 유사도를 산출하는 과정이 수행된다.Next, in step S130, a process of calculating the degree of similarity between the blocks of the first execution file and the blocks of the second execution file is performed.

유사도의 산출 과정은 제 1 실행 파일로부터 취한 m개(m은 자연수)의 블록과 제 2 실행 파일로부터 취한 n개(n은 자연수)의 블록을 이용하여, 제 1 실행 파일의 m개 블록 각각에 대해서 제 2 실행 파일의 n개 블록들의 유사도를 산출하는 과정을 포함하여 구성될 수 있다.The process of calculating the similarity is performed by using m blocks (m is a natural number) taken from the first executable file and n blocks (n is a natural number) taken from the second executable file, And calculating the degree of similarity between the n blocks of the second execution file.

한편, 단계(S130)에서 유사도 산출이 이용되는 제 1 실행 파일의 m개 블록과 제 2 실행 파일의 n개 블록은 각각 제 1 실행 파일과 제 2 실행 파일로부터 추출된 모든 블록을 반드시 의미하는 것은 아니다. 즉, 제 1 실행 파일을 구성하는 블록들로부터 m개 블록, 제 2 실행 파일을 구성하는 블록들로부터 n개 블록이 선택적으로 유사도 산출에 이용될 수도 있다.On the other hand, the m blocks of the first execution file and the n blocks of the second execution file, in which the similarity degree calculation is used in step S130, necessarily mean all the blocks extracted from the first execution file and the second execution file no. That is, n blocks from the blocks constituting the first executable file and m blocks from the blocks constituting the second executable file may be selectively used for calculating the degree of similarity.

도 3은 본 발명에 따른 소프트웨어 유사도 탐지 방법에서 실행 파일들의 블록 단위로 유사도를 비교하는 과정을 설명하기 위한 개념도이다.FIG. 3 is a conceptual diagram for explaining a process of comparing similarities in units of blocks of execution files in the method of detecting similarity of software according to the present invention.

도 3을 참조하면, 제 1 실행 파일(310)의 m개(도 3에서, m=3) 블록들(311, 312, 313) 각각에 대해서 제 2 실행 파일(320)의 n개(도 3에서, n=3) 블록들(321, 322, 323) 각각에 대한 유사도 산출이 이루어짐을 알 수 있다.Referring to FIG. 3, n pieces of the second executable files 320 (see FIG. 3) are generated for m blocks (311, 312, 313 in FIG. 3) of the first executable file 310, (N = 3) blocks 321, 322, and 323, respectively.

즉, 상술된 예에서 총 m×n 번의 유사도 연산이 수행될 수 있다. 유사도의 산출은 각각의 블록을 구성하는 어셈블리 코드의 집합 또는 순서(시퀀스)에 기초한 자카드(Jaccard) 유사도 연산 또는 코사인(cosine) 유사도 연산에 의해서 수행될 수 있다. That is, in the above-described example, a total of m x n similarity calculations can be performed. The calculation of the degree of similarity can be performed by a Jacquard similarity calculation or a cosine similarity calculation based on a set of assembly codes or sequences (sequences) constituting each block.

최종적으로, 단계(S130)에서는, 제 1 실행 파일의 각 블록에 대한 제 2 실행 파일의 블록들 중의 유사도 값들 중에서 가장 높은 유사도 값을 취하며, 모든 최대 유사도 값의 평균을 취하여 제 1 실행 파일과 제 2 실행 파일의 유사도 값으로 산출하도록 구성될 수 있다.
Finally, in step S130, the highest similarity value among the similarity values in the blocks of the second execution file with respect to each block of the first execution file is taken, and the average of all the maximum similarity values is taken, The similarity value of the second executable file.

본 발명에 따른 소프트웨어 유사도 탐지 장치의 구성The configuration of the software similarity detection apparatus according to the present invention

도 4는 본 발명에 따른 소프트웨어 유사도 탐지 방법을 수행하는 소프트웨어 유사도 탐지 장치의 구성 예를 설명하기 위한 블록도이다.FIG. 4 is a block diagram illustrating an exemplary configuration of a software similarity detecting apparatus for performing a software similarity detecting method according to the present invention.

도 4를 참조하면, 본 발명에 따른 소프트웨어 유사도 탐지 장치의 일 구성예(400)는, 비교 대상이 되는 제 1 및 제 2 소프트웨어의 실행 파일들-제1 실행 파일 및 제 2 실행 파일-을 저장하는 저장부(410), 상기 제 1 및 제 2 실행 파일을 디스어셈블하는 디스어셈블부(420), 상기 제 1 및 제 2 실행 파일의 디스어셈블된 결과를 함수 호출 명령어를 포함하는 복수의 블록들로 분할하는 블록 분할부(430) 및 상기 제 1 실행 파일의 블록들과 상기 제 2 실행 파일의 블록들 간의 유사도를 산출하는 유사도 산출부(440)를 포함하여 구성될 수 있다.
Referring to FIG. 4, an exemplary configuration 400 of a software similarity degree detecting apparatus according to the present invention includes storing first and second executable files of a first and a second software to be compared, that is, a first executable file and a second executable file A disassembly unit 420 for disassembling the first and second executable files, a plurality of blocks including a function call instruction, a disassembled result of the first and second executable files, And a similarity calculation unit 440 for calculating a degree of similarity between the blocks of the first execution file and the blocks of the second execution file.

먼저, 저장부(410)는 비교 대상이 되는 제 1 소프트웨어와 제 2 소프트웨어로부터 추출되어 직접적인 유사도 비교의 대상이 되는 제 1 실행 파일(411)과 제 2 실행 파일(412)을 저장하고 있는 구성요소이다.First, the storage unit 410 stores a first executable file 411 and a second executable file 412, which are extracted from the first software and the second software to be compared and subjected to direct similarity comparison, to be.

도 4의 예시에서, 저장부(410)는 제 1 실행 파일과 제 2 실행 파일만을 저장하고 있는 것으로 도시되어 있으나, 저장부(410)는 후술될 디스어셈블부(420), 블록 분할부(430) 및 유사도 산출부(440)과도 연결되어, 각 구성요소의 출력 결과물들을 저장하여, 다른 구성요소들의 수행에 참조될 수 있다. 즉, 저장부(410)는 디스어셈블부(420)가 제 1 실행 파일과 제 2 실행 파일을 디스어셈블한 결과를 저장하도록 구성될 수 있으며, 블록 분할부(430)가 실행 파일들을 분할한 블록 분할 결과를 저장하도록 구성될 수도 있다. 유사도 산출부(440)는 저장부(410)에 저장된 블록 분할 결과와 디스어셈블 결과를 참조하여 유사도를 산출하도록 구성될 수 있다.
4, the storage unit 410 stores only the first execution file and the second execution file. However, the storage unit 410 may include a disassembly unit 420, a block division unit 430, And the similarity calculation unit 440, and stores the output results of the respective components, and can be referred to the performance of other components. That is, the storage unit 410 may be configured to store the result of disassembly of the first executable file and the second executable file by the disassembler 420, and the block dividing unit 430 may store the result of disassembling the first executable file and the second executable file, And may be configured to store the result of the division. The similarity calculating unit 440 may be configured to calculate the similarity with reference to the block division result and the disassembly result stored in the storage unit 410.

다음으로, 디스어셈블부(420)는 제 1 실행 파일과 제 2 실행 파일을 저장부(410)로부터 독출하여 디스어셈블을 수행하고, 디스어셈블된 결과를 저장부(410)에 저장하도록 구성될 수 있다.Next, the disassembly unit 420 may read the first executable file and the second executable file from the storage unit 410 and perform disassembly, and store the disassembled result in the storage unit 410 have.

디스어셈블부(420)는 하나의 독립된 물리적 장치이기보다는, 후술될 블록 분할부(430) 및 유사도 산출부(440)와 함께 프로그램 코드로 구성될 수 있으며, 중앙 처리 장치(processor)에 의해서 실행되는 프로그램 코드와 프로세서의 결합체로서 구성될 수 있다. 디스어셈블부(420)는 대표적으로 IDApro 디스어셈블러와 디스어셈블러를 수행가능한 프로세서 및 메모리 장치로 구성될 수 있다.
The disassembly unit 420 may be constituted by program codes together with the block dividing unit 430 and the similarity calculating unit 440 to be described later and may be implemented by a central processing unit And can be configured as a combination of a program code and a processor. The disassembly unit 420 may be typically a processor and a memory device capable of performing an IDApro disassembler and a disassembler.

다음으로, 블록 분할부(430)은 저장부(410)로부터 제 1 실행 파일과 제 2 실행 파일의 디스어셈블된 결과를 독출하여, 디스어셈블된 결과를 복수의 블록으로 분할하고, 분할된 블록들을 저장부(410)에 저장하는 역할을 수행하도록 구성될 수 있다.Next, the block dividing unit 430 reads the disassembled result of the first executable file and the second executable file from the storage unit 410, divides the disassembled result into a plurality of blocks, And store it in the storage unit 410.

앞서 언급된 바와 같이, 실행 파일들의 특징은 주로 운영체제에서 제공하는 API(Application Programming Interface) 등과 개발자가 직접 작성한 사용자 정의 함수 등에 의해서 결정될 수 있다. 본 발명에 따른 소프트웨어 유사도 탐지 장치의 블록 분할부에서도 이에 착안하여 실행 파일을 블록 단위로 분할하게 되며, 이 때 각각의 블록들은 함수 호출 명령어(예컨대, 어셈블리 코드로서 call 명령어)를 포함하게 된다. As mentioned above, the characteristics of the executable files can be mainly determined by an API (Application Programming Interface) provided by the operating system and a user-defined function created by the developer. The block division part of the software similarity detection apparatus according to the present invention is also divided into blocks by dividing an executable file into blocks, each of which includes a function call instruction (for example, a call instruction as assembly code).

부가적으로, 각각의 블록들은 각각의 블록들의 특징을 잘 나타낼 수 있도록 함수 호출 명령어 전후의 명령어들을 포함하여 구성된다. In addition, each block is configured to include instructions before and after the function call instruction so as to better describe the characteristics of each block.

각각의 블록에 포함되는 함수 호출 명령어 전후의 명령어 개수는 고정값을 가질 수도 있고 함수 호출시 함수의 인자(argument)에 기초하여 결정될 수 있다. 또는, 각각의 블록에 포함되는 함수 호출 명령어 전후의 명령어 개수는 해당 함수의 반환 값(return value)에 기초하여 결정될 수 있다. 또는, 각각의 블록에 포함되는 함수 호출 명령어 전후의 명령어 개수는 함수 호출시 함수의 인자와 함수의 반환 값에 모두 기초하여 결정될 수도 있다.The number of instructions before and after the function call instruction included in each block may have a fixed value or may be determined based on an argument of the function at the time of function call. Alternatively, the number of instructions before and after the function call instruction included in each block can be determined based on the return value of the function. Alternatively, the number of instructions before and after the function call instruction included in each block may be determined based on both the function argument and the return value of the function at the time of the function call.

예컨대, 함수 호출시 함수의 인자들(반환 값)이 많은 경우는, 블록에 포함되는 함수 호출 명령어 전후의 명령어들의 개수를 증가시킬 수 있고, 함수의 인자들(반환 값)이 적은 경우는 블록에 포함되는 함수 호출 명령어 전후의 명령어들의 개수를 감소시킬 수 있다. 또한, 함수의 인자들이나 반환 값의 데이터 형태나 기타 특징에 기초하여 블록에 포함되는 명령어 개수를 가변적으로 조절할 수 있다.For example, when there are many function arguments (return values) at the time of function call, it is possible to increase the number of commands before and after the function call instruction included in the block. If the function arguments (return value) You can reduce the number of commands before and after the included function call instruction. In addition, the number of instructions included in the block can be variably controlled based on the data type of the arguments of the function or the return value or other characteristics.

상술된 도 2를 통하여 블록 분할 과정과 관련된 예시가 설명되었으므로, 이하에서는 추가적인 설명은 생략한다.
Since an example related to the block dividing process has been described with reference to FIG. 2 described above, further explanation will be omitted below.

마지막으로, 유사도 산출부(440)는 상기 제 1 실행 파일의 블록들과 상기 제 2 실행 파일의 블록들 간의 유사도를 산출하는 구성요소이다.Finally, the similarity calculation unit 440 is a component for calculating the degree of similarity between the blocks of the first execution file and the blocks of the second execution file.

유사도 산출부(440)는 제 1 실행 파일로부터 취한 m개(m은 자연수)의 블록과 제 2 실행 파일로부터 취한 n개(n은 자연수)의 블록을 이용하여, 제 1 실행 파일의 m개 블록 각각에 대해서 제 2 실행 파일의 n개 블록들의 유사도를 산출하는 과정을 수행하도록 구성될 수 있다.The similarity calculation unit 440 calculates the similarity of the mth block of the first executable file using m blocks (m is a natural number) taken from the first execution file and n blocks (n is a natural number) taken from the second execution file, And calculating the degree of similarity of n blocks of the second executable file with respect to each of the first executable file and the second executable file.

한편, 상술된 유사도 산출이 이용되는 제 1 실행 파일의 m개 블록과 제 2 실행 파일의 n개 블록은 각각 제 1 실행 파일과 제 2 실행 파일로부터 추출된 모든 블록을 반드시 의미하는 것은 아니다. 즉, 제 1 실행 파일을 구성하는 블록들로부터 m개 블록, 제 2 실행 파일을 구성하는 블록들로부터 n개 블록이 선택적으로 유사도 산출에 이용될 수도 있다. 유사도의 산출은 각각의 블록을 구성하는 어셈블리 코드의 집합 또는 순서(시퀀스)에 기초한 자카드(Jaccard) 유사도 연산 또는 코사인(cosine) 유사도 연산에 의해서 수행될 수 있다. On the other hand, the m blocks of the first execution file and the n blocks of the second execution file in which the similarity degree calculation is used do not necessarily mean all blocks extracted from the first execution file and the second execution file, respectively. That is, n blocks from the blocks constituting the first executable file and m blocks from the blocks constituting the second executable file may be selectively used for calculating the degree of similarity. The calculation of the degree of similarity can be performed by a Jacquard similarity calculation or a cosine similarity calculation based on a set of assembly codes or sequences (sequences) constituting each block.

유사도 산출 과정의 블록 별 유사도 산출 개념은 앞서 도 3을 통하여 설명된 바와 같으므로, 중복된 설명은 생략한다.The concept of calculating the degree of similarity for each block in the similarity calculation process is the same as that described above with reference to FIG. 3, so that redundant description is omitted.

유사도 산출부(440)는 최종적으로, 제 1 실행 파일의 각 블록에 대한 제 2 실행 파일의 블록들 중의 유사도 값들 중에서 가장 높은 유사도 값을 취하며, 모든 최대 유사도 값의 평균을 취하여 제 1 실행 파일과 제 2 실행 파일의 유사도 값으로 산출하도록 구성될 수 있다.
Finally, the similarity calculation unit 440 takes the highest similarity value among the similarity values among the blocks of the second execution file for each block of the first execution file, and takes an average of all the maximum similarity values, And the second executable file.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

400: 소프트웨어 유사도 탐지 장치
410: 저장부
420: 디스어셈블부 430: 블록 분할부
440: 유사도 산출부400: Software similarity detection device
410:
420: disassembly unit 430: block dividing unit
440:

Claims

(A) disassembling executables of the first and second software to be compared, the first executable file and the second executable file;
(B) dividing the disassembled result of the first and second executable files into a plurality of blocks each including one function call instruction; And
(C) calculating a degree of similarity between the blocks of the first executable file and the blocks of the second executable file,
Wherein each of the plurality of blocks includes the one function call instruction included in each block and a predetermined number of other instructions before and after the one function call instruction.

delete

The method according to claim 1,
Wherein the number of instructions before and after the one function call instruction included in each block is determined based on an argument of the function.

The method according to claim 1,
Wherein the number of instructions before and after the one function call instruction included in each block is determined based on a return value of the function.

The method according to claim 1,
Wherein the step (c) comprises: for each of m blocks (m is a natural number) of the first executable file and n (n is a natural number) blocks of the second executable file, And calculating a similarity value for n blocks of the second executable file.

The method of claim 5,
Wherein the calculation of the similarity value is performed by using a Jacquard similarity degree or a cosine similarity degree as the set or order of the assembly codes constituting the blocks to be compared.

The method of claim 5,
Determining the highest similarity values among the similarity values of n blocks of the second execution file for each of m blocks of the first execution file and comparing the average of the highest similarity values to the first execution file and the second And calculating the similarity value of the executable file as the similarity value of the executable file.

A storage unit for storing executable files of first and second software to be compared, a first executable file and a second executable file;
A disassembly unit for disassembling the first and second executable files;
A block divider for dividing the disassembled result of the first and second execution files into a plurality of blocks each including one function call instruction; And
And a similarity calculation unit for calculating a degree of similarity between the blocks of the first execution file and the blocks of the second execution file,
Wherein each of the plurality of blocks includes the one function call instruction included in each block and a predetermined number of other instructions before and after the one function call instruction.

delete

The method of claim 8,
Wherein the number of instructions before and after the one function call instruction included in each block is determined based on an argument of the function.

The method of claim 8,
Wherein the number of instructions before and after the one function call instruction included in each block is determined based on a return value of the function.

The method of claim 8,
Wherein the similarity calculation unit calculates the degree of similarity for each of m blocks of the first executable file (m is a natural number) and n blocks of the second executable file (n is a natural number) 2 < / RTI > executable file. &Lt; RTI ID = 0.0 > 21. < / RTI >

The method of claim 12,
Wherein the calculation of the similarity value is performed by using a Jacquard similarity degree or a cosine similarity degree as the set or order of the assembly codes constituting the blocks to be compared.

The method of claim 12,
Wherein the similarity calculating unit determines the highest similarity values among the similarity values of n blocks of the second execution file for each of the m blocks of the first execution file and outputs the average of the highest similarity values to the first execution file And the similarity value of the second executable file.