KR101544253B1

KR101544253B1 - Method for detecting software plagiarism based upon analysis on call frequency of application programming interfaces

Info

Publication number: KR101544253B1
Application number: KR1020120130545A
Authority: KR
Inventors: 김상욱; 조성제; 이상철; 하지운; 채동규
Original assignee: 단국대학교 산학협력단
Priority date: 2012-11-16
Filing date: 2012-11-16
Publication date: 2015-08-12
Also published as: KR20140063322A

Abstract

어플리케이션 프로그래밍 인터페이스(API) 호출 빈도 분석을 통한 소프트웨어 표절 탐지 방법이 개시된다. 본 발명에 따른 소프트웨어 표절 탐지 방법은 제 1 프로그램과 제 2 프로그램의 API 호출 특성 정보를 추출하는 API 호출 특성 정보 추출 단계 및 제 1 프로그램과 API 호출 특성 정보와 제 2 프로그램의 API 호출 특성 정보를 상호 비교하여, 제 1 프로그램과 제 2 프로그램의 유사도(similarity)를 판단하는 유사도 판단 단계를 포함하여 구성될 수 있다. 따라서, 비교 대상 프로그램들의 소스 코드를 입수하거나, 비교 대상 프로그램의 실행 파일을 동적으로 분석할 필요 없이, 비교 대상 프로그램들의 실행 파일을 정적으로 분석하여 추출된 소프트웨어 버스마크만으로 소프트웨어 표절 여부를 효과적으로 탐지할 수 있다.A method for detecting software plagiarism through application programming interface (API) call frequency analysis is disclosed. The software plagiarism detection method according to the present invention includes an API call feature information extraction step of extracting API call feature information of a first program and a second program and a step of extracting API call feature information of the first program, And a similarity determination step of comparing similarity between the first program and the second program. Therefore, it is possible to statically analyze the executable file of the comparison target programs without effectively obtaining the source code of the programs to be compared or dynamically analyzing the executable file of the program to be compared, and effectively detect whether the software is plagiarized .

Description

FIELD OF THE INVENTION [0001] The present invention relates to a method for detecting software plagiarism based on application program interface call frequency analysis,

본 발명은 소프트웨어 표절 탐지 방법에 관한 것으로, 더욱 상세하게는 소프트웨어의 특징을 규정하는 버스마크(birthmark)를 이용하여 소프트웨어의 표절 여부를 결정하는 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a software plagiarism detection method, and more particularly, to a method of determining software plagiarism using a birthmark that defines a characteristic of software.

소프트웨어 산업에서 소프트웨어 표절이 날로 증가하고 있는 추세에 있다. 소프트웨어 표절이란 다른 사람이 개발한 소프트웨어의 소스 코드나 오픈 소스 코드를 허락이나 어떠한 라이선스도 없이 무단으로 이용하여 소프트웨어를 개발하는 것을 의미한다. Software plagiarism is on the rise in the software industry. Software plagiarism is the development of software by unauthorized use of source code or open source code of software developed by others without permission or license.

종래의 소프트웨어 표절 검출 기술은 크게 두 가지 방법이 존재하며, 첫 번째 방법은 비교 대상 소프트웨어들의 소스 코드간의 유사성을 비교하는 것이고, 두 번째 방법은 비교 대상 소프트웨어들의 실행 파일(executable file)들간의 유사성을 비교하는 것이다.The first method is to compare the similarity between the source codes of the comparative software and the second method is to compare the similarities between the executable files of the comparative software .

첫 번째 방법은 소스 코드들로부터 추출된 토큰 시퀀스(token sequence)나 구문 트리(syntax tree)를 상호 비교하는 방법으로서, 근본적으로 비교 대상이 되는 소프트웨어들의 소스코드를 확보하여야 가능한 방법으로서 소스 코드의 확보가 쉽지 않은 상황인 경우에 적용되기 어렵다.The first method is a method of comparing the token sequence extracted from the source codes or the syntax tree. It is necessary to obtain the source code of the software to be basically compared, It is difficult to apply the present invention to a situation in which it is not easy.

두 번째 방법은, 프로그램 실행 파일들로부터 추출된 특징 정보(이하, 버스마크(birthmark))를 상호 비교하는 방법이지만, 현존하는 대부분의 특징 정보에 관한 연구는 Java 프로그램의 특징 정보의 정의 및 추출 방법에 초점을 두고 있으며 윈도우(Windows) 프로그램의 특징 정보의 정의나 추출 방법에 대한 연구는 존재하지 않는다. The second method is a method of mutually comparing feature information extracted from program execution files (hereinafter referred to as "birthmarks"). However, research on most existing feature information is based on definition and extraction of feature information of a Java program And there is no research on the definition and extraction of feature information of Windows programs.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 소프트웨어의 표절을 탐지하는 방법으로서, 소프트웨어를 구성하는 프로그램 실행 파일의 정적인 분석을 통하여 추출된 정적 소프트웨어 버스마크를 상호 비교하여 소프트웨어 표절 여부를 탐지하는 방법을 제공하는데 있다.It is an object of the present invention to solve the above problems and provide a method of detecting plagiarism of software, comprising the steps of: comparing static software bus marks extracted through static analysis of program execution files constituting software, And to provide a method of detection.

상기 목적을 달성하기 위한 본 발명은, 소프트웨어 표절(plagiarism) 탐지 방법으로서, 제 1 프로그램과 제 2 프로그램의 API(Application Programming Interface) 호출 특성 정보를 추출하는 API 호출 특성 정보 추출 단계 및 상기 제 1 프로그램과 API 호출 특성 정보와 상기 제 2 프로그램의 API 호출 특성 정보를 상호 비교하여, 상기 제 1 프로그램과 상기 제 2 프로그램의 유사도(similarity)를 판단하는 유사도 판단 단계를 포함하는 소프트웨어 표절 탐지 방법을 제공한다.According to an aspect of the present invention, there is provided a software plagiarism detection method including: an API call characteristic information extraction step of extracting API (Application Programming Interface) call characteristic information of a first program and a second program; And a similarity determination step of comparing the API call property information and the API call property information of the second program with each other to determine a similarity between the first program and the second program, .

여기에서, 상기 API 호출 특성 정보 추출 단계는 상기 제 1 프로그램과 상기 제 2 프로그램의 API 함수 호출 그래프(function call graph)를 작성하는 단계 및 상기 각각의 함수 API 호출 그래프를 이용하여 상기 제 1 프로그램과 상기 제 2 프로그램의 API 호출 특성 정보를 생성하는 단계를 포함하여 구성될 수 있다.Here, the step of extracting the API call characteristic information may include the steps of: creating an API function call graph of the first program and the second program; And generating API call property information of the second program.

이때, 상기 API 함수 호출 그래프를 작성하는 단계는, 상기 제 1 프로그램과 상기 제 2 프로그램을 디스어셈블(disassemble)하고 디스어셈블된 결과를 이용하여 상기 API 함수 호출 그래프를 작성하도록 구성될 수 있다.At this time, the step of creating the API function call graph may be configured to disassemble the first program and the second program, and to generate the API function call graph using the disassembled result.

이때, 상기 API 함수 호출 그래프는 상기 제 1 또는 제 2 프로그램을 구성하는 함수들간의 호출 관계와 상기 함수들 중 적어도 일부의 상기 제 1 또는 제 2 프로그램에서 이용되는 API에 대한 호출 관계를 정의하도록 구성될 수 있다.Here, the API function call graph may be configured to define a calling relationship between functions constituting the first or second program and a calling relation for an API used in the first or second program of at least a part of the functions. .

여기에서, 상기 API 호출 특성 정보를 생성하는 단계에서 생성되는 API 호출 특성 정보는 상기 제 1 또는 제 2 프로그램에서 이용되는 API 별 호출 횟수 카운팅 값을 원소(element)로 포함하는 벡터로 표현될 수 있다.Here, the API call property information generated in the step of generating the API call property information may be expressed by a vector including an element count as a count value of a call count per API used in the first or second program .

이때, 상기 API 호출 특성 정보는 TD-IDF(Term Frequency-Inverse Document Frequency) 방법에 의해서 결정된 API별 가중치를 각 원소에 곱하여 구성될 수 있다. 이 경우, 상기 API 별 가중치는 상기 대상 프로그램의 상기 API별 호출 횟수와 상기 API를 이용하는 프로그램들의 숫자의 역수를 곱한 값에 기초하여 생성될 수 있다.At this time, the API call property information may be configured by multiplying each element by a weight for each API determined by a TD-IDF (Term Frequency-Inverse Document Frequency) method. In this case, the weight for each API may be generated based on a value obtained by multiplying the number of calls per API of the target program by an inverse number of the number of programs using the API.

여기에서, 상기 유사도 판단 단계는 상기 제 1 프로그램의 API 호출 특성 정보와 상기 제 2 프로그램의 API 호출 특성 정보에 대한 코사인 유사도(cosine similarity) 연산을 통해 상기 제 1 프로그램과 상기 제 2 프로그램의 유사도를 지시하는 유사도 값을 계산하도록 구성될 수 있다.Here, the degree of similarity determination may be performed by calculating a degree of similarity between the first program and the second program through a cosine similarity operation on the API call characteristic information of the first program and the API call characteristic information of the second program, And calculate a similarity value value to indicate.

이때, 상기 유사도 판단 단계는 상기 제 1 프로그램과 상기 제 2 프로그램간의 유사도 값이 소정의 임계값에 의해서 설정되는 구간에 속하는지를 판단하여 상기 제 1 프로그램과 상기 제 2 프로그램 간의 표절 여부를 판단하도록 구성될 수 있다.In this case, the similarity determination step determines whether the similarity value between the first program and the second program belongs to a section in which the similarity value is set by a predetermined threshold value, and determines whether or not the first program and the second program are plagiarized .

이때, 상기 소정의 임계값은 실험적으로 결정될 수 있다.At this time, the predetermined threshold value may be determined experimentally.

상기와 같은 본 발명에 따른 소프트웨어 표절 탐지 방법은 소프트웨어 저작권 보호 분야에 적용될 수 있다. The software plagiarism detection method according to the present invention as described above can be applied to the field of software copyright protection.

본 발명에 따른 탐지 방법은 비교 대상 프로그램들의 소스 코드를 입수하거나, 비교 대상 프로그램의 실행 파일을 동적으로 분석할 필요 없이, 비교 대상 프로그램들의 실행 파일을 정적으로 분석하여 추출된 소프트웨어 버스마크만으로 소프트웨어 표절 여부를 효과적으로 탐지할 수 있다.The detection method according to the present invention statically analyzes the executable file of the comparison target programs without obtaining the source code of the comparison target programs or dynamically analyzing the executable file of the comparison target program, Can be effectively detected.

도 1은 본 발명에 따른 소프트웨어 표절 탐지 방법의 일 실시예를 설명하기 위한 순서도이다.
도 2는 본 발명에 따른 API 호출 특성 정보의 추출을 위한 함수 호출 그래프의 개념을 설명하기 위한 개념도이다.
도 3은 본 발명에 따른 표절 탐지 방법의 실험 대상이 되는 40개 프로그램을 정리한 도표이다.
도 4는 본 발명에 따른 소프트웨어 표절 탐지 방법을 적용한 실험예를 설명하기 위한 그래프이다.FIG. 1 is a flowchart illustrating a software plagiarism detection method according to an embodiment of the present invention.
2 is a conceptual diagram for explaining the concept of a function call graph for extracting API call property information according to the present invention.
3 is a table summarizing 40 programs to be tested in the method of detecting plagiarism according to the present invention.
4 is a graph for explaining an experimental example to which the software plagiarism detection method according to the present invention is applied.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.
Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

먼저, 본 발명에서 사용되는 용어들을 정리한다.First, terms used in the present invention are summarized.

소프트웨어 버스마크(software birthmark)는 프로그램의 고유한 특징 정보로서, 프로그램의 실행 파일(윈도우 프로그램의 EXE 파일 또는 Java 프로그램의 byte 코드)에서 추출될 수 있다.A software birthmark is unique feature information of a program and can be extracted from an executable file of a program (EXE file of a Windows program or byte code of a Java program).

f(p)를 프로그램 p로부터 추출된 특징(feature)들의 집합이라 하면, 다음의 두 가지 조건을 충족할 경우, f(p)를 프로그램 p의 버스마크라 할 수 있다.
Let f (p) be a set of features extracted from program p. If f (p) satisfies the following two conditions, f (p) can be called the bus mark of program p.

첫째는, f(p)가 프로그램 p 그 자체로부터 얻어져야 한다.First, f (p) must be obtained from program p itself.

둘째는, 프로그램 q가 프로그램 p의 소스코드에 기초하여 생성되었을 경우, 이를 소프트웨어 표절로 정의하며, 이 때 f(p)=q(p)가 성립된다. 또는, f(p)와 q(p)가 상당한 수준으로 유사할 경우에도 프로그램 q와 프로그램 p를 표절 관계에 있는 것으로 정의할 수도 있다.
Second, when the program q is generated based on the source code of the program p, it is defined as a software plagiarism, and f (p) = q (p) is established at this time. Alternatively, program q and program p may be defined as being in plagiarism even if f (p) and q (p) are similar in level.

소프트웨어 버스마크는 2가지-정적 버스마크(static birthmark)와 동적 버스마크(dynamic birthmark)-로 분류될 수 있다. 정적 버스마크는 프로그램의 실행 파일 그 자체에서 추출된 특징에 기반한 것이며, 동적 버스마크는 실행 파일의 실행(execution) 중에 나타나는 특징에 기반한 것이다.Software bus marks can be categorized into two types - static birthmarks and dynamic birthmarks. The static bus mark is based on features extracted from the executable of the program itself, and the dynamic bus mark is based on the characteristics that appear during the execution of the executable file.

정적 버스마크의 경우는 다양한 소스코드 변환 방법을 이용한 소프트웨어 표절 기법들에 강인성(robustness)을 가지고 있지만 유사한 종류의 소프트웨어들을 표절로서 잘못 탐지하는 오 탐지(false alarm)의 확률이 높은 단점이 있다.The static bus mark has a robustness to software plagiarism techniques using various source code conversion methods, but it has a high probability of false alarms that incorrectly detect similar kinds of software as plagiarism.

반면, 동적 버스마크의 경우는 프로그램에 대한 입력과 실행 환경에 의존적이므로 쉽게 변화할 수 있고, 프로그램의 전역적 특성을 제공하지 못하고 실제 실행되고 있는 부분적 코드의 지역적인 특성만을 제공하게 된다는 단점이 있다.
On the other hand, the dynamic bus mark has a disadvantage in that it can be easily changed because it depends on the input and execution environment of the program, and it does not provide the global characteristic of the program and only provides local characteristics of the partial code actually being executed .

API 호출 특성 정보는 본 발명에 따른 소프트웨어 표절 탐지 방법에서 정적 소프트웨어 버스마크로서 이용되는 정보로서, 대상 프로그램의 API 별 호출 횟수를 각 API에 대응되는 원소의 값으로 가지는 벡터(vector)를 의미할 수 있다.
The API call characteristic information is information used as a static software bus mark in the software plagiarism detection method according to the present invention and may mean a vector having a number of calls per API of the target program as the value of an element corresponding to each API have.

API 함수 호출 그래프(function call graph)는 본 발명에 따른 소프트웨어 표절 탐지 방법에서 API 호출 특성 정보를 생성하기 위해서 이용되는 분석 수단을 의미할 수 있다.
The API function call graph may refer to analysis means used to generate API call feature information in the software plagiarism detection method according to the present invention.

본 발명에 따른 소프트웨어 표절 탐지 방법Software plagiarism detection method according to the present invention

본 발명은 상술된 소프트웨어 버스마크 중 정적 버스마크를 이용한 표절 탐지 기법에 관한 것이다. The present invention relates to a plagiarism detection technique using static bus marks among the above-described software bus marks.

도 1은 본 발명에 따른 소프트웨어 표절 탐지 방법의 일 실시예를 설명하기 위한 순서도이다.FIG. 1 is a flowchart illustrating a software plagiarism detection method according to an embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 소프트웨어 표절 탐지 방법은 API 호출 특성 정보 추출 단계(S110)와 추출된 API 호출 특성 정보를 이용하여 2개 프로그램 간의 유사도를 판단하는 유사도 판단 단계(S120)를 포함하여 구성될 수 있다.1, a software plagiarism detection method according to the present invention includes an API call characteristic information extraction step (S110) and a similarity degree determination step (S120) for determining a similarity degree between two programs using the extracted API call characteristic information .

세부적인 실시예로서, API 호출 특성 정보 추출 단계(S110)는 비교 대상 프로그램들의API 함수 호출 그래프를 작성하는 단계(S112)와 작성된 함수 API 호출 그래프를 이용하여 비교 대상 프로그램들의 API 호출 특성 정보를 생성하는 단계(S113)를 포함하여 구성될 수 있다.As a detailed example, the API call characteristic information extracting step (S110) generates the API call characteristic information of the comparison target programs using the generated function API call graph (S112) of creating the API function call graph of the comparison target programs (Step S113).

이하에서, 비교 대상 프로그램들을 제 1 프로그램(p)과 제 2 프로그램(q)으로 정의한다. 또한, API 호출 특성 정보는 앞서 설명된 각각의 프로그램의 소프트웨어 버스마크일 수 있다.Hereinafter, the programs to be compared are defined as a first program (p) and a second program (q). In addition, the API call characteristic information may be the software bus mark of each program described above.

API 함수 호출 그래프를 작성하는 단계(S112)는 비교 대상 프로그램들의 API 함수 호출 그래프를 작성하기 위해 각 프로그램(제 1 및 제 2 프로그램)을 디스어셈블(disassemble)하는 단계(S111)을 추가로 포함할 수 있다. 디스어셈블을 위하여 대표적으로 IDApro disassembler가 이용될 수 있으나, 기타 다양한 디스어셈블러가 이용될 수 있을 것이다.The step of creating the API function call graph (S112) further includes a step S111 of disassembling each program (first and second programs) to create an API function call graph of the comparison target programs . An IDApro disassembler may be typically used for disassembly, but various other disassemblers may be used.

API 함수 호출 그래프의 작성은 프로그램 실행 코드 전체를 먼저 디스어셈블한 다음 디스어셈블된 결과물에 기초하여 API 함수 호출 그래프를 작성하는 방식으로 이루어질 수도 있으며, 디스어셈블과 API 함수 호출 그래프의 작성이 동시에 이루어지는 방식으로 이루어질 수도 있다.The API function call graph can be created by disassembling the entire program executable code first and then creating an API function call graph based on the disassembled result, or by simultaneously creating the disassembly and API function call graph &Lt; / RTI >

이하에서는, API 함수 호출 그래프(function call graph)의 개념을 설명하기로 한다. API 함수 호출 그래프는 본 발명에서 소프트웨어 버스마크로서 기능하는 API 호출 특성 정보를 추출하기 위한 수단으로 이용되며, 비교 대상 프로그램(제 1 프로그램과 제 2 프로그램) 각각에 대하여 작성될 수 있다. 그러나, 본 발명에 따른 API 호출 특성 정보의 추출을 위한 수단이 반드시 API 함수 호출 그래프로 한정되지는 않는다. Hereinafter, the concept of an API function call graph will be described. The API function call graph is used as a means for extracting API call characteristic information functioning as a software bus mark in the present invention and can be prepared for each of the comparison target programs (the first program and the second program). However, the means for extracting the API call property information according to the present invention is not necessarily limited to the API function call graph.

도 2는 본 발명에 따른 API 호출 특성 정보의 추출을 위한 함수 호출 그래프의 개념을 설명하기 위한 개념도이다.2 is a conceptual diagram for explaining the concept of a function call graph for extracting API call property information according to the present invention.

도 2를 참조하면, API 함수 호출 그래프(200)는 비교 대상 프로그램(제 1 프로그램 또는 제 2 프로그램)을 구성하는 함수들(f₁, f₂,..., f₆)을 지정하는 노드(node)들(211~216)과 프로그램에서 이용되는 API들(a₁, a₂)을 지정하는 노드들(221, 222)을 포함하여 구성될 수 있다. 또한, API 함수 호출 그래프는 상기 함수들과 API들간의 호출 관계를 노드들간의 링크(link)인 에지(edge)로서 표현하게 된다.
Referring to FIG. 2, the API function call graph 200 includes a node (hereinafter referred to as a " node ") designating functions f ₁ , f ₂ , ..., f ₆ constituting a comparison target program nodes 211 and 216 and nodes 221 and 222 for specifying APIs a ₁ and a ₂ used in the program. Also, the API function call graph expresses the call relation between the functions and the APIs as an edge which is a link between the nodes.

이때, API 함수 호출 그래프를 통하여 해당 그래프의 대상이 되는 프로그램의 API별 호출 횟수가 결정될 수 있으며, 단계(S113)에서는 작성된 API 함수 호출 그래프를 이용하여 API별 호출 횟수를 카운팅하여 API 호출 특성 정보를 생성하게 된다.At this time, the number of calls per API of the target program of the graph can be determined through the API function call graph. In step S113, the number of calls per API is counted using the created API function call graph to obtain API call property information Respectively.

API 함수 호출 그래프에서 API a의 호출 횟수는 다음의 2가지 조건을 충족하는 노드들의 숫자로서 정의될 수 있다.
The number of API a calls in the API function call graph can be defined as the number of nodes satisfying the following two conditions.

조건1) 인입되는 에지(incoming edge)가 없는 노드Condition 1) Node without an incoming edge

조건2) API a에 대응되는 노드에 직접 또는 간접적으로 도달할 수 있는 노드
Condition 2) A node that can directly or indirectly reach the node corresponding to API a

상술된 조건 1 및 조건 2를 감안하여, 도 2를 다시 참조하면, 대상 프로그램은 두 개의 API(API a₁과 API a₂)를 이용하고 있으며, API a₁의 호출 횟수는 2가 되며, API a₂의 호출 횟수는 3이 됨을 알 수 있다. 예컨대, API a₁의 경우에 조건 1과 조건 2를 모두 충족하는 함수 노드는 2개-함수 f₁에 대한 노드(211)와 함수 f₆에 대한 노드(216)-가 된다. 예컨대, API a₂의 경우에는 조건 2와 조건 3을 모두 충족하는 함수 노드는 3개- 함수 f₂에 대한 노드(212), 함수 f₃에 대한 노드(213), 함수 f₆에 대한 노드(216)-가 된다.
2, the target program uses two APIs (API a ₁ and API a ₂ ), the number of API a ₁ calls is 2, and the API and the number of calls of a ₂ is 3. For example, in the case of API a ₁ , a function node that satisfies both condition 1 and condition 2 becomes two-node 211 for function f ₁ and node 216 for function f ₆ . For example, in the case of API a ₂ , the function node that satisfies both condition 2 and condition 3 is three-node 212 for function f ₂ , node 213 for function f ₃ , node (for function f ₆ ) 216) -.

단계(S113)에서 API 함수 호출 그래프를 이용하여 프로그램이 이용하는 API와 API 별 호출 횟수를 카운팅한 값을 얻게 되면, API 호출 특성 정보가 소프트웨어 버스마크로서 생성된다.In step S113, when the value obtained by counting the API used by the program and the number of calls per API using the API function call graph is obtained, the API call characteristic information is generated as the software bus mark.

본 발명에서 API 호출 특성 정보는 프로그램에서 이용될 수 있는 API 별로 호출 횟수를 카운팅한 값을 가지는 원소(element)로 포함하는 벡터(vector)로 표현될 수 있다. 예컨대, 대상 프로그램이 이용할 수 있는 1000개의 API가 존재한다면 API a₁~a₁₀₀₀에 대응되는 1000개의 원소를 가지는 벡터로 API 호출 특성 정보는 표현될 수 있다. 하나의 예로서, 대상 프로그램이 윈도우 프로그램이라면 MSDN(Microsoft Developer Network)에 정의된 API의 총 개수(n)가 대상 프로그램의 API 호출 특성 정보 벡터의 원소 개수(n)가 될 수 있다.In the present invention, the API call property information may be represented by a vector including an element having a value obtained by counting the number of calls per API available in the program. For example, if there are 1000 APIs available to the target program, the API call property information can be expressed as a vector having 1000 elements corresponding to APIs a ₁ to a ₁₀₀₀ . As an example, if the target program is a Windows program, the total number (n) of APIs defined in the MSDN (Microsoft Developer Network) may be the number of elements (n) of the API call property information vector of the target program.

다만, API 호출 특성 정보는 원칙적으로 존재하는 모든 API의 개수(n)를 원소 개수로 가지는 벡터(n 차원 벡터)로 구현되어야 하나, API들 중에서 예외 처리(exception handling), 쓰레드 생성(thread generation), 프로세스 종료(process termination), 메모리 할당(memory allocation) 등과 같은 필수적인(essential) 작업들을 수행하는 API들은 대부분의 프로그램들에서 공통적으로 이용하고 있는 API들이기 때문에, 이들에 대한 호출 횟수 정보가 API 호출 특성 정보로 포함될 경우에는 프로그램들 간의 차별화가 어려울 수 있다.However, the API call property information should be implemented as a vector (n-dimensional vector) having the number of elements (n) of all APIs existing in principle. However, exception handling, thread generation, Since the APIs that perform essential tasks such as process termination and memory allocation are APIs commonly used in most programs, When included as information, it may be difficult to differentiate between programs.

따라서, 본 발명의 단계(S113)에서는 TD-IDF(term frequency-inverse document frequency) 기법을 적용하여 API 별 가중치를 API 호출 특성 정보에 적용하여 API 호출 특성 정보를 생성하는 방법을 이용할 수 있다.Therefore, in the step S113 of the present invention, a method of generating the API call characteristic information by applying the term frequency-inverse document frequency (TD-IDF) technique and applying the weight for each API to the API call characteristic information may be used.

TD-IDF 기법은 정보 재생(information retrieval) 영역에서 널리 이용되고 있는 기법으로서, 문서(document) 내에 존재하는 단어(term)들의 중요성을 평가하는데 이용될 수 있는 수치적 통계 기법이다. TF(term frequency)는 문서 내에서 단어의 발생 횟수를 의미하고, DF(document frequency)는 해당 단어를 포함하고 있는 문서의 숫자를 의미한다. IDF는 DF의 역수이며, TF-IDF는 TF와 IDF의 곱을 의미한다.The TD-IDF technique is a widely used technique in the information retrieval domain and is a numerical statistical technique that can be used to evaluate the importance of terms in a document. TF (term frequency) means the number of occurrences of words in the document, and DF (document frequency) means the number of documents containing the word. IDF is the reciprocal of DF, and TF-IDF is the product of TF and IDF.

본 발명의 API 호출 특성 정보를 추출하기 위해 TD-IDF 기법을 적용할 경우, TF는 프로그램내의 API 호출 횟수에 대응될 수 있고, DF는 해당 API를 호출하는 프로그램들의 개수에 대응될 수 있다. When the TD-IDF technique is applied to extract the API call property information of the present invention, the TF may correspond to the number of API calls in the program, and the DF may correspond to the number of programs that call the API.

따라서, 본 발명의 단계(S113)는 API별로 실험적으로 구해진 TD와 IDF를 API 호출 특성 정보를 표현한 벡터의 대응되는 API별 원소에 가중치(weight)로서 곱하는 과정을 포함하여 구성될 수 있다.Accordingly, the step S113 of the present invention may include a step of multiplying TD and IDF experimentally obtained for each API as a weight for a corresponding API element of a vector representing the API call characteristic information.

결과적으로, TF는 프로그램 내에서 자주 사용되는 API에 대하여 높은 값을 가지는 가중치가 되며, IDF(DF의 역수)는 많은 프로그램들에서 일반적으로 사용되어지는 API에 대하여 낮은 값을 가지는 가중치가 된다. 이는, 다른 프로그램들에서는 널리 사용되지 않는 API가 특정 프로그램에서는 많이 이용될 경우에 해당 API가 해당 특정 프로그램의 특징을 규정하는 API가 될 수 있음을 의미한다.
As a result, TF is a weighted value with a high value for frequently used APIs in a program, and IDF (the inverse of DF) is a weighted value with a low value for APIs commonly used in many programs. This means that when an API that is not widely used in other programs is frequently used in a specific program, the API can be an API that specifies the characteristics of the specific program.

다음으로, 유사도 판단 단계(S120)는 단계(S110)를 거쳐서 추출된 대상 프로그램들의 API 호출 특성 정보들을 상호 비교하여 유사도(similarity)를 계산하는 단계이다. 계산된 유사도는 소정의 임계값과 비교하여 대상 프로그램들이 상호간에 표절 관계에 있는지를 판단하는데 이용될 수 있다. 소정의 임계값은 실험적으로 결정될 수 있다.Next, the similarity determination step S120 is a step of calculating similarity by comparing the API call characteristic information of the target programs extracted through step S110. The calculated similarity may be used to determine whether the target programs are in a plagiarism relationship with each other by comparing with a predetermined threshold value. The predetermined threshold value can be determined experimentally.

단계(S110)에서 추출된 API 호출 특성 정보는 벡터로서 표현되므로 단계(S120)의 유사도 판단은 벡터간의 유사도를 판단하기 위한 코사인 유사도(cosine similarity) 계산으로 도출된 유사도 값을 이용하여 이루어질 수 있다.Since the API call characteristic information extracted in step S110 is represented as a vector, the similarity determination in step S120 may be performed using the similarity value derived by calculation of cosine similarity for determining the similarity between vectors.

하기 수학식 1은 제 1 프로그램과 제 2 프로그램의 유사도 값 연산을 위한 계산식의 예시이다.
Equation (1) is an example of a calculation formula for calculating the similarity value between the first program and the second program.

여기에서, V_p 와 V_q 는 각각 제 1 프로그램과 제 2 프로그램의 API 호출 특성 정보(즉, 소프트웨어 버스마크)로서 벡터로서 표현된다.Here, V _p and V _q are expressed as vectors as API call characteristic information (i.e., software bus marks) of the first program and the second program, respectively.

{V_p}_i와 {V_q}_i는 각각 V_p 벡터와 V_q 벡터의 i번째 원소를 지정하며, 상기 수학식 1의 계산결과는 0 내지 1의 값을 가지게 된다.
{V _p } _i and {V _q } _i designate the i-th element of the V _p vector and the V _q vector, respectively, and the calculation result of Equation (1) has a value of 0 to 1.

단계(S120)에서는 추가적으로 수학식 1에 의해서 계산된 유사도 값을 소정의 임계값들과 비교하여 대상 프로그램들간의 표절 또는 비표절을 판단하도록 구성될 수 있다.In step S120, the similarity value calculated by Equation (1) may be further compared with predetermined threshold values to determine plagiarism or non-plagiarism between target programs.

이때 소정의 임계값(ε)는 상기 수학식 1을 통해 결정된 유사도 값과 비교되는 3개 구간을 결정짓는 값이 될 수 있다. 예컨대 수학식 1에 의해서 계산된 SIM(V_p, V_q)는 아래와 같은 임계값(ε)과의 관계에 의해서 대상 프로그램들의 표절 여부를 판단하는데 이용될 수 있다. 임계값(ε)은 실험적으로 결정될 수 있으며, 임계값을 결정하기 위한 실험의 예는 후술된다.At this time, the predetermined threshold value? May be a value for determining three intervals to be compared with the similarity value determined through Equation (1). For example, SIM (V _p , V _q ) calculated by Equation (1) can be used to determine whether or not the target programs are plagiarized by a relationship with the threshold value? The threshold value [epsilon] can be experimentally determined, and an example of an experiment for determining the threshold value will be described later.

1) SIM(V_p, V_q) >= 1- ε 인 경우, 제 1 프로그램과 제 2 프로그램은 상호간에 표절 관계에 있는 것으로 판단한다.1) When SIM (V _p , V _q ) > = 1 -?, It is determined that the first program and the second program are in a plagiarism relation with each other.

2) SIM(V_p, V_q) <= ε 인 경우, 제 1 프로그램과 제 2 프로그램은 상호간에 비표절 관계(즉, 각자가 독립적인 프로그램)에 있는 것으로 판단한다.2) When SIM (V _p , V _q ) < = epsilon, it is determined that the first program and the second program are in a non-plagiarism relation (that is, each is an independent program).

3) ε < SIM(V_p, V_q) <1- ε 인 경우, 제 1 프로그램과 제 2 프로그램의 표절 여부에 대해서 결론을 유보한다.
3) If ε <SIM (V _p , V _q ) <1- ε, the conclusion of whether or not the first program and the second program are plagiarized is reserved.

실험예(Experimental Example ( experimentalexperimental studystudy ))

이하에서는, 본 발명에 따른 소프트웨어 표절 탐지 방법의 정확성을 검증하기 위하여 10가지 카테고리에 속하는 40개의 프로그램들을 대상으로 한 실험예를 설명한다.Hereinafter, in order to verify the accuracy of the software plagiarism detection method according to the present invention, an experimental example of 40 programs belonging to 10 categories will be described.

도 3은 본 발명에 따른 표절 탐지 방법의 실험 대상이 되는 40개 프로그램을 정리한 도표이다.3 is a table summarizing 40 programs to be tested in the method of detecting plagiarism according to the present invention.

도 3을 참조하면, 10가지 카테고리(텍스트 에디터, FTP 클라이언트 등)에 대해서 각 카테고리 별로 4개의 프로그램을 선별되었다.Referring to FIG. 3, four programs are selected for each category for ten categories (text editors, FTP clients, etc.).

각 카테고리 별로 4개의 프로그램들 중 한 쌍의 프로그램들(예컨대, 1과 2, 3과 4)은 동일 프로그램의 다른 버전 프로그램으로 구성되며, 각 쌍끼리(1과 2로 구성되는 쌍과, 3과 4로 구성되는 쌍)는 전혀 다른 프로그램으로 구성된다. 따라서, 4개의 프로그램 중 쌍을 이루는 프로그램들 간에는 높은 유사도를 가질 것이므로 '표절'로서 판단되어야 할 것이며, 각 쌍끼리는 낮은 유사도를 가질 것이므로 '비표절'로서 판단되어야 할 것이란 가정이 가능하다.One pair of four programs (for example, 1 and 2, 3 and 4) for each category is composed of different version programs of the same program, and each pair (pair consisting of 1 and 2, 4) consists of a completely different program. Therefore, it is presumed that it should be judged as 'plagiarism' because it will have a high degree of similarity between programs constituting a pair of the four programs, and it is assumed that each pair should be judged as 'non-plagiarism' since they will have a low degree of similarity.

본 발명의 탐지 방법의 정확성(accuracy)을 검증하기 위해서 하기 수학식 2 내지 수학식 4에서 세 가지 정확성 지표들(precision, recall, F-measure)을 정의한다.
In order to verify the accuracy of the detection method of the present invention, three accuracy indicators (precision, recall, F-measure) are defined in the following equations (2) to (4).

상기 수학식 2 내지 4에서, CC는 인간 전문가에 의해서 '표절'로서 분류된 프로그램 쌍들의 집합을 의미하며, CI는 인간 전문가에 의해서 '비표절'로 분류된 프로그램 쌍들의 집합을 의미한다. 도 3에서 예시된 40개 프로그램의 경우, 각 카테고리 별로 4개의 프로그램들 중 한 쌍의 프로그램들(예컨대, 1과 2, 3과 4)은 '표절'로서 분류 가능하며, 각 쌍끼리(1과 2로 구성되는 쌍과, 3과 4로 구성되는 쌍)는 '비표절'로 분류 가능하다.In Equations 2 to 4, CC denotes a set of program pairs classified as 'plagiarism' by a human expert, and CI denotes a set of program pairs classified as 'plagiarism' by a human expert. In the case of the 40 programs illustrated in FIG. 3, one of four programs (for example, 1 and 2, 3 and 4) for each category can be classified as 'plagiarism' 2, and a pair consisting of 3 and 4) can be classified as 'non-plagiarism'.

또한, PC는 본 발명에 따른 방법에 의해서 '표절'로서 분류된 프로그램 쌍들의 집합을 의미하며, PI는 본 발명에 따른 방법에 의해서 '비표절'로서 분류된 프로그램 쌍들의 집합을 의미한다.Also, PC means a set of program pairs classified as 'Plagiarism' by the method according to the present invention, and PI means a set of program pairs classified as 'Plagiarism' by the method according to the present invention.

도 4는 본 발명에 따른 소프트웨어 표절 탐지 방법을 적용한 실험예를 설명하기 위한 그래프이다.4 is a graph for explaining an experimental example to which the software plagiarism detection method according to the present invention is applied.

도 4에서 x축은 임계값(ε)을 의미하며, y축은 설정된 임계값(ε)에 의해서 측정된 정확성(accuracy)를 의미한다.In FIG. 4, the x-axis represents the threshold value (?), And the y-axis represents the accuracy measured by the set threshold value?.

도 4를 참조하면, 임계값(ε)을 0.4로 설정한 경우에 본 발명에 따른 탐지 방법은 가장 높은 정확성(precision: 0.97, recall: 0.95, F-measure: 0.96)을 가짐을 알 수 있다.Referring to FIG. 4, it can be seen that the detection method according to the present invention has the highest accuracy (precision: 0.97, recall: 0.95, F-measure: 0.96) when the threshold value? Is set to 0.4.

앞서 언급된 바와 같이, 임계값(ε)가 0.4로 설정된 경우, 유사도 값이 [0, 0.4]의 레인지에 속할 경우는 비교 대상 프로그램들이 비표절로서 판단되고, 유사도 값이 [0.6, 1.0]의 레인지에 속할 경우는 비교 대상 프로그램들이 표절로서 판단된다.
As described above, when the threshold value? Is set to 0.4, when the similarity value belongs to the range of [0, 0.4], the comparison target programs are determined as non-plagiarism, and the similarity value is set to [0.6, 1.0] If they belong to the range, the programs to be compared are judged as plagiarism.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

211~216: 함수 노드
221, 222: API 노드211 to 216: function node
221, 222: API node

Claims

A method for software plagiarism detection,
An API call characteristic information extraction step of extracting API (Application Programming Interface) call characteristic information of the first program and the second program;
Assigning a pre-set API weight value to the API call property information of the first program and the API call property information of the second program; And
And a similarity determination step of comparing the first program, the API call property information, and the API call property information of the second program with each other to determine a similarity between the first program and the second program,
Wherein the step of extracting the API call characteristic information comprises the steps of: receiving, by the API function call of each of the first program and the second program, nodes having no incoming edge and directly or indirectly reaching the API, When the first program and the second program are connected to each other, acquiring the number of nodes connected to each API for each of the first program and the second program, and acquiring the API of each of the first program and the second program based on the number of nodes connected to the respective APIs And generates a call frequency bus mark as API call characteristic information.

The method according to claim 1,
Wherein the API call characteristic information extracting step creates a API function call graph of the first program and the second program.

delete

The method according to claim 1,
Wherein the API call property information is expressed by a vector including an element count as a call count value per API used in the first or second program.

The method according to claim 1,
Wherein the weight for each API is determined by a TD-IDF (Term Frequency-Inverse Document Frequency) method.

The method of claim 6,
Wherein the weight for each API is generated based on a value obtained by multiplying the number of calls per API in each of the first program and the second program by an inverse number of the number of programs calling each API.

The method according to claim 1,
Wherein the degree of similarity determining step determines a degree of similarity based on a degree of similarity indicating a degree of similarity between the first program and the second program through a cosine similarity calculation on the API call characteristic information of the first program and the API call characteristic information of the second program, A software plagiarism detection method that calculates a value.

The method of claim 8,
Wherein the similarity determination step determines whether a similarity value between the first program and the second program belongs to a section set by a predetermined threshold value and determines whether the first program and the second program are plagiarized or not, Way.

The method of claim 9,
Wherein the predetermined threshold is determined experimentally.

Obtaining the number of nodes connected to each API when there are no incoming edges and nodes directly reaching one or more APIs (application programming interfaces) are connected to one or more APIs by API function calls of the program;
Generating an API call frequency bus mark of the program based on the number of nodes connected to each of the APIs;
(C) assigning a pre-set API weight to each API of the API call frequency bus mark;
(D) calculating an API calling frequency bus mark of the first program corresponding to the program and applying the weight for each API, and calculating a degree of similarity between the API calling frequency bus mark of the second program to which the weight for each API is applied, The software plagiarism detection method.

The method of claim 11,
Wherein the weight for each API is determined by multiplying the frequency of each API in the program by the reciprocal of the number of programs calling each API in the first and second programs.