KR100863943B1

KR100863943B1 - Plagiarism detecting method and plagiarism detecting apparatus

Info

Publication number: KR100863943B1
Application number: KR1020070099837A
Authority: KR
Inventors: 조환규; 류창건
Original assignee: 부산대학교 산학협력단
Priority date: 2007-10-04
Filing date: 2007-10-04
Publication date: 2008-10-16

Abstract

A method and a device for detecting plagiarism are provided to detect the plagiarism precisely from a document in a short time without degrading precision by reflecting characteristics of a free word order language such as Hangul or Japanese. A partial similarity of each divided unit is calculated by receiving a document from an input unit, loading the compared document from a storage unit, and dividing the paired documents into each dividing unit(S400). A document similarity between the paired documents is calculated by accumulating the partial similarities when the partial similarity is higher than a predetermined threshold value, and plagiarism is determined on the basis of a total plagiarism level(S500). The partial similarity is calculated by checking a preliminary similarity of a compared part of the paired documents based on a coincident ratio of all divided unit without overlapping, and checking an in-depth similarity in consideration of weight according to a coincident position in the divided unit when the preliminary similarity is higher than the threshold similarity.

Description

Plagiarism detecting method and plagiarism detecting apparatus

본 발명은 표절탐색 시스템 및 표절탐색 방법에 관한 것으로서, 보다 상세히는, 한글이나 일본어와 같이 자유로운 어순 특성을 가지는 언어로 작성된 문서의 표절 부위를 탐색하고, 전체 표절도를 산출하며, 표절의 방향과 경로를 출력할 수 있는 시스템 및 방법에 관한 것이다.The present invention relates to a plagiarism search system and a plagiarism search method, and more particularly, to search for plagiarism sites of documents written in a language having free word order characteristics such as Korean and Japanese, to calculate the total plagiarism, A system and method capable of outputting a path are provided.

일반적으로, 어순이 고정되어 있는 영어나 중국어 등의 언어(이하, '고정어순 언어'라 함)와 달리, 한글이나 일본어는, 자유로운 어순을 가진다(이하, '자유어순 언어'라 함)는 중요한 특성이 있다. In general, unlike languages such as English or Chinese (hereinafter, referred to as 'fixed language') where the word order is fixed, Korean and Japanese have free word order (hereinafter referred to as 'free order language'). There is a characteristic.

예컨대, 영어나 중국어의 경우에는, 어순이 고정되어 있어서, 정해진 어순을 사용하지 않으면 틀린 문장이 되므로, 표절 여부를 탐색할 때에 그 고정된 어순을 고려하여 탐색할 수가 있다. 그러나 자유어순 언어의 경우에는, 동일한 문장이라도 어순만 달리하거나, 새로운 어절의 삽입과 삭제, 치환 등에 의하여, 쉽사리 표절을 할 수 있다. For example, in the case of English or Chinese, the word order is fixed. If the word order is not used, the sentence is incorrect. Therefore, when searching for plagiarism, the fixed word order can be considered. However, in the case of a free word order language, the same sentence can be easily plagiarized by only changing word order or by inserting, deleting, or replacing a new word.

다음은 어순의 변화에 따른 문장의 변화를 나타낸 예이다.The following is an example of the change of sentence according to the change of word order.

1. 오늘 전국적으로 강풍을 동반한 비가 내릴 것이다.1. Today it will rain with strong winds nationwide.

2. 전국적으로 오늘 비가 강풍을 동반하여 내릴 것이다.2. It will rain today with strong winds nationwide.

3. 비가 오늘 전국적으로 강풍을 동반하여 내릴 것이다.3. It will rain with strong winds nationwide today.

4. 강풍을 동반한 비가 전국적으로 오늘 내릴 것이다.4. It will rain today with nationwide winds.

5. 오늘 전국적으로 내릴 비는 강풍을 동반할 것이다.5. Rain today will be accompanied by strong winds.

6. 전국적으로 오늘 내릴 비는 강풍을 동반할 것이다.6. Nationwide rain today will be accompanied by strong winds.

상기 예는, 모두 다른 문장이지만, 의미는 모두 동일하므로, 이 문장들은 표절로 간주하는 것이 올바른 표절 탐색이 될 것이다. 주어, 동사, 목적어, 수식어 등 어느 것 하나도 고정된 위치에 있어야 한다는 규칙이 없다고 해도 좋을 정도로, 자유어순 언어의 어순은 매우 자유로워서, 다양한 표현이 가능하다.The above examples are all different sentences, but the meanings are all the same, so it would be a correct plagiarism search to consider them as plagiarism. Free word order The word order of a language is so free that various subjects can be expressed, without the rule that none of the subjects, verbs, objects, and modifiers must be in a fixed position.

따라서 자유어순 언어에 의한 문서의 표절 여부를 탐지하기 위해서는, 고정어순 언어에 대하여 유효한 것으로 인정된 방법을 이용하더라도, 그 표절 여부를 정확히 탐지해낼 수 없게 되는 것이며, 이런 점에서 자유어순 언어에 대해서는 고정어순 언어에서와는 다른 알고리즘을 사용할 필요가 있다.Therefore, in order to detect the plagiarism of a document by free-language language, even if a method recognized as valid for a fixed-language language is used, the plagiarism cannot be detected correctly. You need to use a different algorithm than in the word order language.

상기와 같은 자유어순 언어의 여러 가지 까다로운 특성들로 인하여, 자유어순 언어의 표절 탐색은 누구나 쉽게 할 수 없었으며, 종래의 영어 등 고정어순 언어에서 유효하다고 인정된 방식을 응용한 표절 탐색법은, 자유어순 언어에 대하여는 좋은 성능을 내지 못하고 있다.Due to the various difficult features of the free word order language, anyone could not easily search for plagiarism in the free word order language, and the plagiarism search method using a method recognized as valid in a fixed word order language such as English, There is no good performance for free-language languages.

일반적으로, 상기 문장들의 표절을 탐색하려면, 모든 어절들 간에 유사도를 비교해 보거나, 한 음절단위로 잘라서 정렬하여 유사도를 비교하여야 한다. 그러나 모든 문장에 상기와 같은 방법을 적용한다면, 두 문서 간 유사도를 측정하는 데에만도 많은 시간이 걸리게 된다.In general, in order to search for plagiarism of the sentences, the similarity is compared between all the words, or the similarity may be compared by cutting out one syllable unit. However, if the above method is applied to all the sentences, it will take much time to measure the similarity between the two documents.

예컨대, 한글이나 일본어 문서는, 수많은 단어와 조사들로 이루어져 있다. 또한, 하나의 단어라도, 어간과 어미로 구분되어, 어미가 활용되는 경우도 있다. 이 많은 단어와 조사들, 그리고 어간만을 분리하여, 모두 일대일로 비교한다면, 매우 정확한 표절 결과가 나타날 것이다. 하지만, 그와 같은 방식을 사용하게 된다면, 한 학급 학생들이 제출한 리포트의 표절 여부를 탐색하는 데에도 상당한 시간이 걸리게 될 것이다.For example, a Korean or Japanese document consists of a number of words and surveys. In addition, even one word may be classified into a stem and a mother, and the mother may be utilized. If you separate many of these words, surveys, and stems and compare them one-to-one, you will get very accurate plagiarism results. However, if you do that, it will take a while for a class member to explore the plagiarism of a report.

예컨대, 한 학급의 학생수가 60명이라고 가정하고, 두 리포트 사이의 표절을 탐색하는데 걸리는 시간을 1분이라고 하자. 리포트의 표절을 검사하기 위해서는, 모든 리포트의 쌍을 일대일로 비교해봐야 하므로, 60 ×59 ×1 = 3540분, 즉 59시간이 걸리게 된다. 아마도 한 학급의 학생들 간 리포트 표절 탐색을 59시간 동안 수행해야만 결과가 나온다면, 이 프로그램은 쓸모없는 것이 될 것이다.For example, suppose you have 60 students in a class and the time it takes to find plagiarism between two reports is 1 minute. To check for plagiarism in reports, it is necessary to compare all report pairs one-to-one, which takes 60 x 59 x 1 = 3540 minutes, or 59 hours. Perhaps if the results of a 59-hour report plagiarism search between students in a class do not produce results, the program will be useless.

이와 같은 이유로, 표절 탐색 프로그램에서 중요한 요소 중 하나가 '시간'이라는 사실을 알 수 있다.For this reason, we can see that one of the important factors in plagiarism detection programs is 'time'.

게다가, 종래에는, 표절여부를 판정해주는 기술이 존재하더라도, 그 판정의 기준이 무엇인지 명확하지 않은 것이 대부분이었다. 따라서 표절검사 방법에 따라 서 표절여부의 판정이 엇갈리고 있어서, 객관성이 담보되지 못하였다.In addition, in the past, even if there is a technique for determining whether plagiarism exists, it is mostly not clear what the criterion of the determination is. Therefore, the plagiarism test results were mixed and the objectivity was not guaranteed.

또한, 종래에는, 표절여부만을 판정해주고, 그 표절의 정도가 어느 정도인지를 명확한 결과치로서 제시하는 기술이 없었다. 따라서 경미한 표절과 심각한 표절 사이의 차이가 구분되지 못하였다.In addition, conventionally, there is no technique for determining only plagiarism and presenting the degree of plagiarism as a clear result. Thus, no distinction was made between mild and severe plagiarism.

그리고 종래에는, 유사도에 의하여 표절여부만을 판정해주는 기술뿐이어서, 어느 문서에서 어느 문서로 표절이 이루어졌는지를 나타내는 표절의 방향에 대한 결과치의 제시는 이루어지지 못하였다. 따라서 복수의 문서 중에서 원문과 1차 표절문, 2차 표절문 등 그 표절의 순서가 제시될 수가 없었다.In the related art, only a technique for determining whether plagiarism is determined based on similarity, and thus, a result value for the direction of plagiarism indicating which document was plagiarized was not made. Therefore, the order of the plagiarism of the original text, the first plagiarism sentence, and the second plagiarism sentence could not be presented.

그리고 종래에는, 자유어순 언어에 대한 표절탐색 기술이 적용되어, 웹 등의 네트워크상에서 서비스되는 경우가 없었다. 따라서 표절검사의 대중화에 걸림돌이 되고 있었다.In the past, a plagiarism search technique for a free-language language was applied, and no service was provided on a network such as the web. As a result, the popularity of plagiarism was becoming an obstacle.

본 발명은, 상기와 같은 종래의 문제점을 해소하기 위하여 안출된 것으로서, 한글이나 일본어와 같은 자유어순 언어의 특성을 반영하여, 정확한 표절탐색을 행할 수 있는, 표절탐색 방법 및 장치를 제공하고자 하는 것이다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned conventional problems, and is to provide a plagiarism search method and apparatus that can accurately perform plagiarism search by reflecting the characteristics of a free-language language such as Korean or Japanese. .

또한, 정확성을 해치지 않으면서도, 빠른 시간 내에 탐색을 마칠 수 있는, 표절탐색 방법 및 장치를 제공하고자 하는 것이다.In addition, it is an object of the present invention to provide a plagiarism search method and apparatus capable of completing a search in a short time without compromising accuracy.

게다가, 유사도 검사방법 중에는, 신속하지만 정확성이 떨어지는 방법도 있고, 정확하지만 신속성이 떨어지는 방법도 있는데, 이들 방법을 모두 채용하여, 그 장점만 살려서 적용할 수 있는 표절탐색 방법 및 장치를 제공하고자 하는 것이다.In addition, among similarity test methods, there are methods that are quick but less accurate, and some methods are less accurate but less rapid. It is intended to provide a method and apparatus for plagiarism detection that can be applied by utilizing all of these methods. .

그리고 표절 여부뿐만 아니라, 표절의 정도를 정량화하여 나타내는, 표절탐색 방법 및 장치를 제공하고자 하는 것이다.And to provide a method and apparatus for plagiarism detection, indicating the degree of plagiarism as well as whether plagiarism.

또한, 표절의 방향과 경로를 나타내는, 표절탐색 방법 및 장치를 제공하고자 하는 것이다.In addition, it is an object of the present invention to provide a plagiarism search method and apparatus for indicating the direction and path of plagiarism.

그리고 표절탐색 수단을 구비한 서버에 의하여 서비스가 제공되고, 이에 접속한 단말에서 원격으로 표절탐색 결과를 출력할 수 있는 표절탐색 장치를 제공하고자 하는 것이다.Further, a service is provided by a server provided with a plagiarism search means, and a terminal to provide a plagiarism search device for remotely outputting a plagiarism search result.

상기와 같은 과제를 달성하기 위하여, 본 발명은 표절탐색 방법과 표절탐색 장치를 제공한다.In order to achieve the above object, the present invention provides a plagiarism search method and plagiarism search device.

본 발명의 표절탐색 방법은, 적어도 기억수단, 입력수단, 출력수단 및 제어수단을 구비하는 장치를 이용하여, 데이터 입력된 복수의 문서 사이의 표절을 탐색하는 방법에 있어서, 상기 입력수단을 통하여 입력되어 상기 기억수단에 체계적으로 기억되어 있는, 비교대상이 되는 상기 문서 각 쌍을 대상으로 하여, 미리 정해져 있는 구분단위로 나눠서, 중복되지 않도록 대비 검사하여, 상기 각 구분단위를 중심으로 한 대비 부분에 대한 유사도를 산출하는 부분유사도 산출단계와, 미리 정해져 있는 임계값 이상인 경우에, 상기 부분유사도를 누적하여, 상기 문서 각 쌍에 대한 문서유사도를 산출하고, 이로부터 문서표절 여부를 판단하는 문서표절 판단단계를 포함하되, 상기 부분유사도 산출단계에서 상기 비교 검사를 함에 있어서, 상기 모든 구분단위에 대하여 중복되지 않도록 행하여지고, 정확성보다는 신속성이 우선되는 방법에 의하여, 대비 부분의 예비적 유사도를 검사하는 예비검사 단계와, 상기 예비검사의 유사도가 미리 정해져 있는 임계값 이상인 경우에만 행하여지고, 신속성보다는 정확성이 우선되는 방법에 의하여, 대비 부분의 심층적 유사도를 검사하는 심층검사 단계로 나눠서 수행하도록 구성됨을 특징으로 한다.The plagiarism search method of the present invention is a method for searching for plagiarism between a plurality of documents into which data has been input using a device having at least a storage means, an input means, an output means, and a control means. Each pair of documents to be compared and stored systematically in the storage means, divided into predetermined division units, and contrast-checked so as not to overlap each other. Partial similarity calculation step of calculating the similarity with respect to the document, and if it is equal to or more than a predetermined threshold value, accumulate the partial similarity, calculate the document similarity degree for each pair of documents, and determine the document plagiarism from this. as in the comparison tests in, the degree of similarity calculation step section comprising the steps, wherein all the division About is carried out do not overlap, and the pre-inspection step for inspecting a preliminary degree of similarity, contrast part in such a way that speed takes precedence, rather than accuracy, is carried out only if the similarity of the pre-test than predetermined threshold, than the speed It is characterized in that it is configured to perform by dividing into an in- depth inspection step to examine the deep similarity of the contrast portion by the method of the accuracy first.

여기서, 상기 구분단위는, 어절임을 특징으로 하거나, 상기 구분단위는, 최대 음절수가 미리 정해져 있는 수 k로 한정되어, 상기 최대 음절수를 초과하는 어절에 대해서는, 앞쪽 음절부터 연속하여 k 음절로 분할하는 과정을 순차 반복함으로써 이루어지는, k- mer 분할어구임을 특징으로 하는 것이 바람직하다.Here, the division unit is characterized by being a word , or the division unit is limited to a number k having a predetermined maximum syllable number, and for a word exceeding the maximum number of syllables, the syllable unit is divided into k syllables consecutively from the first syllable. K - mer , made by repeating the process It is preferable to characterize the division phrase .

또한, 상기 구분단위로 나눠진 상기 문서 각각은, 상기 구분단위의 앵커를 키(key)로 하고, 상기 구분단위가 상기 문서 내에서 출현한 위치를 레퍼런스(reference)로 하여, 사전구조로 변환되어 구비되며, 상기 구분단위를 중심으로 한 대비 부분에 대한 유사도의 산출을 위한 대비 검사는, 상기 문서 각각에 대한 상기 사전구조의 모든 공통앵커에 대하여 이루어지도록 구성됨을 특징으로 하는 것이 더욱 바람직하다.Each document divided into the division units is converted into a dictionary structure using the anchor of the division unit as a key and the position where the division unit appears in the document as a reference. Preferably, the contrast check for calculating the similarity with respect to the contrast portion centered on the division unit is configured to be performed for all common anchors of the dictionary structure for each of the documents.

게다가, 상기 구분단위를 중심으로 한 대비 부분에 대한 유사도의 산출을 위한 대비 검사 이전에, 표절여부의 판단에 영향을 미치는 구분단위인 의미단어는 잔존시키고, 표절여부의 판단에 영향을 미치지 않는 구분단위인 불용어(不用語)는 제거하는 과정이 선행되도록 구성됨을 특징으로 하는 것이 바람직하다.In addition, prior to the contrast test for calculating the similarity with respect to the contrast unit centered on the division unit, a semantic word, which is a division unit that affects the judgment of plagiarism, remains and does not affect the judgment of plagiarism. It is preferable that the unit is used as a term used to remove the term .

한편, 상기 예비검사 단계에 있어서, 대비 부분의 전체 음절수 중에서 일치하는 음절수가 차지하는 비율에 의하여 유사도를 산출하도록 구성됨을 특징으로 하여도 좋다.On the other hand, in the preliminary inspection step, it may be characterized in that it is configured to calculate the degree of similarity by the ratio of the number of the matching syllables in the total number of syllables of the contrast portion.

또한, 상기 예비검사 단계의 상기 대비 부분은, 상기 구분단위를 중심으로 하여, 문서의 위치상 전후로 미리 정해져 있는 수 a 만큼의 어절을 확장한 부분임을 특징으로 하여도 좋다.In addition, the contrast portion of the preliminary inspection step may be characterized by extending the number of words a predetermined number before and after on the basis of the division unit.

이때, 상기 어절의 확장은, 상기 구분단위가 중복되지 않는 범위 내에서 이루어지도록 구성됨을 특징으로 하는 것이 바람직하다.In this case, it is preferable that the expansion of the word is configured to be performed within a range in which the division unit does not overlap.

그리고 상기 예비검사 단계는, 미리 정해져 있는 다단계로 이루어지고, 앞 단계에서 미리 정해져 있는 임계값 이상의 유사도로 판단된 경우에만 뒷단계로 진 행하며, 상기 대비 부분은, 상기 구분단위를 중심으로 하여, 문서의 위치상 전후로 미리 정해져 있는 수만큼의 어절을 확장한 부분으로 하고, 뒷단계로 갈수록 상기 확장하는 음절의 수가 증가하도록 구성됨을 특징으로 하여도 좋다.And the preliminary inspection step is made of a predetermined multi- step, and proceeds to the back step only when it is determined that the similarity is equal to or more than the predetermined threshold value in the previous step, the contrast portion, the document based on the division unit, It may be characterized in that it is configured to extend as many words as a predetermined number before and after the position of, and to increase the number of the extended syllables in the later step.

한편, 상기 심층검사 단계에 있어서, 대비 부분을 어절별로 지역 정렬하여, 각 어절의 구성 음절에 대하여, 그 어절에 있어서 그 음절이 위치하는 위치에 따른 가중치를 가산함으로써, 유사도를 산출하도록 구성됨을 특징으로 하여도 좋다.On the other hand, in the in-depth inspection step, it is configured to calculate the similarity by aligning the contrast portion by word, adding weights according to the position of the syllable in each word for the syllables of each word You may make it.

또한, 상기 심층검사 단계의 상기 대비 부분은, 상기 구분단위를 중심으로 하여, 문서의 위치상 전후로 미리 정해져 있는 수 b 만큼의 어절을 확장한 부분임을 특징으로 하여도 좋다.The contrast portion of the in-depth inspection step may be an enlarged portion of a number b of words defined in advance and backward on the basis of the division unit.

그리고 상기 심층검사 단계는, 미리 정해져 있는 다단계로 이루어지고, 앞 단계에서 미리 정해져 있는 임계값 이상의 유사도로 판단된 경우에만 뒷단계로 진행하며, 상기 대비 부분은, 상기 구분단위를 중심으로 하여, 문서의 위치상 전후로 미리 정해져 있는 수만큼의 어절을 확장한 부분으로 하고, 뒷단계로 갈수록 상기 확장하는 음절의 수가 증가하도록 구성됨을 특징으로 하여도 좋다.In addition, the depth inspection step is made of a predetermined multi- step, and proceeds to a later step only when it is determined that the similarity is equal to or greater than a predetermined threshold value in the previous step, wherein the contrast portion is based on the classification unit, It may be characterized in that it is configured to extend as many words as a predetermined number before and after the position of, and to increase the number of the extended syllables in the later step.

한편, 상기 각 대비 부분에 있어서의 부분유사도에 대하여, 상기 심층검사 단계에서 산출된 유사도를 절대유사도라 하고, 상기 대비 부분이 완전 일치할 때의 유사도를 완전일치 유사도라 할 때, 상기 완전일치 유사도에 대한 상기 절대유사도의 비를 상대유사도라 하며, 상기 문서표절 판단단계에 있어서는, 상기 문서유사도 의 산출을 위하여 누적되는 부분유사도는 상기 절대유사도로 하고, 상기 누적 여부를 판단하기 위한 기준이 되는 부분유사도는 상기 상대유사도로 하도록 구성됨을 특징으로 하여도 좋다.On the other hand, the relative partial similarity of each contrast portion, when referred to the degree of similarity calculated in the depth test stage absolute degree of similarity, and to the degree of similarity referred to the match degree of similarity when the compared portions match-the match-similarity The ratio of absolute similarity to is referred to as relative similarity , and in the document plagiarism judgment step, the partial similarity accumulated for calculating the document similarity is the absolute similarity, and is a part used as a criterion for determining whether the cumulative is similar. Similarity may be characterized in that it is configured to the relative similarity.

이때, 상기 문서 전체에 있어서의 문서유사도에 대하여, 상기 대비 부분의 절대유사도의 누적을 문서 절대유사도라 하고, 상기 문서가 완전 일치할 때의 유사도를 문서 완전일치 유사도라 할 때, 상기 문서 완전일치 유사도에 대한 상기 문서 절대유사도의 비를 문서 상대유사도라 하며, 상기 문서표절 여부는, 상기 문서 상대유사도로부터 판단되도록 구성됨을 특징으로 하여도 좋다.In this case, with respect to the document similarity in the entire document, the accumulation of absolute similarity in the contrast portion is called document absolute similarity , and the similarity when the documents are perfectly matched is called document perfect match similarity . The ratio of the document absolute similarity to the similarity may be referred to as document relative similarity , and the document plagiarism may be configured to be determined from the document relative similarity.

한편, 상기 문서표절 여부는, 미리 정해져 있는 확률모델에 상기 문서유사도를 대응시킴으로써 판단되도록 구성됨을 특징으로 하는 것이 바람직하다.On the other hand, the document plagiarism is preferably characterized in that it is configured to be determined by matching the document similarity to a predetermined probability model .

이때, 상기 확률모델은, 다수의 표절이 아닌 독립문서 끼리를 비교하여, 실제로 표절이 아님에도 불구하고 표절로 의심될 만큼 유사한 표현이 출현할 확률을 통계적으로 정리하여 도출한 함수인 것을 특징으로 하는 것이 바람직하다.In this case, the probabilistic model is a function derived by statistically arranging the probability that similar expressions appear to be suspected of plagiarism even though they are not plagiarism by comparing independent documents , rather than multiple plagiarism. It is preferable.

그리고 이때, 상기 확률모델은, 다양한 독립문서들에 대하여 실험을 행한 결과 미리 생성되어 준비되며, 상기 확률모델을 이용한 표절탐색을 수행함에 있어서는, 상기 문서유사도에 대응되는 상기 확률모델의 함수값에 의하여, 표절의 정도를 나타내는 구체적인 확률값을 구하여 출력하도록 구성됨을 특징으로 하는 것이 바람직하다.In this case, the probability model is generated and prepared in advance as a result of experiments on various independent documents, and in performing plagiarism search using the probability model, the function of the probability model corresponding to the document similarity is determined. In particular, the method may be configured to obtain and output a specific probability value indicating the degree of plagiarism.

한편, 가중치에 의한 유사도 산출의 경우에, 상기 비교대상이 되는 상기 문서 쌍을 문서 A와 문서 B라 할 때, 상기 대비 부분의 유사도는, 문서 A를 기준으로 한 문서 B의 유사도와, 문서 B를 기준으로 한 문서 A의 유사도가 서로 다른, 비대칭 유사도이며, 상기 가중치는, 기준이 되는 문서에 비하여 대비가 되는 문서에 추가된 삽입부분에 대한 가중치와, 기준이 되는 문서에 비하여 대비가 되는 문서에 삭제된 삭제부분에 대한 가중치가 서로 다르게 정해지도록 구성됨을 특징으로 하는 것이 바람직하다.On the other hand, in the case of the similarity calculation by weight, when the said document pair used as the comparison object is document A and document B, the similarity of the said contrast part is similarity of the document B based on document A, and document B The similarity of Document A based on the different from each other is asymmetric similarity , the weight is a weight of the insertion portion added to the contrasting document compared to the reference document and the contrasting document compared to the reference document It is preferable that the weights of the deleted portions are determined to be different from each other.

이때, 상기 심층검사에 있어서, 상기 문서 A를 기준으로 한 문서 B의 유사도와, 문서 B를 기준으로 한 문서 A의 유사도가 각각 산출되고, 상기 문서표절 판단단계에 있어서, 상기 문서 A를 기준으로 한 문서 B의 유사도와, 문서 B를 기준으로 한 문서 A의 유사도로부터, 각각 문서 A를 기준으로 한 문서 B의 문서유사도와, 문서 B를 기준으로 한 문서 A의 문서유사도가 산출되도록 구성되며, 상기 문서 A를 기준으로 한 문서 B의 문서유사도 값과, 문서 B를 기준으로 한 문서 A의 문서유사도 값의 비교에 의하여, 문서 표절의 방향을 결정하도록 구성됨을 특징으로 하여도 좋다.At this time, in the in-depth inspection, the similarity of Document B based on Document A and the similarity of Document A based on Document B are respectively calculated, and in the document plagiarism determination step, based on the document A From the similarity of one document B, the similarity of document A based on document B, the document similarity of document B based on document A, and the document similarity of document A based on document B are calculated. The document similarity value of Document B based on Document A and the Document Similarity value of Document A based on Document B may be configured to determine the direction of document plagiarism .

그리고 이때, 비교대상인 모든 상기 문서 쌍에 대하여 상기 문서의 표절 방향을 산출한 후, 표절 방향에 따라서 각 문서마다 화살표로 연결한 표절경로 도형으로 표시하도록 구성됨을 특징으로 하는 것이 바람직하다.In this case, after calculating the plagiarism direction of the document for all the pairs of documents to be compared, it is preferable to configure the plagiarism path diagrams connected by arrows for each document according to the plagiarism direction.

한편, 본 발명의 표절탐색 장치는, 적어도 기억수단, 입력수단, 출력수단 및 제어수단을 구비하여, 데이터 입력된 복수의 문서 사이의 표절을 탐색하는 장치에 있어서, 상기 기억수단에 구비되어, 상기 입력된 문서를 체계적으로 저장하는 문서 데이터베이스와, 상기 문서 데이터베이스로부터 비교대상 문서 쌍을 읽어내어, 대비 검사하여, 문서표절 여부를 판단하는 제어수단을 적어도 구비하고, 상기 제어수단은, 상기 문서 데이터베이스로부터 비교대상이 되는 문서쌍을 읽어들여서, 상기 중 어느 하나의 표절탐색 방법을 수행하도록 구성됨을 특징으로 한다.On the other hand, the plagiarism search apparatus of the present invention is provided with at least a storage means, an input means, an output means, and a control means to search for plagiarism between a plurality of documents into which data is input. to store the input document systematically document database and by checking read out the comparison target document pairs from the document database, and contrast, at least a control means for determining whether the document plagiarism, and the control means, from the document database, Read the document pair to be compared, characterized in that configured to perform any one of the above plagiarism search method.

이때, 상기 표절탐색 장치는, 통신수단을 더욱 구비한 서버이고, 상기 서버에는, 네트워크를 통하여 상기 서버의 통신수단과 연결 가능한 통신수단을 포함하며, 적어도 기억수단, 입력수단, 출력수단 및 제어수단을 구비하여 이루어지는 단말이 접속되며, 상기 서버의 제어수단은, 상기 비교대상이 되는 복수의 문서가, 상기 단말의 입력수단을 통하여 입력되어, 상기 단말의 통신수단을 통하여, 상기 서버의 통신수단으로 전송되면, 상기 서버의 입력수단이, 상기 서버의 통신수단으로부터 상기 복수의 문서를 데이터 입력받도록 제어하고, 상기 복수의 문서에 대하여 표절여부 판단 결과가 산출되면, 상기 결과가 상기 서버의 통신수단과 연결된 상기 단말의 통신수단을 거쳐서, 상기 단말의 출력수단에 출력되도록, 상기 서버의 출력수단이, 상기 복수의 문서의 표절여부 판단 결과를 상기 서버의 통신수단에 데이터 출력하도록 제어하도록 구성됨을 특징으로 하여도 좋다.At this time, the plagiarism search apparatus is a server further provided with communication means, the server including communication means connectable with communication means of the server via a network, and at least a storage means, an input means, an output means, and a control means. The terminal comprising a terminal is connected, the control means of the server, a plurality of documents to be compared is input through the input means of the terminal, through the communication means of the terminal, to the communication means of the server When transmitted, the input means of the server controls to receive data from the plurality of documents from the communication means of the server, and if a plagiarism determination result is calculated for the plurality of documents, the result is communicated with the communication means of the server. via the communication means of the mobile station is connected, to be output to the output means of the terminal, the output means of the server, the The article plagiarism whether the determination result of the number may be characterized by being configured to control to output the data to the communication means of the server.

상기와 같은 2단계, 나아가서는 다단계로 표절 탐색 구간을 설정하고, 앞 단계에서는 속도를 중점으로 한 탐색을 행하고, 뒤 단계에서는 정확성을 중점으로 한 탐색을 행함으로써, 표절의 가능성이 적은 부분에 대해서는 시간이 많이 소요되는 심층검사를 생략하고, 표절의 가능성이 큰 부분에 집중하여 심층검사를 행할 수 있도록 함으로써, 빠른 속도로 정확한 표절탐색을 행할 수 있다.By setting the plagiarism search section in the above two steps, and in the multi-steps above, conducting the search focusing on speed in the previous step, and searching for accuracy in the next step, By eliminating time-consuming in-depth examinations and focusing on areas where plagiarism is likely to occur, in-depth examinations can be performed quickly and accurately.

또한, 비교대상 문서를 사전구조로 변환함으로써 중복검사 가능성을 없애서, 비교대상 부분을 줄임으로써, 속도를 향상시킨다.In addition, by converting the document to be compared into a preliminary structure, the possibility of duplicate inspection is eliminated, and the portion to be compared is reduced, thereby improving the speed.

그리고 비교대상 부분의 어절별로, 앞부분은 가중치를 높이고, 뒷부분은 가중치를 낮춤으로 인하여, 어간과 어미의 분리 없이, 간이한 방법으로 주요 의미를 담고 있는 부분을 중점적으로 탐색하도록 하여, 자유어순 언어의 특성에 맞는 정확한 문서 탐색이 가능해진다.By the word of the comparison part, the weight of the front part is increased and the weight of the back part is lowered, so that the parts containing the main meanings can be searched in a simple way without the separation of the stem and the ending. Accurate document search is possible according to the characteristics.

또한, 임계값(T1과 T2)을 사용자가 조절함으로써, 좀 더 세밀한 표절 검사를 할 수 있거나, 또는 좀 더 빠른 시간 내에 표절 검사를 할 수 있다. 대용량의 문서를 빠르게 표절 검사를 대략적으로 하고 싶은 사용자는, T1 값을 낮추고 T2 값을 올리면 되고, 문서 내에 표절 구간이 정확히 어느 부분인지 몇 군데에서 표절이 일어났는지 세밀하게 알고 싶은 사용자는, T1 값을 높이고, T2 값을 낮추면 될 것이다.In addition, by adjusting the thresholds T1 and T2, a finer plagiarism check can be performed, or a plagiarism check can be performed within a shorter time. For those who want to quickly check plagiarism of large documents, lower the T1 value and increase the T2 value. For those who want to know in detail where exactly the plagiarism section is located in the document, Increase the value and decrease the T2 value.

게다가, 확률모델과 대응시켜 비교함으로써, 표절여부의 판단뿐만 아니라, 표절이 어느 정도 이루어졌는지의 정량적 결과치도 출력할 수 있다.In addition, by comparing with the probability model, not only the judgment of plagiarism but also the quantitative result of the degree of plagiarism can be output.

그리고 삽입, 삭제 부분에 대한 가중치 차별화에 의한 양방향 비대칭 유사도 산출방식에 의하여, 표절이 이루어진 방향을 판별할 수 있고, 이를 그래프화하여 시각적으로 출력할 수 있다.The direction of plagiarism can be determined by a bidirectional asymmetric similarity calculation method by weight differentiation of the insertion and deletion portions, and the graph can be visualized and output.

그리고 서버에 상기 기능이 구현된 경우에, 상기 서버에 연결된 단말에서는, 원격으로 검사하고자 하는 문서 데이터를 전송하여, 그 결과를 원격으로 수신하여 출력할 수 있다.When the function is implemented in the server, the terminal connected to the server may transmit the document data to be inspected remotely, and remotely receive and output the result.

이하, 본 발명에 대하여 첨부도면을 참조하면서 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 10은, 본 발명의 표절탐색 방법이 적용되는 장치 구성을 예시하는 전체 블럭도이다.Fig. 10 is an overall block diagram illustrating an apparatus configuration to which the plagiarism search method of the present invention is applied.

본 발명의 표절탐색 방법은, 도 10에 도시된 바와 같이, 적어도 기억수단(120), 입력수단(140), 출력수단(180) 및 제어수단을 구비하는 장치를 이용한다. The plagiarism detection method of the present invention uses an apparatus including at least a storage means 120, an input means 140, an output means 180, and a control means, as shown in FIG.

상기 기억수단(120)은, 반도체 메모리와 같은 휘발성 기억수단뿐만 아니라, 하드디스크나 플래시메모리와 같은 비휘발성 기억수단까지 포함하는 광의의 개념이다.The storage means 120 is a broad concept including not only volatile storage means such as a semiconductor memory but also nonvolatile storage means such as a hard disk or a flash memory.

또한, 상기 입력수단(140)은, 키보드와 같은 타이핑용 입력수단뿐만 아니라, 마우스와 같은 포인팅장치도 당연히 포함하며, 하드디스크와 같은 데이터 기억수단(120)이나 네트워크를 통하여 데이터를 송수신하는 통신수단(130)을 통하여 직접 메모리로 로드되는 경우에도, 이를 광의의 입력수단으로 볼 수 있는 경우가 있다.In addition, the input unit 140 naturally includes not only a typing input unit such as a keyboard, but also a pointing device such as a mouse, and communication means for transmitting and receiving data through a network or a data storage unit 120 such as a hard disk. Even when directly loaded into the memory through the 130, it may be seen as a broad input means.

상기 출력수단(180)은, 디스플레이장치나 프린터, 스피커 등과 같은 장치는 당연히 포함하며, 하드디스크와 같은 데이터 기억수단(120)이나 네트워크를 통하여 데이터를 송수신하는 통신수단(130)을 통하여 메모리로부터 데이터가 전송되는 경우에도, 이를 광의의 출력수단으로 볼 수 있는 경우가 있다.The output means 180 includes, of course, a device such as a display device, a printer, a speaker, and the like, and the data from the memory through the data storage means 120 such as a hard disk or the communication means 130 for transmitting and receiving data through a network. Even when is transmitted, it may be seen as a broad output means.

상기 제어수단은, CPU(110) 등의 마이크로프로세서의 처리기능을 이용하도록 구성되거나, 별개의 독립된 보조프로세서를 이용하도록 구성될 수 있다. 이때, 상기 제어수단은, 본 발명의 표절탐색 방법이 실행되도록 제어되며, 바람직하게는, 상기 기억수단(120)에 로드되어 실행되는 컴퓨터 프로그램에 의하여 구성된다.The control means may be configured to use processing functions of a microprocessor such as the CPU 110 or may be configured to use a separate independent coprocessor. At this time, the control means is controlled so that the plagiarism search method of the present invention is executed, and is preferably constituted by a computer program loaded and executed in the storage means 120.

상기 본 발명의 표절탐색 방법이 구현되는 장치는, 도 10에 도시된 바와 같이, 컴퓨터(100)로 구성될 수 있다. 그리고 상기 컴퓨터(100)에 본 발명의 표절탐색 방법이 적용되면, 상기 컴퓨터(100)는 본 발명의 표절탐색 장치가 된다.Apparatus in which the plagiarism search method of the present invention is implemented, as shown in FIG. 10, may be configured as a computer 100. When the plagiarism search method of the present invention is applied to the computer 100, the computer 100 becomes the plagiarism search device of the present invention.

본 발명의 표절탐색 장치는, 데이터 입력된 복수의 문서 사이의 표절을 탐색하는 장치이다. 상기 데이터 입력은, 통상 상기 입력수단(140)을 통하여 이루어진다. The plagiarism search apparatus of the present invention is an apparatus for searching for plagiarism between a plurality of data input documents. The data input is usually made through the input means 140.

상기 본 발명의 표절탐색 장치는, 제어수단과 함께, 문서 데이터베이스(145)를 구비함이 바람직하다. The plagiarism search apparatus of the present invention preferably includes a document database 145 together with the control means.

상기 문서 데이터베이스(145)는, 상기 기억수단(120)에 구비된다. 상기 기억수단(120)은, 문서의 반영구적 보존을 위해서는 비휘발성 기억수단임이 바람직하지만, 문서의 보존을 요하지 않을 경우에는 휘발성 기억수단이더라도 좋다. 상기 문서 데이터베이스(145)는, 상기 입력된 문서를 체계적으로 저장하는 기능을 한다. 상기 문서 데이터베이스(145)는, 문서의 체계적 저장이 가능하다면 탐색기의 폴더 형식이더라도 좋지만, 예컨대 문서의 주제별, 의뢰인별, 소스별로 문서를 체계화하여 저장하고 탐색할 수 있는 형식으로 형성되는 것이 바람직하다.The document database 145 is provided in the storage means 120. The storage means 120 is preferably a nonvolatile storage means for semi-permanent storage of the document. However, the storage means 120 may be volatile storage means when the document is not required to be stored. The document database 145 functions to systematically store the input document. The document database 145 may be in the form of a folder of a searcher if the document can be systematically stored, but is preferably formed in a format that can systematically store and search the document by subject, client, and source of the document.

상기 제어수단은, 상기 문서 데이터베이스(145)로부터 비교대상 문서 쌍을 메모리로 읽어내어, 대비 검사하여, 문서표절 여부를 판단하는 수단이다. 즉, 상기 제어수단은, 입력된 문서들 중에서, 비교하고자 하는 대상이 되는 2개의 문서, 즉 문서 쌍을 읽어낸다. 그 후, 이 2개의 문서를 서로 대비하여 검사한다. 그 결과로서 산출되는 유사도에 대하여, 그만큼의 유사도가 우연히 발생할 확률값과 비교하여, 문서의 표절여부 및 표절의 정량적 정도, 그리고 표절방향을 판단하는 것이다. 상기 문서 쌍을 읽어내어 대비 검사하고, 결과를 도출하는 과정은, 모든 비교대상 문서들에 대하여 반복 수행된다.The control means is a means for determining whether a document is plagiarized by reading a comparison document pair from the document database 145 into a memory, checking for contrast. That is, the control means reads out two documents, namely, document pairs, to be compared from among the input documents. Then, these two documents are checked against each other. As for the similarity calculated as a result, the similarity degree by chance is compared with the probability value by chance, and the quantitative degree of plagiarism and plagiarism, and the plagiarism direction of the document are judged. The process of reading the document pairs, checking them for comparison, and deriving the results is repeated for all the documents to be compared.

상기 제어수단은, 크게, 부분유사도 산출단계와, 문서표절 판단단계를 수행한다.The control means largely performs a partial similarity calculation step and a document plagiarism determination step.

상기 부분유사도 산출단계는, 비교대상이 되는 상기 문서 각 쌍을 대상으로 하여, 미리 정해져 있는 구분단위로 나눠서, 중복되지 않도록 대비 검사하여, 상기 각 구분단위를 중심으로 한 대비 부분에 대한 유사도를 산출하는 단계이다. 즉, 문서 전체를 소정 크기로 나눠서 대비 부분을 설정하고, 각 대비 부분끼리 비교하여, 각 대비 부분의 유사도를 산출하는 것이다.The partial similarity calculation step may be performed for each pair of documents to be compared, divided into predetermined division units, and contrast-checked so as not to overlap each other, thereby calculating similarity with respect to the contrast portion centering on each division unit. It's a step. That is, a contrast part is set by dividing the whole document into predetermined sizes, and comparing each contrast part, and calculating the similarity degree of each contrast part.

상기 문서표절 판단단계는, 미리 정해져 있는 임계값 이상인 경우에, 상기 부분유사도를 누적하여, 상기 문서 각 쌍에 대한 문서유사도를 산출하고, 이로부터 문서표절 여부를 판단하는 단계이다. 즉, 상기 각 대비 부분에 대한 부분유사도를 종합하여, 이로부터 문서유사도를 산출하고, 이 문서유사도에 근거하여 문서표절 여부를 판단하며, 더 나아가서는, 문서의 표절 정도와 표절 방향, 표절경로도 산출하는 것이다.The document plagiarism determination step is a step of accumulating the partial similarity, calculating a document similarity degree for each pair of documents when it is equal to or more than a predetermined threshold value, and determining whether the document is plagiarized. That is, the document similarity is calculated from the partial similarity degree for each contrast portion, and the document plagiarism is judged based on the document similarity degree, and further, the degree of plagiarism, the plagiarism direction, and the plagiarism path of the document. To calculate.

특히, 상기 부분유사도 산출단계에서 상기 비교 검사를 함에 있어서, 예비검사 단계와 심층검사단계로 나눠서 수행하도록 구성됨이 바람직하다.In particular, in performing the comparison test in the partial similarity calculation step, it is preferably configured to be divided into a preliminary test step and an in-depth test step.

상기 예비검사 단계는, 상기 모든 구분단위에 대하여 행하여지고, 정확성보다는 신속성이 우선되는 방법에 의하여, 대비 부분의 예비적 유사도를 검사하는 단계이다. 상기 심층검사 단계는, 상기 예비검사의 유사도가 미리 정해져 있는 임계값 이상인 경우에만 행하여지고, 신속성보다는 정확성이 우선되는 방법에 의하여, 대비 부분의 심층적 유사도를 검사하는 단계이다.The preliminary inspection step is a step of inspecting the preliminary similarity of the contrast portion by a method which is performed for all the division units and prioritizes over accuracy. The in-depth inspection step is performed only when the similarity of the preliminary inspection is equal to or more than a predetermined threshold value, and the in-depth similarity of the contrast portion is examined by a method in which accuracy is given priority over rapidity.

상기 유사도를 수행하는 구체적인 방법은 여러 가지가 있을 수 있고, 이들에 대한 상기 정확성과 신속성에 대한 판단은 다소 주관적일 수도 있다. 그러나 이러한 주관성을 탈피하기 위하여, 의도적으로 작성된 다양한 표절문서들에 대하여 다양한 방법을 적용하여 시뮬레이션 함으로써, 그 정확성이나 신속성의 비율 내지 순위를 계산하여 정량화할 수 있다.There may be a variety of specific methods for performing the similarity, and the determination of the accuracy and promptness thereof may be somewhat subjective. However, in order to escape this subjectivity, by simulating various plagiarism documents intentionally produced by various methods, it is possible to calculate and quantify the ratio or rank of accuracy or speed.

상기 예비검사 단계와 심층검사 단계의 구체적인 내용에 대하여는 후술한다.Details of the preliminary inspection step and the deep inspection step will be described later.

한편, 본 발명의 표절탐색 방법은, 독립된 컴퓨터에서 실행되는 것에 한하지 않고, 본 발명의 표절탐색 방법이 서버에 탑재되도록 구성되고, 상기 서버에 연결된 단말기에 대하여 표절탐색 서비스를 제공하는 것도 가능하다. 따라서 상기 서버 는, 본 발명의 표절탐색 장치의 다른 실시예가 된다.Meanwhile, the plagiarism search method of the present invention is not limited to being executed on an independent computer, and the plagiarism search method of the present invention is configured to be mounted on a server, and it is also possible to provide a plagiarism search service to a terminal connected to the server. . Thus, the server is another embodiment of the plagiarism detection apparatus of the present invention.

상세히 설명하면, 이 경우에, 상기 표절탐색 장치는, 통신수단(130)을 더욱 구비한 서버(100)이다. 서버(100)도 컴퓨터의 일종임은 말할 필요가 없다.In detail, in this case, the plagiarism search apparatus is the server 100 further provided with the communication means 130. It goes without saying that the server 100 is also a kind of computer.

상기 서버(100)에는, 단말(200)이 접속된다. 상기 단말(200)은, 네트워크를 통하여 상기 서버의 통신수단(130)과 연결 가능한 통신수단(230)을 포함한다. 그리고 상기 단말(200)은, 적어도 기억수단, 입력수단, 출력수단 및 제어수단을 구비한다.The terminal 200 is connected to the server 100. The terminal 200 includes a communication means 230 connectable with the communication means 130 of the server through a network. The terminal 200 includes at least a storage means, an input means, an output means, and a control means.

이 경우에, 상기 서버의 제어수단은, 상기 독립된 컴퓨터의 경우와 동일한 제어를 수행하되, 부가적으로 통신수단을 통한 문서의 입력, 통신수단을 향한 결과의 출력과 관련된 제어를 더욱 수행한다. In this case, the control means of the server performs the same control as in the case of the independent computer, but additionally performs control related to the input of the document through the communication means and the output of the result toward the communication means.

즉, 먼저, 문서의 입력 시에 있어서는, 상기 비교대상이 되는 복수의 문서가, 상기 단말의 입력수단을 통하여 입력되어, 상기 단말의 통신수단을 통하여, 상기 서버의 통신수단으로 전송되면, 상기 서버의 입력수단이, 상기 서버의 통신수단으로부터 상기 복수의 문서를 데이터 입력받도록 제어한다.That is, at the time of inputting a document, when the plurality of documents to be compared are input through the input means of the terminal and transmitted to the communication means of the server through the communication means of the terminal, the server Control means to receive data input from the plurality of documents from the communication means of the server.

그리고 결과의 출력 시에 있어서는, 상기 복수의 문서에 대하여 표절여부 판단 결과가 산출되면, 상기 결과가 상기 서버의 통신수단과 연결된 상기 단말의 통신수단을 거쳐서, 상기 단말의 출력수단에 출력되도록, 상기 서버의 출력수단이, 상기 복수의 문서의 표절여부 판단 결과를 상기 서버의 통신수단에 데이터 출력하도록 제어한다.In the output of the result, if the result of plagiarism determination is calculated for the plurality of documents, the result is output to the output means of the terminal via the communication means of the terminal connected to the communication means of the server. The output means of the server controls the data outputting result of the plagiarism determination of the plurality of documents to the communication means of the server.

이하, 본 발명의 실시예와 동작을 보다 상세히 설명한다.Hereinafter, embodiments and operations of the present invention will be described in detail.

도 1은, 본 발명의 표절탐색 방법을 예시하는 전체 플로차트이다. 이에 의하면, 본 발명의 방법은, 데이터 입력단계(S100), 사전구조화 단계(S200), 예비검사 단계(S300), 심층검사 단계(S400), 문서표절여부 판단단계(S500), 및 결과출력 단계(S600)를 포함할 수 있다. 이들 각 단계는, 순차적으로 수행되는 것을 기본으로 하지만, 표절탐색이 가능한 범위 내에서, 각 단계나 그 내부의 세부단계 중 일부를 반복루프로 다중 실행하여도 좋고, 각 단계 내부의 세부단계 중 일부를 생략하여도 좋다.Fig. 1 is an overall flowchart illustrating the plagiarism detection method of the present invention. According to this, the method of the present invention, data input step (S100), pre-structuring step (S200), preliminary inspection step (S300), deep inspection step (S400), document plagiarism determination step (S500), and result output step It may include (S600). Each of these steps is based on being performed sequentially, but within the scope of plagiarism detection, multiple steps of each step or some of its sub-steps may be repeated in a loop, and some of the sub-steps within each step may be executed. May be omitted.

<비교대상 문서의 입력><Enter the document to be compared>

도 2는, 데이터 입력(S100)단계의 세부 플로차트이다.2 is a detailed flowchart of the data input step S100.

표절 여부에 대한 비교의 대상이 되는 비교대상 문서는, 통상의 입력수단(140)에 의하여, 시스템에 입력(S120)된다. 즉, 키보드를 통하여 타이핑되어 입력되어도 좋고, 미리 저장되어 있는 문서파일이 오픈됨으로써 입력되어도 좋으며, 데이터 네트워크를 통하여 입력되어도 좋다.The comparison target document to be compared for plagiarism is input to the system (S120) by the normal input unit 140. That is, it may be input by typing through a keyboard, may be input by opening a previously stored document file, or may be input via a data network.

상기 비교대상 문서는, 다양한 포맷으로 입력될 수 있다. 예컨대, 아래아 한글(등록상표), MS-워드 등의 데이터 포맷으로 입력될 수 있다. 이 경우, 이들 다양한 입력 데이터는, 본 발명에서 하나의 통일화된 포맷으로 변환될 필요가 있으며, 이를 위한 처리단계로서의 데이터 변환 단계(S110)가, 입력의 전처리단계로서 존재할 수 있다.The comparison target document may be input in various formats. For example, it may be input in a data format such as Hangul (registered trademark), MS-word, or the like. In this case, these various input data need to be converted into one unified format in the present invention, and a data conversion step S110 as a processing step for this may exist as a preprocessing step of the input.

<문서를 소단위로 분할><Divide document into sub-units>

도 3은, 사전구조화(S200)단계의 세부 플로차트이다.3 is a detailed flowchart of the pre-structuring step (S200).

상기 입력된 각 문서는, 먼저, 미리 정해져 있는 크기의 구분단위로 분할(S210)된다. Each of the input documents is first divided into division units of a predetermined size (S210).

이때, 상기 구분단위는, 어절이어도 좋다. 어절은 공백문자에 의하여 구분되어 있는 경우에는 그 구분이 용이하기 때문이다. 만일 공백문자에 의하여 구분되어 있지 않은 경우에는, 공지의 번역 프로그램 등에서 사용되는 기법, 즉 조사나 명사, 동사 등의 데이터베이스에 의존하여 분별하는 기법을 사용하여 구분하도록 구성할 수 있다.In this case, the division unit may be a word. This is because words are easily distinguished when they are separated by white space. If it is not separated by a space character, it may be configured to distinguish using a technique used in a known translation program, that is, a technique for discriminating depending on a database such as a survey, a noun, a verb, and the like.

또는, 상기 구분단위는, 최대 음절수가 미리 정해져 있는 수 k로 한정되어, 상기 최대 음절수를 초과하는 어절에 대해서는, 앞쪽 음절부터 연속하여 k 음절로 분할하는 과정을 순차 반복함으로써 이루어지는, "k-mer 분할어구"임을 특징으로 하는 것이 바람직하다. 이렇게 이루어지는 분할작업을 k-mer 앵커화(S220)라 한다. 여기서 k는, 예컨대 3으로 할 수 있다. 상기 k-mer 분할어구에 관한 예는 후술한다.Alternatively, the division unit is limited to the number k having the maximum number of syllables in advance, and for words exceeding the maximum number of syllables, the process of sequentially dividing the k syllables from the previous syllable into k syllables is performed. mer break phrase ". The splitting operation thus performed is called k-mer anchoring (S220). K can be 3, for example. An example of the k-mer split phrase will be described later.

<빠른 분석과 중복 회피를 위한 사전구조>Preliminary Structure for Fast Analysis and Avoiding Duplicates

또한, 상기 구분단위로 나눠진 상기 문서 각각은, 사전구조로 변환되도록 함(S230)이 바람직하다. 상기 사전구조는, 상기 구분단위의 앵커를 키(key)로 하 고, 상기 구분단위가 상기 문서 내에서 출현한 위치를 레퍼런스(reference)로 할 수 있다. 즉, 본 발명에 있어서는, 비교의 대상이 되는 양 문서의 빠르고 중복 없는 분석을 위하여, 구분단위, 예컨대 어절 단위, 또는 k-mer 분할어구 단위로 문장을 나누어, 사전적 구조로 만들 수 있다.In addition, it is preferable that each of the documents divided by the division unit is converted into a dictionary structure (S230). In the dictionary structure, the anchor of the division unit may be a key, and the position at which the division unit appears in the document may be a reference. That is, in the present invention, for fast and non-overlapping analysis of both documents to be compared, sentences can be divided into division units such as word units or k-mer split phrase units to form a dictionary structure.

예컨대, 어떤 문서1을 분석하여, 어절 단위로, For example, a document 1 is analyzed and, in word units,

「갔다, 그, 나는, 이다」`` I went, that I am ''

라는 앵커가 도출되고, An anchor is derived,

어떤 문서2를 분석하여, 어절 단위로, Analyzing a document 2, in word units,

「오늘, 에서, 이다, 하루」`` Today in is a day ''

라는 앵커가 도출된 경우, 이들 앵커를 키(key)로 하고, 상기 각 앵커가 원래 문서에서 위치하고 있는 위치를 레퍼런스(ref.)로 하여, 도 11과 같은 사전을 만들 수 있다. 앵커의 우측의 3, 5, 9와 같은 숫자는 상기 위치를 의미한다.When anchors are derived, these dictionaries can be made as a key and a dictionary as shown in FIG. 11 can be made using the position where each anchor is located in the original document as a reference ref. Numbers such as 3, 5, 9 on the right side of the anchor mean the position.

여기서, 상기 앵커들 중에서, Here, among the anchors,

「이다」"to be"

라는 앵커는 양 문서에 공통된다. 이를 공통앵커라 하고, 상기 공통앵커를 중심으로 양 문서를 비교하는 것이 바람직하다.Anchors are common to both documents. This is called a common anchor, and it is preferable to compare both documents based on the common anchor.

상기와 같이, 문서가 사전구조로 변환되면, 상기 구분단위를 중심으로 한 대비 부분에 대한 유사도의 산출을 위한 대비 검사는, 상기 문서 각각에 대한 상기 사전구조의 모든 공통앵커에 대하여 이루어지도록 구성되는 것이 바람직하다. 사전 구조는, 다수 출현한 구분단위에 대하여 중복되지 않도록 정리를 하여 주므로, 신속하고도 중복되지 않는 검사를 제공할 수 있기 때문이다. 상기 사전구조를 형성하는 예나, 상기 형성된 사전구조를 이용한 대비 검사의 상세한 예는 후술한다.As described above, when the document is converted into a dictionary structure, the contrast check for calculating the similarity with respect to the contrast portion centered on the division unit is configured to be performed for all common anchors of the dictionary structure for each of the documents. It is preferable. This is because the dictionary structure arranges the division units that appear in a large number so that they do not overlap, and thus can provide a quick and non-overlapping inspection. An example of forming the preliminary structure and a detailed example of a contrast test using the formed prestructure will be described later.

<표절 탐색시간 단축을 위한 불용어 제거><Remove stopwords to reduce plagiarism search time>

상기 어절이나 k-mer 분할어구 등의 구분단위를 중심으로 한 대비 부분에 대한 유사도의 산출을 위한 대비 검사 이전에, 표절여부의 판단에 영향을 미치는 구분단위인 의미단어는 잔존시키고, 표절여부의 판단에 영향을 미치지 않는 구분단위인 불용어(不用語)는 제거(S240)하는 과정이 선행되도록 구성됨을 특징으로 하는 것이 바람직하다. 만일 사전구조를 형성한 경우에는, 순차로 나열된 상기 각 사전구조에서 불용어를 삭제하면 되므로 편리하다.Prior to the contrast test for calculating the similarity with respect to the contrast unit centered on the division unit such as the phrase or k-mer segmentation phrase, the semantic word, which is a division unit affecting the judgment of plagiarism, is left and Terminology that does not affect the judgment is preferably characterized in that it is configured to precede the process of removal (S240). If a dictionary structure is formed, it is convenient to delete a stopword from each of the above listed dictionary structures.

불용어를 삭제하는 이유는 다음과 같다.Reasons for deleting stopwords are as follows.

표절 탐색에 걸리는 시간을 단축하기 위해서는, 문서의 크기를 줄이고, 비교할 어절의 수를 줄여야 한다. 따라서 문서의 수많은 어절, 특히 단어와 조사들 중에서, 불용어(不用語)는 모두 찾아 제거하고, 가치 있는 정보만 추려서 표절 탐색을 하여야 한다. 불용어란, 인터넷 검색 분야에서 사용되는 용어로서, 검색할 때에 색인 용어로서의 "의미를 가지지 않는 단어"를 뜻한다. To reduce the time it takes to search for plagiarism, you need to reduce the size of the document and the number of words to compare. Therefore, among the numerous words of the document, especially words and surveys, all the stopwords should be found and removed, and only plausible information should be searched for plagiarism. Terminology is a term used in the field of Internet search and means "word without meaning" as an index term when searching.

이를 표절 탐색에 맞춰서 뜻을 바꾼다면, '있다', '이', '그리고' 등 문장의 의미에 영향을 주지 못하는 접속사, 대명사, 의미 없는 명사 등을 가리키는 말이 될 것이다. 다음의 어절들은 임의의 20페이지 분량의 문서에서 가장 많이 나타나는 어절들과 빈도수를 순서대로 나타낸 것이다.If we change the meaning in accordance with the plagiarism search, we will refer to conjunctions, pronouns, and meaningless nouns that do not affect the meaning of sentences such as 'yes', 'yi', 'and'. The following words are in order of the most common words and their frequencies in any 20-page document.

있는there is 이this 것이다will be 있다have 새로운new 제2의Second 그That 수Number 55개55 51개51 50개50 48개48 46개46 44개44 41개41 30개30

상기 표 1의 예에서, '제2의'라는 어절은, 다른 문서에서는 잘 나타나지 않는 어절이지만, 이 문서의 내용과 관련되어 있어서 빈번하게 나타났다는 특이점이 있다. 그 이유는, 이 문서가 "제3의 물결"에 대한 내용을 주제로 글을 쓰고 있어서, "제2의", "제3의"라는 글귀가 자주 사용되기 때문이다. "이순신"을 주제로 독후감을 쓴 리포트에서 "이순신"이라는 말은 자주 사용될 것이나, "이순신"이라는 어절이 동일하다고 하여 표절이라 할 수는 없다. 그러므로 문장에서 자주 사용되는 "이순신"이라는 어절, 즉 주제어는, 불용어로 간주하여도 무방할 것이다. 표절을 할 경우, 보통 유사한 단어들이 연속적으로 나타나게 되는데, 본 발명에서는 이 같은 특성을 파악하여 표절을 검출하므로, 주제어의 제거로 인한 성능 저하는 없다.In the example of Table 1, the word "second" is a word that does not appear well in other documents, but has a peculiarity that appears frequently because it is related to the contents of this document. The reason is that this document is written on the subject of "third wave", and the phrases "second" and "third" are often used. In a report written on the subject of "Yi Soon Shin", the word "Yi Soon Shin" will be used frequently, but it is not plagiarism that the words "Yi Soon Shin" are identical. Thus, the phrase "Yi-Shin Shin", which is often used in sentences, can be considered as a stopword. In the case of plagiarism, similar words usually appear consecutively. In the present invention, since plagiarism is detected by detecting such characteristics, there is no performance degradation due to the removal of the main word.

상기 표 1에서, '제2의' 이외의 어절들은 일상적으로 일반 문장에서 많이 사용되는 것이며, 게다가 중요한 의미를 내포하는 핵심 어절이 아니다. 그러므로 당연히 불용어로 처리하여야 한다.In Table 1, words other than 'second' are commonly used in general sentences, and are not core words containing important meanings. Therefore, it must be treated as a stopword.

만약, 비교해야 할 문서에 "있는"이라는 어절이 40개만 존재하여도, 55 ×40 = 2200번의 어절 비교가 발생되며, 표절 탐색하는 두 문서 간 공통된 어절이 불용어일 경우, 탐색시간은 불용어 개수가 n일 경우, 그 제곱, 즉 n²씩 시간이 늘어나게 된다.If there are only 40 words that are "in" in the document to be compared, 55 × 40 = 2200 word comparisons occur, and if the common word between the two documents searching for plagiarism is a stopword, the search time is not available. If n, the time is increased by the square, n ² .

본 발명에서는, 이러한 원리에 의하여, 문서에서 어절의 출현 빈도수가 전체 어절 수에서 일정 이상일 경우, 불용어로 간주하여 제거한 후, 표절 탐색을 하도록 하였다. 각 문서 또는 사전구조에서 공통적으로 일정 임계값 이상 나타나는 단어들을 불용어로 간주하여 찾아내면, 이들 찾아낸 단어들 가운데 대다수의 단어들이 의미 없는 대명사이며, 그 이외에는 반복되는 문서의 주제어들이다.In the present invention, when the frequency of occurrence of a word in a document is more than a certain number of words in the document, it is regarded as an unusable word, and then the plagiarism search is performed. When words found in common in each document or dictionary structure are identified as stopwords, the majority of these words are meaningless pronouns, and are the subject words of repeated documents.

임계값(T1)Threshold (T1) 시간(sec)Time (sec) 탐색성능(%)Search performance (%) 1.000 0.010 0.007 0.006 0.005 0.0011.000 0.010 0.007 0.006 0.005 0.001 123 111 115 115 104 23123 111 115 115 104 23 100 100 100 100 93 43100 100 100 100 93 43

표 2는, (어절의 빈도수)/(전체 어절의 수)를 임계값(T1)보다 높은 어절을 모두 불용어로 처리하여 제거할 경우의 표절탐색의 성능을 나타낸 것이다. 이 표 2는, 6개의 20페이지 분량의 문서에, 임의로 표절한 6개의 문구(문단)를 삽입한 후, 본 발명의 방법 및 장치를 통하여 표절 탐색을 한 결과를 나타낸 것이다. 6개의 문서가, 자신을 제외한 다른 문서들과 표절 여부를 탐색하였으므로, 총 30번의 비교가 이루어졌으며, 걸린 시간은 '시간'란에 기록하였고, 30개의 표절 문단 중 몇 개를 발견하였는지를 '탐색성능'란에 백분율로 나타내었다.Table 2 shows the performance of plagiarism search when (word frequency) / (number of total words) is treated as a stopword for all words higher than the threshold value T1. Table 2 shows the results of plagiarism search through the method and apparatus of the present invention after inserting six arbitrarily plagiarized phrases (paragraphs) into six 20-page documents. Since 6 documents searched for plagiarism with other documents except themselves, a total of 30 comparisons were made, the time taken was recorded in the 'Time' column, and how many of the 30 plagiarism paragraphs were found. It is expressed as a percentage in the column.

임계값을 낮춰서 더 많은 어절을 제거할수록 시간이 급격하게 줄어드는 반면, 표절탐색의 정확성이 떨어지는 것을 알 수 있다. 표절탐색의 정확도에 영향을 주지 않는 범위 내에서 가장 짧은 시간 내에 표절을 탐색할 수 있는 임계값은 0.006이 된다.As the threshold is lowered, more words are removed, and the time is drastically reduced, while the accuracy of plagiarism detection is reduced. The threshold for searching for plagiarism in the shortest time within a range that does not affect the accuracy of plagiarism detection is 0.006.

이런 불용어들을 모두 제거하여, 표절판정에 의미를 가지는 어구만을 이용한다면, 표절 탐색 속도는 매우 빨라질 것이며, 또한 정확한 표절 문구 탐색도 가능하게 된다. By removing all of these stopwords and using only phrases that make sense in plagiarism, the speed of plagiarism search will be very fast, and accurate plagiarism search will be possible.

상기 불용어 제거는, 본 발명의 다단계 검사의 최초단계, 즉 최초의 예비검사 수행 전에 행하면 되며, 바람직하게는 사전구조 형성 후에 행한다. 이하의 설명에서는, 특별한 기재가 없는 한, 사전구조를 형성하는 것을 가정하기로 한다.The stopwords removal may be performed before the first stage of the multi-stage inspection of the present invention, that is, before the first preliminary inspection is performed, and preferably after the preliminary structure is formed. In the following description, it is assumed that a preliminary structure is formed unless otherwise specified.

<다단계 탐색><Multilevel navigation>

문서의 사전구조를 서로 비교하여, 동일한 구분단위, 예컨대 어절을 찾았으면, 이제 문장 사이 표절 여부를 탐색하여야 한다. Comparing the document's dictionary structure with each other and finding the same division, eg word, we now need to look for plagiarism between sentences.

그런데, 문서 전체를 어절끼리 표절을 탐색하게 된다면, 표절 탐색시간은 각 문서의 어절 수의 곱에 비례하게 될 것이며, 대용량 문서간의 표절탐색인 경우에는, 소요시간이 무시할 수 없는 요소가 된다.However, if plagiarism is searched for in the entire document, the plagiarism search time will be proportional to the product of the number of words in each document, and in the case of plagiarism search between large documents, the time required will not be ignored.

따라서 본 발명은, 이를 다단계, 적어도 2단계 표절탐색 구간을 이용하여, 소요시간을 줄였다. 다단계 표절탐색 구간에 의한 표절탐색 수행이란, 서로 다른 기준에 의하여, 표절탐색을 단계별로 여러 번 수행한다는 것이다. Therefore, the present invention reduces the time required by using a multi-step, at least two-step plagiarism search interval. Performing plagiarism search by multi-stage plagiarism search section means that the plagiarism search is performed several times in stages according to different criteria.

예컨대, 2단계로 구성하는 경우에, 첫 번째 표절 탐색은, 사전구조 비교를 통하여 찾아진 동일한 어절을 기준으로 양측의 유사도를 비교하는 것으로서, 본격적인 유사도 검사를 할 대상이 되는지를 빠른 속도로 간이하게 검사하는 것이므로, 예비검사라고 할 수 있다.For example, in the case of configuring in two stages, the first plagiarism search is to compare the similarities of both sides based on the same word found through the preliminary structure comparison, and to quickly and easily determine whether to be subjected to a full-scale similarity check. Since it is a test, it can be called a preliminary test.

두 번째 표절 탐색은, 마치 유전자 서열의 비교를 행하듯이, 면밀히 양자를 비교하여 유사도를 검사하는 것으로서, 상기 예비검사를 통과한 부분만이 검사의 대상이 되도록 하여 검사할 대상을 한정한 상태에서 충분히 깊이 있는 검사를 수행하는 것이므로, 심층검사라고 할 수 있다.The second plagiarism search is to examine similarity by closely comparing the two, as if to compare gene sequences, so that only the part that passed the preliminary test is subject to the test, and the target to be tested is limited. It is a deep inspection because it is sufficiently deep.

상기에서 2단계로 구성되는 경우에 대하여만 설명하였으나, 신속성과 정확성에 대한 가중치를 변화시키면서, 예비검사, 중간검사, 심층검사와 같이 3단계로 구성하거나, 더욱 세분화하여 1차 예비검사, 2차 예비검사, 1차 심층검사, 2차 심층검사와 같이 4단계로 구성하는 등, 다단계로 구성할 수 있음은 자명하다. 이하 2단계 구성을 가정하여 설명한다.Although only the case of two stages has been described above, three stages such as preliminary examination, intermediate inspection, and in-depth examination, or further subdividing the first preliminary examination and the second, with varying weights for rapidity and accuracy, are described. It is obvious that it can be configured in multiple stages such as preliminary examination, first in-depth examination, and second in-depth examination. The following description assumes a two-step configuration.

이에 의하여, 도 1의 예비검사 단계(S300) 및 심층검사 단계(S400)와 같이, 다단계 표절탐색이 수행된다. 그리고 예비검사나 심층검사는, 각각 세분화하여 다시 내부적으로 세부 다단계로 수행되도록 구성될 수도 있다.As a result, as in the preliminary inspection step S300 and the deep inspection step S400 of FIG. 1, multi-step plagiarism search is performed. In addition, the preliminary examination or the in-depth examination may be configured to be subdivided into a detailed multi-level internally.

상기 예비검사와 심층검사는, (1) 모든 탐색대상 영역, 예컨대 문서 전체에 대하여 예비검사만이 반복루프를 수행한 후, 비로소 심층검사를 하도록 구성(즉, 예비검사 루프와 심층검사 루프의 별도 수행)할 수도 있고, (2) 문서를 분할한 각 구분단위마다에 대하여, 예비검사를 수행하고, 통과하면 곧바로 심층검사를 수행하는 과정을 반복하도록 구성(즉, 예비검사와 심층검사를 하나의 단위로 묶어서 루프 내에서 함께 순차로 수행)할 수도 있다. 이하에서는 (1)의 구성인 경우에 대하여만 설명한다. 여기서, (2)의 구성인 경우는 (1)의 구성인 경우와 달리, 예비검사 결과를 별도로 기억해 둘 필요는 없으나, 심층검사 결과는 별도로 기억해 두어야 한다는 점만 다르며, 이를 구현하는 기술에 대한 이해는 당업자에게 있어서 용이하다고 하겠다.The preliminary examination and the in-depth examination are (1) configured to perform an in-depth inspection only after the preliminary inspection is repeated for all the searched areas, for example, the entire document (i.e., separate the preliminary inspection loop and the deep inspection loop). (2) conduct a preliminary examination for each divisional unit that divides the document and, if passed, immediately repeat the process of performing an in-depth inspection (i.e. You can also group them together and perform them sequentially in a loop). Only the case of the structure of (1) is demonstrated below. Here, in the case of the configuration of (2), unlike the case of the configuration of (1), it is not necessary to remember the preliminary examination results separately, except that the in-depth examination results must be stored separately, and the understanding of the technique for implementing the It will be easy for those skilled in the art.

<신속한 예비검사>Rapid Preliminary Inspection

도 4는, 예비검사(S300)단계의 세부 플로차트이다.4 is a detailed flowchart of the preliminary inspection (S300) step.

상기 예비검사 단계에 있어서, 정확성보다 신속성이 중점이 되는 비교 방법을 이용한다. 그 일례로서, 대비 부분의 전체 음절수 중에서 일치하는 음절수가 차지하는 비율에 의하여 유사도를 산출하도록 구성됨이 바람직하다. 이를 위하여, 도 8과 같이, 음절단위 분할단계(S332), 정렬단계(S334) 및 일치비율 산출단계(S336)를 수행하여도 좋다. 이들 각 단계의 구체적인 예는 후술한다.In the preliminary inspection step, a comparison method is used in which speed is more important than accuracy. As an example, it is preferable that the similarity is calculated based on the proportion of the corresponding syllables in the total syllables of the contrast portion. To this end, as shown in FIG. 8, the syllable unit division step S332, the alignment step S334, and the coincidence ratio calculation step S336 may be performed. Specific examples of each of these steps will be described later.

이와 같이, 구분단위를 중심으로 한 정확한 유사도는 비록 아니더라도, 유사도가 높을 개연성이 인정되는 방법이라면, 예비검사 목적으로는 충분히 적용 가능하다는 것이다. 이는, 후속의 심층검사의 대상을 유효적절하게 줄이기 위한 목적이기 때문이다.In this way, although the exact similarity centering on the division unit is not a method that is likely to have high similarity, it is sufficiently applicable for the purpose of preliminary inspection. This is because the purpose is to effectively reduce the subject of subsequent in-depth examinations.

이때, 상기 예비검사 단계의 상기 대비 검사는, 상기 사전구조화 단계(S200)에서 미리 분할하여 놓은 모든 구분단위에 대하여 중복되지 않도록 수행된다. 상기 구분단위는, 상기한 바와 같이, 어절일 수도 있고, k-mer 분할어구(앵커)일 수도 있으며, 기타 다른 음절단위일 수도 있다. 이하에서는 구분단위가 앵커인 경우에 대하여 설명하지만, 어절인 경우에도 유사하게 수행할 수 있다. 상기 예비검사는 양 비교문서의 공통된 구분단위들, 즉 앵커들에 대하여 수행되어야 하므로, 이를 위하여 공통된 앵커를 추출하는 단계를 공통앵커 추출단계(S310)라 한다.In this case, the contrast test of the preliminary inspection step is performed so as not to overlap with respect to all division units pre-divided in the pre-structuring step (S200). As described above, the division unit may be a word, a k-mer split phrase (anchor), or another syllable unit. Hereinafter, a case in which the division unit is an anchor will be described. However, in the case of a word, it can be similarly performed. Since the preliminary inspection has to be performed on common division units, that is, anchors, of both comparison documents, the extraction of the common anchor for this purpose is called a common anchor extraction step (S310).

그런데, 앵커 자체는 매우 작은 단위이므로, 비교가 수행되는 단위, 즉 대비 부분은 이보다 큰 단위가 됨이 바람직하다. 예컨대 5어절로 이루어진 대비 부분은, 구성 어절의 순서 등에 의하여 표절 여부의 판단 방법이 더욱 풍부해지고, 한꺼번에 여러 어절을 비교하므로, 비교의 속도도 증가한다. However, since the anchor itself is a very small unit, it is preferable that the unit in which the comparison is performed, that is, the contrast portion, is a larger unit. For example, in the case of the contrast section consisting of five words, the method of judging plagiarism becomes more abundant due to the order of the constituent words, and the comparison speed is also increased, because the comparison of several words at once is also performed.

따라서 상기 구분단위를 중심으로 하여, 문서의 위치상 전후로 미리 정해져 있는 수 a 만큼의 어절(구분단위가 앵커인 경우에는 그 앵커가 속한 어절로 하여도 좋음)을 확장(S320)한 부분을 비교대상인 대비 부분으로 함이 바람직하다. 이때, 상기 어절의 확장은, 상기 구분단위가 중복되지 않는 범위 내에서 이루어지도록 구성됨이 바람직하다.Accordingly, the portion of the comparison unit is expanded (S320) by a predetermined number of words (when the classification unit is an anchor, it may be the word to which the anchor belongs) based on the division unit. It is preferable to make it a contrast part. In this case, the expansion of the word is preferably configured to be within the range that the division unit does not overlap.

예비검사는, 비교대상이 되는 양 문서의 사전구조 비교에 의한 공통앵커가 속하는 동일 어절을 기준으로, 양측에 앞뒤 각각 a개씩의 어절을 가져와서, 총 (2a + 1)개의 어절끼리의 유사도(S330)를 비교한다. 이때, 자유어순 언어의 특성에 따른 표절을 탐색할 수 있도록, 유사도 산출에 있어서는, 양측 대비 부분을 구성하는 모든 어절을 음절 단위로 나누어 정렬한 후 비교하는 방법을 사용하여도 좋다. 즉, 동일한 음절의 개수를 세고, 전체 음절의 개수로 나눠서, 유사도를 산출하는 것이다. 이렇게 함으로써, 어절의 치환, 삽입, 삭제도 발견할 수 있으므로 바람직하다.The preliminary inspection is based on the same word belonging to the common anchor by comparison of the preliminary structure of the documents to be compared, and a word of each word is placed on both sides, and the similarity between the total (2a + 1) words ( S330) is compared. In this case, in order to search for plagiarism according to the characteristics of the free-language language, in the calculation of the similarity, a method of dividing and comparing all the words constituting the two contrasting parts by syllable units may be used. That is, the similarity is calculated by counting the same syllable and dividing by the total number of syllables. This is preferable because word substitution, insertion, and deletion can also be found.

예컨대, 사전구조의 동일 어절이 "제2의"라고 하자. 이때, 양측에 대하여 앞뒤 각각 a개씩의 어절을 가져와서 이루어지는 총 (2a + 1)개의 비교하여야 할 양측 어절 집합(대비 부분)은 다음과 같다(a = 3인 경우).For example, let's say that the same word in the dictionary structure is "secondary." At this time, a total of (2a + 1) word pairs (comparative part) to be compared by taking a word of each word back and forth on both sides are as follows (a = 3).

문AMoon A 하는 3개의 조직이 제2의 물결에 의해서 태어난Three organizations to say were born by the second wave 문BMoon B 강조되어 갔다 즉 제2의 물결이 지구상을 휩쓸기 The second wave swept across the planet

이를 음절단위로 나눠서 정렬한 상태는 다음과 같다.The state divided by syllable unit is as follows.

문AMoon A 2, 3, 결, 개, 난, 는, 물, 서, 어, 에, 의, 의, 의, 이, 제, 조, 직, 태, 하, 해Two, Three, Texture, Dog, I, The, Water, Stand, Uh, On, Of, Of, Of, This, Article, Joe, Upright, Tae, Ha, Year 문BMoon B 2, 강, 갔, 구, 결, 기, 다, 되, 물, 상, 쓸, 어, 을, 의, 이, 조, 제, 즉, 지, 휩2, River, Gone, Orb, Texture, Flag, Da, Being, Water, Prize, Ruffled, Uh, Of, This, Joe, Article, That, Will, Whip

상기 표 4의 두 비교대상에 있어서 서로 일치되는 음절은, In the two comparison targets of Table 4, the syllables that match each other,

"2, 결, 물, 어, 의, 이, 조""Two textures, water, er of teeth, joe"

의 7개 음절이다. 따라서 총 20개의 음절에 대하여 7개의 음절이 일치되므로, 예비검사의 유사도는, Seven syllables. Therefore, since seven syllables correspond to a total of 20 syllables, the similarity of the preliminary test is

7 / 20 = 0.35, 즉 35%7/20 = 0.35, or 35%

가 된다.Becomes

상기와 같이, 사전구조의 각 공통앵커에 해당되는 양측의 대비 부분에 대한 예비검사의 유사도를 구한 후, 이 유사도가 미리 정해져 있는 임계값(T2) 이상인지를 판단(S340)한다. As described above, after obtaining the similarity degree of the preliminary inspection for the contrast portions of both sides corresponding to the respective common anchors in the dictionary structure, it is determined whether the similarity degree is equal to or greater than a predetermined threshold value T2 (S340).

예비검사와 심층검사가 별개 루프에서 수행되는 구성 하에서는, 예비검사 유사도가 임계값 이상인 경우에는, 그 구간을 후술하는 심층검사의 대상(S350)으로 분류하여 두고, 예비검사 유사도가 임계값 미만인 경우에는, 심층검사의 대상이 아니므로(S360), 즉시 다음 비교대상에 대한 예비검사의 비교동작으로 이행한다. In the configuration in which the preliminary inspection and the deep inspection are performed in separate loops, when the preliminary inspection similarity is greater than or equal to the threshold value, the section is classified as an object of the in-depth examination (S350), which will be described later. , Since it is not the subject of an in-depth examination (S360), it immediately proceeds to the comparison operation of the preliminary examination for the next comparison target.

이때, 만일 심층검사 대상으로 분류된 것이 전혀 없다면(S380), 후속의 심층검사를 수행할 필요 없이 탐색을 완전 종료시키면서, 이 비교된 문서는 '표절 아님'으로 판정(S390)하여도 좋다. At this time, if nothing is classified as an in-depth inspection object (S380), the comparison document may be determined as 'not plagiarism' (S390) while the search is completely terminated without the need for subsequent in-depth inspection.

예비검사와 심층검사가 동일 루프에서 수행되는 구성 하에서는, 예비검사 유사도가 임계값 이상인 경우에는, 그 구간을 후술하는 심층검사의 대상(S350)으로 분류하여 즉시 심층검사를 수행하고, 예비검사 유사도가 임계값 미만인 경우에는, 심층검사의 대상이 아니므로(S360), 즉시 다음 비교대상에 대한 예비검사의 비교동작으로 이행한다. Under the configuration in which the preliminary examination and the in-depth examination are performed in the same loop, when the preliminary examination similarity is greater than or equal to the threshold value, the section is classified into an object of the in-depth examination (S350), which will be described later, and an in-depth examination is immediately performed. If it is less than the threshold value, it is not the subject of the in-depth inspection (S360), and immediately proceeds to the comparison operation of the preliminary inspection for the next comparison target.

그리고 상기와 같은 예비검사는, 사전구조의 모든 공통앵커(S370)에 대한 처리가 중복되지 않고 완료될 때까지, 반복 수행된다. The preliminary inspection as described above is repeatedly performed until the processes for all common anchors S370 of the preliminary structure are completed without overlapping.

이때, 여기서 정해지는 임계값(T2)도, 본 발명의 중요한 값이다. 유사도가 낮은 구간을 미리 제거함으로써, 오랜 시간이 소요되는 지역정렬 방식을 통한 심층검사를 좀 더 적게 수행하도록 하는 것이 예비검사의 목적이기 때문이다. 임계값(T2)이 최종 심층검사의 시간 및 결과(정확도)에 미치는 영향을 실험하여, 다음에 나타낸다.At this time, the threshold value T2 determined here is also an important value of the present invention. This is because the purpose of the preliminary inspection is to remove the low similarity section in advance, so that the in-depth examination using the long time alignment method is performed less. The effect of the threshold T2 on the time and result (accuracy) of the final deep inspection is tested and shown next.

임계값(T2)Threshold (T2) 시간(sec)Time (sec) 탐색 성능(%)Explore performance (%) 0.000 0.015 0.030 0.050 0.060 0.070 0.080 0.0900.000 0.015 0.030 0.050 0.060 0.070 0.080 0.090 123 112 108 114 104 98 96 67123 112 108 114 104 98 96 67 100 100 100 100 100 100 100 73100 100 100 100 100 100 100 73

여기서, 상기 표 5를 살펴보면, 임계값(T2)이 커질수록 예비검사가 강화되어, 예비검사를 통과하는 구간이 줄어들고, 이에 따라서, 심층검사까지 행하는 빈도가 줄어들어서, 시간이 줄어들게 됨을 알 수 있다. 그리고 심층검사를 하는 어절의 수가 감소하게 되면, 표절 탐색의 정확성이 함께 감소됨을 알 수 있다. 결과적으로, 표절 탐색 시간이 가장 적게 들면서, 표절 탐색의 정확성을 보장하는 임계값(T2)은, 0.080이 되는 것을 알 수 있다.Here, referring to Table 5, as the threshold value T2 increases, the preliminary test is strengthened, and the interval for passing the preliminary test is reduced, and accordingly, the frequency of performing the in-depth test is reduced, thereby reducing the time. . In addition, as the number of in-depth words decreases, the accuracy of plagiarism detection decreases. As a result, it can be seen that the threshold T2 for ensuring the accuracy of plagiarism search is 0.080 while the plagiarism search time is the least.

그리고 상기 예비검사 단계는, 미리 정해져 있는 세부 다단계로 이루어지도록 구성될 수 있다. 이때, 상기 세부 다단계 중의 앞 단계에서 미리 정해져 있는 임계값 이상의 유사도로 판단된 경우에만 뒷단계로 진행하도록 함이 바람직하다.And the preliminary inspection step may be configured to be made in a predetermined multi-step predetermined. In this case, it is preferable to proceed to the later step only when it is determined that the similarity is equal to or greater than the predetermined threshold value in the previous step of the detailed multi-step.

그리고 세부 다단계로 구성된 경우에는, 상기 대비 부분은, 상기 구분단위를 중심으로 하여, 문서의 위치상 전후로 미리 정해져 있는 수만큼의 어절을 확장한 부분으로 하고, 뒷단계로 갈수록 상기 확장하는 음절의 수가 증가하도록 구성됨을 특징으로 하여도 좋다. 즉, 예컨대, 처음 단계에서 3어절을 확장하여 비교했다면, 다음 단계에서는 7어절을 확장하여 비교하고, 그 다음 단계에서는 11어절을 확장하여 비교하는 식으로, 점층적으로 확장어절을 증가시킬 수 있다.In the case where the multi-level process is configured in detail, the contrast portion is an extended portion of a number of words that are predetermined before and after the position of the document, centering on the division unit, and the number of the extended syllables is gradually increased to a later stage. It may be characterized by being configured to increase. For example, in the first step, if the extended phrase is compared and compared, the next step may be expanded by comparing the seven phrases, and the next step may be expanded by comparing the 11 phrases. .

이와 같이, 예비검사 자체를 세부 다단계로 이루어지도록 함으로써, 예비검사의 세부 다단계의 초기단계에서 이미 유사성이 임계값보다 작다고 판단된 경우에는, 그 이후의 세부 다단계는 물론, 다음의 심층검사도 건너뛰게 되므로, 더욱 신속성을 향상시킬 수 있다.In this way, if the preliminary inspection itself is made in the detailed multi-step, and it is determined that the similarity is already smaller than the threshold in the initial stage of the detailed multi-stage of the preliminary inspection, the subsequent deep inspection as well as the subsequent detailed multi-step are skipped. Therefore, the speed can be improved more.

<한정된 대상에 대한 정확성을 제고시키는 심층검사><In-depth inspection to increase the accuracy of a limited object>

도 5는, 심층검사(S400)단계의 세부 플로차트이다.5 is a detailed flowchart of the deep inspection (S400) step.

심층검사는, 상기 예비검사를 통과한 대비 부분만이 검사의 대상이 되도록 하여 검사할 대상을 한정한 상태에서 충분히 심도 있는 검사를 수행하는 것이다. 따라서 예비검사와 심층검사가 별개의 루프에서 수행되도록 구성된 경우에는, 심층검사 대상으로 분류된 것들을 추출(S410)해 낸 후 순차로 심층검사가 수행되어야 할 것이며, 예비검사와 심층검사가 동일 루프에서 수행되도록 구성된 경우에는, 심층검사 대상으로 분류된 즉시 심층검사가 수행되어야 할 것이다.In-depth examination is to perform a sufficiently in-depth examination in a state in which only the contrast portion that has passed the preliminary examination is the subject of the examination and the subject to be inspected is limited. Therefore, if the preliminary examination and the in-depth examination are configured to be performed in separate loops, the in-depth examinations should be sequentially performed after extracting those classified as the in-depth examination (S410), and the preliminary examination and the in-depth examination in the same loop. If configured to be performed, an in-depth examination should be performed immediately after being classified for in-depth examination.

상기 심층검사 단계에 있어서, 신속성보다는 정확성에 중점을 두는 비교 방법을 이용할 필요가 있다. 그 예로서, 대비 부분을 어절별로 지역 정렬하여, 각 어절의 구성 음절에 대하여, 그 어절에 있어서 그 음절이 위치하는 위치에 따른 가중치를 가산함으로써, 유사도를 산출하는 방법을 이용하도록 구성될 수 있다. 이를 위하여, 도 9와 같이, 어절별 지역정렬단계(S432), 음절단위 비교단계(S434), 음절위치별 가중치 가산단계(S436), 절대유사도 산출단계(S438) 및 상대유사도 산출단계(S439)를 수행하도록 하여도 좋다. 그 구체적인 예는 후술한다.In this deep inspection step, it is necessary to use a comparison method that focuses on accuracy rather than speed. As an example, it may be configured to use a method of calculating the similarity by regionally arranging the contrast portion by word, and adding weights according to the positions of the syllables in each word to the constituent syllables of each word. . To this end, as shown in FIG. 9, the region sorting step for each word (S432), the syllable unit comparison step (S434), the weight addition step for each syllable location (S436), the absolute similarity calculation step (S438), and the relative similarity calculation step (S439). May be performed. Specific examples thereof will be described later.

이와 같이, 대비 부분에 대한 신속한 유사도는 비록 산출되지 못하더라도, 정확한 유사도가 산출될 것으로 인정되는 방법이라면, 심층검사 목적으로는 적용 가능하다는 것이다. As such, the rapid similarity of the contrast portion may be applicable for in-depth inspection purposes, even if it is not calculated, as long as the exact similarity is acceptable.

또한, 상기 심층검사 단계의 상기 대비 부분은, 상기 구분단위를 중심으로 하여, 문서의 위치상 전후로 미리 정해져 있는 수 b 만큼의 어절을 확장(S420)한 부분임을 특징으로 하여도 좋다. 상기 심층검사에 있어서의 확장어절수 b는, 상기 예비검사에 있어서의 확장어절수 a보다 큰 것이 바람직하다. 이때, 상기 어절의 확장은, 상기 구분단위가 중복되지 않는 범위 내에서 이루어지도록 구성됨을 특징으로 하는 것이 바람직하다.In addition, the contrast portion of the deep inspection step may be characterized in that the extended portion (S420) by the number of b predetermined word before and after the position of the document centered on the division unit. It is preferable that the expansion word number b in the above-mentioned deep inspection is larger than the expansion word number a in the preliminary inspection. In this case, it is preferable that the expansion of the word is configured to be performed within a range in which the division unit does not overlap.

심층검사에 의하여, 양 비교대상인 대비 부분의 유사도를 산출(S430)한다. 이때, 심층검사는, 상기와 같이 표절 탐색할 대상 구간인 대비 부분을 예비검사보다 확대하여 적용하면서, 비교방식에 있어서도, 예비검사의 일치음절수 비율계산과 같은 간이한 방식이 아니라, 정밀한 방식으로 표절을 탐색하게 되므로, 각 대비 부분에 대한 탐식시간이 많이 소요되지만, 상기 대비 부분의 수가 적으므로, 문제가 되지 않는다.By the in-depth inspection, the similarity of the contrast portion that is both comparison targets is calculated (S430). At this time, in-depth examination, while applying the contrast portion that is the target section to be plagiarized search as described above than the preliminary examination, in the comparison method, not a simple method such as calculating the number of syllable syllables of the preliminary examination, but in a precise manner Since the plagiarism is searched for, it takes a lot of eating time for each contrast portion, but since the number of the contrast portions is small, there is no problem.

상기와 같이, 각 앵커에 해당되는 양측의 대비 부분에 대한 심층검사의 유사도를 구한 후, 이 유사도가 미리 정해져 있는 임계값 이상인지를 판단(S440)한다. As described above, after obtaining the similarity of the deep inspection for the contrast portion of both sides corresponding to each anchor, it is determined whether the similarity is equal to or more than a predetermined threshold value (S440).

예비검사와 심층검사가 별개 루프에서 수행되는 구성 하에서는, 심층검사 유사도가 임계값 이상인 경우에는, 그 대비 부분의 유사도를 후술하는 문서표절 여부 판단단계(S500)로 즉시 전달하고, 다음 대비 부분의 처리를 수행하며, 심층검사 유사도가 임계값 미만인 경우에는, 그 대비 부분을 후술하는 문서표절도 산출대상이 아닌 구간(S460)으로 분류하여 두고, 즉시 다음 비교대상에 대한 심층검사 비교동작(S470)으로 이행한다. Under the configuration in which the preliminary inspection and the deep inspection are performed in separate loops, if the depth inspection similarity is greater than or equal to the threshold value, the similarity of the contrast portion is immediately transmitted to the document plagiarism determination step (S500), which will be described later, and processing of the next contrast portion. If the depth inspection similarity is less than the threshold value, the contrast portion is classified into a section (S460) which is not a target of document plagiarism, which will be described later, and immediately goes to the depth inspection comparison operation (S470) for the next comparison target. To fulfill.

만일 앵커표절로 판정된 것이 전혀 없다면(S480), 문서표절 판단을 수행할 필요 없이 탐색을 완전 종료시키면서, 이 비교된 문서는 표절이 아니라고 판정(S490)하여도 좋다.If no anchor plagiarism has been determined at all (S480), the compared document may be determined to be not plagiarism (S490) while the search is completely terminated without performing document plagiarism judgment.

예비검사와 심층검사가 동일 루프에서 수행되는 구성 하에서는, 심층검사 유사도가 임계값 이상인 경우에는, 그 대비 부분을 후술하는 앵커표절 판정된 구간(S450)으로 분류하여 두고, 심층검사 유사도가 임계값 미만인 경우에는, 그 대비 부분을 후술하는 문서표절도 산출대상이 아닌 구간(S460)으로 분류하여 두고, 즉시 다음 비교대상에 대한 비교동작(S470)으로 이행한다. Under the configuration in which the preliminary inspection and the deep inspection are performed in the same loop, when the depth inspection similarity is greater than or equal to the threshold, the contrast portion is classified into an anchor plagiarism determination section (S450) described later, and the depth inspection similarity is less than the threshold. In this case, the part of the comparison is classified into a section S460 which is not to be calculated as document plagiarism to be described later, and immediately proceeds to a comparison operation S470 for the next comparison target.

그리고 상기와 같은 심층검사는, 심층검사 대상인 모든 공통앵커(S470)에 대한 처리가 중복 없이 완료될 때까지, 반복 수행된다. The deep inspection as described above is repeatedly performed until the processes for all common anchors S470 that are the deep inspection targets are completed without duplication.

그리고 상기 심층검사 단계는, 그 자체를 세분화하여, 미리 정해져 있는 세부 다단계로 이루어지도록 구성할 수 있다. 이때, 앞 단계에서 미리 정해져 있는 임계값 이상의 유사도로 판단된 경우에만 뒷단계로 진행하게 된다.In addition, the deep inspection step may be configured to be subdivided into itself, and to be made in a predetermined multi-step. At this time, the process proceeds to the back step only when it is determined that the similarity is equal to or greater than the predetermined threshold in the previous step.

그리고 상기 대비 부분은, 상기 구분단위를 중심으로 하여, 문서의 위치상 전후로 미리 정해져 있는 수만큼의 어절을 확장한 부분으로 하고, 뒷단계로 갈수록 상기 확장하는 음절의 수가 증가하도록 구성됨을 특징으로 하여도 좋다. The contrast portion is configured to extend a predetermined number of words before and after the document based on the division unit, and to increase the number of the extended syllables in a later step. Also good.

이 구성에 의하여, 만일 처음 단계에서 20어절의 확장을 한 경우, 다음 단계에서는 40어절의 확장을 하고, 그 다음 단계에서는 60어절의 확장을 하는 식으로, 점층적으로 비교대상인 대비 부분을 증가시킬 수 있고, 이에 의하여 앞 단계의 간이한 비교에서 유사성이 적다고 판단된 경우에는, 뒷단계의 모든 세부 다단계를 건너뛰어 생략할 수 있으므로, 신속성을 담보할 수 있다.By this configuration, if the first step is extended to 20 words, the next step is expanded to 40 words, and the next step is extended to 60 words. In this case, if it is determined that the similarity is small in the simple comparison of the previous step, all the detailed multi-steps of the later step can be skipped and thus secured.

상기 심층검사는, 비교되는 두 문서의 비교구간인 대비 부분 A, B에 대하여, 구간 A를 기준으로 한 구간 B의 유사도(A→B 구간 유사도)와, 구간 B를 기준으로 한 구간 A의 유사도(B→A 구간 유사도)를 모두 구할 수 있다. 그리고 후술하는 가중치에 의한 어간중시 방법(음절의 위치와 삽입, 삭제에 따른 차등 패널티 부여방법)을 이용하면, 이들 두 유사도는 서로 그 값이 일치하지 않는다. 따라서, 이 유사도는 비대칭 유사도이다. 이 결과에 의하여, 후술하는 문서표절의 방향성을 알 수 있다.The in-depth inspection is similarity of section B based on section A (A → B section similarity) and section A based on section B with respect to comparison parts A and B, which are comparison sections of two documents to be compared. (B → A section similarity) can be found. In addition, using the weighting method (weight difference penalty method according to the position and insertion and deletion of syllables) by weights described below, these two similarities do not coincide with each other. Thus, this similarity is asymmetric similarity. From this result, the orientation of the document plagiarism mentioned later can be known.

<지역정렬 방식의 유사도 산출방법><Method of calculating similarity of regional alignment method>

본 발명의 심층검사에 있어서는, 정밀한 비교방식의 하나로서, 지역정렬에 의한 유사도 산출방법을 이용할 수 있다.In the in-depth inspection of the present invention, as one of the precise comparison methods, a similarity calculation method based on regional alignment can be used.

지역정렬 방식이란, 유전자 서열의 비교에 있어서, 유전자의 유사도를 분석하는 방법으로써, 대용량의 정보에서 유사한 부분을 빠르게 찾아준다. 본 발명에서는, 이 방식을 응용하여, 문서의 일정 구간을 가져와서, 배점 매트릭스(Scoring Matrix)를 사용하여, 구간 내에서 가장 유사도가 높은 부분을 찾아내게 된다.Regional sorting is a method of analyzing the similarity of genes in comparing gene sequences, and finds similar parts quickly in a large amount of information. In the present invention, by applying this method, a certain section of a document is taken, and a scoring matrix is used to find a part having the highest similarity in the section.

그런데, 이 방식은, 본질적으로 두 문서에서 가져온 구간의 길이가 배점 매트릭스의 행과 열의 길이가 되므로, 상당히 많은 연산을 수행하여야 하여, 소요시간이 길어진다는 단점이 있다. 따라서 만일 사전구조의 각 앵커에 해당되는 모든 비교대상 구간에 대하여, 빠짐없이 지역정렬을 행하면, 문서 전체의 표절탐색 시간도 길어지게 되는 것은 당연하다. 이런 점에서 지역정렬을 수행하여야 할 횟수, 즉 대상을 줄일 필요가 있고, 이러한 기능은, 상기 예비검사를 통하여 달성하고 있다.However, this method has a disadvantage in that the length of the interval obtained from the two documents becomes the length of the rows and columns of the point matrix, so that a great deal of calculation is required and the time required is long. Therefore, if all the comparison targets corresponding to each anchor of the dictionary structure are completely localized, it is natural that the plagiarism search time of the entire document becomes long. In this regard, it is necessary to reduce the number of times that the area alignment should be performed, that is, the object, and this function is achieved through the preliminary inspection.

<어간과 어미를 나누지 않고도 어간을 중시하는 유사도 산출방법><Method of calculating similarity that emphasizes stem without dividing the stem with the stem>

또한, 본 발명의 심층검사에 있어서는, 정확성을 크게 해치지 않으면서 신속 간이상을 추구하는 유사도 산출방법의 하나로서, 가중치에 의한 어간 중시방식을 이용할 수 있다.Further, in the in-depth examination of the present invention, a stem-centered method based on weighting may be used as one of the similarity calculation methods for pursuing quick and abnormal abnormalities without significantly deteriorating the accuracy.

일반적으로, 문서 간의 정확한 표절 탐색(유사도 산출)을 위해서는, 어절을 어간과 어미로 나누고, 어미를 모두 불용어로 간주하여 제거한 후, 어간끼리 비교를 하여야 한다. 하지만, 이 과정은, 많은 시간과 오래된 노하우를 요구하므로, 단기간에 한글 표절을 빠르게 탐색하기 위해서는 다른 방법이 요구된다.In general, for accurate plagiarism search (calculation of similarity) between documents, the words should be divided into stems and endings, all the endings are considered to be unusable, and the stems must be compared. However, since this process requires a lot of time and old know-how, another method is required to quickly detect Korean plagiarism in a short time.

본 발명에 있어서는, 어절 내에서의 음절의 위치에 따라서 음절마다 서로 다른 가중치를 두었다. 가중치를 두는 이유에 대한 이해를 돕기 위하여, 도 12의 어절별 음절개수 분포도를 참조한다.In the present invention, different syllables are assigned different weights according to the position of syllables in a word. In order to help understand the weighting reason, the syllable number distribution of each word is shown in FIG. 12.

도 12의 어절별 음절개수 분포도는, 일반 문서에서 어절의 평균 음절수를 구하여, 백분율로 나타낸 그래프이다. 어절의 평균 음절수가 2개인 것이 30%를 차지하며, 3개인 것이 35%를 차지함을 알 수 있다. 결국, 대부분의 어절이, 2음절 내지 3음절로 구성되어 있다는 것을 알 수 있다.The syllable number distribution of each word in FIG. 12 is a graph showing the average syllable number of words in a general document and indicating the percentage. The average number of syllables in a word is 30%, and the number of words is 35%. As a result, it can be seen that most words are composed of two to three syllables.

상기 어절별 음절개수 분포도에서 보는 바와 같이, 2음절이나 3음절로 구성된 어절이 65%를 차지하고 있다. 이때, 어미는 뒤의 1음절인 경우가 많으므로, 어간은 1음절이나 2음절이 된다. 따라서 중요한 정보를 담고 있는 음절은 첫 번째 음절과 두 번째 음절임을 알 수 있다. As shown in the number of syllable number distribution for each word, a word composed of two syllables or three syllables accounts for 65%. At this time, since the mother is often one syllable in the back, the stem is one syllable or two syllables. Therefore, it can be seen that the syllables containing important information are the first syllable and the second syllable.

따라서 어절을 어간과 어미로 구분하여, 어미를 제거하는 과정을 대신하여, 앞의 두 음절은 높은 가중치를 주고, 마지막 음절에는 낮은 가중치를 주게 된다면, 사실상 어미를 제거하는 효과를 볼 수가 있다.Therefore, if the word is divided into a stem and a mother, and instead of removing the mother, the first two syllables have a high weight and the last syllable has a low weight, the mother can be effectively removed.

예컨대, 표 6에서와 같이, For example, as shown in Table 6,

"철수는 영희의 절친한 친구이다""Cheol's best friend."

라는 문장과, Sentence,

"철수가 영희를 친한 친구사이로 여긴다""Cheol regards Young-hee as her best friend."

라는 문장이 있다고 하자. 각 문장의 첫 어절끼리 비교할 경우, "철수는"과 "철수가"라는 어절은, 앞의 두 음절은 동일하고, 마지막 음절은 다르다. 이때, 앞의 두 음절은 어간에 해당되고, 뒤의 한 음절은 조사임을 알 수 있다. 이와 같이 중요한 부분은 앞쪽에 놓이게 된다.Say there is a sentence. When comparing the first words of each sentence, the words "Cheolsu" and "Cheolsu" are identical, the first two syllables are identical, and the last syllable is different. In this case, it can be seen that the first two syllables correspond to the stem and the next one syllable is irradiation. This important part is put forward.

상기 원리를 감안한 상기 음절의 위치별 가중치에 대한 구체적 예를 들면, 어절에 있어서, 첫 번째 음절이 일치하면 0.9점을 배점하고, 두 번째 음절이 일치하면 0.8점을 배점하며, 세 번째 이후의 n 번째 음절이 일치하면 0.3^n-2점을 배점하는 것이다. As a specific example of the positional weight of the syllable in consideration of the above principle, in the word, if the first syllable matches, the score is 0.9 and if the second syllable is matched, the score is 0.8 and the third and subsequent n If the first syllable matches, then 0.3 ^n-2 points are scored.

이를 상기 표 6의 예에 적용하여 보면, 표 6의 (a)와 같이, 문A를 기준으로 한 문B의 절대유사도(A→B 문 유사도)는, '배점' 란에 기재되어 있는 바와 같이, 5.1의 점수가 산출된다. Applying this to the example of Table 6, as shown in Table 6 (a), the absolute similarity of the door B based on the door A (A → B door similarity), as described in the 'scoring' column , A score of 5.1 is calculated.

한편, 이 산출된 점수가 어느 정도의 점수인지를 알기 위해서는 비교대상이 있어야 한다. 문B가 문A와 동일한 경우(100% 완전일치)라면, 이 이상의 유사도는 발생될 수 없다. 이때의 최대유사도, 즉 완전일치 유사도(A→A 문 유사도)는, '일치' 란에 기재되어 있는 바와 같이, 8.09의 점수가 산출된다.On the other hand, there is a comparison target to know how much the score is calculated. If statement B is the same as statement A (100% complete match), no more similarity can occur. The maximum similarity at this time, that is, the perfect match similarity (A → A statement similarity), as described in the “Match” column, a score of 8.09 is calculated.

따라서 문A에 대하여, 문B는, 5.1 / 8.09 ≒ 0.63, 즉 63%의 상대유사도를 가지게 된다.Thus, for Door A, Door B has a relative similarity of 5.1 / 8.09 ≒ 0.63, or 63%.

한편, 문B를 기준으로 한 문A의 절대유사도(B→A 문 유사도)도 생각할 수가 있는데, 표 6의 (b)와 같이, '배점' 란에 '배점' 란에 기재되어 있는 바와 같이, 5.1의 점수가 산출된다. 한편, 문A가 문B와 동일한 경우(100% 완전일치)의 최대유사도, 즉 완전일치 유사도(B→B 문 유사도)는, '일치' 란에 기재되어 있는 바와 같이, 9.817의 점수가 산출된다.On the other hand, it is also possible to think of the absolute similarity of door A (B → A door similarity) based on door B. As shown in (b) of Table 6, as described in the 'score' column in the 'score' column, A score of 5.1 is calculated. On the other hand, when Moon A is equal to Door B (100% complete match), that is, the maximum similarity, that is, perfect match similarity (B → B door similarity), a score of 9.817 is calculated as described in the 'Match' column. .

따라서 문B에 대하여, 문A는, 5.1 / 9.817 ≒ 0.52, 즉 52%의 상대유사도를 가지게 된다.Thus, for door B, door A has a relative similarity of 5.1 / 9.817 ≒ 0.52, or 52%.

후술하는 바와 같이, 이들 상대유사도를 소정의 임계값과 비교함으로써, 표절인지의 여부를 판정하게 되며, (A→B 유사도)와 (B→A 유사도)가 다르다는 점을 이용하여 표절의 방향성을 알 수 있게 되는 것이다.As described later, by comparing these relative similarities with a predetermined threshold, it is determined whether plagiarism is carried out, and the directionality of plagiarism is determined based on the difference between (A → B similarity) and (B → A similarity). It will be possible.

<임계값 T1과 T2의 연관성><Association of threshold T1 and T2>

앞에서, 불용어에 관한 임계값(T1)의 최적값과 예비검사 통과에 관한 임계값(T2)의 최적값은, 각각 별개의 실험을 통하여 구하였다. 하지만, 이들 임계값들(T1, T2)은, 서로 심층검사의 정확도에 영향을 미치므로, 이 개별적으로 구한 최적값을 동시에 적용하여 본 발명을 수행한 결과, 표절 탐색의 정확성을 보장하지 못함을 알 수 있었다. 그러므로 다시 실험을 통하여 표절 탐색의 정확성을 보장하는 임계값들(T1, T2)의 최적값을 구하였으며, 그 실험 결과를 다음에 나타낸다.In the foregoing, the optimum value of the threshold value T1 for the stopwords and the optimum value of the threshold value T2 for the preliminary test passages were obtained through separate experiments, respectively. However, since these thresholds (T1, T2) affect the accuracy of each other's in-depth inspection, the results of the present invention by simultaneously applying these individually obtained optimal values do not guarantee the accuracy of plagiarism detection. Could know. Therefore, through the experiment again, the optimum values of the threshold values (T1, T2) to ensure the accuracy of plagiarism detection were obtained, and the experimental results are shown below.

임계값(T1)Threshold (T1) 임계값(T2)Threshold (T2) 시간(sec)Time (sec) 탐색 성능(%)Explore performance (%) 0.001 0.004 0.005 0.006 0.006 0.006 0.007 0.008 0.008 0.009 0.009 0.009 0.010 1.0000.001 0.004 0.005 0.006 0.006 0.006 0.007 0.008 0.008 0.009 0.009 0.009 0.010 1.000 0.070 0.070 0.070 0.070 0.060 0.050 0.070 0.070 0.050 0.070 0.060 0.050 0.050 0.0000.070 0.070 0.070 0.070 0.060 0.050 0.070 0.070 0.050 0.070 0.060 0.050 0.050 0.000 12 46 52 56 79 84 65 69 217 81 97 108 233 12312 46 52 56 79 84 65 69 217 81 97 108 233 123 17 73 73 73 87 87 73 73 93 93 100 100 100 10017 73 73 73 87 87 73 73 93 93 100 100 100 100

표 7에서는, 여러 가지 임계값(T1, T2)을 변화시켜가면서 실험하였고, 표절 탐색에 걸리는 시간과, 삽입된 표절 문구가 정확히 찾아지는지 여부를 검사하였다.In Table 7, we experimented by changing various threshold values (T1, T2), and examined the time taken for plagiarism detection and whether the inserted plagiarism phrases were found correctly.

임계값(T1, T2)이 동시에 시스템에 영향을 주어, T1 값이 작고, T2 값이 클수록, 항상 시간이 줄고, 표절 탐색의 정확성이 떨어지지는 않았다. 하지만, 대체적으로, 시간은 T1 값에 비례하였고, T2 값에는 반비례하였다. 표절 탐색의 정확도 또한, T1 값에 비례하였고, T2 값에 반비례하였다. T1과 T2 값을 동시에 적용하였을 때의 최적값은, 각각을 적용하였을 때와는 달리, 각각 0.009와 0.05가 되었다.The thresholds T1 and T2 affected the system at the same time, so that the smaller the T1 value, the larger the T2 value, the less time was always and the accuracy of plagiarism detection was not compromised. In general, however, time was proportional to T1 value and inversely proportional to T2 value. The accuracy of plagiarism detection was also proportional to the T1 value and inversely proportional to the T2 value. The optimum values when the T1 and T2 values were applied simultaneously were 0.009 and 0.05, respectively, unlike when the respective values were applied.

<다단계 탐색에 의한 효과><Effect of multilevel search>

다단계, 예컨대 2단계로 표절탐색 구간을 설정함으로 인하여, 얼마나 효과가 있는지를 실험을 통하여 알아보았다. 이때, 6개의 문서들의 사전구조로 만들어진 어절의 수에서, 예비검사를 통과하는 어절의 수가 얼마인지, 또한, 통과한 어절에서 심층검사를 통과하는 어절은 얼마나 되는지를 조사하였다. 결과를 다음에 나타낸다.By setting the plagiarism search interval in multiple stages, for example, two stages, it was examined through experiments. At this time, the number of words that passed the preliminary test and the number of words that passed the in-depth test were examined. The results are shown below.

상기 표 8에서 나타나는, 심층검사 구간을 통과하는 어절의 비율을 그래프로 나타낸 것이 <2단계 탐색결과>이다. 본래 비교해야 할 어절의 수에 비하여, 상당히 많은 부분이 줄었음을 한눈에 알 수 있다.As shown in Table 8, the ratio of words passing through the in-depth examination section is graphed. Compared to the number of words originally to be compared, it can be seen at a glance that there are considerably fewer parts.

즉, 예비검사를 통과하는 어절 수는 평균 0.873%이고, 심층검사를 통과하는 어절 수는 평균 0.727%이다. 따라서 상기 2단계의 표절 탐색 구간을 거치게 되면, 표절 여부를 조사하여야 할 어절의 수가 평균 0.634%로 줄게 된다. 이와 같이, 2단계 표절 탐색 구간을 통하여, 거의 절반의 시간을 단축할 수 있게 된다.In other words, the average number of words passing the preliminary test was 0.873%, and the average number of words passing the deep test was 0.727%. Therefore, the number of words to be checked for plagiarism is reduced to an average of 0.634% when the plagiarism search step of the second step is performed. As such, it is possible to shorten almost half the time through the two-step plagiarism search section.

<문서 전체의 유사도><Similarity of the entire document>

도 6은, 문서표절여부 판단(S500)단계의 세부 플로차트이다.6 is a detailed flowchart of the document plagiarism determination step (S500).

상기와 같이, 본 발명에서는, 비교할 각 문서로부터 그 문서의 어절들을 나누고, 이 어절들 중에서 불용어를 제거한 후, 나머지 어절들에 대하여 소정 크기로 앵커화하여 사전구조를 형성한 상태에서, 모든 앵커에 대하여 중복 없이 예비검사를 하면서, 소정 기준 이상의 유사도를 가진 구간에 대해서는 심층검사를 하도록 하고 있다. 이하, 이와 같은 과정을 거쳐서 산출된 심층검사의 결과 유사도로부터, 상기 비교대상 문서 사이의 유사도를 산출하는 과정에 대하여 설명한다.As described above, the present invention divides the words of the document from each document to be compared, removes stop words from the words, and anchors them to a predetermined size with respect to the remaining words to form all the anchors. While preliminary inspections are performed without duplication, in-depth examinations are carried out on sections having similarity or higher than a predetermined standard. Hereinafter, the process of calculating the similarity between the said documents to be compared from the similarity result of the in-depth test computed through such a process is demonstrated.

문서의 유사도는, 기본적으로, 상기 심층검사를 거친 각 대비 부분인 구간들의 유사도의 합으로부터 산출(S520)할 수 있다. 따라서 A, B 두 문서를 비교한 경우, 심층검사에서 검사한 각 대비 부분의 (A→B 구간 유사도)들의 합으로부터 (A→B 문서 유사도)를 산출하고, 각 대비 부분의 (B→A 구간 유사도)의 합으로부터 (B→A 문서 유사도)를 산출하면 되는 것이다.The similarity of the document may be basically calculated from the sum of the similarity of the sections that are the respective contrast parts that have undergone the in-depth inspection (S520). Therefore, when comparing two documents A and B, (A → B document similarity) is calculated from the sum of the (A → B section similarity) of each contrast portion examined by the in-depth inspection, and the (B → A interval) of each contrast portion is calculated. (B → A document similarity) can be calculated from the sum of the similarities).

이때, 상기 A, B 두 문서가 완전 일치할 경우의 심층검사의 유사도, 즉 문서 A를 기준으로 한 문서 A의 유사도와, 문서 B를 기준으로 한 문서B의 유사도를 미리 산출(S510)하여 둘 필요가 있다. 이 완전 일치시의 심층검사 유사도인 완전일치 유사도가, 최대유사도이기 때문에, 일부 일치시의 유사도에 대한 기준이 되기 때문이다. In this case, the similarity of the deep inspection when the two documents A and B completely match, that is, the similarity of the document A based on the document A, and the similarity of the document B based on the document B are calculated in advance (S510). There is a need. This is because the perfect match similarity, which is a deep inspection similarity at the time of perfect match, is the maximum similarity, and thus serves as a reference for the similarity at the time of partial match.

즉, 상기 각 대비 부분에 있어서의 부분유사도에 대하여 고찰하면, 상기 심층검사 단계에서 산출된 유사도를 절대유사도라 하고, 상기 대비 부분이 완전 일치할 때의 유사도를 완전일치 유사도라 하면, 상기 완전일치 유사도에 대한 상기 절대유사도의 비를 상대유사도라 할 수 있다.In other words, when considering the partial similarity in each contrast portion, the similarity calculated in the depth inspection step is called an absolute similarity, and the similarity when the contrast portions are perfectly matched is a perfect match similarity. The ratio of absolute similarity to similarity may be referred to as relative similarity.

이제 모든 심층검사 대상인 공통앵커에 대한 심층검사가 종료되면, 문서표절 판단단계에 들어간다. 여기서는, 상기 부분유사도 중에서, 상기 절대유사도의 누적에 의하여 상기 문서유사도를 산출하도록 하는데, 상기 각 부분유사도에 대하여 상기 누적에 사용할 지의 여부를 판단하기 위한 기준으로서 상기 상대유사도를 이용하도록 구성될 수 있다.Now when the in-depth inspection of the common anchors, which are all in-depth examinations, is completed, the document plagiarism determination step is entered. Here, among the partial similarities, the document similarity may be calculated by accumulating the absolute similarity, and the relative similarity may be configured to use the relative similarity as a criterion for determining whether to use the cumulative similarity for each partial similarity. .

이때, 상기 문서 전체에 있어서의 문서유사도에 대하여, 상기 대비 부분의 부분유사도 중의 절대유사도의 누적을 문서 절대유사도라 하고, 상기 문서가 완전 일치할 때의 유사도를 문서 완전일치 유사도라 할 수 있다.In this case, with respect to the document similarity in the entire document, the accumulation of absolute similarity in the partial similarity of the contrast portion may be referred to as document absolute similarity, and the similarity when the documents are perfectly matched may be referred to as document perfect match similarity.

그리고 상기 문서 완전일치 유사도에 대한 상기 문서 절대유사도의 비를 문서 상대유사도라 할 수 있다. 또한, 상기 문서 절대유사도와 문서 상대유사도를 통칭하여 문서 유사도(S530)라 할 수 있다. 이와 같이, 문서 유사도는 심층검사 결과 유사도의 합계로부터 산출된다.The ratio of the document absolute similarity to the document complete match similarity may be referred to as a document relative similarity. In addition, the document absolute similarity and the document relative similarity may be collectively referred to as document similarity (S530). As such, the document similarity is calculated from the sum of the similarities as a result of the in-depth inspection.

이때, 상기 문서표절 여부는, 상기 문서 상대유사도로부터 판단되도록 구성됨이 바람직하다. 즉, 상기 문서 상대유사도가 소정 임계치 이상이면, '문서 표절'로 판단하는 것이다.In this case, the document plagiarism is preferably configured to be determined from the document relative similarity. That is, if the document relative similarity is more than a predetermined threshold, it is determined as 'document plagiarism'.

한편, 다른 예로서, 상기 문서표절 여부는, 상기 문서 절대유사도를 이용하여, 확률모델에 근거하여 판단(S540)하면 객관적인 계량화가 가능하며, 이에 대한 구체적인 예는 후술한다.On the other hand, as another example, whether the document plagiarism is determined based on the probability model using the document absolute similarity (S540) can be objectively quantified, a specific example thereof will be described later.

<확률모델의 적용>Application of Probability Model

도 7은, 결과 출력(S600)단계의 세부 플로차트이다.7 is a detailed flowchart of the result output step S600.

상기 문서표절 여부는, 미리 정해져 있는 확률모델에 상기 문서유사도를 대응시킴으로써, 객관적인 수치로 문서표절 정도를 산출하고, 이를 임계값과 비교함으로써, 판단되도록 구성됨이 바람직하다. 이때, 상기 확률모델은, 다수의 표절이 아닌 독립문서 끼리를 비교하여, 실제로 표절이 아님에도 불구하고 표절로 의심될 만큼 유사한 표현이 우연히 출현할 확률을 통계적으로 정리하여 도출한 함수인 것이 바람직하다. 이를 표절확률 함수 또는 표절도 함수라고 할 수 있고, 이 함수의 그래프는 표절도 그래프라고 할 수 있다.The document plagiarism is preferably configured to determine the degree of document plagiarism by an objective numerical value by comparing the document similarity with a predetermined probability model, and to compare the document plagiarism with a threshold. In this case, it is preferable that the probability model is a function derived by statistically arranging the probability of similar expressions occurring as suspected by plagiarism, even though they are not plagiarism, by comparing independent documents rather than multiple plagiarism. . This may be called a plagiarism probability function or a plagiarism function, and the graph of this function may be called a plagiarism graph.

즉, 일반적인 독립문서, 즉 서로 분야나 기재용어가 달라서, 서로 표절이 일어나지 않았다고 인정되는 문서 간에 있어서, 표절로 오인될만한 동일한 표현이나 내용이 우연히 나타날 확률을 통계적으로 조사하는 것이 가능하다. In other words, it is possible to statistically examine the probability of accidentally appearing the same expressions or contents that may be mistaken for plagiarism among general independent documents, i.e., documents which are recognized to have different plagiarism due to different fields or description terms.

연구에 의하면, 이러한 확률함수의 그래프는, 유사한 부분의 발생빈도를 세로축에, 절대유사도를 가로축에 그렸을 경우에, 정규분포와는 다른 양상을 나타냄을 알 수 있다. 즉, 원점에서 시작하여, 절대유사도가 그리 크지 않을 때 그 피크치가 발생하며, 그 이후 절대유사도가 증가할수록 그런 유사한 부분의 발생빈도는 제로(zero)에 수렴하는 형태를 나타낸다. 그리고 이러한 함수는, 예컨대 굼벨(Gumbel)함수로 대표(도 14 참조)되거나 근사(도 15 참조)될 수 있다.According to the study, it can be seen that the graph of the probability function is different from the normal distribution when the occurrence frequency of similar parts is plotted on the vertical axis and the absolute similarity on the horizontal axis. In other words, starting at the origin, the peak value occurs when the absolute similarity is not very large. Then, as the absolute similarity increases, the frequency of occurrence of such a similar portion converges to zero. And such a function can be represented, for example, by the Gumbel function (see FIG. 14) or approximated (see FIG. 15).

따라서 이러한 확률모델, 또는 이를 함수로 정리한 확률함수, 또는 이를 그래프로 출력한 표절도 그래프를 기준으로 하여, 상기 심층검사 후에 문서 전체에 대하여 구한 문서의 (A→B 문서 절대유사도)나 (B→A 문서 절대유사도)(S610)를 상기 확률모델, 확률함수, 또는 표절도 그래프에 대응시켜 비교함으로써, 그 절대유사도가 표절에 해당되는지 아닌지를 판단하는 것이 가능하다. 즉, 상기 확률모델은, 다양한 독립문서들에 대하여 실험을 행한 결과 미리 생성되어 준비되며, 상기 확률모델을 이용한 표절탐색을 수행함에 있어서는, 상기 절대유사도나 이에 의하여 산출되는 문서유사도에 대응되는 상기 확률모델의 함수값에 의하여, 표절의 정도를 나타내는 구체적인 확률값을 구하여 출력하도록 구성됨이 바람직하다.Therefore, based on the probability model, or the probability function summarized as a function, or the plagiarism graph outputted as a graph, (A → B document absolute similarity) or (B) By comparing [A document absolute similarity degree] (S610) with the probability model, probability function, or plagiarism graph, it is possible to determine whether the absolute similarity degree corresponds to plagiarism. That is, the probability model is generated and prepared in advance as a result of experiments on various independent documents, and when performing plagiarism search using the probability model, the probability corresponding to the absolute similarity or the document similarity calculated by the probability model. Preferably, the function value of the model is configured to obtain and output a specific probability value representing the degree of plagiarism.

예컨대, 2000어절의 크기를 가지는 독립문서, 즉 비표절문서의 굼벨함수 그래프를 고찰(도 13 참조)하면, 절대 유사도가 27 이상이 나올 확률이 5% 미만이다. 즉, 표절을 하지 않았음에도 불구하고, 절대유사도가 27 이상이 나올 확률은 5% 미만으로 극히 작다는 것이다. 따라서 이에 의하면, 본 발명의 심층검사 후 문서 전체의 유사도 산출에서 나온 절대유사도가 27보다 높을 경우, 그 문서는 비표절일 확률이 5% 미만, 즉, 표절일 확률이 95% 이상이라고 판단하는 것이 가능하므로, 임계값이 비표절일 확률 15%라면, 이 문서는 표절이라고 단정할 수 있다.For example, when considering a Gumbell function graph of an independent document having a size of 2000 words, that is, a non-plagiarism document (see FIG. 13), the probability of absolute 27 or more is less than 5%. In other words, despite the lack of plagiarism, the probability of having an absolute similarity of 27 or more is extremely small, less than 5%. Therefore, according to this, if the absolute similarity derived from the calculation of the similarity of the entire document after the in-depth examination of the present invention is higher than 27, it is determined that the document is less than 5% probability of plagiarism, that is, 95% or more probability of plagiarism So, if the threshold is 15% probability of non-plagiarism, then this document can be assumed to be plagiarism.

<표절의 경로><Path of plagiarism>

상기 심층검사에 있어서의 가중치에 의한 유사도 산출의 경우에, 상기 비교대상이 되는 상기 문서 쌍을 문서 A와 문서 B라 하면, 이때, 상기 대비 부분의 유사도는, 문서 A를 기준으로 한 문서 B의 유사도(A→B 문서 유사도)와, 문서 B를 기준으로 한 문서 A의 유사도(B→A 문서 유사도)가 서로 다른, 비대칭 유사도이다. 이는, 상기 가중치가, 기준이 되는 문서에 비하여 대비가 되는 문서에 추가된 삽입부분에 대한 가중치(삽입 패널티)와, 기준이 되는 문서에 비하여 대비가 되는 문서에 삭제된 삭제부분에 대한 가중치(삭제 패널티)가 서로 다르게 정해지도록 구성되어 있기 때문이다.In the case of the similarity calculation by the weight in the above-mentioned deep inspection, if the document pairs to be compared are referred to as document A and document B, then the similarity of the contrast portion is determined by document B based on document A. The similarity (A → B document similarity) and the similarity (B → A document similarity) of Document A based on Document B are asymmetric similarities. This means that the weight is the weight (insert penalty) for the insertion part added to the contrasted document compared to the reference document, and the weight for the deleted part deleted in the contrasted document compared to the reference document (deletion). This is because the penalty is configured to be determined differently.

만일 표절의 방향성, 즉 원본문서가 어느 것이고, 표절문서가 어느 것인지를 알고자 한다면, 상기 가중치에 의한 심층검사에 있어서, (A→B 구간 유사도)와 (B→A 구간 유사도)를 각각 산출할 필요가 있다. If you want to know the direction of plagiarism, that is, the original document and which of the plagiarism documents, in the in-depth inspection based on the weight, (A → B section similarity) and (B → A section similarity) can be calculated. There is a need.

그리고 상기 문서표절 판단단계에 있어서는, 상기 (A→B 구간 유사도)와 (B→A 구간 유사도)로부터, 각각 (A→B 문서 유사도)와 (B→A 문서 유사도)가 산출되도록 구성할 필요가 있다.In the document plagiarism determination step, it is necessary to configure so that (A → B document similarity) and (B → A document similarity) are calculated from the (A → B section similarity) and (B → A section similarity), respectively. have.

그리고 (A→B 문서 유사도)와 (B→A 문서 유사도)의 값의 비교에 의하여 표절이 이루어진 방향을 결정할 수 있다. 상기 양 방향의 유사도들은 서로 비대칭, 즉 서로 다르게 나타나기 때문이다. 상기 각 패널티의 부여방식에 따라서는, 큰 값을 가지는 유사도가 산출된 방향으로, 문서의 표절이 이루어졌다고 판단할 수도 있고, 또는 작은 값을 가지는 유사도가 산출된 방향으로, 문서의 표절이 이루어졌다고 판단할 수도 있다. 이 중에서, 본 실시예에서는, 큰 값을 가지는 문서유사도가 산출된 방향으로, 문서의 표절이 이루어졌다고 판단하도록 구성되어 있다. 즉 예컨대, 표절방향은, 유사도가 큰 방향으로 결정(S630)된다. The direction of plagiarism can be determined by comparing the values of (A → B document similarity) and (B → A document similarity). This is because the similarities in both directions appear asymmetrically, that is, differently from each other. Depending on the method of assigning each penalty, the document may be judged to have been plagiarized in a direction in which similarity with a large value is calculated, or the document was plagiarized in a direction in which a similarity with a small value is calculated. You can also judge. Among these, in the present embodiment, the document similarity having a large value is configured to determine that plagiarism of the document is performed. That is, for example, the plagiarism direction is determined in a direction of high similarity (S630).

또한, 상기 (A→B 문서 유사도)와 (B→A 문서 유사도)를 이용하여, 상기한 바와 같이, 표절도 그래프, 또는 표절 정도를 나타내는 확률값을 출력(S630)할 수도 있다.In addition, using the above (A → B document similarity) and (B → A document similarity), as described above, a plagiarism graph or a probability value indicating the degree of plagiarism may be output (S630).

그리고 이때, 비교대상인 모든 상기 문서 쌍에 대하여 상기 문서의 표절 방향을 산출한 후, 표절 방향과 표절정도(확률값)에 따라서 각 문서마다 화살표로 연결한 표절경로 도형(그래프)으로 표시(S640)하도록 구성됨이 바람직하다. 상기 표절방향, 경로의 산출과 그래프 표시의 예는 후술한다.At this time, after calculating the plagiarism direction of the document for all the pairs of documents to be compared, and displaying the plagiarism path diagram (graph) connected by arrows for each document according to the plagiarism direction and the degree of plagiarism (probability value) (S640). Preferably configured. Examples of the plagiarism direction, the calculation of the path and the graph display will be described later.

<실험예 1>Experimental Example 1

"핸드볼 편파판정"에 대한 네이버 뉴스 검색결과 상위 16개 문서에 대하여 본 발명을 적용하였다. 이 중에서 10개의 문서는 이하와 같다.Naver News search results for "handball polarization" was applied to the top 16 documents. Ten documents are as follows.

<문서1> 네이트 닷컴<Document 1> Nate.com

카타르에 준결승서 분패... 3-4위전으로 밀려 (도하=연합뉴스) 특별취재단 = 2006 도하 아시안게임에 출전한 한국 남자핸드볼의 대회 6연패 꿈이 중동 심판의 편파 판정으로 좌절됐다. Qatar's semi-final defeat ... 3rd place (Doha = Yonhap News) Special Reporter = The dream of a sixth consecutive Korean men's handball competition in the 2006 Doha Asian Games was frustrated by a bias in the Middle East.

한국은 12일 오전(한국시간) 카타르 도하 알가라파 스타디움에서 열린 남자핸드볼 준결승에서 주최국 카타르를 맞아 중동 심판의 노골적인 편파 판정에 휘말리며 28-40, 12점 차로 분패했다. 1986년 서울 대회에서 우승한 이래 2002년 부산 대회까지 아시안게임 5연패를 이뤘던 남자 핸드볼은 노골적인 심판의 횡포에 아시아 정상 자리를 약탈당하고 말았다. 이번 경기에 배정된 심판은 쿠웨이트 출신 알리 압둘 후세인과 사미 칼라프. 이들은 경기 시작부터 편파판정을 일삼았다. 경기 시작 30초 만에 이태영이 왼쪽 측면에서 골을 넣었으나 라인을 밟았다며 무효처리했고, 이어진 속공기회에서는 백원철의 오버스텝을 선언했다. 이후부터 심판들의 편파판정은 공격의 핵인 203㎝의 장신 공격수 윤경신에게 집중됐다. 전반 4분 윤경신이 속공 기회에서 두어발짝 밖에 밟지 않았는데 심판은 곧바로 오버스텝을 선언했고, 윤경신이 어이없다는 표정을 짓자 2분 퇴장을 줬다. 점수는 0-4로 벌어졌고 한국은 전반 5분에야 피봇 박중규가 첫 골을 성공시켰다. South Korea lost to 28-40 and 12 points in the men's handball semi-finals held at the Doha Algarapa Stadium in the morning of the 12th in the morning. The men's handball, which won five Asian games in a row until the 2002 Busan Games since winning the Seoul in 1986, was plundered by the top spot in Asia by a blatant umpire. The referees assigned to the match are Kuwaiti Ali Abdul Hussein and Sami Calaf. They have been biased since the start of the game. Within 30 seconds of the game, Lee Tae-young scored a goal from the left flank but dismissed it as stepping on the line. Since then, the judgments of the referees have been concentrated on 203cm tall attacker Yun Kyung-shin, the core of the attack. In the first four minutes, Yun Kyung-shin stepped on a couple of quick shots, but the referee immediately declared an overstep and gave him a two-minute exit when he expressed an absurd expression. The score was 0-4 and Korea only scored the first goal in the first five minutes of the pivot.

이후부터 심판들은 턴오버(공격권이 넘어가는 것) 작전을 썼다. 원래 신체 접촉이 많은 핸드볼 경기에서 한국이 상대에게 조금만 닿으면 턴오버를 불어댔다. 한국 수비 때에는 무조건 2분 퇴장에다 7m 드로를 선언했다. 한국은 경기 내내 골키퍼를 제외한 5명이 공격할 수밖에 없었다. 한국은 중거리슛밖에 도리가 없었고, 윤경신과 백원철, 이준희가 15m 밖에서 속임수 동작 뒤에 중거리포를 날려 점수를 차근차근 쌓아갔지만, 전반을 13-19, 6점 차로 뒤진 채 마쳤다. 후반 들어 심판들의 편파판정은 도를 더해갔고 후반 8분에는 김장문이 레드카드로 실격되는 상황까지 벌어졌다. 점수는 16-28, 12점 차까지 났다. Since then, the referees have been operating a turnover. Originally, in a handball game with a lot of physical contact, South Korea blowed a turnover if it touched a little. In defense, South Korea unconditionally declared a 7m draw with two minutes left. Korea had no choice but to attack five players, except for the goalkeeper. Korea only had a medium-range shot, and Yoon Kyung-shin, Baek Won-cheol, and Lee Jun-hee shot a medium-range cannon after tricking out of 15m, but they scored 13-19, six points behind. In the second half, the judgments of the referees increased, and in the second eight minutes, Kim Jang-moon was disqualified as a red card. The score was 16-28, up to 12 points.

한국은 포기하지 않았지만, 힘을 내려 하면 심판들은 어김없이 호루라기를 불었다. 몸싸움이 가장 많을 수밖에 없는 피봇 박중규는 끊임없이 턴오버에 시달렸고, 어이없는 2분 퇴장도 받았다. 후반 16분 점수는 17-32, 무려 15점 차로 벌어졌고, 더 이상 추격하는 것은 의미가 없어졌다. 한국은 5-6명이 싸우고 카타르는 심판 2명까지 합해 9명이 싸우는 경기였다. 경기를 통틀어 한국 2분 퇴장은 10개였고 카타르는 3개였다.South Korea did not give up, but when it did, the judges blew a whistle. Pivot Park Jung-gyu, who had the most struggles, suffered a constant turnover and received a ridiculous 2-minute off. In the second half, the score was 17-32, a whopping 15 points, and the pursuit made no sense. South Korea fought 5-6 people and Qatar had 9 players, including two referees. Throughout the game, Korea left 10 minutes in two minutes and three in Qatar.

<문서2> Raran Blog<Document 2> Raran Blog

한국은 9일(한국시간) 카타르 도하 알가라파 인도어홀에서 열린 아시안게임 남자핸드볼 예선리그 F조 최종전에서 카타르 출신 심판들의 편파 판정에 끌려다니며 26대32로 분패했다. 한국은 1승1무1패 승점 3점으로 일본과 2위를 다퉈야 하는 상황을 맞아 아시안게임 6연패의 목표도 불투명해졌다. 바레인과의 최종전을 앞둔 일본이 18골 차이로 바레인을 이기면 한국은 준결승에 나갈 수 없다. 현재 골득실차는 한국 8, 일본 -11이다. 심판 2명은 경기 내내 한국의 공격 흐름을 끊거나 쿠웨이트의 편을 들었다. 전반 22분에는 김태완이 심판에게 욕을 하는 손짓을 했다며 퇴장을 명령하기도 했다. 후반에는 이재우가 레드카드로 4분간 퇴장, 박중규도 2분 퇴장을 받아 3명의 공격수로 경기를 치르기도 했다.South Korea lost 26-32 in the final match of Group F referees from Qatar during the Asian Games Men's Handball Preliminary Group F final at the Doha Algarapa Indore Hall in Qatar. South Korea's sixth consecutive Asian game win was unclear as Korea faced a two-time victory over Japan with one win, one draw and one defeat. If Japan beat Bahrain by 18 goals before the final match against Bahrain, Korea could not go to the semifinals. The goal gap is currently 8 in Korea and 11 in Japan. The two referees cut off Korea's attack flow or sided with Kuwait throughout the game. In the 22nd minute, Kim Tae-wan ordered the resignation saying that he swears to the referee. In the second half, Lee Jae-woo was sent off for four minutes with a red card, while Park Joong-kyu was also sent off for two minutes and played three strikers.

<문서3> losbaby blog<Document 3> losbaby blog

박도헌 한국 남자 핸드볼 대표팀 감독은 경기가 끝나자 전광판을 쳐다봤다. 26대32로 6점 차 패배였다. 박 감독은 화가 머리끝까지 치밀었다. 두 카타르 심판 때문이었다. Park Do-heon, the Korean men's handball team coach, stared at the scoreboard after the match. It was a 26-32 loss by six points. The director pushed the painter's head. Two Qatar referees.

9일(한국시간) 카타르 도하 알가라파 인도어홀에서 열린 한국과 쿠웨이트의 핸드볼 예선리그 F조 최종전. 심판들은 혼전 상황이 벌어지면 어김없이 쿠웨이트볼을 선언했다. 한국이 속공 기회를 잡으면 호루라기를 불었다. 사이드라인에서 다시 공격하라며. South Korea and Kuwait's handball qualifying group Group F final match held at the Doha Algarapa Indore Hall in Doha, Qatar. The referees proclaimed Kuwait Ball when there was a predicament. When South Korea got a quick shot, it blew a whistle. Attack from the sideline again.

가장 어이없는 일은 전반 22분에 일어났다. 심판은 김태완이 손짓으로 자기에게 욕을 했다며 코트에서 내쫓아 버렸다. 후반 9분쯤엔 이재우가 레드카드를 받아 4분간 퇴장당했고 박중규까지 2분 퇴장을 받았다. 도저히 이길 수 없는 경기였다. The most ridiculous thing happened in the first 22 minutes. The referee was kicked out of the court saying that Kim Tae-wan swears at him. About 9 minutes later, Lee Jae-woo received a red card and was sent off for 4 minutes, and Park Joong-kyu was sent off for 2 minutes. It was a game that I could not win.

1승1무1패가 된 한국은 다행히 골득실차에서 일본에 앞서 간신히 준결승에 진출할 수 있었다. 한국은 규정 대로 경기 종료 후 한 시간 내에 판정에 대한 이의를 제기했다. 이번 대회에 참가한 12개 조(24명)의 심판 가운데 무려 9개 조가 중동지역 출신이다. 박 감독은 "아시아핸드볼연맹의 회장국을 맡고 있는 쿠웨이트가 한국을 탈락시켜 금메달을 따내려고 중동 심판들을 앞세워 장난을 치고 있다"고 말했다. Fortunately, Korea, who lost 1 win, 1 draw, and 1 loss, could barely advance to the semifinals before Japan. Korea objected to the ruling within one hour after the end of the match as prescribed. Of the twelve (24) referees participating in the contest, nine are from the Middle East. Park said, "Kuwait, who is the chairman of the Asian Handball Federation, is playing with the Middle Eastern referees to eliminate Korea and win the gold medal."

중동의 '한국 남자 핸드볼 죽이기' 움직임은 개막 전에 이미 감지됐다. 대회 조직위원회는 독일 핸드볼 분데스리가에서 활약하고 있는 에이스 윤경신이 소속 팀의 일정 때문에 대표팀 합류가 늦어지자 출전 기회를 박탈하려 했다. 개막식이 열린 지난 1일까지 조직위에 윤경신의 여권을 제출하지 않으면 출전할 수 없다고 통보한 것. 관례에 어긋나는 일이었다. 다행히 대한체육회가 중재에 나서 문제는 해결됐다. The Middle East 'hand kill Korean men' handball 'was already detected before the opening. The organizer of the tournament tried to deprive them of their chances when Ace Yun Kyung-shin, who plays for the German handball Bundesliga, was late because of his team's schedule. It was notified that the organizers could not participate without submitting Kyung Kyung Shin's passport to the Organizing Committee by the last day of the opening ceremony. It was against custom. Fortunately, the Korean Physical Education Association was arbitrated and the problem was solved.

한국 남자 핸드볼은 한국엔 '효자'지만 다른 나라에겐 '공공의 적'이다. 1986년 서울대회부터 2002년 부산대회까지 무려 16년 동안 정상을 지키고 있으니 견제가 너무 심하다. 비인기종목 '한데볼'은 외국에 나가서도 서럽기만 하다. 설움을 달랠 길은 아시안게임 6연패 달성밖에 없다.Korean male handball is 'Hyoja' in Korea but 'public enemy' in other countries. From the 1986 Seoul Games to the 2002 Busan Games, the summit has been in check for 16 years. The unpopular event 'Hadebol' is sad even when you go abroad. The only way to appease is to achieve six consecutive Asian games.

<문서4> 한겨레 21<Document 4> Hankyoreh 21

한국 선수단이 남자핸드볼 경기에서 나온 중동 심판의 편파판정에 공식 항의했다.A South Korean squad officially protested the decision of the Middle East referee in a men's handball match.

선수단은 "지난 9일 남자핸드볼 한국-쿠웨이트 본선리그 최종전 때 카타르 출신 심판 2명의 편파 판정과 관련해 아시아올림픽평의회(OCA)와 대회 조직위원회, 아시아핸드볼연맹(AHF)에 항의 서한을 보냈다"고 밝혔다. 당시 한국은 경기 내내 텃세 판정에 고전하며 6점 차로 분패했다.At the final match of the men's handball Korea-Kuwait finals on the 9th, the Squadron sent a letter of protest to the Asian Olympic Council (OCA), the Organizing Committee and the Asian Handball Federation (AHF) regarding the bias against two Qatari referees. At that time, South Korea lost 6 points, struggling with ruling throughout the game.

<문서5> 야후 미디어<Document 5> Yahoo Media

특별취재단 = 2006 도하아시안게임에 출전한 한국 남자핸드볼의 대회 6연패 꿈이 중동 심판의 편파 판정으로 좌절됐다. 한국은 12일 오전(한국시간) 카타르 도하 알가라파 스타디움에서 열린 남자핸드볼 준결승에서 주최국 카타르를 맞아 중동 심판의 노골적인 편파 판정에 휘말리며 28-40, 12점 차로 분패했다. 1986년 서울 대회에서 우승한 이래 2002년 부산 대회까지 아시안게임 5연패를 이뤘던 남자 핸드볼은 노골적인 심판의 횡포에 아시아 정상 자리를 약탈당하고 말았다. Special Reporter = The sixth consecutive dream of the Korean men's handball in the 2006 Doha Asian Games was frustrated by a bias in the Middle East. South Korea lost to 28-40 and 12 points in the men's handball semi-finals held at the Doha Algarapa Stadium in the morning of the 12th in the morning. The men's handball, which won five Asian games in a row until the 2002 Busan Games since winning the Seoul in 1986, was plundered by the top spot in Asia by a blatant umpire.

이번 경기에 배정된 심판은 쿠웨이트 출신 알리 압둘 후세인과 사미 칼라프. 이들은 경기 시작부터 편파판정을 일삼았다. 경기 시작 30초 만에 이태영이 왼쪽 측면에서 골을 넣었으나 라인을 밟았다며 무효처리했고 이어진 속공기회에서는 백원철의 오버스텝을 선언했다. The referees assigned to the match are Kuwaiti Ali Abdul Hussein and Sami Calaf. They have been biased since the start of the game. Within 30 seconds of the game, Lee Tae-young scored a goal from the left flank but dismissed it as stepping on the line.

이후부터 심판들의 편파판정은 공격의 핵인 203㎝의 장신 공격수 윤경신에게 집중됐다. 전반 4분 윤경신이 속공 기회에서 두어발짝 밖에 밟지 않았는데 심판은 곧바로 오버스텝을 선언했고 윤경신이 어이없다는 표정을 짓자 2분 퇴장을 줬다. 점수는 0-4로 벌어졌고 한국은 전반 5분에야 피봇 박중규가 첫 골을 성공시켰다. 이후부터 심판들은 턴오버(공격권이 넘어가는 것) 작전을 썼다. 원래 신체 접촉이 많은 핸드볼 경기에서 한국이 상대에게 조금만 닿으면 턴오버를 불어댔다. Since then, the judgments of the referees have been concentrated on 203cm tall attacker Yun Kyung-shin, the core of the attack. Yoon Kyung-shin stepped on a couple of steps in the first four minutes, but the referee immediately declared the overstep and gave him a two-minute exit when he expressed an absurd expression. The score was 0-4 and Korea only scored the first goal in the first five minutes of the pivot. Since then, the referees have been operating a turnover. Originally, in a handball game with a lot of physical contact, South Korea blowed a turnover if it touched a little.

한국 수비 때에는 무조건 2분 퇴장에다 7m 드로를 선언했다. 한국은 경기 내내 골키퍼를 제외한 5명이 공격할 수밖에 없었다. 한국은 중거리슛 밖에 도리가 없었고, 윤경신과 백원철, 이준희가 15m 밖에서 속임수 동작 뒤에 중거리포를 날려 점수를 차근차근 쌓아갔지만 전반을 13-19, 6점 차로 뒤진 채 마쳤다. In defense, South Korea unconditionally declared a 7m draw with two minutes left. Korea had no choice but to attack five players, except for the goalkeeper. Korea only had a medium-range shot, and Yoon Kyung-shin, Baek Won-cheol and Lee Joon-hee struck a medium-range cannon after tricking out of 15m, but scored 13-19, six points behind.

후반들어 심판들의 편파판정은 도를 더해갔고 후반 8분에는 김장문이 레드카드로 실격되는 상황까지 벌어졌다. 점수는 16-28, 12점 차까지 났다. 한국은 포기하지 않았지만 힘을 내려 하면 심판들은 어김없이 호루라기를 불었다. 몸싸움이 가장 많을 수밖에 없는 피봇 박중규는 끊임없이 턴오버에 시달렸고 어이없는 2분 퇴장도 받았다. In the second half, the judgments of the referees increased, and in the second eight minutes, Kim Jang-moon was disqualified as a red card. The score was 16-28, up to 12 points. South Korea did not give up, but when they stepped up, the referees blew a whistle. Pivot Park Jung-gyu, who had the most struggles, was constantly in a turnover and was given a ridiculous 2-minute exit.

후반 16분 점수는 17-32, 무려 15점 차로 벌어졌고 더 이상 추격하는 것은 의미가 없어졌다. 한국은 5-6명이 싸우고 카타르는 심판 2명까지 합해 9명이 싸우는 경기였다. 경기를 통틀어 한국 2분 퇴장은 10개였고 카타르는 3개였다.In the second half, the score was 17-32, a whopping 15 points, and the pursuit was no longer meaningful. South Korea fought 5-6 people and Qatar had 9 players, including two referees. Throughout the game, Korea left 10 minutes in two minutes and three in Qatar.

<문서6> KBS 뉴스<Document 6> KBS News

2006 도하아시안게임에 참가하고 있는 한국 대표선수단이 남자핸드볼 경기에서 나온 중동 심판의 극심한 편파판정에 대해 공식 항의했다. South Korea's representatives, who are participating in the 2006 Doha Asian Games, have officially protested the severe polarization of the Middle East referee in a men's handball match.

대표선수단은 11일(이하 한국시간) "지난 9일 열린 남자핸드볼 한국-쿠웨이트 본선리그 최종전 때 카타르 출신 심판 2명의 편파 판정과 관련 아시아올림픽평의회(OCA)와 대회 조직위원회, 아시아핸드볼연맹(AHF)에 항의 서한을 보냈다"고 밝혔다. 당시 경기에서 한국은 경기 내내 텃세 판정에 고전하며 6점 차로 분패하고 말았다. Representatives said on November 11 (Korea time) that the Asian Olympic Council (OCA), the Organizing Committee of the Asian Handball Federation (AHF) I sent a letter in protest. ” At that time, Korea lost six points by struggling with the ruling throughout the game.

서한에서 대표선수단은 "이같은 편파 판정은 스포츠 정신에 어긋나는 것"이라며 "다음부터는 이런 일이 절대 발생하지 않도록 해달라"고 요청했다. In the letter, the representative means, "This kind of bias is against sports spirit," he said, "so that this will never happen."

한국은 12일 오전 2시 카타르 도하 알가라파 인도어홀에서 주최국 카타르와 준결승전을 갖는다. 소속 리그 일정 때문에 합류가 늦어졌던 거포 윤경신(33.함부르크)이 도착한 덕분에 전력 상승 효과가 기대된다. 하지만 이번에도 중동 심판을 내세워 장난을 칠 것이 뻔하다. 카타르에서는 핸드볼이 축구 다음으로 인기 있는 스포츠다. 한국대표단의 항의 서한을 받았더라도 자국의 결승 진출을 위해서 스포츠 정신은 깡그리 무시할 것으로 보인다.Korea will play semi-final against the host country, Qatar, at the Doha Algarapa Indore Hall in Qatar on the 12th. The arrival of the gunman Yun Kyung-shin (33.Hamburg), who was late due to his league schedule, is expected to boost the power. But this time, I will play with Middle East referees. In Qatar, handball is the second most popular sport after football. Even after receiving a letter of protest from the Korean delegation, the spirit of sports is likely to be ignored in order to advance to the finals.

박도헌(조선대) 남자 핸드볼 감독은 "당장은 중동 심판의 텃세를 막을 방도가 없다. 죽기 살기로 부딪치는 수밖에 없다"며 필승을 다짐했다.Park Do-heon (Chosun University), a male handball coach, said, "I have no way to stop the Middle East referees. I have no choice but to live to die."

<문서7> 조선일보 스포츠<Document 7> Chosun Ilbo Sports

2006 도하 아시안게임에 참가하고 있는 한국 대표선수단이 남자핸드볼 경기에서 나온 중동 심판의 극심한 편파판정에 대해 공식 항의했다. South Korea's representatives, who are participating in the 2006 Doha Asian Games, have officially protested the severe polarization of the Middle East referee from a male handball match.

서한에서 대표선수단은 "이같은 편파 판정은 스포츠 정신에 어긋나는 것"이라며 "다음부터는 이런 일이 절대 발생하지 않도록 해달라"고 요청했다. 한국은 12일 오전 2시 카타르 도하 알가라파인도어홀에서 주최국 카타르와 준결승전을 갖는다. In the letter, the representative means, "This kind of bias is against sports spirit," he said, "so that this will never happen." South Korea will play semi-final against the host country, Qatar, at 2:00 am Doha Algara Pine Door Hall.

소속 리그 일정 때문에 합류가 늦어졌던 거포 윤경신(33.함부르크)이 도착한 덕분에 전력 상승 효과가 기대된다. 하지만 이번에도 중동 심판을 내세워 장난을 칠 것이 뻔하다. 카타르에서는 핸드볼이 축구 다음으로 인기 있는 스포츠다. 한국대표단의 항의 서한을 받았더라도 자국의 결승 진출을 위해서 스포츠 정신은 깡그리 무시할 것으로 보인다.The arrival of the gunman Yun Kyung-shin (33.Hamburg), who was late due to his league schedule, is expected to boost the power. But this time, I will play with Middle East referees. In Qatar, handball is the second most popular sport after football. Even after receiving a letter of protest from the Korean delegation, the spirit of sports is likely to be ignored in order to advance to the finals.

<문서8> 미디어 다음 스포츠<Document 8> Media Next Sports

어제 핸드볼 경기를 보다가...정말 할 말을 잃었습니다...솔직히 일본 전에서도...심판의 자질 문제는 감지되어왔었는데요...어제 쿠웨이트와의 경기...뭐라 할 말이 없더군요...레드카드 남발에...상대방은 공격자 파울 신나게 하고...오버 스탭하면서 골 넣은 거 다 인정되고;;;우리는 멋지게 넣은 골도...어거지로 오버스탭이라 해서 없애고;;;I watched a handball game yesterday ... I really didn't say anything ... Honestly, I had a problem with the judges in Japan ... I played with Kuwait yesterday ... I didn't say anything. .. on the red card ... the opponent is excited to attack the attacker ... and the goal is recognized as overstepping; ;;

그런 말도 안되는 상황 속에서도...(어떤 때는 우리 편은 필드 플레이어가 3명이더군요...다 2분간 혹은 4분간 퇴장 줘서...)단 세명이서도 골을 넣고...엄청난 숫적 차이 속에서 자주 생기는 1:1 상황에서...열심히 선방하면서... 단 6점 차의 패배라는...놀라운 성과를 보여준 우리 남자 핸드볼 대표팀...그 분들의 4년간의 땀과 노력이...시력 검사랑 양심 검사부터 해봐야 하는 심판들의 농간으로 인해...헛되게 눈물로 끝나지 않았으면 좋겠네요...ㅠ.ㅠ Even in that ridiculous situation ... (sometimes we had three field players ... all left for two or four minutes ...) In the 1: 1 situation that occurs frequently ... with the hard work of the team ... the loss of only six points ... our male handball team has shown amazing results ... the four years of their sweat and effort ... Ophthalmology screening and conscience test from the judgment of the judges to do ... I hope not to end in tears ... ㅠ. ㅠ

단 한번도 이런 곳에 글을 남긴 적이 없었지만...어제 경기를 보면서... 정말... 밥에 잠이 안 올만큼 화가 나서 글을 남깁니다...정말 대단한 플레이를 보여주는 우리 남자 핸드볼 대표팀...우리의 관심이... 그 분들에게 힘이 될 것이라 믿으면서...이 글을 써봅니다..I never wrote a place like this, but ... I watched yesterday's game ... I really ... I'm so angry that I can't sleep .... Our male handball team shows a great play ... I write this article, believing that our interest will help them.

<문서9> 스포츠 칸Document 9: Sports Khan

한국 선수단이 중동 핸드볼 심판의 극심한 편파판정에 대해 공식 항의했다. 한국은 11일 "지난 9일 열린 남자핸드볼 한국-쿠웨이트 본선리그 최종전 때 카타르 출신 심판 2명의 편파판정과 관련해 아시아올림픽평의회(OCA)와 대회 조직위원회, 아시아핸드볼연맹(AHF)에 항의서한을 보냈다"고 밝혔다. The Korean squad officially protested the severe deflection of the Middle Eastern handball referee. South Korea sent a letter of protest to the Asian Olympic Council (OCA), the Organizing Committee and the Asian Handball Federation (AHF) in relation to the determination of two Qatari referees during the final match of the men's handball Korea-Kuwait finals on the 9th. Said.

한국은 쿠웨이트전에서 편파판정에 고전하며 6점차로 졌다. 12일 오전 2시 카타르와 준결승전을 갖는 박도헌 남자핸드볼 대표팀 감독은 "죽기 살기로 부딪치는 수밖에 없다"고 필승의지 다졌다.Korea lost six points in their fight against Kuwait. On February 12, Park Do-heon, the men's handball national team coach who had a semi-final match with Qatar, said, "We have no choice but to live to die."

<문서10> ygclan<Document 10> ygclan

일본전에도 심하더니 혹시나 했더니 쿠웨이트전도 ㅡㅡ It was bad even in the Japanese war, but I did it in Kuwait.

한국 남자 핸드볼이 심판의 편파 판정으로 쿠웨이트에 패했으나 일본에 골득실차에 앞서 조 2위로 준결승전에 진출했다. The Korean male handball lost to Kuwait due to a referee's bias, but they advanced to the semi-finals in second place ahead of Japan's goal gap.

아시안게임 6연패를 노리는 한국은 9일(한국시간) 카타르 도하 알-가라파 인도어홀에서 열린 쿠웨이트와의 예선리그 F조 최종전에서 카타르 출신 심판 두 명의 편파 판정으로 26대32로 무릎을 꿇었다.카타르의 유서프 알하일과 압둘나세르 알 하마드 심판의 편파 판정은 경기 내내 계속됐다. South Korea, aiming for the sixth consecutive Asian Games, smashed 26-32 in a Q2 final match against Kuwait in the final match against Kuwait in the Doha Al-Garafa Indore, Qatar. Judgment of Yusuf Al-Hail and Abdul Nasser Al Hamad's bias continued throughout the game.

애매한 상황에선 어김없이 쿠웨이트 볼을 선언했다. 한국은 심판의 지나친 간섭 때문에 속공 기회도 제대로 살리지 못했다. 한국은 전반 22분 김태완이 손짓으로 심판에게 욕을 했다는 의혹을 받아 퇴장당하기도 했다. 10-15로 뒤진 채 전반을 마친 한국은 후반 9분에는 이재우가 레드카드로 4분간 퇴장당하고 박중규까지 2분 퇴장을 받아 7명 중 5명이 싸웠다. 도저히 이길 수 없는 경기였다. In an ambiguous situation, he declared Kuwait Ball. South Korea has not been able to make the most of the haste due to excessive interference from the referees. South Korea was sent off in the 22nd minute after allegations that Kim Tae-wan shook the referee with his hand. After finishing the first half, 10-15, South Korea fought five minutes with Lee Jae-woo being sent off for four minutes with a red card and two minutes with Park Jung-gyu. It was a game that I could not win.

바레인을 43대29로 꺾은 뒤 2차전에서 일본과 무승부를 기록한 한국은 1승1무1패(승점 3)가 됐다. 한국은 1승1무1패를 기록한 일본과 골득실차를 따져 조 2위로 준결승에 진출했다. 골득실차에서 한국은 +8이었고 일본은 -11이어서 한국이 17골 앞서 있었다. 만일 일본이 바레인을 18골 이상으로 이겼다면 한국은 준결승 티켓을 일본에 내 줄 수밖에 없었다. 그러나 일본은 바레인에 25대24로 이겨 한국은 준결승에 진출했다. 장인익 남자 핸드볼 코치는 경기 후 "심판 편파 판정이 지나쳤다. 규정 대로 1시간 이내에 판정에 이의를 제했다'고 말했다. After defeating Bahrain 43-29, Korea Rep. Scored 1 win, 1 draw and 1 loss (3 victory points) against Japan in the second leg. Korea advanced to the semi-final with the second place after the goal difference with Japan, which recorded 1 win, 1 draw and 1 loss. In the goal gap, Korea was +8 and Japan was -11, so Korea was 17 goals ahead. If Japan beat Bahrain with more than 18 goals, South Korea had no choice but to give the semi-final ticket to Japan. However, Japan beat Bahrain 25:24 and Korea advanced to the semifinals. Jang In-ik, a man's handball coach, said, “The referee's bias decision was excessive.

두 카타프 심판의 편파 판정 배후에는 쿠웨이트가 있다는 게 강태구 여자 핸드볼 감독의 주장이다. 그는 "금메달을 노리는 쿠웨이트가 한국을 탈락시키기 위해 카타르 감독들을 앞세워 장난을 친 것 같다"고 말했다. Kang Tae-gu's female handball manager claims that Kuwait is behind two Kataf referees. "It seems that Kuwait, who is looking for a gold medal, was playing around with Qatar coaches to eliminate Korea," he said.

강 감독은 심판의 편파 판정이 이번만이 아니라고 했다. 강 감독은 "지난 2월 태국 방콕에서 열린 제12회 아시아남자핸드볼선수권대회에서도 카타르 심판이 한국 경기에서 편파 판정을 해 제소를 통해 영구 제명시킨 적이 있다"며 "편파 판정 의혹을 사는 국가들은 정년이 거의 다 된 카타르 심판을 내세우는 등 교묘한 방법을 동원한다"고 분개했다.Kang said the judgment was not the only one. Kang said, “The Qatar referee made a partial bias in the Korean game and permanently expelled him at the 12th Asian Men's Handball Championship held in Bangkok, Thailand, in February. "It's a tricky way to put up with the Qatar judges that are almost done."

기타 문서 11 내지 16의 데이터는 생략한다.The data of other documents 11 to 16 are omitted.

상기 모든 문서데이터에 대하여, 각 문서마다 사전구조를 형성하고, 불용어를 제거하였다. 그 후, 각 문서들의 조합에 대하여 공통앵커를 추출한 후, 상기 문서의 각 조합마다, 모든 공통앵커에 대하여 음절 일치도에 의한 유사도를 검사하는 예비검사를 행하고, 소정 임계값 이상의 유사도를 보인 경우에는, 이에 대하여 지역정렬을 이용하여 어절 내의 음절 위치에 따른 가중치 부여에 의한 유사도를 검사하는 심층검사를 행하였다. For all the document data, a dictionary structure was formed for each document, and stop words were removed. Then, after extracting the common anchors for each combination of documents, for each combination of the documents, a preliminary inspection is performed to check the similarity by syllable correspondence for all common anchors, and when the similarity is equal to or greater than a predetermined threshold value, In-depth examination was performed to examine the similarity by weighting according to the syllable position in the word using regional alignment.

그리고 상기 모든 공통앵커에 대한 예비검사 및 심층검사가 종료된 후, 각 문서 조합마다 비대칭 유사도를 산출하였다. 그 결과로부터, 도 17과 같이, 표절의 경로를 그래프로 도시하였다.After the preliminary examination and the in-depth examination of all the common anchors were completed, the asymmetry similarity was calculated for each document combination. From the results, the path of plagiarism was graphically shown as in FIG. 17.

상기 그래프에 있어서, 원 속의 T12, T3 등의 문자는 각 문서를, 화살표는 표절의 방향을, 화살표 옆의 숫자는 비표절도(문서 간의 거리)를 나타낸다.In the graph, letters such as T12 and T3 in the circle indicate each document, arrows indicate the direction of plagiarism, and numbers beside the arrow indicate non-plagiarism (distance between documents).

예컨대 상단의 문서 T12는, 어떠한 문서로부터도 표절한 바가 없으므로 원본이고, T17, T3, T7은, T12의 표절본이다.For example, the upper document T12 is the original because there is no plagiarism from any document, and T17, T3, and T7 are plagiarized copies of T12.

<실험예 2>Experimental Example 2

가요의 노래가사에 대하여 본 발명의 방법을 적용하였다. 다만, 여기서 비교되는 원본과 수정본은, 실제로 표절은 아니며, 단순히 가사저작자에 의한 수정임을 밝혀둔다.The method of the present invention was applied to the song lyrics of the song. It should be noted, however, that the original and amended versions compared here are not actually plagiarism, but simply amendments by the lyrics author.

<문서 A - 원본><Document A-Original>

창문을 열고 흠 내다봐요 / 저 높은 곳에 우뚝 걸린 깃발 펄럭이며 / 당신의 텅 빈 가슴으로 불어오는 / 더운 열기의 세찬 바람Open the window and look out / It's a flag fluttering high up / Blowing into your empty heart / A hot wind of hot heat

살며시 눈 감고 들어봐요 / 먼 대지 위를 달리는 사나운 말처럼 / 당신의 고요한 가슴으로 닥쳐오는 / 숨 가쁜 벗들의 말발굽 소리Close your eyes and listen / Like wild horses running on distant lands / The sound of horseshoe hooves coming to your calm heart

누가 내게 손수건 한 장 던져 주리오 / 내 작은 가슴에 얹어 주리오 / 누가 내게 탈춤의 장단을 쳐 주리오 / 그 장단에 춤추게 하리오Who throws me a handkerchief? Put it on my small breast. Who puts me on the beat of mask dance?

나는 고독의 친구 방황의 친구 상념 끊이지 않는 / 번민의 시인이라도 좋겠오 / 나는 일몰의 고갯길을 넘어가는 / 고행의 수도승처럼 / 하늘에 비낀 노을 바라보며 / 시인의 마을에 밤이 오는 소릴 / 들을 테요I'll be a lonely friend Wandering friend Thoughtless / A poet of agony / I'm crossing the path of sunset / Like a monk of asceticism / Looking at the oar in the sky / I'll hear the night of the poet's village /

우산을 접고 비 맞아 봐요 / 하늘은 더욱 가까운 곳으로 다가와서 / 당신의 그늘진 마음에 비 뿌리는 / 젖은 대기의 애틋한 우수Fold your umbrella and hit the rain / The sky comes closer to you / The rain in your shaded heart / The warm atmosphere of wet atmosphere

누가 내게 다가와서 말 건네 주리오 / 내 작은 손 잡아 주리오 / 누가 내 운명의 길동무 돼 주리오 / 어린 시인의 벗 돼 주리오Who comes to me and tells me Zurio / Hold my little hand Zurio / Who is my fate's way Zurio / A young poet's friend Zurio

<문서 B - 수정본><Document B-Revision>

창문을 열고 흠 내다봐요 / 저 높은 곳에 푸른 하늘 구름 흘러가며 / 당신의 부푼 가슴으로 불어오는 / 맑은 한줄기 산들 바람Open the window and look out / Blue sky clouds flowing high up / Blowing through your swollen heart / A clear breeze

살며시 눈 감고 들어봐요 / 먼 대지 위를 달리는 사나운 말처럼 / 당신의 고요한 가슴으로 닥쳐오는 / 숨 가쁜 자연의 생명의 소리Close your eyes and listen / Like wild horses running on distant lands / Coming to your serene heart / The sound of breathless nature's life

누가 내게 따뜻한 사랑 건네 주리오 / 내 작은 가슴을 달래 주리오 / 누가 내게 생명의 장단을 쳐 주리오 / 그 장단에 춤추게 하리오Whoever handed me a warm love, Jurio / soothe my little breasts, / Who would give me the beat of life / Let me dance to it

나는 자연의 친구 생명의 친구 상념 끊기지 않는 / 사색의 시인이라면 좋겠오 / 나는 일몰의 고갯길을 넘어가는 / 고행의 수도승처럼 / 하늘에 비낀 노을 바라보며 / 시인의 마을에 밤이 오는 소릴 / 들을 테요I'm a friend of nature Life's friend Thoughtless / I wish I was a poet in thought / I'm going over the path of sunset / Like a monk of asceticism / Looking at the oar in the sky / I'll hear the night at the poet's village /

우산을 접고 비 맞아 봐요 / 하늘은 더욱 가까운 곳으로 다가와서 / 당신의 울적한 마음에 비 뿌리는 / 젖은 대기의 애틋한 우수Fold your umbrella and hit the rain / The sky comes closer to you / Raining down on your crying heart / The warmth of the wet atmosphere

누가 내게 다가와서 말 건네 주리오 / 내 작은 손 잡아 주리오 / 누가 내 마음에 위안 돼 주리오 / 어린 시인의 벗 돼 주리오Who comes to me and tells me Zurio / Hold my little hand Zurio / Who's comforting my heart Zurio / You're my young poet Zurio

상기 두 문서에 대하여, 어절당 최대 3음절까지만 허용하여 분리하는 3-mer 어구를 앵커로 하고, 그 앵커의 발생위치를 레퍼런스로 하여, 각각 사전구조를 만들었다. 레퍼런스는, 문서의 첫 어절 위치를 제로(zero)부터 시작한 어절의 일련번호이다. 레코드 분리는 '/'로 나타내었다.For the two documents, a premer- sion structure was constructed by using a 3-mer phrase that allowed separation of up to three syllables per word and using the anchoring position as a reference. The reference is the sequence number of the word, starting from zero with the first word position in the document. Record separation is indicated by '/'.

<사전 A - 원본><Dictionary A-Original>

가까운 - 96 / 가는 - 145 / 가쁜 - 35 / 가슴에 - 48 / 가슴으 - 14 32 / 가와서 - 98 110 / 감고 - 22 / 갯길을 - 74 143 / 건네 - 112 / 걸린 - 8 / 고갯길 - 74 143 / 고독의 - 62 131 / 고요한 - 31 / 고행의 - 76 146 / 곳에 - 6 / 곳으로 - 97 / 그 - 57 / 그늘진 - 100 / 길동무 - 122 / 깃발 - 9 / 끊이지 - 67 136 / 나는 - 61 72 130 141 / 내 - 46 114 120 / 내게 - 40 52 109 / 내다봐 - 3 / 넘어 - 144 / 넘어가 - 75 / 노을 - 80 150 / 높은 - 5 / 누가 - 39 51 108 119 / 눈 - 21 / 다가와 - 98 110 / 다봐요 - 3 / 닥쳐오 - 33 / 달리는 - 27 / 당신의 - 11 30 99 / 대기의 - 105 / 대지 - 25 / 더욱 - 95 / 더운 - 16 / 던져 - 44 / 도승처 - 77 147 / 돼 - 123 128 / 들어봐 - 23 / 들을 - 87 157 / 라보며 - 81 151 / 럭이며 - 10 / 마을에 - 83 153 / 마음에 - 101 / 말 - 111 / 말발굽 - 37 / 말처럼 - 29 / 맞아 - 92 / 먼 - 24 / 바라보 - 81 151 / 바람 - 19 / 밤이 - 84 154 / 방황의 - 64 133 / 번민의 - 69 138 / 벗 - 127 / 벗들의 - 36 / 봐요 - 93 / 불어오 - 15 / 비 - 91 102 / 비낀 - 79 149 / 빈 - 13 / 뿌리는 - 103 / 사나운 - 28 / 살며시 - 20 / 상념 - 66 135 / 세찬 - 18 / 소리 - 38 / 소릴 - 86 156 / 손 - 116 / 손수건 - 41 / 수도승 - 77 147 / 숨 - 34 / 슴으로 - 14 32 / 승처럼 - 77 147 / 시인의 - 82 126 152 / 시인이 - 70 139 / 않는 - 68 137 / 애틋한 - 106 / 어가는 - 75 / 어린 - 125 / 어봐요 - 23 / 어오는 - 15 / 얹어 - 49 / 열고 - 1 / 열기의 - 17 / 오는 - 85 155 / 우뚝 - 7 / 우산을 - 89 / 우수 - 107 / 운명의 - 121 / 위를 - 26 / 이라도 - 70 139 / 인이라 - 70 139 / 일몰의 - 73 142 / 작은 - 47 115 / 잡아 - 117 / 장 - 43 / 장단에 - 58 / 장단을 - 54 / 저 - 4 / 접고 - 90 / 젖은 - 104 / 좋겠오 - 71 140 / 주리오 - 45 50 56 113 118 124 129 / 창문을 - 0 / 쳐 - 55 / 쳐오는 - 33 / 춤추게 - 59 / 친구 - 63 65 132 134 / 탈춤의 - 53 / 텅 - 12 / 테요 - 88 158 / 펄럭이 - 10 / 하늘에 - 78 148 / 하늘은 - 94 / 하리오 - 60 / 한 - 42 / 흠 - 2Close-96 / thin-145 / hot-35 / chest-48 / chest-14 32 / come-98 110 / close-22 / long way-74 143 / handed-112 / hung-8 / high road-74 143 / Solitary-62 131 / Serene-31 / Ascetic-76 146 / Where-6 / Where-97 / Its-57 / Shady-100 / Busters-122 / Flags-9 / Incessant-67 136 / I- 61 72 130 141 / my-46 114 120 / me-40 52 109 / look out-3 / beyond-144 / beyond-75 / glow-80 150 / high-5 / who-39 51 108 119 / eyes-21 / Come up-98 110 / Look up-3 / Shut up-33 / Running-27 / Your-11 30 99 / Atmospheric-105 / Earth-25 / Even more-95 / Hot-16 / Throw-44 / Where to go- 77 147 / H-123 128 / Listen-23 / Hear-87 157 / Labo-81 151 / Luck-10 / In the village-83 153 / Like-101 / Horse-111 / Horseshoe-37 / Like a horse -29 / Right-92 / Far-24 / Looking-81 151 / Wind-19 / Night-84 154 / Wandering-64 133 / Fret-69 138 / Friend-127 / Friends-36 / Look-93 / Blow-15 / Rain-91 102 / Rainy-79 149 / Bean-13 / Rooted-103 / Wild-28 / Slight-20 / Thought-66 135 / Blessed-18 / Sound-38 / Thoryl-86 156 / Hand-116 / Handkerchief-41 / Monk-77 147 / Breath-34 / Chest-14 32 / Like-77 147 / Poet -82 126 152 / Poet-70 139 / Do-68 137 / Carefree-106 / Going-75 / Young-125 / Look-23 / Going-15 / Topped-49 / Opening-1 / Opening- 17 / coming-85 155 / tall-7 / umbrella-89 / excellent-107 / fate-121 / up-26 / anything-70 139 / inla-70 139 / sunset-73 142 / small-47 115 / grab-117 / intestinal-43 / in short-58 / in short-54 / low-4 / fold-90 / wet-104 / wish-71 140 / jurio-45 50 56 113 118 124 129 / windows -0 / scream-55 / scream-33 / dance-59 / friends-63 65 132 134 / mask dance-53 / tongue-12 / teyo-88 158 / Flap-10 / in the sky-78 148 / sky-94 / hario-60 / one-42 / hmm-2

<사전 B - 수정본><B Dictionary-Revision>

가까운 - 94 / 가는 - 143 / 가쁜 - 34 / 가슴으 - 13 31 / 가슴을 - 46 / 가와서 - 96 108 / 감고 - 21 / 갯길을 - 72 141 / 건네 - 42 110 / 고갯길 - 72 141 / 고요한 - 30 / 고행의 - 74 144 / 곳에 - 6 / 곳으로 - 95 / 구름 - 9 / 그 - 55 / 끊기지 - 65 134 / 나는 - 59 70 128 139 / 내 - 44 112 118 / 내게 - 39 50 107 / 내다봐 - 3 / 넘어 - 142 / 넘어가 - 73 / 노을 - 78 148 / 높은 - 5 / 누가 - 38 49 106 117 / 눈 - 20 / 다가와 - 96 108 / 다봐요 - 3 / 닥쳐오 - 32 / 달래 - 47 / 달리는 - 26 / 당신의 - 11 29 97 / 대기의 - 103 / 대지 - 24 / 더욱 - 93 / 도승처 - 75 145 / 돼 - 121 126 / 들어봐 - 22 / 들을 - 85 155 / 따뜻한 - 40 / 라보며 - 79 149 / 러가며 - 10 / 마을에 - 81 151 / 마음에 - 99 119 / 말 - 109 / 말처럼 - 28 / 맑은 - 15 / 맞아 - 90 / 먼 - 23 / 바라보 - 79 149 / 바람 - 18 / 밤이 - 82 152 / 벗 - 125 / 봐요 - 91 / 부푼 - 12 / 불어오 - 14 / 비 - 89 100 / 비낀 - 77 147 / 뿌리는 - 101 / 사나운 - 27 / 사랑 - 41 / 사색의 - 67 136 / 산들 - 17 / 살며시 - 19 / 상념 - 64 133 / 생명의 - 36 51 62 131 / 소리 - 37 / 소릴 - 84 154 / 손 - 114 / 수도승 - 75 145 / 숨 - 33 / 슴으로 - 13 31 / 승처럼 - 75 145 / 시인의 - 80 124 150 / 시인이 - 68 137 / 않는 - 66 135 / 애틋한 - 104 / 어가는 - 73 / 어린 - 123 / 어봐요 - 22 / 어오는 - 14 / 열고 - 1 / 오는 - 83 153 / 우산을 - 87 / 우수 - 105 / 울적한 - 98 / 위를 - 25 / 위안 - 120 / 이라면 - 68 137 / 인이라 - 68 137 / 일몰의 - 71 140 / 자연의 - 35 60 129 / 작은 - 45 113 / 잡아 - 115 / 장단에 - 56 / 장단을 - 52 / 저 - 4 / 접고 - 88 / 젖은 - 102 / 좋겠오 - 69 138 / 주리오 - 43 48 54 111 116 122 127 / 창문을 - 0 / 쳐 - 53 / 쳐오는 - 32 / 춤추게 - 57 / 친구 - 61 63 130 132 / 테요 - 86 156 / 푸른 - 7 / 하늘 - 8 / 하늘에 - 76 146 / 하늘은 - 92 / 하리오 - 58 / 한줄기 - 16 / 흘러가 - 10 / 흠 - 2Close-94 / thin-143 / hot-34 / chest-13 31 / chest-46 / come-96 108 / close-21 / long way-72 141 / handed-42 110 / long way-72 141 / calm -30 / Ascetic-74 144 / Where-6 / Where-95 / Clouds-9 / That-55 / Hang-65 134 / I-59 70 128 139 / My-44 112 118 / Me-39 50 107 / Look Out-3 / Beyond-142 / Beyond-73 / Glow-78 148 / High-5 / Who-38 49 106 117 / Eyes-20 / Come-96 108 / Look-3 / Shut up-32 / Soothe- 47 / running-26 / your-11 29 97 / atmospheric-103 / earth-24 / more-93 / dwelling place-75 145 / h-121 126 / listen-22 / listen-85 155 / warm-40 / Labo-79 149 / Going-10 / In the village-81 151 / In mind-99 119 / Horse-109 / Like a horse-28 / Clear-15 / Right-90 / Far-23 / Looking-79 149 / Wind-18 / night-82 152 / naked-125 / see-91 / puffy-12 / blow-14 / rain-89 100 / rained-77 1 47 / Roots-101 / Ferocious-27 / Love-41 / Thoughtful-67 136 / Mountains-17 / Slightly-19 / Thoughtful-64 133 / Life-36 51 62 131 / Sound-37 / Thoryl-84 154 / Hand-114 / monk-75 145 / breath-33 / breasts-13 31 / like w-75 145 / poet-80 124 150 / poet-68 137 / does-66 135 / carefree-104 / -73 / young-123 / look-22 / coming-14 / open-1 / coming-83 153 / umbrella-87 / excellent-105 / melancholy-98 / up-25 / yuan-120 / -68 137 / Inra-68 137 / Sunset-71 140 / Natural-35 60 129 / Small-45 113 / Grab-115 / In short-56 / In short-52 / Low-4 / Fold-88 / Wet-102 / Wish-69 138 / Jurio-43 48 54 111 116 122 127 / Window-0 / Smash-53 / Squirrel-32 / Dance-57 / Friends-61 63 130 132 / Tyo-86 156 / Blue-7 / sky-8 / in the sky-76 146 / sky-92 / hario-58 / one line-16 / flow-10 / hmm - 2

상기 두 사전에서, 공통앵커를 추출하였다. 상기 공통앵커는, 예비검사와 심층검사를 행할 대상이 되는 대비 부분을 정하기 위한 것이다.In both dictionaries, a common anchor was extracted. The common anchor is for determining a contrast portion to be subjected to preliminary inspection and deep inspection.

<공통앵커><Common anchor>

가까운 / 가는 / 가쁜 / 가슴으 / 가와서 / 감고 / 갯길을 / 건네 / 고갯길 / 고요한 / 고행의 / 곳에 / 곳으로 / 그 / 나는 / 내 / 내게 / 내다봐 / 넘어 / 넘어가 / 노을 / 높은 / 누가 / 눈 / 다가와 / 다봐요 / 닥쳐오 / 달리는 / 당신의 / 대기의 / 대지 / 더욱 / 도승처 / 돼 / 들어봐 / 들을 / 라보며 / 마을에 / 마음에 / 말 / 말처럼 / 맞아 / 먼 / 바라보 / 바람 / 밤이 / 벗 / 봐요 / 불어오 / 비 / 비낀 / 뿌리는 / 사나운 / 살며시 / 상념 / 소리 / 소릴 / 손 / 수도승 / 숨 / 슴으로 / 승처럼 / 시인의 / 시인이 / 않는 / 애틋한 / 어가는 / 어린 / 어봐요 / 어오는 / 열고 / 오는 / 우산을 / 우수 / 위를 / 인이라 / 일몰의 / 작은 / 잡아 / 장단에 / 장단을 / 저 / 접고 / 젖은 / 좋겠오 / 주리오 / 창문을 / 쳐 / 쳐오는 / 춤추게 / 친구 / 테요 / 하늘에 / 하늘은 / 하리오 / 흠Close / Going / Busy / Chest / Come / Close / Pass / Pass / Pass / Quiet / Ascetic / Where / Where / He / I / My / Me / Look out / Beyond / Beyond / Glow / High / Who / Eyes / Come / Look / Shut up / Run / Your / Atmosphere / Earth / More / Way to live / Go / Listen / Listen / Look / In town / Like / Say / Say / Right Distant / Looking / Windy / Night / Naked / Look / Blow / Rain / Rainy / Sprinkling / Ferocious / Slight / Thought / Sound / Sound / Hand / Monk / Breath / Heart / Win like / Poet / Poet There are / do / tender / go away / young / look / go / open / coming / umbrella / excellent / up / down / sunset / small / hold / on the rhythm / rhythm / low / fold / wet / Wish / jurio / window / scream / scream / dance / friend / te / in the sky / sky / hario / hmm

<불용어 제거><Remove Terminology>

상기 공통앵커 중에는 불용어도 포함되어 있다. 불용어는 표절 검사에 있어서 도움이 되지 않으므로 제거하여야 한다. 불용어로서, 의미 없는 대명사나 자주 반복되는 주제어 등을 들 수 있다.Among the common anchors, stopwords are also included. Terminology should not be removed as it does not help with plagiarism testing. As stopwords, there are meaningless pronouns and frequently repeated subject words.

상기 공통앵커 중에서 불용어로 취급되는 것이 바람직한 것은 다음과 같다. 이들에 대해서는 예비검사와 심층검사를 행하지 않는 것이 바람직하다.It is preferable to be treated as a stopword among the common anchors as follows. It is preferable not to perform preliminary inspection and deep inspection about these.

나는 / 내 / 내게 / 누가 / 당신의 / 시인의 / 주리오 / 친구I / my / me / who / your / poet's / jurio / friend

<예비검사><Preliminary Inspection>

상기 불용어를 제외한 모든 공통앵커들이 예비검사의 대상이다. 상기 각 공통앵커에 대하여 예비검사를 할 때는, 그 앵커에 대하여, 그 앵커가 양 문서에 출현한 어절위치를 중심으로 하여, 소정의 어절만큼 확장한 문자열을 양측 대비 부분으로 함이 바람직하다. 다만, 검사의 중복을 피하기 위하여, 상기 확장에 의하여 상기 공통앵커 중 다른 공통앵커가 이미 조사되었다면, 그 조사된 공통앵커는 다음 그 앵커의 조사차례가 되었을 때에 검사를 생략할 수 있도록, 그 확장범위를 조정할 필요가 있다.All common anchors except the stopword are subject to preliminary examination. When preliminary inspection is performed on each of the common anchors, it is preferable that the anchors have a character string extended by a predetermined word, with the anchors appearing in both documents as the center of the two sides. However, in order to avoid duplication of inspection, if another common anchor among the common anchors has already been investigated by the expansion, the examined common anchor can omit the inspection when the next inspection order of the anchor is made. Need to adjust.

예컨대, 비교대상 앵커가 '가쁜'이라면, 이를 전후 소정 어절, 예컨대 3어절 확장한다. 그 결과, 대비되어야 할 대비 부분은 각각, For example, if the anchor to be compared is 'good', it expands before and after a predetermined word, for example, three words. As a result, the contrasts that need to be contrasted,

원본: 가슴으로 닥쳐오는 숨 가쁜 벗들의 말발굽 소리Original: The voices of the breathing horseshoe's hoofs

수정: 가슴으로 닥쳐오는 숨 가쁜 자연의 생명의 소리Correction: Shut shortness of breath that comes the sound of natural life with the heart

이 된다.Becomes

상기 대비 부분에 대하여 예비검사를 수행한다. 이때, 예컨대, 음절 분리 후, 정렬하고, 음절 일치도를 산출하여 양측 유사도를 산정하는 방식을 취한다면, 음절 분리 후 정렬한 상태는, 다음과 같다.A preliminary inspection is performed on the contrast portion. In this case, for example, if the syllable is separated, the alignment is performed, and the syllable correspondence is calculated to calculate both similarities, the alignment after the syllable separation is as follows.

여기서, 음절 일치된 부분은, 상하의 칸에 일치시켜 놓은 바와 같이, Here, the parts matched with the syllables are matched to the upper and lower columns,

'가 가 는 닥 로 리 쁜 소 숨 슴 오 으 의 쳐''Let's go and shut up the sweet cows breath oh'

의 14음절이다. 따라서 원본 기준의 원본일치도는 14 / 19 = 0.73, 즉 약74%이고, 수정본 기준의 수정일치도는 14 / 19 = 0.73, 즉 약74%이다. 14 syllables. Therefore, the original agreement on the original basis is 14/19 = 0.73, or about 74%, and the modified agreement on the modified basis is 14/19 = 0.73, or about 74%.

상기 예에서는 두 음절일치도가 일치하였지만, 항상 그런 것은 아니다. 대비 부분의 음절수가 다른 것이 보통이므로, 분모가 달라지기 때문이다.In the above example, the two syllable agreements coincide, but not always. This is because the number of syllables in the contrast part is usually different, so the denominator is different.

최종적으로 상기 예비검사에서의 상기 대비 부분의 일치도를 결정하여야 하는데, 이때, 예컨대 평균을 도입한다면, Finally, the degree of agreement of the contrast portion in the preliminary examination should be determined, for example if an average is introduced,

최종일치도 = (원본일치도 + 수정일치도) / 2 = 0.73, 즉 약74%이다. 이 값이 상기 대비 부분의 유사도가 된다.Final agreement = (original agreement + correction agreement) / 2 = 0.73, or about 74%. This value is the similarity of the contrast portion.

만일, 예비검사 통과의 기준이 되는 임계값이 예컨대 "15% 이상"으로 설정되어 있는 경우에는, 상기 대비 부분은 심층검사의 대상이 된다.If the threshold value, which is a criterion for passing the preliminary inspection, is set to, for example, "15% or more", the contrast portion is subjected to an in-depth inspection.

<심층검사>In-depth inspection

상기 예비검사에서 대상 앵커 '가쁜'의 유사도가 임계값 이상이어서, 심층검사의 대상이 된 경우에 대하여 설명한다. 심층검사에서는, 상기 예비검사에서 앵커 기준 확장하는 어절(상기 예에서는 3어절) 이상의 어절 수(예컨대 20 ~ 50어절)로 확장하여 검사한다. 여기서는 예를 간단히 하기 위하여 5어절만 확장한 경우를 예로 든다. 전후 5어절을 확장하면, The case in which the similarity of the target anchor 'good' in the preliminary inspection is greater than or equal to the threshold value and thus becomes the subject of the deep inspection. In the in-depth test, the test is extended to a word count (eg, 20 to 50 words) or more that extends from the anchor criterion (3 words in the above example) in the preliminary test. For the sake of simplicity, here is an example of extending only five words. If you expand the five words before and after,

원본: 당신의 고요한 가슴으로 닥쳐오는 숨 가쁜 벗들의 말발굽 소리 누가 내게 Original: The sound of the hoofs of the breathless friends that come upon your still heart

수정: 당신의 고요한 가슴으로 닥쳐오는 숨 가쁜 자연의 생명의 소리 누가 내게 Correction: the sound of shortness of breath that comes natural to shut your chest tranquil life who me

이를 대상으로 심층검사를 행한다. 이때, 예컨대, 각 어절마다 음절을 비교하여, 위치마다 가중치를 가산하고, 지역정렬을 이용하여 가장 높은 유사도를 가진 영역을 검출하는 방식을 이용한다. 이를 위한 배점 매트릭스는 다음과 같다. 이 경우에는 전체 문장 영역이 검출된다. In-depth examination of the subject. At this time, for example, a syllable is compared for each word, weights are added for each position, and a region is detected using the local alignment to detect the region having the highest similarity. The scoring matrix for this is as follows. In this case, the entire sentence area is detected.

상기 배점 매트릭스는 좌상측에서 우하측으로 갈수록 각 영역의 유사도를 누적한 누계치를 나타내며, 지역정렬에 의하여 계산한 각 영역의 유사도 값은 다음과 같다.The distribution matrix represents a cumulative value that accumulates the similarity of each region from the upper left side to the lower right side, and the similarity values of each region calculated by region alignment are as follows.

상기 원본을 기준으로 한 수정본의 어절마다의 유사도 점수(점수 행)의 총합은 절대유사도 값이며, 58.694이다. 만일 수정본이 원본과 차이가 없어서, 완전 일치하는 경우에 동일 방식으로 유사도(일치점수 행)를 계산해 보면, 완전일치시의 절대유사도 값은 78.954이다.The sum of the similarity scores (score rows) for each word of the revised version based on the original is an absolute similarity value, which is 58.694. If the revision is not different from the original, and the similarity is calculated in the same way in the case of a perfect match, the absolute similarity value for the perfect match is 78.954.

따라서 원본을 기준으로 한 수정본의 상대유사도 값은, Therefore, the relative similarity value of the revision based on the original is

절대유사도 / 완전일치시 절대유사도 = 58.694 / 78.954 = 74.3% Absolute similarity / Absolute similarity with perfect match = 58.694 / 78.954 = 74.3%

가 된다.Becomes

상기 상대유사도 값은, 전체 문서 내에서 대비 부분의 유사도 값을 알려주는 지표일 뿐이다. 상기 상대유사도 값에 의하여, 실제 문서 전체의 표절 여부를 판단하기 위한 유사도로서의 절대유사도 값을 누적할 것인지 여부를 결정하도록 할 수 있다.The relative similarity value is merely an index indicating the similarity value of the contrast portion in the entire document. The relative similarity value may determine whether to accumulate absolute similarity values as similarities for determining whether the entire document is plagiarized.

한편, 문서 전체의 실제 표절여부를 판단하는 척도는 절대유사도 값이다. 일정 영역 이상의 표절 행위는 그 영역의 크기에 상관없이 표절이라 할 수 있다. On the other hand, the measure of the actual plagiarism of the entire document is the absolute similarity value. Plagiarism over a certain area can be called plagiarism regardless of the size of the area.

만일, 절대유사도 값이 40 이상일 경우에 표절로 간주한다면, 상기 예로 든 대비 부분은 표절에 해당된다.If the absolute similarity value is 40 or more, then plagiarism is considered.

<실험예 3>Experimental Example 3

표절경로의 산출 예를 위하여, 다음 6개의 문서를 참조한다. 이들 6개의 문서는, 순차적으로 이전 문서만을 참조하여 임의로 의도적인 표절을 한 경우이다.For examples of calculating plagiarism paths, see the following six documents. These six documents are cases of arbitrarily deliberate plagiarism with reference to only previous documents sequentially.

<원본><Original>

이르면 2008년부터 부인이 출산하면 배우자가 3일 동안 출산휴가를 갈 수 있게 된다. 또 근로자들이 육아기 동안 평상시보다 근로시간을 단축해 근무할 수 있도록 하는 근로시간 단축제도 도입된다. As early as 2008, if her wife gives birth, her spouse can go on maternity leave for three days. In addition, a reduction in working hours is introduced to allow workers to work shorter than usual during parenting.

노동부 관계자는 6일 "저출산ㆍ고령화 시대에 대비하고 일과 가정이 병행할 수 있는 사회 분위기 조성을 위해 이런 내용을 골자로 한 남녀고용평등법 개정안을 다음주께 입법예고하고 2008년부터 시행할 예정"이라고 밝혔다. 노동부는 부인이 출산하면 그 배우자가 정규 휴가와는 별도로 3일간 무급으로 출산휴가를 갈 수 있도록 할 계획이다. An official from the Ministry of Labor announced on the 6th that, in preparation for the era of low birth rate and aging, and to create a social atmosphere where work and family can be combined, the revised Equal Employment Act of the Gender Equality Act will be enacted next week and will be implemented from 2008. The Ministry of Labor plans to allow the spouse to take maternity leave for three days unpaid, in addition to the regular leave when the wife gives birth.

<원본->수정본1><Original-> edited version1>

이르면 2008년부터 부인이 출산하면 배우자가 3일 동안 출산휴가를 쓸 수 있게 되는 법안이 시행된다. 또 근로자들이 육아기 동안 평상시보다 근로시간을 단축해 근무할 수 있도록 하는근로시간 단축제도 함께 도입된다. As early as 2008, a law is enacted that will allow a spouse to spend three days of maternity leave if the wife gives birth. It also introduces a system to reduce working hours, allowing workers to work shorter than usual during parenting.

노동부 관계자는 6일 "저출산, 고령화 시대를 맞이하여 일과 가사를 병행할 수 있도록 하는 사회 분위기 조성을 위해 이런 내용을 골자로 한 남녀고용평등법 개정안을 다음 주 내에 입법예고하고 2008년부터 시행할 예정"이라고 밝혔다. 노동부는 부인이 출산하면 남편이 정규 휴가와는 별도로 3일간 무급으로 출산휴가를 갈 수 있도록 할 계획이다.An official from the Ministry of Labor said on the 6th, "The Equal Employment Equality Act Amendment will be enacted next week and will be implemented from 2008 to create a social atmosphere that enables both work and housework in the face of low birth rates and aging population." Said. The Ministry of Labor plans to allow the husband to go on maternity leave for three days unpaid, apart from the regular leave.

<수정본1->수정본2><Revision 1-> Revision 2>

빠르면 2008년부터 부인이 출산하면 배우자가 3일 동안 출산휴가를 신청할 수 있게 되는 법안이 시행될 예정이다. 또 근로자들이 자녀의 육아기 동안 평상시보다 근로시간을 줄일 수 있도록 하는 근로시간 단축제도 함께 도입된다. As early as 2008, a bill will be implemented that will allow a spouse to apply for maternity leave for three days if the wife gives birth. It also introduces a reduction in working hours, which allows workers to cut working hours more than usual during the child's parenting period.

노동부 관계자는 6일 "저출산, 고령화 시대를 맞이하여 일과가사를 병행할 수 있도록 하는 사회 분위기를 조성하기 위해 이런 내용을 바탕으로 한 남녀고용평등법 개정안을 다음 주 이내에 입법예고하고 2008년부터 시행하도록 할 예정"이라고 밝혔다. 노동부는 부인이 출산하게 되면 남편이 정규 휴가와는 무관하게 3일간 무급휴가를 갈 수 있도록 할 계획이다.On the 6th, an official from the Ministry of Labor said, “In order to create a social atmosphere that enables both families to work together in the era of low birth rates and aging, the revision of the Equal Employment Act, which is based on this information, will be enacted within the next week. " When the wife gives birth, the ministry plans to allow the husband to go on unpaid leave for three days, regardless of the regular leave.

<수정본2->수정본3><Revision 2-> Revision 3>

빠르면 2008년부터 부인이 출산했을 시 배우자가 3일동안 출산휴가를 갈 수 있도록 하는 법안이 시행될 예정이다. 또 근로자들이 자녀의 육아기 동안 평상시보다 업무시간을 줄일 수 있도록 하는 근로시간 단축제도 함께 시행된다. As early as 2008, legislation will be implemented to allow a spouse to take maternity leave for three days if the wife is born. It is also implemented with a reduction of working hours, which allows workers to reduce their working hours more than usual during the childcare period.

노동부 관계자는 6일 "저출산, 고령화 시대에 맞춰 일과 가사를 병행할 수 있도록 하는 사회 분위기 조성을 위해 이와 같은 내용을 바탕으로 한 남녀고용평등법 개정안을 다음 주 내에 입법예고토록 하고 2008년부터 시행하도록 할 예정"이라고 밝혔다. 노동부는 부인이 출산하게 되면 남편이 정규 휴가 이외에 3일간의 무급휴가를 갈 수 있도록 할 계획이다.On the 6th, an official from the Ministry of Labor said, “In order to create a social atmosphere that enables both work and housework to meet both the low birth rate and the aging age, we plan to make legislative notices in the next week and implement it from 2008. " The Ministry of Labor plans to allow the husband to take three days of unpaid leave in addition to the regular leave when the wife gives birth.

<수정본3->수정본4><Revision 3-> Revision 4>

빠르면 2008년부터 부인이 출산시 배우자가 3일간의 출산휴가를 갈 수 있도록 하는 법안이 시행될 예정이다. 또 근로자들이 자녀의 육아기 동안 평상시보다 업무시간을 줄일 수 있게끔 하는 근로시간단축제도 함께 시행된다. As early as 2008, legislation will be implemented to allow a wife to have three days of maternity leave when she gives birth. It is also implemented with a working time reduction system that allows workers to reduce their working hours more than usual during the childcare period.

노동부의 한 관계자는 6일 "저출산, 고령화 시대에 맞게끔 일과 가사를 병행할 수 있도록 사회 분위기를 조성하기 위해 이와 같은 내용을 골자로 한 남녀고용평등법 개정안을 다음 주 내에 국회에 통과시키기로 하고 2008년부터 시행하도록 할 예정"이라고 밝혔다. 노동부는 아내가 출산하게 되면 남편이 정규 휴가 이외에 3일간의 무급휴가를 갈 수 있도록 할 계획이다.An official from the Ministry of Labor said on the 6th of 2008 that, in order to create a social atmosphere for work and housework in line with the age of low birth rates and aging population, a draft amendment of the Gender Equality Employment Act will be passed to the National Assembly within next week. Will be implemented from now on. " The Ministry of Labor plans to allow three days of unpaid leave in addition to regular leave when the wife gives birth.

<수정본4->수정본5><Revision 4-> Revision 5>

이르면 2008년부터 부인이 출산시 그 배우자가 3일간 출산휴가를 갈 수 있도록 하는 법안이 시행될 계획이다. 또 근로자들이 자녀의 육아기 동안 평상시보다 업무시간을 줄일 수 있도록 하는 근로시간단축제도 함께 의무화할 예정이다. As early as 2008, a bill is enacted that will allow a spouse to take three days of maternity leave when a woman gives birth. In addition, workers will be required to reduce working hours during their child-rearing periods.

노동부의 한 관계자는 6일 "저출산, 고령화 시대에 맞춰 일과 가사를 병행할 수 있도록 사회분위기를 조성하기 위해 이와 같은 내용을 골자로 한 남녀고용평등법 개정안을 다음 주 내에 국회에 통과시키기로 하고 2008년부터 시행하도록 할 예정"이라고 밝혔다. 노동부는 아내가 출산하게 되면 남편이 정규 휴가 이외에 3일간의 무급휴가를 보장받을 수 있도록 할 계획이다.An official from the Ministry of Labor said on the 6th that, “In order to create a social atmosphere for work and households in line with the low birth rate and aging age, we will pass the amendment to the Equal Employment Act of the Gender Equity Act next week to the National Assembly. Will be implemented. " When the wife gives birth, the Ministry plans to guarantee that the husband can receive three days of unpaid leave in addition to the regular leave.

상기 원문에서 수정본 5까지의 6개의 문서들에 대하여, 이들 사이의 표절 탐색 후, 상대유사도 값을 나타내면, 다음과 같다.For the six documents from the original text to the revised 5, after the plagiarism search between them, the relative similarity values are as follows.

상기 상대유사도를 살펴보면, 예컨대, 원본을 기준으로 한 수정본1의 상대유사도(→)는 389.830이고, 수정본 1을 기준으로 한 원본의 상대유사도(←)는 386.830이다. 이렇게 유사도가 다른 이유는, 표절에 의한 새 어절의 삽입에 대한 가중치와 표절에 의한 기존 어절의 삭제에 대한 가중치가 다르기 때문이다. 즉, 본 발명의 심층검사에서 산출되는 유사도는, 비대칭 유사도이다. Looking at the relative similarity, for example, the relative similarity (→) of revision 1 based on the original is 389.830, the relative similarity (←) of the original based on revision 1 is 386.830. This similarity is different because the weights for the insertion of new words by plagiarism and the weights for the deletion of existing words by plagiarism are different. That is, the similarity calculated by the in-depth examination of the present invention is an asymmetric similarity.

이때, 상기 두 방향의 유사도를 비교함으로써, 표절의 방향을 알 수 있는데, 상기 삽입 및 삭제시의 가중치를 부여하는 방식에 따라서는, 큰 유사도 값을 가지는 쪽으로 표절이 이루어지거나, 작은 유사도 값을 가지는 쪽으로 표절이 이루어진다. 본 실시예에서는, 두 유사도 중에서, 큰 유사도를 나타내는 방향을 표절이 이루어진 방향으로 간주함이 바람직하다. 이는, 의도적으로 표절을 행한 후 실험을 하여 보면, 항상 그러한 결과가 도출됨을 확인할 수 있다.At this time, by comparing the similarity of the two directions, the direction of plagiarism can be known, depending on the weighting method for the insertion and deletion, plagiarism toward a large similarity value, or having a small similarity value Plagiarism In this embodiment, it is preferable to regard the direction showing the large similarity among the two similarities as the direction in which plagiarism was made. This, intentionally plagiarism experiments, it can be confirmed that such results are always derived.

즉, 상기 예에서, 표절은 의도적으로 이루어졌으며, 원본 → 수정본1 → 수정본2 → 수정본3 → 수정본4 → 수정본5의 순으로 되어 있다. 그리고 이를 본 발명에 적용한 결과, 상기와 같이, 표 12의 좌측 문서를 기준으로 한 우측 문서의 유사도가 더 크게 나타난다는 사실을 확인할 수 있다.That is, in the above example, plagiarism was intentionally performed in the order of original → revision 1 → revision 2 → revision 3 → revision 4 → revision 5. As a result of applying this to the present invention, it can be seen that, as described above, the similarity of the right document based on the left document of Table 12 appears larger.

상기와 같이, 본 발명에서는 표절의 순서 내지 경로를 알 수 있으므로, 이를 알아보기 용이하도록 시각적으로 도시할 수 있다. 도 18은, 상기 표 12의 표절경로를 도시한 그래프이다.As described above, in the present invention, since the order or path of plagiarism can be known, it can be visually illustrated to make it easy to recognize. 18 is a graph showing the plagiarism paths in Table 12. FIG.

도 18에 있어서, 우측 선도 그래프의 원 속의 숫자 0 ~ 5는, 각각 원본, 수정본 1 ~ 5의 문서를 의미한다. 화살표는 표절이 이루어진 방향을 의미하며, 화살표 옆의 0.321, 0.253, 0.282, 0.225, 0.156의 숫자는, 문서 사이의 거리, 즉 비표절도를 나타내고 있다. 즉, 거리가 가까우면(숫자가 작으면) 표절일 가능성이 높으며, 거리가 멀면(숫자가 크면) 서로 유사성이 없다는 뜻이다. 거리는 0 ~ 1 사이의 값으로 표현된다.In FIG. 18, the numbers 0-5 in the circle | round | yen of a right line graph mean the document of the original and the revisions 1-5, respectively. The arrow indicates the direction in which plagiarism was made, and the numbers of 0.321, 0.253, 0.282, 0.225, and 0.156 next to the arrow indicate the distance between documents, that is, non-plagiarism. This means that if the distance is small (the number is small), it is likely plagiarism; if the distance is large (the number is large), there is no similarity. The distance is expressed as a value between 0 and 1.

상기 문서 사이의 거리는, 표절도와 반비례하고, 상기 표절도는, 문서 전체에 대한 절대유사도를, 문서 완전일치시의 절대유사도로 나눈 값, 즉 문서 상대유사도에 의하여 정하여지도록 할 수 있다.The distance between the documents is inversely proportional to the plagiarism, and the plagiarism may be determined by a value obtained by dividing the absolute similarity of the entire document by the absolute similarity at the time of complete document matching, that is, the document relative similarity.

좌측의 데이터를 살펴보면, 원래 그려야 할 그래프(Original Graph)의 내역에 있어서, 문서의 수, 즉 총 Vertex수가 6개이므로, 그려야 할 화살표의 수, 즉 총 Edge수가 15이며, 전체 가중치의 평균이 0.393임을 알 수 있다. 그런데, 상기 예는, 순차적으로 하나만을 의도적으로 표절한 것이기 때문에, 화살표가 일직선 형태로 순차 연결될 뿐이다. 그래서 이를 고려한 그래프(Restructed Graph)의 내역에 있어서, 현재의 Vertex수는 동일하게 6개이지만, 현재의 Edge수는 5개로 감소되어 있다. 그리고 상기 순차적인 5번의 비교에 있어서의 현재 가중치의 평균은 0.247이 된다.Looking at the data on the left, in the original graph, the number of documents, that is, the total number of vertices is six, so the number of arrows to be drawn, that is, the total number of edges is 15, and the average of the total weights is 0.393. It can be seen that. By the way, in the above example, since only one is deliberately plagiarized sequentially, the arrows are only sequentially connected in a straight line. So, in the details of the graph (Restructed Graph), the current number of vertices is equal to six, but the number of current edges is reduced to five. The average of the current weights in the five consecutive comparisons is 0.247.

<실험예 4>Experimental Example 4

인터넷 검색엔진 구글에서 "피겨 김연아"라는 키워드로 검색된 뉴스들 총 151개 중 21개의 문서들만 추출하여 표절 검사를 위한 예제 데이터를 만들었다.Internet search engine Google extracted only 21 documents out of 151 news articles searched with the keyword "Kim Yu-na" to create example data for plagiarism detection.

##### 노컷뉴스1 ########## NocutNews1 #####

'피겨 요정' 김연아(17. 군포수리고)가 자신의 첫 국내 공연이 예정됐던 서울 목동 아이스링크의 화재 진화 과정을 지켜보는 장면이 영상으로 포착됐다.The video shows that Kim Yu-na (17. Gunpo Surigo), the figure fairy, watches the fire evolution of the Mokdong Ice Rink in Seoul, where her first domestic performance was scheduled.

이날 화재는 발생 20여분만에 진화됐지만 이날 오후 열릴 예정인 '현대카드 슈퍼 매치 2007 슈퍼 스타즈 온 아이스(이하 슈퍼매치)'의 일정은 모두 취소됐다.The fire was extinguished in about 20 minutes, but the schedule of 'Hyundai Card Super Match 2007 Super Stars on Ice' (Super Match) scheduled to open this afternoon was canceled.

화재 당시 아이스링크에는 슈퍼매치를 위해 출연자들이 리허설을 위해 현장에 있었으며, 100여명의 초등학생들이 빙상 특별활동 수업을 받고 있었던 것으로 알려졌다.The ice rink at the time of the fire was said to have had performers on site for rematching, and about 100 elementary school students were taking ice skating classes.

김연아는 화재가 진화된 후 소방차들 사이에서 그 모습을 드러냈으며 걱정스러운 표정으로 현장을 지켜본 후 관계자들과 함께 자리를 떴다.Kim appeared after the fire was extinguished among the fire trucks. After watching the scene with a worried expression, she left with the officials.

한편 김연아, 안도 미키(일본) 등 세계 정상급 피겨 스타들이 총출동하는 슈퍼매치는 이날 오후 7시30분 목동 아이스링크에서의 첫 공연을 시작으로 16일까지 세 차례의 공연이 예정돼 있었다. On the other hand, a super match with world-class figure stars such as Kim Yu-na and Miki Ando (Japan) was scheduled for three performances until the 16th, starting with her first performance at Mok-dong Ice Rink at 7:30 pm on the day.

##### 노컷뉴스2 ########## No Cut News 2 #####

목동 아이스링크에 발생한 화재로 '피겨 요정' 김연아(17. 군포수리고)가 공연할 예정이었던 '현대카드 슈퍼매치 2007 슈퍼 스타즈 온 아이스(이하 슈퍼매치)'의 개최가 무산됐다.The fire on the Mokdong Ice Rink has resulted in the failure of the 'Hyundai Card Super Match 2007 Super Stars on Ice' (Super Match), which Kim Yuna (17. Gunpo Surigo) was supposed to perform.

슈퍼매치 주최사인 현대카드는 "14일부터 3일간 예정된 슈퍼매치 공연을 취소한다"며 "고객과 선수들의 안전을 최우선적으로 고려해 슈퍼매치 취소를 결정했다"고 덧붙였다.Hyundai Card, the super match organizer, said, "We will cancel the super match performance scheduled for three days from the 14th." "We decided to cancel the super match considering the safety of customers and players as the top priority."

현대카드 관계자는 "건물 안전진단 결과, 화재가 지붕 일부에서 발생한 만큼 실내 링크를 비롯한 건물 전체의 구조적 안전에는 이상이 없는 것으로 확인되었다"며 "하지만 무리하게 공연을 강행할 경우 발생할 수 있는 안전 문제를 고려, 주최측으로서 모든 손실을 감수하고 행사를 취소하기로 결정했다"고 말했다.An official from Hyundai Card said, "As a result of building safety diagnosis, it was confirmed that there is no problem in the structural safety of the entire building including the interior link as much as the fire occurred on the part of the roof." As a organizer, I decided to take all the losses and cancel the event. ”

김연아, 안도 미키(일본) 등 세계 정상급 피겨 스타들이 나설 예정이었던 슈퍼매치는 목동 아이스링크에서 이날 오후 7시30분 첫 공연을 시작으로 16일까지 세 차례의 공연을 하기로 예정돼 있었으며, 김연아는 이를 위해 지난 10일 훈련지인 캐나다 토론토에서 입국했다.Supermatch, which was supposed to be released by world-class figure stars such as Kim Yu-na and Miki Ando (Japan), was scheduled to perform three times by the 16th at the Mokdong Ice Rink, starting at 7:30 pm on the day. To this end, I arrived in Toronto, Canada, on the 10th.

화재는 이날 오전 11시53분, 아이스링크 경기장 지붕에서 시작됐으며 긴급 출동한 소방대에 의해 20여분만에 진화됐다. 이날 불은 방수용 모르타르 작업을 하던 경기장 지붕에서 시작돼 3천㎡ 넓이의 지붕을 절반 이상 태웠으나 경기장 내부로는 번지지 않았다. The fire started at 11:53 am on the roof of the ice rink arena and was extinguished in 20 minutes by an emergency fire brigade. The fire started on the roof of the stadium where the waterproofing mortar was working, and burned more than half the roof of 3,000 square meters, but did not spread inside the stadium.

당시 경기장 안에는 이날 저녁 열릴 예정이었던 슈퍼매치 공연을 준비하던 선수들과 주최측 관계자들을 비롯해 특강 중이었던 초등학생 100여명 등 180여명이 있었으나 모두 안전하게 대피해 다행히 인명피해는 없었다.At the time, there were about 180 people in the stadium, including the players who were preparing for the supermatch performance that was going to be held this evening, the organizers, and about 100 elementary school students who were special lectures.

특히 오후 1시로 예정된 슈퍼매치 리허설을 위해 목동 아이스링크에 도착한 김연아는 현장에서 화재를 목격한 뒤 곧바로 차를 돌려 숙소인 서울 소공동 롯데호텔로 돌아갔다.Kim Yu-na, who arrived at the Mokdong Ice Rink for the super match rehearsal scheduled for 1:00 pm, saw a fire at the scene and immediately returned to the Lotte Hotel, Sogong-dong, Seoul.

김연아의 에이전트사인 IB스포츠는 "김연아가 리허설을 위해 목동 아이스링크에 도착했을 당시, 불길을 치솟았고 이를 보고는 곧바로 호텔로 돌아가 공연 개최 여부가 결정되기를 기다리고 있었다"고 말했다.Kim Yu-na's agent, IB Sports, said, "When Kim arrived at Mok-dong ice rink for rehearsal, the flames soared, and she immediately returned to the hotel and waited for the performance to be decided."

한편 김연아는 공연 직후 기자회견을 갖고 "이번 아이스쇼를 위해 준비를 많이 했다. 스태프들도 고생을 많이 했는데 너무 아쉽고, 공연을 위해 한국을 찾은 다른 선수들에게도 미안한 마음이다"며 실망감을 표했다.On the other hand, Kim said at a press conference immediately after the performance, "I prepared a lot for this ice show. The staff also struggled a lot, and I am very sorry and sorry for the other players who came to Korea for the performance."

##### 노컷뉴스3 ########## No Cut News 3 #####

'피겨 요정' 김연아(17. 군포수리고)의 국내 첫 공연이 예정된 서울 목동 아이스링크에 화재가 발생했다. 화재는 발생 20여분만에 진화됐지만 이날 오후 열릴 예정인 '현대카드 슈퍼 매치 2007 슈퍼 스타즈 온 아이스(이하 슈퍼매치)'의 정상적인 개최 여부는 아직 미지수다.A fire broke out in Seoul's Mokdong Ice Rink where the figure-figure Kim Yu-na (17. Gunpo Surigo) was scheduled for her first performance in Korea. The fire was extinguished in about 20 minutes, but it is still unknown whether the Hyundai Card Super Match 2007 Super Stars on Ice (Super Match), which is scheduled to open this afternoon, is still open.

이날 불은 오전 11시53분 아이스링크 지붕에서 일어났고 소방차 16대와 진화인원 50여명이 출동해 20여분만에 진화했다. 현장 목격자에 따르면 "당시 아이스링크 지붕에서 인부 4명이 공연과 무관하게 페인트칠 작업을 하고 있었고, 지붕에서부터 불길이 치솟았다"고 말했다. 그러나 다행히 불길은 지붕에서 더 이상 번지지 않았고, 공사 인부들을 비롯한 인명 피해도 없었다.The fire broke out on the roof of the ice rink at 11:53 am and evolved in 20 minutes with 16 fire trucks and 50 firefighters. "At the time, four workers were painting on the roof of the ice rink, regardless of the performance, and the flames soared from the roof," according to an on-site witness. Fortunately, the flames no longer spread from the roof, and there were no casualties, including construction workers.

이날 아이스링크에는 슈퍼매치를 위해 김연아를 비롯한 출연자들이 리허설을 위해 현장에 있었으며, 100여명의 초등학생들이 빙상 특별활동 수업을 받고 있었던 것으로 알려졌다.On the ice rink, Kim Yu-na and other performers were on site to rehearse for the super match, and about 100 elementary school students were taking ice skating classes.

슈퍼매치 주관사인 세마스포츠마케팅의 한 관계자는 "오후 1시부터 리허설이 예정되어 있어 출연자들이 모두 도착해있는 상태였다"면서 "그러나 실내에서는 화재가 난 사실도 잘 몰랐고, 화재 당시 모두 밖으로 안전하게 빠져 나갔다. 출연자들은 현재 목동 주변에서 상황을 주시하고 있다"고 말했다.An official from Super Match's organizer, Sema Sports Marketing, said, "The rehearsal was scheduled to begin at 1:00 pm, so all the performers had arrived." "The performers are currently watching the situation around the shepherds."

한편 이날 오후 7시30분으로 예정된 슈퍼매치 행사의 취소 여부는 오후 1시 현재 결정되지 않았다. 세마 측은 "실내 링크 쪽에는 피해가 없었던 만큼 대회 개최 여부는 긍정적인 상황"이라면서도 "전기 등 모든 시설들을 점검한 뒤 대회 진행 여부를 결정할 것"이라고 밝혔다.Meanwhile, whether or not to cancel the Super Match event scheduled for 7:30 pm on the day was not decided as of 1:00 pm. Sema said, "As there was no damage on the indoor link side, it was a positive situation to hold the competition," he said. "We will check all the facilities including electricity and decide whether to proceed."

김연아, 안도 미키(일본) 등 세계 정상급 피겨 스타들이 총출동하는 슈퍼매치는 이날 오후 7시30분 목동 아이스링크에서의 첫 공연을 시작으로 16일까지 세 차례의 공연이 예정돼 있다.The supermatch, which is attended by world-class figure stars such as Kim Yu-na and Miki Ando (Japan), is scheduled for three performances by the 16th, starting with the first performance at Mok-dong Ice Rink at 7:30 pm on the day.

##### 데일리안 ########## Daily #####

김연아의 올해 첫 국내 무대로 관심을 모았던 '현대카드 슈퍼매치 V-슈퍼스타스 온 아이스' 공연이 14일 오전 발생한 화재로 취소됐다. The performance of 'Hyundai Card Super Match V-Superstars on Ice', which attracted Kim Yu-na's first domestic performance this year, was canceled due to a fire that broke out on the 14th.

14일부터 16일까지 열릴 예정이었던 목동아이스링크의 갑작스런 화재로 3일간의 모든 일정 전일이 취소된 것. The sudden fire of Mokdong Ice Link, which was scheduled to be held from 14th to 16th, canceled all three days of the previous day.

공연 첫날로 예정됐던 이날 오전 9시 시작된 목동아이스링크 지붕 우레탄 보수 공사가 마무리된 직후인 오전 11시53분 발생된 이날 화재는 건물 지붕 일부를 태운 후 20여분 후에 진압됐다. The fire was extinguished 20 minutes after burning part of the building's roof, which occurred at 11:53 am shortly after the completion of the Mokdong Ice Link roof urethane repair, which began on the first day of the concert, at 9 am.

이번 행사를 주최한 현대카드 측은 "화재가 완전히 진압된 후 실시된 안전점검 결과 행사를 진행하는 데 문제가 없다는 판정이 나왔다. 그러나 관객의 안전을 최우선으로 생각해 1%의 위험성이라도 있는 상황에서 행사를 강행할 수는 없다는 판단으로 이번 행사를 취소하게 됐다"고 밝혔다. Hyundai Card, which hosted the event, said, "The safety check conducted after the fire was completely suppressed revealed that there was no problem with the event. However, considering the safety of the audience, the event was held at the risk of 1%. We decided to cancel the event because we couldn't force it. ”

이날 화재 진압 직후에 실시된 안전점검 결과, 행사를 진행하는 데 문제가 없다는 판정이 나왔다. 안전점검을 실시한 한국건설안전기술원의 오강호 박사는 "구조적으로 큰 문제가 없어, 행사를 진행하는 데에는 무리가 없다"고 진단했다. Safety checks conducted immediately after the firefighting on the day indicated that there was no problem with the event. Dr. Kang Kang-ho of the Korea Institute of Construction Safety and Technology, who conducted safety checks, said, "There is no structural problem, so there is no problem in carrying out the event."

그러나 현대카드 측은 자체 회의를 통해 "주말께 태풍을 동반한 비가 예고돼있는데다, 세계적인 선수들과 관객들의 안전을 볼모로 행사를 치를 수는 없다는 결론을 내렸다"고 말했다. However, Hyundai Card said in its own meeting that "we are expecting rain with a typhoon on the weekend and cannot hold the event for the safety of international players and audiences."

이번 대회를 주관한 세마스포츠마케팅 관계자는 "예기치 못한 사고로 국내 최고의 피겨스케이팅 축제가 취소된 것에 대해 아쉬움을 금할 수 없다"며 "티켓을 예매한 모든 팬들에게 책임지고 100% 환불할 것"이라고 말했다. The sema sports marketing official who hosted the tournament said, "I can't help but cancel the best figure skating festival in Korea because of an unexpected accident." .

현대카드 측은 김연아 등을 비롯한 세계적인 피겨스타들의 공연을 보지 못하게 된 팬들의 아쉬움을 달래기 위해 16일 오후 3시 롯데월드 아이스링크에서 '김연아 시범공연(무료)'을 선보인다고 밝혔다. Hyundai Card announced that it will show 'Kim Yu-na Demonstration Performance (free)' at Lotte World Ice Rink on March 16 at 3:00 pm to appease the fans who have not seen the performances of world-famous figure stars including Kim Yu-na.

한편, 김연아는 18일 국정홍보처 국가홍보대사 위촉식에 참석한 뒤 예정대로 20일 캐나다로 출국한다. Kim Yu-na, on the 18th, attends a ceremony for the National Public Relations Ambassador for the National Affairs and Public Affairs Office and leaves for Canada on the 20th as scheduled.

##### 동아일보1 ########## Dong-A Ilbo 1 #####

피겨 요정' 김연아(17·군포 수리고)를 비롯해 세계 정상급 피겨 선수들이 출전하는 '현대카드 슈퍼매치 V-07 슈퍼스타스 온 아이스' 공연이 열릴 예정이었던 목동 아이스링크 경기장에서 불이 나 시민들이 급히 대피하는 소동이 벌어졌다. Fire and citizens are urgently evacuated at Mokdong Ice Rink Stadium, where 'Hyundai Card Supermatch V-07 Superstars on Ice' performances, including figure skating Kim Yu-na (17, Gunpo Surigo), will be held. There was a disturbance.

14일 오전 11시 50분경 서울 양천구 목동 아이스링크 경기장 지붕에서 불이 나 3000m² 넓이의 지붕 절반을 태우고 24분 만에 꺼졌다. The fire burned from the roof of Mok-dong Ice Rink Stadium in Yangcheon-gu, Seoul at around 11:50 am on the 14th, and turned off in 24 minutes.

당시 경기장에는 아이스링크장을 찾은 초등학생과 시민 등 200여 명이 있었다. 하지만 불길이 경기장 내부로 번지지 않아 다행히 인명피해는 없었다. At the time, there were about 200 elementary school students and citizens who visited the ice rink. However, because the flame did not spread inside the stadium, there were no casualties.

소방대 관계자는 "옥상의 방수 시설을 보수하는 작업 도중에 불꽃이 튀어 불이 난 것으로 보고 있다"며 "정확한 화재 원인과 피해 규모를 조사 중"이라고 말했다. An official from the fire brigade said, "We believe that the flames were fired while repairing the roof's waterproofing facility," he said. "We are investigating the exact cause of the fire and the magnitude of the damage."

화재로 인해 14일 오후 7시 반부터 열릴 예정이던 아이스쇼 공연은 취소됐다. 이벤트를 주최한 현대카드 측은 "실내아이스링크 내부 시설에는 문제가 없는 것으로 파악됐지만 혹시 생길지 모를 안전사고에 대비해 이번 공연을 취소하기로 결정했다"고 밝혔다. 현대카드는 이에 따라 이날 오후 7시 반 첫 공연을 시작으로 15일과 16일까지 3일간 예정됐던 공연 일정을 모두 취소하고 예매된 입장권에 대해 환불 조치를 하기로 했다. The ice show was scheduled to be held from 7:30 pm on Tuesday, due to the fire. Hyundai Card, which hosted the event, said, "We found that there is no problem with the facilities inside the indoor ice link. However, we decided to cancel the performance in case of any possible safety accidents." Hyundai Card decided to cancel all the scheduled performances for the three days between the 15th and the 16th, starting with the first performance at 7:30 pm on the same day and refunding the reserved tickets.

현대카드 측은 김연아가 일요일인 16일 오후 3시 서울 송파구 롯데월드 아이스링크에서 이번 아이스쇼 공연을 위해 준비했던 2가지 프로그램 중 '원스 어폰 어 드림(Once Upon A Dream)' 시범 공연을 한다고 밝혔다. 관람은 무료. Hyundai Card announced that Kim will perform 'Once Upon A Dream' out of two programs prepared for the ice show at Lotte World Ice Rink at Songpa-gu, Seoul on March 16 at 3 pm on Sunday. Admission is free.

##### 동아일보2 ########## Dong-A Ilbo 2 #####

김연아(17) 등 세계적인 빙상스타들이 대거 출전하는 꿈의 공연을 볼 수 없게 됐다. It is no longer possible to see the dream performance of world-famous ice star such as Kim Yu-na (17).

14일부터 16일까지 사흘간 서울 목동아이스링크에서 열릴 예정이었던 '현대카드 슈퍼매치 V-슈퍼스타스 온 아이스'가 갑작스러운 화재로 취소됐다. The Hyundai Card Supermatch V-Superstars on Ice, which was scheduled to be held at Mokdong Ice Rink for three days from 14 to 16, was canceled due to a sudden fire.

행사를 주최한 현대카드는 "화재가 진압된 후 실시된 안전점검 결과 행사를 진행하는데 문제가 없다는 판정이 나왔다. 하지만 관객과 선수들의 안전을 위해 1%의 위험성이라도 있는 상황에서 행사를 강행할 수는 없다는 판단으로 이번 행사를 취소하게 됐다"고 밝혔다. Hosting the event, Hyundai Card said, "The safety check conducted after the fire was judged to be no problem in carrying out the event. However, for the safety of the audience and the athletes, the event can be enforced even with a risk of 1%. "We decided to cancel the event because we didn't know that."

김연아, 안도 미키, 스티븐 랑비 등 세계최고의 선수들은 공연을 위해 모든 준비를 마친 상태. 하지만 당일 발생한 화재로 한국팬들앞에서의 공연은 다음 기회로 미뤄지게 됐다. The world's best players, including Kim Yu-na, Mickey Ando and Steven Langby, are all ready for the show. However, due to the fire that occurred on the day, performances for Korean fans were delayed to the next opportunity.

이번 대회를 주관한 세마스포츠마케팅 관계자는 "예기치 못한 사고로 국내 최고의 피겨스케이팅 축제가 취소된 것에 대해 아쉬움을 금할 수 없다"며 "티켓을 예매한 모든 팬들에게 책임지고 100% 환불을 해드리겠다"고 말했다. The sema sports marketing official who hosted the tournament said, "I can't be disappointed that the best figure skating festival in Korea was canceled because of an unexpected accident." said.

이날 화재는 목동아이스링크 지붕 우레탄 보수 공사가 마무리된 직후인 오전 11시 53분쯤 발생했으며 건물 지붕 일부를 태운 후 20여분만에 진압됐다. The fire broke out around 11:53 am shortly after the completion of the Mokdong Ice Link roof urethane repair work.

##### 동아일보3 ########## Dong-A Ilbo 3 #####

##### 매일경제 ########## Maeil Economy #####

김연아와 일본의 안도미키 등을 초대해 화제가 됐던 현대카드 슈퍼매치V 행사가 전격 취소됐다. The Hyundai Card Super Match V event, which was invited by Kim Yu-na and Japan's Andomi Miki, was canceled.

현대카드는 당초 14일부터 3일간 예정됐던 '슈퍼매치 브이-수퍼스타즈 온 아이스(V- Superstars on Ice)' 공연을 취소한다고 14일 밝혔다. Hyundai Card announced on the 14th that it will cancel the 'V- Superstars on Ice' performance that was scheduled for three days from the 14th.

현대카드는 "14일 오전 11시 50분 경 목동 아이스링크 지붕에서 화재사고가 발생해 고객과 선수들의 안전이 우려돼 공연취소를 결정했다"고 설명했다. Hyundai Card explained, "At 11:50 am on the 14th, a fire accident occurred on the roof of the Mok-dong Ice Rink.

현대카드는 안전진단을 실시한 결과 건물 전체 안전에는 이상이 없는 것으로 확인됐지만 혹시 있을 관람객의 안전 문제를 고려해 행사를 취소키로 했다고 설명했다. Hyundai Card confirms that the safety of the entire building has been confirmed as a result of the safety diagnosis, but decided to cancel the event in consideration of the safety issues of visitors.

현대카드는 티켓을 구매한 모든 고객을 대상으로 환불절차를 진행하겠다고 밝혔다. 현대카드는 홈페이지와 전화, 현장안내 등을 통해 고객들에게 대회 취소 사실을 고지하고 있다.Hyundai Card said it will proceed with the refund process to all customers who have purchased tickets. Hyundai Card notifies customers of cancellations through its homepage, telephone, and site guidance.

##### 서울경제 ########## Seoul Economy #####

'피겨요정' 김연아(17ㆍ군포 수리고2)를 비롯해 세계 정상급 피겨선수들이 출전할 예정이던 '현대카드 슈퍼매치Ⅴ-07 슈퍼스타스 온 아이스'가 14일 오전 발생한 서울 목동아이스링크 지붕 화재 사고로 인해 취소됐다.Hyundai Card's Supermatch V-07 Superstars on Ice, which was scheduled to be played by figure skater Kim Yu-na (17, Gunpo Surigo 2), was fired at the Seoul Mokdong Ice Link Roof Was canceled due to.

이벤트를 주관하는 현대카드는 "실내아이스링크 내부 시설에는 문제가 없는 것으로 파악됐지만 혹시나 생길지 모를 안전사고를 대비해 이번 공연을 취소하기로 결정했다"고 밝혔다. 현대카드는 이에 따라 이날 오후7시30분 첫 공연을 시작으로 15일과 16일까지 3일간 예정됐던 공연일정을 모두 취소하고 예매된 입장권에 대해 환불조치를 해주기로 했다.Hyundai Card, which organizes the event, said, "There was no problem with the facilities inside the indoor ice link, but we decided to cancel the show in case of a possible safety accident." Hyundai Card decided to cancel all performances scheduled for 15 days and 16 days for three days starting at 7:30 pm on the same day and refund the reserved tickets.

##### 스포츠투데이 ########## Sports Today #####

오늘 오전 서울 양천구 목동 아이스링크에 발생한 화재로 인해, 14~16일 열릴 예정이었던 '현대카드 슈퍼매치 V-슈퍼스타스 온 아이스' 공연이 취소됐습니다. Due to a fire at Mok-dong Ice Rink in Yangcheon-gu, Seoul this morning, the performance of Hyundai Card Supermatch V-Superstars on Ice, which was scheduled to be held for 14-16 days, was canceled.

이번 공연은 김연아, 안도 미키 등이 참가할 예정으로 관심을 모았었습니다.This performance attracted attention by Kim Yu-na and Miki Ando.

이번 행사를 주최한 현대카드 측은 화재가 진압된 후 안전점검을 실시한 결과, 행사를 진행하는 데 문제가 없다는 판정이 나왔다고 전했습니다.Hyundai Card, which hosted the event, said that after the fire was extinguished, safety checks resulted in a decision that there was no problem with the event.

그러나 1%라도 위험성이 있는 상황에서 행사를 강행할 수는 없어 취소하게 됐다고 밝혔습니다. But even one percent said they could not do so in a dangerous situation, so they canceled it.

또한 주말께 태풍을 동반한 비가 예고돼 있는 상황이라 행사를 치를 수는 없다는 결론을 내렸다고 말했습니다. He also concluded that it was impossible to hold the event due to a rainy season with a typhoon.

공연 첫날로 예정됐던 오전 9시경 목동 아이스링크는 지붕 우레탄 보수공사의 마무리를 지었습니다.The mokdong ice rink, which was scheduled for the first day of the performance, completed the roof urethane repair.

그러나 그 직후, 화재가 발생해 20여분만에 진압됐습니다.But shortly thereafter, a fire broke out and was put down in over 20 minutes.

##### 이타임즈 ########## The Times #####

'피겨요정' 김연아 선수가 참가하는 아이스쇼가 열릴 예정이었던 목동 아이스링크 경기장에서 공연을 불과 7시간여 앞두고화재가 발생했다. A fire broke out just seven hours before the performance at the Mokdong Ice Rink Stadium, where the figure skating fair was held.

주최측은 안전 문제를 이유로 사흘간 열릴 예정이었던 행사 전체를 취소하고 입장권을 예매했던 시민들에게는 전액 환불 및 보상조치하겠다는 입장을 밝혔다. The organizers said they would cancel the entire event, which had been scheduled for three days for safety reasons, and offer a full refund and compensation to citizens who reserved tickets.

14일 오전 11시 53분께 서울 양천구 목동 아이스링크 경기장 지붕에서 불이 났으나 출동한 소방대에 의해 24분 만에 진화됐다. The fire broke out on the roof of the ice rink stadium at Mok-dong, Yangcheon-gu, Seoul on the 14th, but it was extinguished in 24 minutes by a fire brigade.

이날 불은 인부들이 방수용 모르타르 작업을 벌이던 경기장 지붕에서 시작돼 3천㎡ 넓이의 지붕 가운데 500여㎡를 태워 500여만 원(소방서 추산)의 재산피해를 냈다. The fire started on the roof of the stadium where workers were working on the waterproof mortar and burned about 500 square meters of roofs of 3,000 square meters in size, causing damage of more than 5 million won (estimated by the fire station).

화재 당시 경기장에는 이날 저녁 열릴 예정이었던 아이스쇼 공연에 앞서 피겨남자 국가대표 이동훈 선수가 리허설을 하고 있었으며 지하 아이스링크 장에서 피겨강습을 받던 초등학생 150여 명 등 모두 270여 명이 건물 내부에 있었으나 긴급 대피해 다행히 인명 피해는 없었다. At the time of the fire, figure skater Lee Dong-hoon was rehearsing before the ice show, which was scheduled to be held this evening, and 270 people, including 150 elementary school students who were taking a figure lesson at the underground ice rink, were evacuated. There was no casualty.

김연아 선수는 이날 오후 1시로 예정된 공연 리허설에 참가하기 위해 차량을 타고 목동 아이스링크 경기장으로 향하던 중 화재를 목격했으며 아이스링크 현장 근처까지 왔다가 "경기장에 오지 말라"는 주최 측의 연락을 받고 그대로 호텔로 돌아간것으로 확인됐다. Kim Yu-na drove to Mok-dong Ice Rink Stadium to participate in the performance rehearsal scheduled for 1 pm on the same day, witnessed a fire, came near the ice rink site, and was informed by the organizer of "Do not come to the stadium." It was confirmed that it went back.

이날 경기장 옥상에서 작업을 하던 인부들은 경찰에서 "방수공사를 하기 위해접착제를 발라 놓은 뒤 점심식사를 하러 내려갔다 오니 불이 붙어 있었다"고 진술했다. Workers working on the rooftop of the stadium said the police said, "It was on fire when I went down for lunch after applying adhesive for waterproofing."

경찰 관계자는 "작업에 쓰인 접착제는 휘발성과 발화성이 강한 특성을 갖고 있다"며 "이날 낮 햇볕이 강하게 내리 쬔 만큼 자연발화 가능성이 있는지 여부를 면밀히 따지고 있다"고 말했다. A police official said, "The glue used in the work has a strong volatility and a high ignition property," he said. "We are carefully considering whether there is a possibility of spontaneous combustion as the sun is strong in the day."

불이 난 목동 아이스링크 경기장에서는 이날 오후 7시 30분부터 김연아 선수 등이 참가하는 '현대카드 슈퍼매치Ⅴ-07 슈퍼스타스 온 아이스' 공연이 열릴 예정이었으나 주최 측인 현대카드는 화재사고에 따른 안전 문제를 이유로 행사를 전면 취소했다. At the Mok-dong Ice Rink Stadium, which was on fire, the performance of Hyundai Card Super Match V-07 Superstars on Ice will be held from 7:30 pm on the same day, but Hyundai Card, the host of the event, is a safety issue due to a fire accident. The event was canceled entirely.

현대카드 관계자는 "한국건설안전기술원의 긴급 안전진단 결과 외부 지붕만 불에 타 공연은 가능하다는 결과가 나왔으나 단 1%의 문제도 있어서는 안된다는 판단에 따라 이날부터 사흘간 열릴 예정이었던 공연 전체를 취소키로 했다"고 말했다. An official from Hyundai Card said, “As a result of the emergency safety diagnosis by the Korea Institute of Construction Safety and Technology, only the outer roof was burned, but it was possible to cancel the performance, which was scheduled to be held for three days. I was tall. "

현대카드 측은 공연 입장권을 예매했던 시민들에게는 전액 환불조치 하고 추후보상방안을 마련할 방침이다. Hyundai Card plans to give a full refund to citizens who reserved tickets for the performance and to make plans for compensation later.

##### 중앙일보1 ########## JoongAng Ilbo 1 #####

피겨 요정' 김연아(17·군포 수리고)의 국내 공연이 무산됐다.Kim Yeon-a (17, Gunpo Surigo) of figure skating fairy was banned.

김연아를 비롯해 세계 최정상급 피겨스타들이 출전하는 '현대카드 슈퍼매치 V-슈퍼스타스 온 아이스' 공연이 14일 오전 목동아이스링크 지붕 화재 사고로 전격 취소됐다.The performance of 'Hyundai Card Super Match V-Superstars on Ice' performed by Kim Yu-na and the world's top figure skating stars was canceled in the morning of the Mokdong Ice Link roof fire accident on the 14th.

이번 행사를 주최한 현대카드는 "화재가 완전히 진압된 후 실시된 안전점검 결과 행사를 진행하는 데 문제가 없다는 판정이 나왔지만 선수들과 관객의 안전을 최우선으로 생각해 이번 행사를 취소하게 됐다"고 밝혔다. Hyundai Card, which hosted the event, said, "As a result of the safety check conducted after the fire was completely suppressed, it was determined that there was no problem in the event, but the safety of the players and the audience was the top priority, and the event was canceled." .

이날 오전 11시53분 아이스링크 지붕에서 발생된 화재는 건물 지붕 일부를 태운 뒤 20여분 만에 긴급 출동한 소방대에 의해 진압됐다. The fire from the roof of the ice rink at 11:53 am was extinguished by a fire brigade who had been dispatched to the fire within 20 minutes of burning the roof of the building.

현대카드는 이에 따라 이날부터 16일까지 3일간 예정됐던 공연일정을 모두 취소하고 예매된 입장권에 대해 환불조치를 해주기로 했다.Hyundai Card decided to cancel all performances scheduled for three days from 16th to 16th and refund the reserved tickets.

##### 중앙일보2 ########## JoongAng Ilbo 2 #####

현대카드가 14일부터 3일간 예정된 '슈퍼매치 V- Superstars on Ice'의 공연을 취소한다고 밝혔다. Hyundai Card said it will cancel the performance of 'Super Match V- Superstars on Ice' scheduled for three days from 14 days.

14일, 오전 11시50분경 목동 아이스링크 지붕에서 화재사고가 발생함에 따라 고객과 선수들의 안전을 최우선적으로 고려해 슈퍼매치 취소를 결정했다고 현대카드는 설명했다. As the fire broke out on the roof of the Mok-dong ice rink around 11:50 am, Hyundai Card decided to cancel the super match, considering the safety of customers and athletes first.

현대카드 관계자는 "건물 안전진단 결과, 화재가 지붕 일부에서 발생한 만큼 실내 링크를 비롯한 건물 전체의 구조적 안전에는 이상이 없는 것으로 확인되었지만 무리하게 공연을 강행할 경우 발생할 수 있는 안전 문제를 고려, 주최 측으로서 모든 손실을 감수하고 행사를 취소하기로 결정했다"고 말했다. An official from Hyundai Card said, “As a result of the building safety diagnosis, it was confirmed that there was no abnormality in the structural safety of the entire building including the indoor link as the fire occurred on the part of the roof. As I decided to take all the losses and cancel the event. ”

정태영 현대카드 사장은 "예기치 못한 화재로 인해 많은 국내외 팬들이 고대하던 '현대카드 슈퍼매치Ⅴ'를 선사할 수 없게 되어 유감"이라며 "고객과 선수의 안전을 최우선으로 고려한 결정으로 이해해 주시길 바라며, 앞으로 더욱 훌륭한 슈퍼매치 시리즈로 찾아 뵐 것을 약속 드린다"고 밝혔다. Chung Tae-young, president of Hyundai Card, said, "We are sorry that many fans at home and abroad are unable to provide 'Hyundai Card Super Match V' due to unexpected fires. We hope that you will understand it as a decision that considers the safety of customers and players as the top priority. I promise to see you in a better supermatch series. ”

한편 현대카드는 홈페이지와 전화, 현장안내 등을 통해 고객들에게 대회 취소 사실을 고지 중이며, 현대카드로 티켓을 구매한 고객은 전산작업을 통해 일괄 취소할 예정이다. 이외에 기타 방법을 사용해 구매한 고객에게도 취소 및 환불절차를 통해 고객의 불편이 없도록 할 방침이다. Meanwhile, Hyundai Card is notifying customers of the cancellation of the tournament through its website, telephone, and on-site guidance, and customers who purchased tickets with Hyundai Card will cancel all of them through computer operation. In addition to the customer who purchased by using other methods through the cancellation and refund procedures to ensure that there is no inconvenience to customers.

##### 중앙일보3 ########## JoongAng Ilbo 3 #####

'피겨요정' 김연아 선수가 참가하는 아이스쇼가열릴 예정이었던 목동 아이스링크 경기장에서 공연을 불과 7시간여 앞두고 화재가 발생했다. A fire broke out just seven hours before the performance at the Mokdong Ice Rink Stadium, where the figure skating fair's Kim Yu-na was going to be held.

14일 오전 11시 53분께 서울 양천구 목동 아이스링크 경기장 지붕에서 불이 났으나 출동한 소방대에 의해 24분 만에 진화됐다. 이날 불은 방수용 모르타르 작업을 벌이던 경기장 지붕에서 시작돼 3천㎡ 넓이의 지붕 가운데 절반 가량을 태웠다. The fire broke out on the roof of the ice rink stadium at Mok-dong, Yangcheon-gu, Seoul on the 14th, but it was extinguished in 24 minutes by a fire brigade. The fire started on the roof of the stadium where the waterproof mortar was being burned and burned about half of the roof of 3,000 square meters.

화재 당시 경기장 안에는 이날 저녁 열릴 예정이었던 아이스쇼 공연을 준비하던김연아 선수와 주최 측 관계자들을 비롯해 아이스링크장을 찾은 초등학생 등 180여 명이 있었으나 다행히 긴급 대피해 인명피해는 없었다.There were about 180 people in the stadium at the time of the fire, including Kim Yu-na who was preparing for the ice show that was supposed to be held this evening, and the elementary school students who visited the ice rink, but fortunately there were no casualties.

소방대 관계자는 "방수작업을 벌이던 지붕에서 불길이 시작됐으나 내부로는 번지지 않았다"며 "정확한 화재 원인과 피해규모는 조사중"이라고 말했다.An official from the fire brigade said, "The fire started from the roof during the waterproofing work, but it did not spread inside." "The exact cause of fire and the magnitude of the damage are being investigated."

불이 난 목동 아이스링크 경기장에서는 이날 오후 7시 30분부터 김연아 선수 등이 참가하는 '현대카드 슈퍼매치Ⅴ-07 슈퍼스타스 온 아이스' 공연이 열릴 예정이었으며 오후 1시부터는 김연아 선수 등의 리허설이 예정돼 있었던 것으로 알려졌다.At the Mok-dong Ice Rink Stadium, which is on fire, the performance of Hyundai Card Super Match V-07 Superstars on Ice will be held from 7:30 pm on the same day, and rehearsal of Kim Yu-na from 1 pm will be held. It is said to have been.

공연 주최 측 관계자는 "리허설은 일단 취소했으나 예정대로 공연을 계속할지 여부는 경기장 내부 전기 시설과 조명 등을 점검한 뒤 판단할 것"이라고 말했다.An official of the performance organizer said, "We will cancel the rehearsal once, but will decide whether to continue the performance as scheduled after checking the electric facilities and lighting inside the stadium."

##### 중앙일보4 ########## JoongAng Ilbo 4 #####

서울 목동 아이스링크에서 화재가 발생해 20분만에 진화됐다. A fire broke out in Seoul's Mok-dong ice rink and evolved in 20 minutes.

이날 불은 오전11시53분쯤 아이스링크 지붕에서 일어났다. 소방차 16대와 진화인력 50여명이 출동했다. The fire broke out on the roof of the ice rink around 11:53 am. 16 fire trucks and 50 firefighters were dispatched.

화재 당시 아이스링크 지붕에서 인부 4명이 페인트칠을 하고 있었고 불길이 순식간에 치솟은 것으로 알려졌다. 그러나 불길이 지붕에서 더 이상 번지지 않아 공사 인부를 비롯해 인명 피해는 없었다. Four workers were painting on the roof of the ice rink at the time of the fire, and the flames were reported to soar. However, since the flames no longer spread from the roof, there were no casualties, including construction workers.

이날 목동 아이스링크에서는 '피겨 스타' 김연아의 국내 첫 공연이 예정돼 있었다. 그러나 이번 화재로 김연아의 공연은 안전문제로 취소됐다. At the Mok-dong Ice Rink, Kim Yu-na's first performance of the figure star was scheduled. However, due to the fire, Kim's performance was canceled due to safety issues.

김연아를 비롯한 출연자들은 리허설을 준비하기 위해 현장에 있었고 100여명의 초등학생들이 특별활동 수업을 받고 있었던 것으로 전해졌다. Kim and other performers were on site to prepare for the rehearsal, and about 100 elementary school students were reported to be taking extracurricular classes.

##### 쿠키뉴스 ########## Cookie News #####

김연아 등 빙상계의 슈퍼스타들의 공연이 예정돼있던 아이스링크에서 불이나 180여명이 대피했다. Over 180 people were evacuated from the ice rink where Kim's superstars, including Kim Yu-na, were scheduled to perform.

14일 경찰 및 양천소방서에 따르면 이날 오전 11시50분쯤 서울 목동 아이스링크 지붕에서 불이난 뒤 24분만에 진화됐다. 이날 화재는 경기장 지붕에서 방수 작업을 벌이던 중 불이 옮겨붙은 것으로 추정되며 지붕 3000㎡가운데 절반 가량을 태웠다.According to the police and Yangcheon fire department on the 14th, it evolved 24 minutes after the fire broke out on the roof of Seoul Mokdong Ice Rink about 11:50 am. The fire was believed to have been carried out during the waterproofing work on the stadium's roof and burned about half of the roof of 3000 square meters.

김연아 선수를 비롯, 초등학생 등 경기장 안에 있던 사람들은 화재 발생 후 즉시 긴급 대피해 다행히 인명피해는 없었다.People in the stadium, including Kim Yu-na and elementary school students, were evacuated immediately after the fire, and fortunately there were no casualties.

불이 난 목동 아이스링크 경기장에서는 이날 오후 7시 30분부터 김연아 선수 등이 참가하는 '현대카드 슈퍼매치Ⅴ-07 슈퍼스타스 온 아이스' 공연이 열릴 예정이었으며 오후 1시부터는 김연아 선수 등의 리허설이 예정돼 있었던 것으로 알려졌다.At the Mok-dong Ice Rink Stadium, the performance of Hyundai Card's Super Match V-07 Superstars on Ice will be held at 7:30 pm on the day of the event. It is said to have been.

소방서 관계자는 "경기장 지붕에서 불길이 치솟았다는 증언이 있지만 정확한 화재원인은 아직 밝혀지지 않았다"고 말했다.A fire department official said, "There is a testimony that the flames have soared from the roof of the stadium, but the exact cause of the fire has not yet been identified."

##### 투데이코리아 ########## Today Korea #####

목동 아이스링크 화재 관계로, 김연아 선수<사진> 등을 초대해 화제가 됐던 현대카드V 매치가 취소됐다.Due to the fire on the Mok-dong Ice Rink, Hyundai Card V match, which had been discussed by Kim Yuna, was canceled.

현대카드는 당초 14일부터 3일간 예정됐던 '슈퍼매치 브이-수퍼스타즈 온 아이스(V- Superstars on Ice)' 공연을 취소한다고 밝혔다. Hyundai Card said it will cancel the 'V- Superstars on Ice' concert that was scheduled for three days from the 14th.

현대카드는 "14일 오전 11시 50분 경 목동 아이스링크 지붕에서 화재가 발생했던 관계로 공연취소를 결정했다. 건물에는 안전이 없는 것으로 일단 진단 결과가 나왔으나 혹시 모를 안전사고를 대비하기 위해서다"라고 설명했다. Hyundai Card decided to cancel the performance due to a fire on the roof of Mok-dong Ice Rink on 11:14 am on the 14th. The result of the diagnosis was that the building is not safe, but in order to prepare for any unforeseen safety accidents. Explained.

현대카드는 티켓을 구매한 모든 고객을 대상으로 환불절차를 진행하겠다는 입장이며, 홈페이지와 전화, 현장안내 등을 통해 고객들에게 대회 취소 사실을 고지하고 있다.Hyundai Card is in a position to proceed with a refund process for all customers who have purchased tickets, and informs customers of cancellations through the website, telephone, and site guidance.

##### 한국경제 ########## Korea Economy #####

14일 오후 7시 30분부터 서울 목동 아이스링크 경기장에서 갖기로 했던 김연아 선수 등이 참가하는 '현대카드 슈퍼매치Ⅴ-07 슈퍼스타스 온 아이스' 공연이 아이스링크 경기장 일부 화재로 전면 취소됐다. The performance of Hyundai Card Super Match V-07 Superstars on Ice, which was scheduled to be held at Seoul's Mokdong Ice Rink Stadium from 7:30 pm on the 14th, was canceled due to a partial fire at the Ice Rink Stadium.

주최 측인 현대카드 관계자는 "한국건설안전기술원의 긴급 안전진단 결과 외부 지붕만 불에 타 공연은 가능하다는 결과가 나왔으나 단 1%의 문제도 있어서는 안된다는 판단에 따라 사흘간의 공연 전체를 취소키로 했다"고 말했다.Organizer Hyundai Card said, "As a result of urgent safety diagnosis by Korea Institute of Construction Safety and Technology, only the outer roof was burned, and the performance was possible. Said.

공연 입장권을 예매했던 시민들에게는 전액 환불조치 하고 추후 보상방안을 마련할 방침이다.The citizens who reserved tickets for the performance will be fully refunded and plan to compensate later.

이날 오전 11시 53분께 목동 아이스링크 경기장 지붕에서 인부들이 방수용 모르타르 작업을 벌이던 중 불이 나 500여만 원(소방서 추산)의 재산피해를 냈다.At 11:53 am on the roof of the Mok-dong Ice Rink Stadium, workers fired or damaged the property for an estimated 5 million won (estimated by the fire station).

##### 한국일보 ########## Hankook Ilbo #####

14일 오전 서울 양천구 목동 아이스링크장 지붕에서 불이 나 시커먼 연기가 치솟고 있다. 이 불로 이날 오후'피겨 요정' 김연아 선수가 참가한 가운데 열릴 예정이던 아이스쇼가 취소됐다. On the 14th, a fire and black smoke are rising from the roof of the ice rink at Mok-dong, Yangcheon-gu, Seoul. The fire canceled the ice show that was scheduled to be held this afternoon with the figure skating Kim Yu-na.

'피겨 요정' 김연아(17)가 참가하는 아이스 쇼가 열릴 예정이던 목동 아이스링크에서 불이 나 공연이 전면 취소됐다.The fire and performances were canceled at the Mokdong Ice Rink where the figure shows Kim Yu-na (17) participated.

14일 오전 11시53분께 서울 양천구 목동 아이스링크 지붕에서 불이 나 출동한 소방대에 의해 24분만에 꺼졌다. 이날 불은 인부들이 방수용 모르타르 작업을 하던 아이스링크 지붕에서 시작돼 3,000㎡ 넓이의 지붕 가운데 500여㎡를 태우고 500만원(소방서 추산)의 재산피해를 냈다.The fire was fired from the roof of the ice rink at Mok-dong, Yangcheon-gu, Seoul on the 14th, and was turned off in 24 minutes. The fire started on the roof of the ice rink where workers were working on waterproof mortar, and burned about 500 square meters of the roof of 3,000 square meters and damaged the property for 5 million won (estimated by the fire station).

당시 아이스링크 안에는 공연을 앞두고 리허설 중인 피겨 남자 국가대표 선수들과 피겨강습을 받던 초등학생 150여명 등 270명이 있었으나, 긴급 대피해 인명 피해는 발생하지 않았다. 김연아는 오후 1시로 예정된 공연 리허설에 참가하기 위해 차량을 타고 아이스링크 인근에 도착한 뒤 화재를 목격했으며, "공연장에 오지 말라"는 주최 측의 연락을 받고 호텔로 돌아갔다. 인부들은 경찰에서 "방수공사를 하기 위해 접착제를 발라 놓은 뒤 점심을 먹고 왔는데 불이 붙어 있었다"고 진술했다. 경찰 관계자는 "작업에 사용된 접착제는 휘발성과 발화성이 강한 만큼 강한 햇볕을 받아 자연 발화됐을 가능성도 있다"고 말했다.At the time, there were 270 people in the ice rink who were rehearsing before the performance, and 270 people including 150 elementary school students who were taking a figure class, but no emergency damage occurred. Kim arrived at the ice rink near the ice rink to attend a performance rehearsal scheduled for 1:00 pm and witnessed the fire, and returned to the hotel after being contacted by the organizer. The workers said to the police, "I had lunch after applying glue for waterproofing, but it was on fire." A police official said, "The glue used in the work may be spontaneously ignited by strong sunlight, as it is highly volatile and flammable."

김연아 등이 참가하는 '현대카드 슈퍼매치Ⅴ-07 슈퍼스타스 온 아이스' 공연은 이날 오후 7시30분 열릴 예정이었다. 주최측인 현대카드 관계자는 "한국건설안전기술원의 긴급 안전진단 결과 외부 지붕만 불에 타 공연은 할 수 있다는 결론이 나왔지만 안전 상의 이유로 사흘간 열릴 예정이었던 공연 전체를 취소키로 했다"며 "공연 입장권을 예매했던 시민들에게는 전액 환불조치하고 추후 보상 방안을 마련할 방침"이라고 말했다.The performance of 'Hyundai Card Super Match V-07 Superstars on Ice', where Kim Yuna and others participated, was to be held at 7:30 pm on the same day. Organizer Hyundai Card official said, "As a result of the emergency safety diagnosis of the Korea Institute of Construction Safety and Technology, it was concluded that only the outside roof could be burned, but for safety reasons, we decided to cancel the entire performance that was scheduled to be held for three days." "We will give full refunds to the citizens who made reservations and come up with compensation plans."

##### kbs ########## kbs #####

<앵커 멘트> <Anchorment>

안녕하세요. 이선영입니다. Hi. I'm Sunyoung Lee.

김연아의 출전으로 관심을 모았던 피겨 아이스쇼가 예기치 못한 화재로 전격 취소됐습니다.Figure skating ice show attracted attention by Kim's appearance was canceled by unexpected fire.

다행히 인명피해는 없었지만 좀 황당하네요. 정현숙 기자입니다. Luckily, there were no casualties, but it's kind of absurd. I'm Jung Hyun Sook.

<리포트> <Report>

김연아와 안도미키의 우정의 대결, 플루첸코와 야구딘의 카리스마 넘치는 공연. Kim Yu-na's friendship with Ando Miki's charismatic performance by Pluchenko and Yagudin.

세계적인 피겨스타들의 멋진 무대를 고대해온 국내팬들의 꿈이 한순간에 무너져내렸습니다. The dreams of domestic fans, who have been looking forward to the wonderful stage of world-class figure stars, have collapsed in an instant.

오늘 오전 12시, 목동 아이스링크에서 발생한 예기치 않은 화재가 원인입니다. This is due to an unexpected fire at the Mokdong Ice Rink at 12 am today.

<김연아 인터뷰> 가던 중에 알았다. I knew while I was on the interview.

처음에는 농담인줄 알았다... I thought it was a joke at first ...

<오셔 코치 인터뷰> 약간 무서웠다. 당시 링크장안에 없어 다행. Interview with Coach O'Shea. I'm glad it's not in the link at the time.

오늘부터 사흘간 예정됐던 공연이 모두 취소되자, 열정의 무대를 약속했던 김연아도 진한 아쉬움을 드러냈습니다. When all the performances scheduled for three days were canceled from today, Kim Yu-na, who promised the stage of passion, also expressed a deep regret.

<김연아 인터뷰> 공연을 위해 한국을 찾은 다른 선수들에게 미안하다. I'm sorry for the other players who came to Korea for the interview.

지난 1주일간 피겨 스타들의 일거수일투족을 쫓았던 열성팬들은 갑작스런 공연 취소소식에 눈물을 보이기도 했습니다. Enthusiastic fans who followed the figure of the stars for the past week have also shown tears in their sudden cancellation of performances.

<피겨 팬 인터뷰> Figure Fan Interview

너무 아쉬워 눈물이 났다. I was so sad that tears came.

취소된 공연 대신, 김연아는 오는 일요일 롯데월드 아이스링크에서Instead of the canceled performance, Kim will attend Sunday's Lotte World Ice Rink.

한번도 공개된 적이 없는 원스 어폰 어 드림을 공연해 팬들의 아쉬움을 덜게 할 예정입니다. The once-upon-a-dream, which has never been released, will reduce the fans' disappointment.

##### kbs 스포츠뉴스 ########## kbs Sports News #####

김연아의 출전으로 관심을 모은 피겨 아이스쇼가 예기치 못한 화재로 인해 전격 취소됐습니다. The figure skating ice show that attracted interest from Kim's appearance was canceled due to an unexpected fire.

국내팬들에게 멋진 공연을 약속했던 김연아는 이 갑작스런 취소에 진한 아쉬움을 드러냈습니다. Kim Yu-na, who promised great performances to the domestic fans, expressed deep regret for this sudden cancellation.

정현숙 기자가 보도합니다. Reporter Jung Hyun Sook reports.

방수공사 도중, 목동 아이스링크에서 발생한 예기치 않은 화재가 원인입니다. During the waterproofing, an unexpected fire at Mokdong Ice Rink was the cause.

<인터뷰> 김연아 : "가던 중에 알았다. 처음에는 농담인줄 알았다."<Interview> Kim Yu-na: "I knew it on the way. I thought it was a joke at first."

<인터뷰> 오셔 코치 : "약간 무서웠다. 당시 링크장안에 없어 다행..."<< Interview> Coach Osher: "It was a little scary.

사흘간 예정됐던 공연이 모두 취소되자, 열정의 무대를 약속했던 김연아도 진한 아쉬움을 드러냈습니다. When all the scheduled performances were canceled for three days, Kim Yu-na, who promised a stage of passion, also expressed a deep regret.

<인터뷰> 김연아 : "공연을 위해 한국을 찾은 다른 선수들에게 미안하다." <Interview> Kim Yu-na: "I'm sorry for the other players who came to Korea for the performance."

지난 1주일간 피겨 스타들의 일거수일투족을 쫓았던 열성팬들은 갑작스런 공연 취소 소식에 눈물을 보이기도 했습니다. Enthusiastic fans who followed the figure's struggle for the past week also showed tears at the news of a sudden cancellation.

<인터뷰> 피겨 팬 : "너무 아쉬워 눈물이 났다."<Interview> Figure Fan: "I was so sorry for my tears."

취소된 공연 대신, 김연아는 오는 일요일 롯데월드 아이스링크에서 한번도 공개된 적이 없는 원스 어폰 어 드림을 공연해 팬들의 아쉬움을 덜게 할 예정입니다.Instead of the canceled performance, Kim will relieve fans of the Once Upon A Dream, which was never revealed at Lotte World Ice Rink on Sunday.

상기 21개의 문서들에 대하여 본 발명을 적용하여, 비대칭 유사도를 산출하여, 표절의 흐름을 도출하였다. 이를 도 19에 표절경로 그래프로 나타낸다.By applying the present invention to the above 21 documents, the asymmetry similarity was calculated to derive the flow of plagiarism. This is shown in a plagiarism path graph in FIG. 19.

화살표 옆의 숫자는 문서 사이의 거리를 나타내고 있다. 즉, 거리가 가까우면 표절일 가능성이 높으며, 거리가 멀면 서로 유사성이 없다는 뜻이다. 거리는 0 ~ 1 사이의 값으로 표현된다.The number next to the arrow indicates the distance between documents. In other words, if the distance is close, it is likely to be plagiarism. The distance is expressed as a value between 0 and 1.

상기 도 19의 그래프를 보고, 문서 집합을 나누어 본다면, 아래와 같이 6개의 그룹으로 나뉜다.Looking at the graph of FIG. 19 and dividing the document set, it is divided into six groups as follows.

유사그룹Similar group 신문사newspaper 1One 중앙일보2, 노컷뉴스2, 매일경제, 투데이코리아JoongAng Ilbo 2, Nocut News 2, Maeil Economy, Today Korea 22 한국일보, 이타임즈, 중앙일보3, 쿠키뉴스Hankook Ilbo, E-Times, JoongAng Ilbo 3, Cookie News 33 데일리안, 동아일보2, 동아일보3, 스포츠투데이, 중앙일보1Daily, Dong-A Ilbo 2, Dong-A Ilbo 3, Sports Today, JoongAng Ilbo 1 44 노컷뉴스1, 노컷뉴스3, 중앙일보Nocut News 1, Nocut News 3, JoongAng Ilbo 55 kbs스포츠뉴스, kbskbs Sports News, kbs 66 동아일보1, 서울경제Dong-A Ilbo, Seoul Economy

여기서, 각 신문사에서 동일한 내용 또는 약간의 수정만을 거친 기사를 무차별적으로 유포한다는 사실을 이 표절 탐색 결과로써 증명이 되었으며, 구글의 중복 방지 기능은 이 같은 유포를 효과적으로 막지 못한다는 것을 알 수 있다.Here, the results of this plagiarism search proved that each newspaper company distributes the same contents or only minor modifications indiscriminately, and Google's anti-duplicating function does not prevent such distribution effectively.

<변형예><Variation example>

상기 실시예에 있어서는, 예비검사와 심층검사로, 오직 2단계로 구분하여 표절검사를 수행하는 경우에 대하여 설명하였으나, 본 발명은 이에 한하지 않는다. In the above embodiment, the case of performing the plagiarism test by dividing into only two stages by the preliminary test and the in-depth test, the present invention is not limited thereto.

예컨대, 예비검사와 심층검사의 사이에, 이들 예비검사와 심층검사에 있어서의 속도와 정확도의 중간 정도의 속도의 정확도를 가지는 중간검사를 더 하도록 하여, 종국적으로 3단계로 구성하더라도 본원의 기술적 사상에 의하여 충분히 커버되는 범위가 된다. 상기 중간검사는, 음절의 일치도를 산출하여 판단하는 예비검사나, 지역정렬을 이용하는 심층검사와 달리, 다른 기준에 의하여 유사도를 판단하도록 구성함이 바람직하다.For example, between the preliminary examination and the in-depth examination, an intermediate examination having an accuracy of about half the speed and the accuracy in the preliminary examination and the in-depth examination is added so that the technical idea of the present application can It becomes the range covered by enough. The intermediate test is preferably configured to determine similarity based on other criteria, unlike the preliminary test which calculates and judges the degree of syllable agreement and the in-depth test using local alignment.

또한, 상기 예비검사와 심층검사, 또는 추가적인 중간검사에 있어서, 각각 비교대상이 되는 공통앵커에 대하여 앞뒤 수 어절을 확장하여 비교하는 경우에, 이 각 확장을 다단계로 하여도 좋다. 예컨대, 예비검사에 있어서, 이를 다단계, 예컨대 3단계로 나눠서, 최초 제1 예비검사에서는 앵커의 앞뒤로 3어절을 확장하여, 총 7어절을 비교하고, 이를 통과하면, 제2 예비검사에서는 그 앵커의 앞뒤로 5어절을 확장하여, 총 11어절을 비교하고, 이를 통과하면, 최종 제3 예비검사에서는 그 앵커의 앞뒤로 7어절을 확장하여, 총 15어절을 비교하고, 이를 통과하면, 다음 단계, 즉 중간검사나 심층검사로 이행하는 식으로 세분화, 또는 검증의 강화를 할 수 있다. 이 방식은, 중간검사나 심층검사에 있어서도 마찬가지로 적용할 수 있다.Further, in the preliminary inspection, the in-depth inspection, or the additional intermediate inspection, each expansion may be performed in multiple stages in the case of comparing the extended forward and backward passages with respect to the common anchors to be compared, respectively. For example, in the preliminary inspection, it is divided into multiple stages, for example, three stages. In the first preliminary inspection, three words are expanded before and after the anchor, and a total of seven words are compared. By comparing five words forward and backward, comparing 11 words, and passing it, the final third preliminary examination expands 7 words before and after the anchor, and compares 15 words by passing it. Subdivision or verification can be enhanced by transitioning to inspection or in-depth inspection. This method can be similarly applied to the intermediate inspection or the deep inspection.

그리고 본 발명은, 스탠드 얼론 방식으로 수행되는 것에 한하지 않으며, 예컨대, 처음의 비교대상 문서의 입력에서부터 마지막의 표절여부 결과나 표절경로의 출력에 이르기까지의 단계 중 적어도 어느 하나가, 인터넷 등의 네트워크를 통한 데이터 송수신에 의하여 이루어지도록 구성할 수 있다. 이런 구성에 의하여, 표절검사를 의뢰하는 의뢰인이 네트워크를 통하여 대상 자료들을 입력하여 송신하면, 표절검사를 서비스하는 서비스제공자는, 상기 데이터를 수신하여, 이를 처리한 후, 그 결과를 다시 상기 의뢰인에게 네트워크를 통하여 송신하고, 상기 의뢰인은 이를 디스플레이하거나 출력하여 확인할 수 있다. 또는, 인터넷 검색엔진 등에 있어서, 입력된 검색어에 의하여 인터넷 문서를 검색하여 그 검색결과를 사용자에게 제공하는 경우에도, 그 검색어와 관련된 정보의 원본과 표절 본에 관한 정보를 함께 제공하도록 할 수 있다.The present invention is not limited to the stand-alone method. For example, at least one of the steps from the input of the first comparison target document to the result of the last plagiarism or the output of the plagiarism path is determined by the Internet or the like. It can be configured to be made by data transmission and reception through the network. According to this configuration, when the client requesting the plagiarism test inputs and transmits the target data through the network, the service provider who services the plagiarism test receives the data, processes it, and then returns the result to the client. Transmitting through the network, the client can display or output it to confirm. Alternatively, in an Internet search engine or the like, even when an Internet document is searched for by an input search word and the search result is provided to a user, the information related to the original and plagiarism information of the search word may be provided together.

또는, 네트워크를 통하여, 자료의 처리를 여러 컴퓨터에 분산하여 처리하는 분산처리를 본 발명에 적용하여도 좋다. 이 경우, 예컨대, 예비검사는 A 시스템, 심층검사는 B 시스템, 문서표절도 산출은 C 시스템 식으로 분산될 수도 있다.Alternatively, the present invention may be applied to a distributed process in which data processing is distributed to various computers and processed through a network. In this case, for example, the preliminary inspection may be distributed by the A system, the deep inspection by the B system, and the document plagiarism calculation by the C system.

기타, 상기 실시예에서 도출되는 기술적 사상의 범위 내에서, 당업자에 의하여 이루어지는 다양한 변형은, 본 발명의 범위에 속하도록 해석되어야 함은 당연하다.In addition, it is natural that various modifications made by those skilled in the art within the scope of the technical idea derived from the above embodiments should be interpreted to fall within the scope of the present invention.

본 발명은, 표절탐색을 위한 컴퓨터 프로그램 산업, 웹 서비스 산업뿐만 아니라, 원본으로부터의 표절경로를 탐색하여 표시하거나 정리하여 나타내는 컴퓨터 관련 산업에 이용될 수 있다.The present invention can be used not only in the computer program industry for plagiarism search, the web service industry, but also in the computer related industry which searches for, displays, or displays the plagiarism paths from originals.

도 1은, 본 발명의 표절탐색 방법을 예시하는 전체 플로차트이다.Fig. 1 is an overall flowchart illustrating the plagiarism detection method of the present invention.

도 8은, 예비검사(S300)의 제1 유사도 산출(S330)단계의 세부 플로차트이다.8 is a detailed flowchart of the first similarity calculation step S330 of the preliminary inspection S300.

도 9는, 심층검사(S400)의 제2 유사도 산출(S430)단계의 세부 플로차트이다.9 is a detailed flowchart of the second similarity calculation step S430 of the deep inspection S400.

도 11은, 문서의 사전구조의 예시 개념도이다.11 is an exemplary conceptual diagram of a dictionary structure of a document.

도 12는, 어절당 구성음절의 개수의 분포도이다.12 is a distribution diagram of the number of syllables per word.

도 13은, 표절이 아닌 독립문서 400개(각 구성어절 수 약 2000개)에 대하여, 표절이 아님에도 불구하고 표절로 오인될 만큼 유사한 표현이 존재하는 빈도수를 조사하여 나타낸 그래프이다.FIG. 13 is a graph showing the frequency of existence of a similar expression that is mistaken for plagiarism even though it is not plagiarism for 400 independent documents that are not plagiarism.

도 14는, 일반적인 굼벨함수의 그래프이다.14 is a graph of a general lumpbell function.

도 15는, 도 13과 다른 실험에 의하여, 표절이 아닌 다수의 독립문서에 대하여, 표절이 아님에도 불구하고 유사한 표현이 존재하는 빈도수를 조사하여 나타 낸 그래프와 굼벨함수의 그래프를 겹쳐 표현한 그래프이다.FIG. 15 is a graph obtained by overlapping a graph of a graph and a Gumbell function, which are examined by examining a frequency of similar expressions, although not plagiarism, in a number of independent documents that are not plagiarized by experiments different from those of FIG. 13. .

도 16은, 본 발명의 다단계 표절탐색, 예컨대 2단계 표절탐색에 의한 효과를 설명하기 위한 성능 그래프이다.16 is a performance graph for explaining the effect of multi-step plagiarism search, for example, two-step plagiarism search of the present invention.

도 17은, 실험 예에 있어서의 표절경로를 나타내는 그래프이다.17 is a graph showing the plagiarism path in the experimental example.

도 18은, 다른 실험 예에 있어서의 표절경로를 나타내는 그래프이다.18 is a graph showing a plagiarism path in another experimental example.

도 19는, 다른 실험 예에 있어서의 표절경로를 나타내는 그래프이다.19 is a graph showing a plagiarism path in another experimental example.

Claims

A method for searching for plagiarism between a plurality of documents input data using at least an apparatus comprising storage means, input means, output means and control means,

A partial similarity calculation step of calculating the similarity for each division unit by dividing each pair of documents to be compared and stored in the storage means into the storage unit by a predetermined division unit; ,

A document plagiarism determination step of accumulating the partial likelihood, calculating document similarity for each pair of documents, and determining whether the document is plagiarized therefrom when the partial similarity is equal to or greater than a predetermined threshold value.

Including,

The partial similarity calculation step,

A preliminary inspection step of inspecting preliminary similarities of the contrast portions of each pair of documents to be compared by calculating a coincidence ratio in all the divided units so as not to overlap the all divided units; And

An in-depth inspection step is performed only when the similarity of the preliminary inspection is equal to or greater than a predetermined threshold, and reflects the depth of similarity by reflecting the weight according to the matching position in the division unit.

Plagiarism detection method characterized in that configured to perform divided by.

The method according to claim 1,

The division unit, plagiarism search method characterized in that the word .

The method according to claim 1,

The division unit is limited to a predetermined number k of maximum syllables, and for words exceeding the maximum number of syllables, the k-mer segmentation phrase is formed by sequentially repeating the process of dividing the k syllables sequentially from the first syllable. Plagiarism search method characterized in that.

The method according to claim 1,

Each of the documents divided by the division unit is provided as a dictionary structure by using the anchor of the division unit as a key and the position where the division unit appears in the document as a reference.

A contrast check for calculating the similarity with respect to the contrast portion centered on the division unit, characterized in that configured to be performed for all common anchors of the dictionary structure for each of the documents.

The method according to claim 1,

Prior to the contrast test for calculating the similarity with respect to the contrast part centered on the division unit, a division unit that does not affect the judgment of plagiarism, while remaining the semantic word that is a division unit that affects the judgment of plagiarism. Phosphorus stoppage is a plagiarism search method characterized in that the process of elimination is configured to precede.

The method according to claim 1,

In the preliminary inspection step, plagiarism detection method characterized in that it is configured to calculate the similarity by the ratio of the number of the matching syllables in the total number of syllables of the contrast portion.

The method according to claim 1,

The contrasting part of the preliminary inspection step is a plagiarism search method, characterized in that an extended portion of a predetermined number of words before and after the position of the document is centered around the division unit.

The method according to claim 7,

The expansion of the word, plagiarism search method characterized in that configured to be made within the range that the division unit does not overlap.

The method according to claim 1,

The preliminary inspection step is made of a predetermined multi- step, and proceeds to a later step only when it is determined that the similarity is equal to or more than a predetermined threshold in the previous step,

The contrast portion is a plagiarism characterized by being configured to extend the number of words that are predetermined before and after the position of the document centered on the division unit, and to increase the number of the extended syllables in a later step. Navigation method.

The method according to claim 1,

In the in-depth inspection step, it is configured to calculate the similarity by regionally arranging the contrast portion by word, and adding weights according to the position of the syllable in the word to the syllables of each word. How to detect plagiarism

The method according to claim 1,

The contrast portion of the deep inspection step is a plagiarism search method, characterized in that the extended portion of the word b predetermined in front and rear on the position of the document around the division unit.

The method according to claim 11,

The method according to claim 1,

The in-depth inspection step consists of a predetermined multi- step, and proceeds to the later step only when it is determined that the similarity is equal to or greater than a predetermined threshold in the previous step,

The method according to claim 1,

Regarding the partial similarity in the respective contrast portions,

The similarity calculated in the depth inspection step is called an absolute similarity ,

When the similarity when the contrast part is a perfect match is called perfect match similarity ,

The ratio of the absolute similarity to the perfect match similarity is called relative similarity ,

In the document plagiarism determination step, the partial similarity that is accumulated for calculating the document similarity is the absolute similarity, and the partial similarity that is a criterion for determining whether the document is similar to the relative similarity is the plagiarism. Navigation method.

The method according to claim 14,

About document similarity in the whole document

Accumulation of absolute similarity of the contrast portion is called document absolute similarity ,

When the similarity when the document is a perfect match is called the document perfect match similarity ,

The ratio of the document absolute similarity to the document completeness similarity is referred to as document relative similarity ,

The document plagiarism detection method, characterized in that configured to be determined from the document relative similarity.

The method according to claim 1,

The document plagiarism detection method, characterized in that configured to be determined by matching the document similarity to a predetermined probability model .

The method according to claim 16,

The probability model, as compared to the independent article together, not a number of plagiarism, plagiarism, characterized in that in fact the even the less derived by cleaning the probability of expressing the appearance accidentally similar enough to be suspected plagiarism statistical function is not a plagiarism Navigation method.

The method according to claim 17,

In the plagiarism search using the probabilistic model, the plagiarism search method is configured to obtain and output a specific probability value indicating the degree of plagiarism based on a function value of the probabilistic model corresponding to the document similarity.

The method according to claim 10,

When the document pairs to be compared are referred to as document A and document B,

The similarity of the contrast portion is asymmetric similarity , in which the similarity of Document B based on Document A and the similarity of Document A based on Document B are different from each other,

The weight is configured such that the weight of the insertion portion added to the contrasting document is determined differently from the reference document, and the weight of the deletion portion deleted from the contrasting document is determined differently from the reference document. Plagiarism search method.

The method according to claim 19,

In the in-depth inspection, the similarity of Document B based on Document A and the similarity of Document A based on Document B are calculated, respectively.

In the document plagiarism determination step, the document similarity degree of the document B based on the document A and the document B are respectively determined from the similarity of the document B based on the document A, and the similarity of the document A based on the document B. It is configured to calculate the document similarity of Document A on the basis of

And determining the direction of document plagiarism by comparing a document similarity value of Document B based on Document A with a document similarity value of Document A based on Document B.

The method of claim 20,

And calculating a plagiarism direction of the document for all the pairs of documents to be compared, and displaying the plagiarism path diagrams connected by arrows for each document according to the plagiarism direction.

An apparatus comprising at least a storage means, an input means, an output means, and a control means for searching for plagiarism between a plurality of data-input documents.

A document database provided in the storage means for systematically storing the input document;

Control means for reading a comparison document pair from the document database, checking for contrast, and determining whether the document is plagiarized.

Has at least

The control means is configured to perform a plagiarism search method according to any one of claims 1 to 21 by reading a document pair to be compared from the document database.

The method according to claim 22,

The plagiarism search apparatus is a server further provided with a communication means,

The server, through the network comprises a dial means of communication with the communication unit of the server, at least storage means, input means, and the terminal is formed by connecting an output means and a control means,

The control means of the server ,

When the plurality of documents to be compared are input through the input means of the terminal, and are transmitted to the communication means of the server through the communication means of the terminal, the input means of the server from the communication means of the server. Control the data input to the plurality of documents,

When the plagiarism determination result is calculated for the plurality of documents, the output means of the server is outputted to the output means of the terminal via the communication means of the terminal connected to the communication means of the server. Plagiarism detection device, characterized in that configured to control to output data to the communication means of the server to determine whether the plagiarism of the document.