KR20070041918A

KR20070041918A - Design and implementation of a text plagiarism detection method using omucs and sequence alignment technique

Info

Publication number: KR20070041918A
Application number: KR1020050097563A
Authority: KR
Inventors: 김지수; 한상용
Original assignee: 중앙대학교 산학협력단
Priority date: 2005-10-17
Filing date: 2005-10-17
Publication date: 2007-04-20
Also published as: KR100711277B1

Abstract

본 발명은 텍스트의 표절여부를 확인하는 텍스트 표절 탐색시스템 및 그 표절확인방법에 관한 것으로, 입력된 원본문서를 문장단위로 분류하는 원본문서 문장분류단계; 상기 원본문서 문장분류단계를 통해 분류된 문장을 단어단위로 분류하는 원본문서 단어분류단계; 입력된 비교본문서를 문장단위로 분류하는 비교본문서 문장분류단계; 상기 비교본문서 문장분류단계를 통해 분류된 문장을 단어단위로 분류하는 비교본문서 단어분류단계; 상기 원본문서 단어분류단계 및 비교본문서 단어분류단계를 통해 단어단위로 분류된 원본문장과 비교본문장을 각각 비교하여, 서로 비교된 원본문장과 비교본문장 내의 동일단어를 찾는 동일단어 확인단계; 상기 동일단어 확인단계에서 확인된 동일단어를, 코사인 유사도를 변형한 OMUCS[원본문장][비교본문장]에 적용하여 연산하는 OMUCS 연산단계; 및 상기 OMUCS연산단계를 통해 출력된 결과를 제1임계값과 비교하여, 원본문장과 비교본문장 간의 유사도에 따른 유사여부를 판단하는 문장의 유사여부 판단단계를 포함하는 것이다.The present invention relates to a text plagiarism search system for checking whether plagiarism of text and plagiarism checking method, comprising: an original document sentence classification step of classifying an input original document in sentence units; An original document word classification step of classifying sentences classified through the original document sentence classification step in word units; A comparative document sentence classification step of classifying the input comparative document into sentence units; A comparison document word classification step of classifying sentences classified by the comparison document sentence classification step in word units; Comparing the original sentences classified in word units and the comparative text sentences through the original document word classification step and the comparative original document word classification step, and identifying the same words in the original sentence and the comparison text sentence compared to each other; An OMUCS calculation step of applying the same word identified in the same word checking step to an OMUCS [original text] [comparative text] with a modified cosine similarity; And comparing the result output through the OMUCS operation step with a first threshold value, and determining whether or not the similarity of the sentence is determined according to the similarity between the original sentence and the comparative text.

Description

Design and implementation of a text plagiarism detection method using OMUCS and sequence alignment technique}

도 1은 본 발명에 따른 텍스트 표절 확인방법에 적용된 벡터모델의 기하학적 구성의 도면과 수식이고,1 is a diagram and a formula of the geometric configuration of the vector model applied to the text plagiarism check method according to the present invention,

도 2는 본 발명에 따른 텍스트 표절 확인방법에서 진행되는 표절여부 판단방식을 설명하기 위한 예시를 보인 도면이고,2 is a diagram illustrating an example for explaining a plagiarism determination method performed in the text plagiarism check method according to the present invention,

도 3은 본 발명에 따른 텍스트 표절 탐색시스템을 도시한 블록도이고,3 is a block diagram illustrating a text plagiarism search system according to the present invention;

도 4는 본 발명에 따른 텍스트 표절 확인방법을 순차 도시한 플로우차트이고,4 is a flowchart sequentially showing a method for checking text plagiarism according to the present invention;

도 5는 본 발명에 따른 텍스트 표절 탐색시스템을 일구성하는 문서분석모듈의 문서분석과정을 예시로 표현한 도면이고,5 is a diagram illustrating a document analysis process of a document analysis module constituting a text plagiarism search system according to the present invention;

도 6 및 도 7은 본 발명에 따른 텍스트 표절 확인방법의 실시를 위해 적용되는 예시인 원본 영문장과 비교본 영문장을 기재한 테이블이고,6 and 7 is a table listing the original English text and the comparative English text that is an example applied for the implementation of the text plagiarism check method according to the present invention,

도 8은 도 6 및 도 7에 기재된 원본 영문장과 비교본 영문장을 종래 코사인 유사도와 본 발명에 따른 OMUCS에 각각 적용한 결과를 보인 그래프이고,8 is a graph showing the results of applying the original English text and the comparative English text described in FIGS. 6 and 7 to the conventional cosine similarity and OMUCS according to the present invention, respectively.

도 9는 도 6 및 도 7에 기재된 원본 영문장과 비교본 영문장을 본 발명에 따 른 서열정렬 과정에 적용한 결과를 보인 그래프이다.9 is a graph showing the results of applying the original English text and the comparative English text described in FIGS. 6 and 7 in the sequence alignment process according to the present invention.

본 발명은 텍스트의 표절여부를 확인하는 텍스트 표절 탐색시스템 및 그 표절확인방법에 관한 것이다.The present invention relates to a text plagiarism search system for confirming whether plagiarism of text and plagiarism checking method thereof.

컴퓨터 저장장치, 프로세서, 네트워크, 데이터베이스 시스템, 스캐닝 시스템, 사용자 인터페이스와 같은 영역에서의 기술진보는 전자도서관(digital library)이라는 것을 가능하게 했다. 또한 이러한 컴퓨터 분야의 기술진보와 인터넷의 출현은 인터넷 익스플로러, 넷스케이프와 같은 웹 브라우저를 통해 월드 와이드 웹 서버에서 정보를 손쉽게 올리거나 공유할 수 있게 하였고, 풍부한 정보의 공급으로 인해 우리 사회의 정치, 경제, 문화 등에 있어서 많은 변화를 초래하였다. Technological advances in areas such as computer storage, processors, networks, database systems, scanning systems, and user interfaces have made it possible to be digital libraries. In addition, technological advances in the computer field and the emergence of the Internet have made it possible to easily upload and share information on the World Wide Web server through web browsers such as Internet Explorer and Netscape. This has brought about many changes in culture and culture.

그러나 이러한 인터넷의 인기는 사회에 긍정적인 영향뿐만 아니라 부정적인 영향을 초래하고 있다.However, the popularity of the Internet is not only positive but also negatively affecting society.

전자도서관과 웹과 같은 온라인상에 존재하는 정보는 전자문서(digital document)의 형태로 존재한다. 전자문서 형태로 존재하는 정보에 대한 접근성의 증가는 불법 복제와 배포의 위협을 증가시키고 있으며, 지적 재산의 침해라는 복잡한 문제를 야기하였다. 이러한 문제점들은 구글(google), 알타비스타(altavista), 야후(yahoo), 네이버(naver), 엠파스(empas) 등과 같은 검색 사이트의 발전에 힘입어 더욱 심각해지는 실정이다. Information that exists online such as electronic libraries and the web exists in the form of digital documents. Increasing access to information in the form of electronic documents increases the threat of piracy and distribution, and has created a complex problem of intellectual property infringement. These problems are aggravated by the development of search sites such as google, altavista, yahoo, naver, and empas.

또한 문서 편집기 프로그램인 한글과 워드는 문서를 표절하고자 하는 사람들에게 손쉬운 편집도구로서 널리 활용되고 있다. Also, Hangul and Word, the text editor programs, are widely used as easy editing tools for those who want to plagiarize documents.

검색 사이트의 발전과 편집도구의 발달은 빠르고 손쉽게 정보를 제공받고자 하는 사람에게는 좋은 도구가 될 수 있지만, 한편으로는 지적재산을 침해할 수 있는 도구도 될 수 있다. 한편, 문서의 글자체, 글자크기 및 단락의 구조변경 등을 빠르고 손쉽게 처리할 수 있는 문서 편집기의 편리함으로 인해 문서 표절이 더욱 용이해지고 있다.The development of search sites and the development of editing tools can be a good tool for anyone who wants to be informed quickly and easily, but it can also be a tool for infringing intellectual property. On the other hand, document plagiarism is made easier due to the convenience of the text editor that can quickly and easily handle the font, font size, and paragraph structure of the document.

결과적으로, 검색과 문서 편집의 편리함이 문서에 대한 표절 형태를 더욱 발전시켰고, 문서 표절 여부를 판별하는 작업을 더욱 어렵게 하였다.As a result, the convenience of retrieval and document editing further developed the form of plagiarism for documents and made it more difficult to determine whether they were plagiarizing documents.

이러한 문서 표절을 방지하기 위한 접근방법으로는 복사방지(copy prevention)와 복사탐색(copy detection)의 두 가지가 있다. 복사방지는 문서에 대한 표절을 하지 못하도록 물리적으로 독립된 CD와 같은 매체에 저장하는 방법, 문서에 대한 정보 접근의 권한을 부여하는 방법, 문서를 프로그램으로 암호화하는 방법 등이 있다.There are two approaches to preventing such document plagiarism: copy prevention and copy detection. Copy protection includes storing on a medium such as a physically independent CD to prevent plagiarism of the document, authorizing access to information on the document, and encrypting the document with a program.

그러나 이러한 방법들은 상당히 많은 시간과 배용이 요구되며 문서에 대한 접근과 배포를 허용하지 않을 뿐 표절된 문서에 대해서는 표절여부를 검출할 수 없다. However, these methods require a great deal of time and distribution, do not allow access and distribution of documents, and cannot detect plagiarism for plagiarized documents.

상기 복사탐색은 문서의 표절여부를 확인하는 것이다. 그러나, 종래 복사확인방법은 문서의 유사성 측정과 부분적인 구문의 복사를 탐색하는데 약하고, 해시함수를 이용하는 시스템의 경우는 해시값의 충돌이 발생하였다. 또한 구문적인 특 성을 알 수 없는 문서에 대해 복사탐색을 할 수 없고, 다양한 크기의 문서표절을 검출할 수도 없었다.The copy search is to check whether the document is plagiarized. However, the conventional copy confirmation method is weak in searching for similarity of documents and copying of partial syntax. In the case of a system using a hash function, a hash value collision occurs. In addition, copy searching was not possible for documents whose syntax characteristics were unknown, and plagiarism of various sizes could not be detected.

이에 본 발명은 상기와 같은 문제를 해소하기 위해 안출된 것으로, 복사탐색을 실행함에 있어 문서의 표절여부 판별에 대한 신뢰도를 높일 수 있는 텍스트 표절 탐색시스템 및 텍스트 표절 확인방법의 제공을 기술적 과제로 한다.Accordingly, the present invention has been made to solve the above problems, the technical problem is to provide a text plagiarism search system and a text plagiarism check method that can increase the reliability of determining whether the document plagiarism in the copy search. .

상기의 기술적 과제를 달성하기 위하여 본 발명은,The present invention to achieve the above technical problem,

입력된 원본문서를 문장단위로 분류하는 원본문서 문장분류단계;An original document sentence classification step of classifying the input original document in sentence units;

상기 원본문서 문장분류단계를 통해 분류된 문장을 단어단위로 분류하는 원본문서 단어분류단계;An original document word classification step of classifying sentences classified through the original document sentence classification step in word units;

입력된 비교본문서를 문장단위로 분류하는 비교본문서 문장분류단계;A comparative document sentence classification step of classifying the input comparative document into sentence units;

상기 비교본문서 문장분류단계를 통해 분류된 문장을 단어단위로 분류하는 비교본문서 단어분류단계;A comparison document word classification step of classifying sentences classified by the comparison document sentence classification step in word units;

상기 원본문서 단어분류단계 및 비교본문서 단어분류단계를 통해 단어단위로 분류된 원본문장과 비교본문장을 각각 비교하여, 서로 비교된 원본문장과 비교본문장 내의 동일단어를 찾는 동일단어 확인단계;Comparing the original sentences classified in word units and the comparative text sentences through the original document word classification step and the comparative original document word classification step, and identifying the same words in the original sentence and the comparison text sentence compared to each other;

상기 동일단어 확인단계에서 확인된 동일단어를, 코사인 유사도를 변형한 OMUCS[원본문장][비교본문장]에 적용하여 연산하는 OMUCS 연산단계; 및An OMUCS calculation step of applying the same word identified in the same word checking step to an OMUCS [original text] [comparative text] with a modified cosine similarity; And

상기 OMUCS연산단계를 통해 출력된 결과를 제1임계값과 비교하여, 원본문장 과 비교본문장 간의 유사도에 따른 유사여부를 판단하는 문장의 유사여부 판단단계;A similarity determination step of a sentence comparing the result output through the OMUCS operation step with a first threshold value to determine similarity according to the similarity between the original sentence and the comparison sentence;

를 포함하는 텍스트 표절 확인방법이다.Text plagiarism check including.

상기의 기술적 과제를 달성하기 위한 상기 텍스트 표절 확인방법에 있어서,In the text plagiarism check method for achieving the above technical problem,

상기 유사여부 판단단계를 통해 유사한 것으로 확인되는 원본문장과 비교본문장을, 전역정렬 방식, 지역정렬 방식 및 반-전역정렬 방식 중 선택된 하나 이상의 방식을 적용하여 최대값을 구하는 서열정렬단계를 더 포함하는 것이다.The method further includes a sequence sorting step of obtaining a maximum value by applying at least one selected from a global sorting method, a local sorting method, and a semi-global sorting method to the original text and the comparative text that are identified as similar through the similarity determining step. It is.

상기 서열정렬단계를 통해 연산된 최대값들의 합을 제2임계값과 비교하여, 원본문장과 비교문장 간의 유사도에 따른 표절여부를 판단하는 문장의 표절여부 판단단계를 더 포함하는 것이다.Comparing the sum of the maximum values calculated through the sequence alignment step with the second threshold value, and further comprising a plagiarism determination step of determining whether the plagiarism according to the similarity between the original sentence and the comparative sentence.

상기 원본문서 단어분류단계 및 비교본문서 단어분류단계를 통해 단어단위로 분류된 원본문장과 비교본문장을 각각 비교하여, 코사인 유사도인 COS[원본문장][비교본문장]에 적용하여 연산하는 COS 연산단계;The original document classified in word units and the comparative text, respectively, through the word classification step of the original document and the word classification step of the comparative document, respectively, and applied to a COS similar to the cosine [original text] [comparative text]. Operation step;

상기 COS 연산단계를 통해 출력된 결과를 제3임계값과 비교하여, 원본문장과 비교본문장 간의 유사도에 따른 유사여부를 판단하는 문장의 유사여부 판단단계; 및A similarity determination step of determining whether or not the similarity is based on the similarity between the original sentence and the comparative text sentence, by comparing the result output through the COS operation step with a third threshold value; And

상기 유사여부 판단단계를 통해 유사한 것으로 확인되는 원본문장과 비교본문장을, 전역정렬 방식, 지역정렬 방식 및 반-전역정렬 방식 중 선택된 하나 이상의 방식을 적용하여 최대값을 구하는 서열정렬단계;A sequence sorting step of obtaining a maximum value by applying one or more methods selected from a global sorting method, a local sorting method, and a semi-global sorting method to the original text and the comparative text that are identified as similar through the similarity determining step;

상기 서열정렬단계를 통해 연산된 최대값들의 합을 제4임계값과 비교하여, 원본문장과 비교문장 간의 유사도에 따른 표절여부를 판단하는 문장의 표절여부 판단단계를 더 포함하는 것이다.Comparing the sum of the maximum values calculated by the sequence alignment step with the fourth threshold value, and further comprises a plagiarism determination step of determining whether the plagiarism according to the similarity between the original sentence and the comparative sentence.

이하 본 발명을 첨부된 예시도면에 의거하여 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

문서의 표절 여부를 탐색하기 위해서는 문서 간의 비교할 단위의 선정이 가장 중요하다. 문서의 구성요소는 문자, 부호 등이라고 말할 수 있으나, 이들은 문 서의 의미를 표현할 수 있는 단위는 아니며, 단어가 의미의 최소단위가 된다.In order to detect the plagiarism of documents, the selection of units to compare between documents is the most important. The elements of a document may be referred to as letters, symbols, etc., but they are not a unit that can express the meaning of the document, and the word is the minimum unit of meaning.

따라서 본 발명에 따른 텍스트 표절 탐색시스템 및 텍스트 표절 확인방법의 비교 단위는 단어로 한다.Therefore, the comparison unit of the text plagiarism search system and the text plagiarism checking method according to the present invention is a word.

OMUCS(Overlap Measure Using Cosine Similarity)는, 문서 간의 카피정도를 평가하기 위해 정보검색 분야에서 가장 많이 사용되고 있는 벡터모델을 이용하여 문서간의 유사도를 측정하는 방법인 코사인 유사도(Cosine Similarity)를 문서 간의 유사도 측정에 두지 않고 문서 복사탐색에 필요한 문서 간의 중복정도 측정에 두는 방법으로, 본 발명의 상기 OMUCS는 단어의 순서를 고려하지 않는 상태로 적용되므로, 본 발명에 따른 텍스트 표절 탐색시스템 및 텍스트 표절 확인방법에서는 서열정렬기법이 더 적용된다.Overlap Measure Using Cosine Similarity (OMUCS) is a measure of similarity between documents using cosine similarity, which is a method of measuring similarity between documents using a vector model that is used most frequently in the field of information retrieval to evaluate the degree of copying between documents. In this method, the OMUCS of the present invention is applied without considering the order of words, so that the text plagiarism searching system and the text plagiarism checking method according to the present invention are used. Sequence sorting techniques are further applied.

상기 OMUCS는 상술한 바와 같이 문서에서 각 문장들을 벡터 모델로 표현하여 원본 문장과 비교될 문장 간의 중복정도를 측정하기 위해 코사인 유사도가 활용된다.As described above, the OMUCS uses cosine similarity to measure the degree of overlap between the original sentence and the sentence to be compared by expressing each sentence in the document as a vector model.

본 발명에 따른 텍스트 표절 탐색시스템 및 텍스트 표절 확인방법에 적용되는 벡터모델은 도 1(본 발명에 따른 텍스트 표절 탐색시스템 및 텍스트 표절 확인방법에 적용된 벡터모델의 기하학적 구성의 도면과 수식)에 도시된 바와 같으며, 이때 j, k는 용어(단어), S는 문장, W는 가중치를 뜻한다.The vector model applied to the text plagiarism search system and the text plagiarism check method according to the present invention is shown in FIG. 1 (the diagram and the equation of the geometrical configuration of the vector model applied to the text plagiarism search system and the text plagiarism check method). Where j and k are terms (words), S is a sentence, and W is a weight.

용어(단어), 문장 쌍(Sj, Sk)의 가중치 Wij는 양의 비이진 값이며, 문장 Sj의 벡터 Sj (벡터표시를 문서상에서 할 수 없어 상기 '문장'과 '벡터'를 구분하기 위해 벡터값은 진하게 하여 밑줄을 긋습니다.)는 (W1j, W2j, ..., Wtj)로 표현된 다. 여기서 t는 시스템 내의 전체 색인어 수이다.The weight Wij of the term (word) and sentence pair (Sj, Sk) is a positive non-binary value, and the vector Sj of the sentence Sj (the vector cannot be displayed on the document to distinguish between the 'sentence' and 'vector' The value is darkened and underlined.) Is expressed as (W1j, W2j, ..., Wtj). Where t is the total number of index words in the system.

문장 Sj와 문장 Sk는 전체 색인어의 수인 t차원 벡터로 표시된다. 벡터모델에서 문장 Sj와 문장 Sk의 유사도 측정은 두 벡터 Sj 와 Sk 의 상관도로 구할 수 있으며, 이 상관도는 도 1(a)와 같이 두 벡터 간 사이각의 코사인 값으로 도 1(b)의 식과 같이 정량화할 수 있다.The sentence Sj and the sentence Sk are represented by t-dimensional vectors, which are the total number of index words. The similarity measure between sentence Sj and sentence Sk in the vector model can be calculated as the correlation between two vectors Sj and Sk , which is the cosine of the angle between the two vectors as shown in FIG. It can be quantified as in the formula.

여기서, | Sj |와 | Sk |는 두 문장의 노름(norm)값으로 | Sj |와 | Sk |는 문장 공간의 정규화를 제공한다.Where | Sj | and | Sk | is the norm of two sentences. Sj | and | Sk | provides normalization of sentence space.

Wij와 Wik가 0보다 크거나 같은 값을 갖기 때문에 sim(Sj, Sk)값은 0과 1 사이의 값이 된다. 따라서, 벡터모델은 문장 간의 관련 유무만을 예측하기보다는 문장 간의 유사도 값에 따라 유사도의 정도를 매길 수 있다. 이는 sim(Sj, Sk)값에 임계값을 두어 그 임계값에 따라 문서 표절 탐색에서 문장의 표절여부를 판단하는 근거로 삼게 된다.Since Wij and Wik have a value greater than or equal to 0, sim (Sj, Sk) is a value between 0 and 1. Therefore, the vector model may assign the degree of similarity according to the similarity value between sentences, rather than predicting only the relation between sentences. This is based on the sim (Sj, Sk) value, which is used as a basis for determining whether the sentence is plagiarized in the document plagiarism search.

문장 S에 있는 단어의 리스트에 대한 발생 벡터는 O(S)라 정의하고, 빈도벡터는 F(S)로 정의한다. Fi(S)는 문장 S에 있는 단어 j, k가 발생하는 빈도이다. sim(S1, S2)는 문장 S1과 S2 사이의 유사도이고, 코사인 유사도 측정을 적용하여 계산한다. The occurrence vector for the list of words in the sentence S is defined as O (S), and the frequency vector is defined as F (S). Fi (S) is the frequency with which the words j and k in the sentence S occur. sim (S1, S2) is the similarity between sentences S1 and S2 and is calculated by applying the cosine similarity measure.

예를 들어 설명하면, r은 원본문장이고 Sj는 r과 비교될 새로운 문장이라 하고, O(r)={a, b, c}, O(S1)={a, b}, O(S2)={a, b, c}, O(S3)={a, b, c, d, e, f, g, h}로 정의하면(a, b, c, d, e, f, g, h는 단어}, 복사탐색 결과는 r과 S1은 아주 유사하다는 결과가 나오고, r과 S2, S3는 정확한 복사이다는 결과가 나와야 한 다. 그러나, 종래 SCAM방법을 적용하면 [수학식 1]과 같은 결과가 나오게 된다.For example, r is the original sentence, Sj is the new sentence to be compared with r, and O (r) = {a, b, c}, O (S1) = {a, b}, O (S2) = {a, b, c}, O (S3) = {a, b, c, d, e, f, g, h} (a, b, c, d, e, f, g, h Word), and the copy search results show that r and S1 are very similar, and that r, S2, and S3 are exact copies, but applying the conventional SCAM method, Equation 1 The result will be.

결과를 분석해 보면 sim(r,S1), sim(r,S2) 값은 상당히 만족스러운 값이 나와서 r과 S1, S2는 표절여부를 판단할 수 있지만, S3과 같은 경우는 S3이 r의 단어를 모두 가지고 있으므로 불충분한 값의 결과가 나오게 된다. 즉, 원본 문장 r은 단어 a, b, c를 가지고 있고, 문장 S3은 a, b, c, d, e, f, g, h를 가지고 있으므로, 문장 S3은 원본 문장을 완전히 포함하고 있음에도 그 값이 0.61이 나오는 문제가 있는 것이다.Analyzing the results, the sim (r, S1) and sim (r, S2) values are quite satisfactory, so r, S1, and S2 can be judged as plagiarism. All of them have the result of insufficient values. That is, since the original sentence r has the words a, b, and c, and the sentence S3 has a, b, c, d, e, f, g, and h, the sentence S3 contains the original sentence, There is a problem with this 0.61.

이는 문장 S3이 원본문장 r의 단어 a, b, c를 모두 가지고 있음에도 상관없는 단어 d, e, f, g, h를 가지고 있기 때문이다.This is because the sentence S3 has the words d, e, f, g, and h, which do not matter even though they have all the words a, b, and c of the original sentence r.

도 2는 본 발명에 따른 텍스트 표절 탐색시스템 및 텍스트 표절 확인방법에 서 진행되는 표절여부 판단방식을 설명하기 위한 도면인 바, 이를 참조하여 설명한다.2 is a view for explaining a plagiarism determination method performed in the text plagiarism search system and the text plagiarism check method according to the present invention, will be described with reference to this.

도 2(a)는 r과 S1의 관계를 집합으로 표현한 그림으로, r∩S1={a, b}이다. 따라서, O(r)={a, b, c}, O(r∩S1}={a, b}가 된다. 이를 도 1(b)에 대입하여 코사인 유사도를 측정하면 [수학식 2]와 같이 된다.Fig. 2 (a) shows the relation between r and S1 as a set, where r∩S1 = {a, b}. Therefore, O (r) = {a, b, c} and O (r∩S1} = {a, b}, and the cosine similarity is measured by substituting this in Fig. 1 (b). Become together.

결과적으로, 상기 결과값에 따라 r과 S1은 '상당히 유사하다'라는 결과를 낳게 된다.As a result, the result of r and S1 is 'significantly similar'.

도 2(b)는 r과 S2의 관계를 집합으로 표현한 그림이다. r∩S2={a, b, c}이다. 따라서, O(r)={a, b, c}, O(r∩S2)={a, b, c}가 된다. 이를 도 1(b)에 대입하여 코사인 유사도를 측정하면 [수학식 3]와 같이 된다.Fig. 2 (b) is a diagram representing the relationship between r and S2 as a set. r∩S2 = {a, b, c}. Therefore, O (r) = {a, b, c} and O (r∩S2) = {a, b, c}. Substituting this in Figure 1 (b) to measure the cosine similarity is as shown in [Equation 3].

결과적으로, 상기 결과값에 따라 r과 S2는 '매우 유사하다'라는 결과를 낳게 된다.As a result, r and S2 result in 'very similar' according to the result value.

도 2(c)는 r과 S3의 관계를 집합으로 표현한 그림이다. r∩S3={a, b, c}이 다. 따라서, O(r)={a, b, c}, O(r∩S2)={a, b, c}가 된다. 이를 도 1(b)에 대입하여 코사인 유사도를 측정하면 [수학식 4]와 같이 된다.Fig. 2 (c) is a diagram representing the relationship between r and S3 as a set. r∩S3 = {a, b, c} Therefore, O (r) = {a, b, c} and O (r∩S2) = {a, b, c}. Substituting this in Figure 1 (b) to measure the cosine similarity is as shown in [Equation 4].

결과적으로, 상기 결과값에 따라 r과 S3은 '매우 유사하다'라는 결과를 낳게 된다.As a result, r and S3 result in 'very similar' according to the result value.

이상 기술한 본 발명에 따른 OMUCS를 일반화하면 다음과 같이 정의된다.Generalizing the OMUCS according to the present invention described above is defined as follows.

참고로, 종래 코사인 유사도를 구하는 식을 일반화하여 정의하면 다음과 같이 정리될 것이다.For reference, if the general formula is defined to obtain the cosine similarity will be summarized as follows.

여기서 'i'는 원본문장, 'j'는 비교본문장을 의미하고, Si 및 Sj는 각각 원본문장 및 비교본문장에 대한 단어 기반의 집합을 의미한다.Here, 'i' means original sentence, 'j' means comparative sentence, and Si and Sj mean word based set of original sentence and comparative sentence, respectively.

서열정렬(Sequence Alignment)은 생물정보학에서 DNA나 RNA의 서열정보를 분석하는데 쓰이는 기법이다. 서열정렬 정의는 단백질 서열이나 핵산 서열 사이의 상관관계를 나타내는 것으로, 서열 간의 관계는 서열들이 기능적으로나 진화적으로 어느 정도 연관성이 있고, 서열의 어느 부분들이 그러한 연관성을 가지고 있는가를 나타내는 방법으로 표시될 수 있다. 서열을 정렬한 후에 분석하는 것은 두 생물을 비교할 때 머리-머리, 다리-다리, 꼬리-꼬리와 같이 공통적인 기원을 갖는 기관을 비교대상으로 선택하는 것과 같은 의미를 포함한다.Sequence Alignment is a technique used in bioinformatics to analyze the sequence information of DNA or RNA. A sequencing definition represents a correlation between a protein sequence or a nucleic acid sequence, and the relationship between sequences can be expressed in a way that indicates how closely the sequences are functionally and evolutionarily related, and which parts of the sequence have such associations. have. Analyzing the sequence after aligning it includes the same meaning when comparing two organisms to select organs of common origin, such as head-head, leg-leg, and tail-tail.

서열정렬의 목적은 관심대상인 서열(sequence)과 상동성(homology)이 높은 서열들을 알아내어 서열의 기능을 유추하거나 관련있는 서열들 간의 정량적인 상관관계나 관련기능 부위 등을 예측하기 위한 목적으로 이용된다. 또한 특정 집단에 적용할 유전자탐침(DNA probe) 또는 핵산증폭반응을 위한 제작 등에도 서열정렬이 필요하다. 대표적인 서열정렬의 방법에는 전역정렬, 지역정렬, 반-전체정렬 등이 있다. 본 발명에 따른 텍스트 표절 탐색시스템 및 텍스트 표절 확인방법에서는 상기 정렬기법들을 문서 표절여부를 탐색하는데 있어 일련의 순서를 가지고 있는 스트링의 유사성을 비교하는데 이용된다.The purpose of sequence alignment is to identify sequences of interest and sequences with high homology to infer the function of sequences or to predict quantitative correlations or related functional sites among related sequences. do. In addition, sequence alignment is required for DNA probe or nucleic acid amplification reaction to be applied to a specific population. Representative methods of sequencing include global sorting, regional sorting, and semi-global sorting. In the text plagiarism search system and the text plagiarism identification method according to the present invention, the sorting methods are used to compare the similarity of strings having a sequence of sequences in searching for document plagiarism.

전역정렬은 두 서열이 이미 알려져 있고, 전체 길이에 걸쳐 정렬되어야 함을 전제로 하고 있다. 대표적인 것인 전역서열을 정렬하는 알고리즘으로 니들먼 분취(Needleman-Wunsch) 알고리즘이 있다. 정렬하는 방법은 행렬의 왼쪽 위에서 오른쪽 아래로 훑으면서 부분 행렬의 고득점 정렬로부터 최적 정렬을 생성하여 행렬을 따라 최고점을 기록하는 경로만이 추적되어 결과적으로 최적의 정렬이 생성된다.Global alignment assumes that the two sequences are already known and must be aligned over their entire length. A typical algorithm for sorting global sequences is the Needleman-Wunsch algorithm. The sorting method creates an optimal alignment from the high-score alignment of the partial matrix while sweeping from the upper left to the lower right of the matrix, so that only the path that records the highest point along the matrix is traced, resulting in an optimal alignment.

예를 들어 상세히 설명한다.For example, it demonstrates in detail.

문장 1Sentence 1

I am a boy who is a hansome guy and he is handsome and powerful.I am a boy who is a hansome guy and he is handsome and powerful.

단어패턴Word pattern

boy - handsome - guy - handsome - powerfulboy-handsome-guy-handsome-powerful

문자매칭Character matching

A - C - T - C - GA-C-T-C-G

문장 2Sentence 2

I am a boy who is a hansome boy and a powerful guy and a boy is powerful.I am a boy who is a hansome boy and a powerful guy and a boy is powerful.

단어패턴Word pattern

boy - handsome - boy - powerful - guy - boy - powerfulboy-handsome-boy-powerful-guy-boy-powerful

문자매칭Character matching

A - C - A - G - T - A - GA-C-A-G-T-A-G

단어를 나타내는 문자가 일치하는 경우의 점수, 불일치하는 경우의 점수, 공백을 넣어주는 경우의 점수에 대한 각각의 점수를 부여하고 가능한 여러가지 정렬된 점수 합계가 가장 높은 서열 정렬 결과를 찾게 된다.Each score is assigned to the scores of the words that match the word, the scores of the discrepancy, and the scores of the spaces, and the sequence sorting result with the highest possible sum of the various sorted scores is found.

본 발명에서는 일치하는 경우에는 +1점, 불일치하는 경우에는 0점, 공백을 넣어주는 경우에는 -1점으로 하였다.In the present invention, if it matches, it is +1 point, if it is inconsistent, 0 point is set, and -1 point is provided for a space.

[수학식 5]를 이용하여 ACTCG 서열과 ACAGTAG 서열을 정렬하면 [표 1]과 같은 결과를 얻게 된다.When the ACTCG sequence and the ACAGTAG sequence are aligned using Equation 5, the result shown in Table 1 is obtained.

여기서 a[0,0]=0이고 score(i,j)는 i,j번째 문자의 일치여부에 따른 점수가 된다. 오른쪽 가장 아래의 마지막 값이 전역정렬의 결과값이 된다. 따라서 두 서열의 전역정렬 값은 2가 된다.Where a [0,0] = 0 and score (i, j) is the score according to whether the i, j th character matches. The last value at the bottom right is the result of global sort. Thus, the global alignment value of the two sequences is two.

A(boy)A (boy) C(handsomeC (handsome T(guy)T (guy) C(handsomeC (handsome G(powerfulG (powerful 00 -1-One -2-2 -3-3 -4-4 -5-5 A(boy)A (boy) -1-One 1One 00 -1-One -2-2 -3-3 C(handsomeC (handsome -2-2 00 22 1One 00 -1-One A(boy)A (boy) -3-3 -1-One 1One 22 1One 00 G(powerfulG (powerful -4-4 -2-2 00 1One 22 22 T(guy)T (guy) -5-5 -3-3 -1-One 1One 1One 22 A(boy)A (boy) -6-6 -4-4 -2-2 00 1One 1One G(powerfulG (powerful -7-7 -5-5 -3-3 -1-One 00 22

지역정렬은 아직 알려지지 않은 서열을 찾기 위해 어떤 서열로 서열데이터베이스를 탐색하거나 질의 서열에 일치하는 부분을 유전체와 같이 매우 긴 DNA에서 찾는 방식이다. 즉, 가장 최적의 서브시퀀스를 찾는 것으로, 대표적으로는 스미스-워터만(Smith-Waterman) 알고리즘이 있다.Geosorting is a way to search a sequence database for a sequence that is not yet known, or to find parts that match a query sequence in very long DNA, such as genomes. In other words, the most optimal subsequence is found by the Smith-Waterman algorithm.

지역정렬 서열에 [수학식 6]을 적용하여 [표 2]를 만들게 되면 다음과 같은 결과가 나오게 된다. [표 2]에서 최대값은 서열이 가장 긴 서브시퀀스가 되며, 그 값은 2이다.Applying Equation 6 to the local alignment sequence, [Table 2], the following results are produced. In Table 2, the maximum value is the longest subsequence, and the value is 2.

A(boy)A (boy) C(handsomeC (handsome T(guy)T (guy) C(handsomeC (handsome G(powerfulG (powerful 00 -1-One -2-2 -3-3 -4-4 -5-5 A(boy)A (boy) -1-One 1One 00 00 00 00 C(handsomeC (handsome -2-2 00 22 1One 00 00 A(boy)A (boy) -3-3 00 1One 00 00 00 G(powerfulG (powerful -4-4 00 00 00 00 1One T(guy)T (guy) -5-5 00 00 1One 00 00 A(boy)A (boy) -6-6 1One 00 00 00 00 G(powerfulG (powerful -7-7 00 00 00 00 1One

반-전역정렬은 서열의 시작부분과 끝부분에 점수화되지 않는 갭(gap)을 허용한다는 것을 추가한 전체정렬 기법이다.Semi-global sorting is a global sorting technique that adds the ability to allow unscored gaps at the beginning and end of a sequence.

[수학식 7]을 적용하여 [표 3]을 만들게 되면 다음과 같다.Applying [Equation 7] to make [Table 3] is as follows.

a[i,0]=a[0,j]=a[0,0]=0의 값이 들어간다. 반-전역정렬의 값은 3이 된다.a [i, 0] = a [0, j] = a [0,0] = 0. The value of the semi-global alignment is 3.

A(boy)A (boy) C(handsomeC (handsome T(guy)T (guy) C(handsomeC (handsome G(powerfulG (powerful 00 00 00 00 00 00 A(boy)A (boy) 00 1One 00 -1-One 00 00 C(handsomeC (handsome 00 00 22 1One 00 00 A(boy)A (boy) 00 1One 1One 22 1One 00 G(powerfulG (powerful 00 00 1One 1One 22 22 T(guy)T (guy) 00 00 00 22 1One 22 A(boy)A (boy) 00 1One 00 00 22 1One G(powerfulG (powerful 00 00 1One 00 1One 33

이상 설명한 바와 같이, 전역정렬은 시퀀스의 전체적인 관점에서 얼마나 일치되는가의 특징을 가지고 있고, 지역정렬은 최적의 서브시퀀스를 찾는지의 특징을 가지고 있으며, 반-전역정렬은 길이의 차이가 많이 나는 시퀀스를 비교하는데 효율적인 특징을 갖는다.As explained above, global sorting is characterized by how consistent it is in the overall view of the sequence, local sorting is characterized by finding the optimal subsequence, and semi-global sorting is used for sequences with many differences in length. Efficient for comparison.

따라서, 본 발명에 따른 텍스트 표절 탐색시스템 및 텍스트 표절 확인방법은 문장의 부분적인 카피, 문장 배열의 변화, 문장 내 단어의 삭제나 삽입에 대처할 수 있는 장점이 있다.Therefore, the text plagiarism search system and the text plagiarism checking method according to the present invention have the advantage of coping with partial copying of sentences, changes in sentence arrangement, and deletion or insertion of words in sentences.

이상 설명한 본 발명에 따른 텍스트 표절 탐색기술을 이용한 텍스트 표절 확인방법을 설명한다. 이하의 실시예에서는 영문을 대상 텍스트로 하였지만, 이에 한정되는 것은 아니며, 한글 및 기타 다양한 언어로 작성된 문장 또한 그 표절 탐색의 대상이 될 수 있을 것이다.A text plagiarism check method using the text plagiarism search technique according to the present invention described above will be described. In the following embodiments, although English is the target text, the present invention is not limited thereto, and sentences written in Korean and various other languages may also be the target of plagiarism search.

도 3은 본 발명에 따른 텍스트 표절 탐색시스템을 도시한 블록도이고, 도 4는 본 발명에 따른 텍스트 표절 확인방법을 순차 도시한 플로우차트인바, 이를 참조하여 설명한다.3 is a block diagram illustrating a text plagiarism search system according to the present invention, and FIG. 4 is a flowchart sequentially showing a text plagiarism checking method according to the present invention.

S10 ; 원본문서를 문장단위로 분류하는 단계S10; Classifying the original document by sentence

본 발명에 따른 상기 텍스트 표절 탐색시스템(100)은 사용자가 원본문서와 비교본문서 간의 표절정도를 확인하기 위해 원본문서 및 비교본문서를 입력할 때 상기 입력방법을 안내하고, 이외에도 원본문서 및 비교본문서 간의 표절정도를 사용자에게 출력하는 등의 일체의 과정을 수행하는 사용자 인터페이스(110)를 포함한다. The text plagiarism search system 100 according to the present invention guides the input method when a user inputs an original document and a comparative text document in order to confirm the degree of plagiarism between the original document and the comparative original document, in addition to the original document and the comparison. It includes a user interface 110 for performing any process, such as outputting the degree of plagiarism between the text to the user.

상기 사용자 인터페이스(110)를 통해 상기 텍스트 표절 탐색시스템(100)으로 입력된 원본문서는 문서분석모듈(120)로 전송되어 문장단위로 분류된다.The original document input to the text plagiarism search system 100 through the user interface 110 is transmitted to the document analysis module 120 and classified into sentence units.

일반적으로, 문장은 마침표(.)로서 구분되므로, 상기 문서분석모듈(120)은 이를 기준으로 다수개의 문장을 포함하는 원본문서를 분류한다.In general, since the sentences are separated by a period (.), The document analysis module 120 classifies the original document including a plurality of sentences based on this.

S20 ; 분류된 원본 문장을 단어단위로 분류하는 단계S20; Classifying the classified original sentences in word units

상기 단계인 'S10'을 통해 분류된 각 문장은 상기 문서분석모듈(120)에 의해 단어단위로 분류된다. 이때, 하나의 문장을 이루는 단어들 중에는 어법을 맞추기 위한 불용어, 어간 등이 있는 바, 상기 문서분석모듈(120)는 이를 검색하여 삭제하는 과정을 수행하여 완전한 단어만을 추출하게 된다.Each sentence classified through step S10 is classified by word by the document analysis module 120. At this time, among the words constituting a sentence, there are stop words, stems, etc. to fit the phrase, the document analysis module 120 performs a process of searching and deleting them to extract only the complete word.

문장이 영어로 이루어질 경우에는 대문자를 소문자로 변환시키는 작업이 더 포함될 수도 있을 것이다.If the sentence is in English, it may further include converting uppercase letters to lowercase letters.

도 5는 상기 문서분석모듈(120)에서 이루어지는 문서분석과정을 예시로 표현한 도면이다.5 is a diagram illustrating a document analysis process performed in the document analysis module 120 by way of example.

도 5에 도시된 바와 같이, They, are, who, are 등은 문장을 이루는 어법을 맞추기 위한 불용어이고, 명사 뒤에 붙은 's'는 명사의 복수를 표현하기 위한 어간으로써, 상기 문서분석모듈(120)은 이들을 제거하여, 순수한 단어단위로 해당 문장을 분류한다.As shown in FIG. 5, They, are, who, are, etc. are stopwords for matching phrases, and 's' after nouns is a stem for expressing plurals of nouns. ) Removes them and sorts the sentences by pure word units.

한글의 경우에는 '조사'와 '대명사' 등이 불용어가 될 수 있을 것이다.In the case of Hangul, 'search' and 'pronoun' may be the stopwords.

상술한 불용어의 예로서 밝힌 영어 및 한글에서의 대명사는 필요에 따라 불용어로 기준을 잡을 수도 있고, 단어로도 잡을 수 있는 바, 불용어의 선택은 필요에 따라 그 기준을 변경할 수 있을 것이다.The pronouns in English and Korean, which are disclosed as examples of the above-mentioned stopwords, may be set as stopwords as needed, and may also be taken as words, and the selection of stopwords may be changed as necessary.

이렇게 분류된 단어는 단어 기반의 벡터형식으로 집합시켜, 해당 문장별로 각각 분류ㆍ저장한다.The classified words are collected in a word-based vector format and classified and stored for each sentence.

S30 ; 비교본문서를 문장단위로 분류하는 단계S30; Step of classifying the comparative document into sentence units

사용자는 앞서 분류된 원본문서와 표절여부를 비교하고픈 비교본문서를 상기 사용자 인터페이스(110)를 통해 입력하며, 이렇게 입력된 비교본문서는 상기 문서분석모듈(120)에 의해 문장단위로 분류된다.The user inputs a comparison document to compare plagiarism with the original document classified above through the user interface 110, and the input comparison document is classified by sentence unit by the document analysis module 120.

S40 ; 분류된 비교본 문장을 단어단위로 분류하는 단계S40; Classifying the classified comparative sentence by word unit

상기 단계인 'S30'을 통해 분류된 각 문장이 상기 문서분석모듈(120)을 통해 단어단위로 분류되는 바, 이는 상술한 바 있으므로 여기서는 그 설명을 생략하기로 한다.Each sentence classified through the step 'S30' is classified in word units through the document analysis module 120. Since this is described above, the description thereof will be omitted.

S50 ; 단어단위로 분류된 원본과 비교본을 문장단위 별로 유사도를 측정하는 단계S50; Measuring the similarity between the original and the comparative classified by word unit by sentence unit

상기 문서분석모듈(120)에 의해 각각 단어 기반의 벡터형식으로 집합되어 문장별로 분류된 상기 원본문서와 비교본문서는 단어중복수 확인모듈(130)에 의해 상기 OMCUS 측정으로 중복정도가 확인된다.The original document and the comparative document, which are each set in a word-based vector format by the document analysis module 120 and classified by sentence, are checked for overlapping degree by the OMCUS measurement by the word plural number checking module 130.

상술한 바와 같이, 상기 OMCUS 측정으로 원본문서의 모든 문장별 벡터집합과, 비교본문서의 모든 문장별 벡터집합이 각각 개별적으로 측정되며, 그 결과값이 소정의 임계값을 넘지 않을 경우에는 비교대상에서 제외한다.(S60, S70)As described above, the vector set of all sentences of the original document and the vector set of all sentences of the comparison document are individually measured by the OMCUS measurement, and when the result value does not exceed a predetermined threshold, the comparison target (S60, S70)

상기 임계값이란, 원본문서의 해당 문장과 비교본문서의 해당 문장에 대한 코사인 유사도 결과값(최소값 0, 최대값 1)의 범위 내에 있는 소정값으로, 임의로 설정할 수도 있으나 통계값에 근거하여 설정할 수도 있다.The threshold is a predetermined value within the range of the cosine similarity result value (minimum value 0, maximum value 1) of the sentence of the original document and the sentence of the comparative document. have.

즉, 상기 임계값이 0.75로 설정될 경우, 상기 OMCUS 측정을 통한 중복정도 확인결과 상기 0.75보다 작은 결과가 나오면, 단어 기반의 벡터형식으로 집합된 해당 원본문장과 비교본문장은 더 이상의 비교를 수행하지 않고, 상기 0.75 이상의 결과가 나오면, 해당 원본문장과 비교본문장은 서열정렬과정을 거치게 된다.That is, when the threshold value is set to 0.75, if the result of checking the degree of redundancy through the OMCUS measurement results in a result smaller than 0.75, the corresponding original sentence and the comparison sentence set in a word-based vector form do not perform further comparison. If the result is 0.75 or more, the original text and the comparative text are subjected to a sequence alignment process.

S80 ; 단어단위별로 서열정렬하는 단계S80; Sorting by word unit

상기 단계인 'S50'을 통해 임계값보다 높은 결과값을 갖는 해당 원본문장 및 비교본문장은 서열정렬관계 확인모듈(140)에 의해 단어의 서열정렬과정을 수행한다.The original sentence and the comparative sentence having a result value higher than the threshold value through the step 'S50' performs the sequence alignment of the words by the sequence alignment check module 140.

이때, 상기 서열정력관계 확인모듈(140)은 상기 전역정렬, 지역정렬 및 반-전역정렬 기법을 적용하여 집합화된 단어들 간의 서열정렬을 수행한다.In this case, the sequence tack relation identification module 140 performs sequence alignment between the aggregated words by applying the global sorting, local sorting and anti-global sorting techniques.

S90 ; 서열정렬을 통해 나온 수치화된 값의 합 확인단계S90; Identifying the sum of digitized values through sequencing

표절여부 결정모듈(150)은 상기 서열정렬관계 확인모듈(140)의 결과값들 모두 합하고, 그 결과가 임계값과 비교한다.The plagiarism determination module 150 adds all of the result values of the sequence alignment relationship verification module 140 and compares the result with a threshold value.

즉, 전역정렬이 적용된 원본문서 및 비교본문서의 집합화된 모든 문장에 대한 결과값과, 지역정렬이 적용된 원본문서 및 비교본문서의 집합화된 모든 문장에 대한 결과값 및, 반-전역정렬이 적용된 원본문서 및 비교본문서의 집합화된 모든 문장에 대한 결과값을 합하여 상기 임계값과 비교하는 것이다.That is, the result value of all the aggregated sentences of the original document and the comparative document with global sorting, the result value of all the aggregated sentences of the original document and the comparative document with local sorting, and semi-global sorting. The sum of the result values of all the sentences of the applied original document and the compared original document is compared with the threshold value.

하지만, 전역정렬, 지역정렬 또는 반-전역정렬 중 선택된 어느 하나 이상의 방식만을 적용하여, 적용된 방식의 결과값 만의 합을 비교할 수도 있을 것이다.However, by applying only one or more methods selected from global sorting, local sorting, or anti-global sorting, the sum of only the result values of the applied methods may be compared.

또한, 전역정렬, 지역정렬 또는 반-전역정렬 중 선택된 어느 하나 이상의 방식을 적용한 후, 해당 결과값을 문장별, 적용된 정렬방식별로 평균을 내어 그 평균값을 평균과 관련된 임계값과 비교할 수도 있을 것이다.In addition, after applying one or more methods selected from global sorting, local sorting, or semi-global sorting, the resultant may be averaged by sentence and applied sorting method, and the average value may be compared with a threshold associated with the average.

이때, 상기 결과값이 상기 임계값에 비해 작으면 원본문서와 비교본문서는 표절정도가 낮은 것으로 판정하고(S120), 상기 임계값 이상이면 원본문서와 비교본문서는 표절정도가 높은 것으로 판정한다.(S110)At this time, if the result value is smaller than the threshold value, the original document and the comparative document are judged to have a low degree of plagiarism (S120). If the result value is higher than the threshold value, the original document and the comparative document are determined to have a high degree of plagiarism. S110)

앞서 설명한 바와 같이, 상기 임계값은 임의로 결정되거나 통계에 의해 결정될 수 있는 바, 이는 고정된 값이 아니며 문서의 종류와 내용 및 다양한 인자로서 변경될 수 있다.As described above, the threshold may be arbitrarily determined or determined by statistics, which is not a fixed value and may be changed as the type and content of the document and various factors.

이상 본 발명에 따른 텍스트 표절 확인방법에 따른 실험예를 이하에서 기술한다.The experimental example according to the text plagiarism check method according to the present invention will be described below.

우선, 원본문서를 그대로 카피했을 때 이를 탐지할 수 있는지에 대해 확인한다. First, check if the original document can be detected as it is.

문서 1은 원본 뉴스가 되고 문서 2는 원본 뉴스를 그대로 카피한 것으로, 코사인 유사도와 본 발명에 따른 OMUCS를 모두 적용하였다.Document 1 is the original news, and Document 2 is a copy of the original news as it is, applying both cosine similarity and OMUCS according to the present invention.

그 결과, 코사인 유사도는 물론 본 발명에 따른 OMUCS 모두가 결과값으로 1이 나왔으며, 이를 통해 상기 문서 1 및 문서 2는 동일한 문서로 확인되었다. 일반적인 코사인 유사도와 본 발명에 따른 OMUCS 모두가 동일한 결과값이 나온 이유는 두 문서의 문장이 일치하기 때문이며, 본 실험에 대한 결과값으로는 본 발명에 따른 OMUCS의 개선이 확인되지는 않았다.As a result, the cosine similarity as well as the OMUCS according to the present invention, all of the results were 1, through which the document 1 and the document 2 was identified as the same document. The reason why both the cosine similarity and the OMUCS according to the present invention resulted in the same result is that the sentences of the two documents coincide. As a result of the experiment, the improvement of the OMUCS according to the present invention was not confirmed.

한편, 상기 결과값과 더불어 상기 문서 1 및 문서 2의 동일한 문장의 위치 또한 표시된다. 이는 본 발명에 따른 텍스트 표절 확인방법이 문장별로 분류하는 단계를 거치면서 다른 문장과의 구분을 위해 분류된 문장의 순번을 지정할 수 있기 때문이다.In addition, the position of the same sentence of the document 1 and the document 2 is also displayed along with the result value. This is because the text plagiarism checking method according to the present invention can specify the order of the classified sentences to distinguish them from other sentences while going through the step of classifying the sentences.

다음으로, 원본문서인 문서 1 내의 임의 문장을 다른 문서인 문서 2 내에 첨부하여 실험을 한다. Next, an experiment is performed by attaching an arbitrary sentence in Document 1, which is the original document, to Document 2, which is another document.

그 결과, 코사인 유사도와 본 발명에 따른 OMUCS의 결과값은 1이 나오며, 해당 문장의 위치 또한 표시된다.As a result, the cosine similarity and the result of the OMUCS according to the present invention is 1, and the position of the sentence is also displayed.

다음으로, 원본문서인 문서 1에서 문장보다 작은 단위의 구문을 카피하여 다른 문서인 문서 2에 첨부한 경우, 코사인 유사도는 0.67이 나온 반면, 본 발명에 따른 OMUCS의 결과값은 0.98로 나왔다. Next, in the case where the original document Document 1 is copied to another document Document 2, which is smaller than a sentence, the cosine similarity is 0.67, while the result of OMUCS according to the present invention is 0.98.

계속해서, 본 발명에 따른 텍스트 표절 확인방법의 효과를 극명하게 보일 수 있는 실험 내용과 그 결과를 개시한다.Subsequently, the contents of the experiments and the results that can clearly show the effect of the text plagiarism check method according to the present invention are disclosed.

도 6 및 도 7은 본 발명에 따른 텍스트 표절 확인방법의 실시를 위해 적용되는 원본 영문장과 비교본 영문장을 기재한 테이블이고, 도 8은 도 6 및 도 7에 기재된 원본 영문장과 비교본 영문장을 종래 코사인 유사도와 본 발명에 따른 OMUCS에 각각 적용한 결과를 보인 그래프인바, 이를 참조하여 설명한다.6 and 7 are tables describing original English texts and comparative English texts applied for the implementation of the method for checking text plagiarism according to the present invention, and FIG. 8 is a comparative text with original English texts described in FIGS. 6 and 7. The graph shows the results of applying the English text to the conventional cosine similarity and OMUCS according to the present invention, which will be described with reference to this.

도 8에서 보이는 바와 같이, 도 6 및 도 7에 도시된 원본 영문장 및 비교본 영문장은 코사인 유사도 및 OMUCS에 따라 다른 유사도를 보이고 있다. 사실 상기 도 7에 도시된 비교본 영문장은 도 6에 도시된 원본 영문장을 형식만 바꾸었을 뿐 실제로는 거의 동일한 내용이므로 원본 영문장과 비교본 영문장의 유사도는 커질수록 정확한 비교결과를 얻는 것인 바, 본 발명에 따른 텍스트 표절 확인방법은 종래보다 개선된 것임을 알 수 있다.As shown in FIG. 8, the original English text and the comparative English text shown in FIGS. 6 and 7 show different similarities according to cosine similarity and OMUCS. In fact, the comparative English text shown in FIG. 7 is only the same as the original English text shown in FIG. 6, but the contents are almost the same, so that the similarity between the original English text and the comparative English text is increased to obtain an accurate comparison result. Bar, it can be seen that the text plagiarism check method according to the present invention is improved than the conventional.

도 9는 도 6 및 도 7에 기재된 원본 영문장과 비교본 영문장을 본 발명에 따른 서열정렬 과정에 적용한 결과를 보인 그래프이다.9 is a graph showing the results of applying the original English text and the comparative English text described in FIGS. 6 and 7 in the sequence alignment process according to the present invention.

도 9의 결과를 분석하면 전역정렬, 지역정렬, 반-전역 정렬의 값이 다르게 나오고 있다. 이는 원본 영문장들은 그대로 표절한 것이 아니라 문장의 변화를 시켰기 때문이다. 전체 단어의 패턴 수 124개에서 전역 정렬은 67점, 지역 정렬은 75점, 반-전역 정렬은 75점이 나오게 되었다. 따라서 본 발명에 따른 텍스트 표절 확인방법은 원본 문장에 변화를 준 문서에 대한 표절여부를 확인하는데 효과적임을 알 수 있다.In analyzing the results of FIG. 9, the values of global sorting, local sorting, and anti-global sorting are different. This is because the original English sentences were not plagiarized, but changed the sentence. The total number of patterns in 124 was 67 points for global alignment, 75 points for local alignment, and 75 points for anti-global alignment. Therefore, it can be seen that the text plagiarism check method according to the present invention is effective in checking plagiarism for a document in which the original sentence is changed.

표절여부를 확인할 수 있는 서열정렬 값들의 임계값은 앞서 설명한 바와 같이 통계치를 통해 변경될 수 있으며 문서의 내용에 따라서도 변경될 수 있다.Thresholds of sequence alignment values that can identify plagiarism may be changed through statistics as described above, and may also be changed depending on the contents of a document.

이상 상기와 같은 본 발명에 따르면, 비교본 문서가 원본 문서를 카피했는지 여부를 수치적으로 쉽게 파악할 수 있는 것은 물론, 문장의 형식이 변경되더라도 그 실질적인 문장카피에 대한 여부를 확인할 수 있어 보다 정밀한 표절 감식이 가능하다.According to the present invention as described above, it is easy to numerically determine whether the comparative document has copied the original document, and even if the sentence format is changed, it is possible to check whether the actual sentence is copied more precisely plagiarism. Identification is possible.

Claims

An original document sentence classification step of classifying the input original document in sentence units;

An original document word classification step of classifying sentences classified through the original document sentence classification step in word units;

A comparative document sentence classification step of classifying the input comparative document into sentence units;

A comparison document word classification step of classifying sentences classified by the comparison document sentence classification step in word units;

Comparing the original sentences classified in word units and the comparative text sentences through the original document word classification step and the comparative original document word classification step, and identifying the same words in the original sentence and the comparison text sentence compared to each other;

An OMUCS calculation step of applying the same word identified in the same word checking step to an OMUCS [original text] [comparative text] with a modified cosine similarity; And

A similarity determination step of a sentence comparing the result output through the OMUCS operation step with a first threshold value and determining whether or not the similarity is based on the similarity between the original sentence and the comparative body sentence;

Text plagiarism check method comprising a.

The method of claim 1,

The method further includes a sequence sorting step of obtaining a maximum value by applying one or more methods selected from a global sorting method, a local sorting method, and a semi-global sorting method to the original sentence and the comparative sentence identified as similar through the similarity determination step. Text plagiarism check method characterized in that.

The method of claim 2,

And comparing the sum of the maximum values calculated through the sequence alignment step with a second threshold value, further comprising a plagiarism determination step of determining whether plagiarism is based on similarity between the original sentence and the comparative sentence. How to check for plagiarism.

The original document classified in word units and the comparative text, respectively, through the word classification step of the original document and the word classification step of the comparative document, respectively, and applied to a COS similar to the cosine [original text] [comparative text]. Operation step;

A similarity determination step of determining whether or not the similarity is based on the similarity between the original sentence and the comparative text sentence, by comparing the result output through the COS operation step with a third threshold value; And

A sequence sorting step of obtaining a maximum value by applying one or more methods selected from a global sorting method, a local sorting method, and a semi-global sorting method to the original text and the comparative text that are identified as similar through the similarity determining step;

Text plagiarism check method comprising a.

The method of claim 4, wherein

And comparing the sum of the maximum values calculated through the sequence sorting step with a fourth threshold value, further including a plagiarism determination step of determining a plagiarism according to similarity between the original sentence and the comparative sentence. How to check for plagiarism.