KR20020009077A

KR20020009077A - Method of searching for piracy and steal on a piece of writing

Info

Publication number: KR20020009077A
Application number: KR1020000042346A
Authority: KR
Inventors: 김회율; 서창덕; 추현곤
Original assignee: 김회율
Priority date: 2000-07-24
Filing date: 2000-07-24
Publication date: 2002-02-01
Also published as: KR100406671B1

Abstract

PURPOSE: A sentence plagiarism search method is provided to extract sentence patterns or code patterns included in a document, and to automatically search for plagiarized sentences by comparing the sentence or code patterns included in the documents. CONSTITUTION: The method comprises steps of searching for new web documents by using a robot agent(201), extracting morphologic characteristic vectors from new searched web documents(202), making a database from the extracted morphologic characteristic vectors(203), enabling a remote user to send a query of requesting a plagiarism on a specific sentence by using a query input interface connected to an application server(204), the application server extracting characteristic vectors from the input sentence and checking a similarity by comparing with the characteristic vectors stored in the database(205), and the server transmitting the compared result to the user over a network.

Description

Method of searching for piracy and steal on a piece of writing}

본 발명은 웹상의 문서 및 각종 소스코드에 대한 표절(혹은 도용)을 검색하는 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것으로, 특히 문서상에 포함되어 있는 문장의 패턴 또는 코드 패턴의 특징을 추출하여, 웹상의 문서들 사이의 표절을 판단하고, 표절 문서를 자동으로 검색하는 것이다.The present invention relates to a method of searching for plagiarism (or theft) of documents and various source codes on a web, and to a computer readable recording medium recording a program for realizing the method. Extracting the features of a pattern or code pattern to determine plagiarism between documents on the web and automatically searching for plagiarism documents.

최근, 인터넷상의 데이터 이용이 급격히 증가함과 더불어 전자 출판이 활발해짐에 따라, 기술, 산업 등의 각종 문서 및 프로그램 코드에 대한 저작권 보호는 그 필요성이 점차 증대되고 있는 실정이다.In recent years, as the use of data on the Internet is rapidly increased and electronic publishing is active, the need for copyright protection of various documents and program codes such as technology and industry is gradually increasing.

그러나, 종래의 인터넷 검색 엔진의 경우, 단순한 특징 단어 또는 특정 주제에 대한 분류로만 데이터베이스를 구축하여, 문서의 고유성이나 표절에 대하여 검사할 수 있는 방법이 없었다. 만약에, 개인이 자신의 창작물에 대해 불법 도용 또는 표절을 확인하기 위해서는 일일이 특정 단어 또는 주제를 이용하여, 관련 사이트 및 문서를 찾고, 그 내용을 읽어 표절 여부를 판단해야만 하였다. 이는 실제로 관련 문서의 수가 많아질 경우, 거의 불가능한 일이며, 상당한 시간과 노력이 소요되는 문제점이 있었다.However, in the case of the conventional Internet search engine, there is no method of checking the uniqueness or plagiarism of documents by constructing a database based on a simple feature word or a classification of a specific subject. In order to identify illegal theft or plagiarism of an individual's creation, an individual had to find relevant sites and documents by using specific words or topics, and read the contents to determine plagiarism. In practice, this is almost impossible when the number of related documents increases, and it takes a considerable time and effort.

따라서, 인터넷상의 문서 및 코드에 대한 저작권을 보호하고 표절 및 도용을 막기 위해, 사용자가 일일이 해당 문서의 내용을 읽지 않고도 두 문서 사이의 특징 요소를 이용하여 내용의 일치 부합성을 파악할 수 있는 방안이 필수적으로 요구된다.Therefore, in order to protect copyrights and prevent plagiarism and theft of documents and codes on the Internet, there is a way for users to identify the correspondence of contents by using the features of the two documents without having to read the contents of the documents. It is required.

상기한 바와 같은 문제점을 해결하기 위하여 안출된 본 발명은, 사용자가 일일이 해당 문서의 내용을 읽지 않고도 두 문서 사이의 특징 요소를 이용하여 내용의 일치 부합성을 파악함으로써, 인터넷상의 문서 및 코드에 대한 저작권을 보호하고 표절 및 도용을 막기 위한 표절 및 도용 검색 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.The present invention devised to solve the problems described above, by using the feature elements between the two documents without the user to read the contents of the document by identifying the correspondence of the content, the document and code on the Internet It is an object of the present invention to provide a method of searching for plagiarism and theft to protect copyrights, preventing plagiarism and theft, and a computer-readable recording medium recording a program for realizing the method.

도 1 은 본 발명이 적용되는 표절 및 도용 검색 시스템의 구성 예시도.1 is an exemplary configuration of a plagiarism and theft search system to which the present invention is applied.

도 2 는 본 발명에 따른 문장 표절 및 도용 검색 방법에 대한 일실시예 흐름도.2 is a flowchart illustrating a sentence plagiarism and theft search method according to the present invention;

도 3a 는 본 발명에 이용되는 데이터베이스 구성 예시도.3A is an exemplary database configuration used in the present invention.

도 3b 는 본 발명에 이용되는 질의 구성 예시도.Figure 3b is a diagram illustrating a query configuration used in the present invention.

도 4 는 본 발명에 이용되는 문장 표절 및 도용 검색 서비스를 위한 사용자 인터페이스 예시도.Figure 4 illustrates a user interface for sentence plagiarism and theft search service used in the present invention.

도 5 는 본 발명에 이용되는 문장의 특징 벡터 추출 예시도.5 is an exemplary diagram of feature vector extraction of sentences used in the present invention.

*도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

11 : 바이로봇 12 : 사용자 인터페이스부11: ViRobot 12: User Interface

13 : 어플리케이션 서버 14 : 데이터베이스13: application server 14: database

상기 목적을 달성하기 위한 본 발명은, 인터넷을 포함한 네트워크상의 각종 문서 및 코드의 표절/도용을 검사하기 위한 방법에 있어서, 상기 문서의 문장의 대한 제1 특징벡터를 추출하는 제 1 단계; 추출된 상기 제1 특징벡터를 이용하여 데이터베이스를 구축하는 제 2 단계; 네트워크를 통한 원격지의 사용자로부터 표절/도용 질의 입력시에, 질의 입력된 문서 혹은 문장의 제2 특징벡터를 추출하는 제 3 단계; 및 상기 제2 특징벡터를 이용하여 상기 데이터베이스에 기 저장된 상기 제1 특징벡터와의 유사도를 검사하여 표절/도용 여부를 판정하는 제 4 단계를 포함하여 이루어진 것을 특징으로 한다.According to an aspect of the present invention, there is provided a method for checking plagiarism / theft of various documents and codes on a network including the Internet, the method comprising: a first step of extracting a first feature vector of a sentence of the document; A second step of constructing a database using the extracted first feature vectors; A third step of extracting a second feature vector of the query-input document or sentence upon inputting a plagiarism / theft query from a remote user via a network; And a fourth step of determining whether plagiarism / theft is performed by checking similarity with the first feature vector previously stored in the database by using the second feature vector.

또한, 본 발명은 상기 제 4 단계의 판정결과에 따라, 표절/도용 판정결과를네트워크를 통해 사용자에게 알리는 제 5 단계를 더 포함하여 이루어진 것을 특징으로 한다.The present invention may further comprise a fifth step of informing the user of the plagiarism / theft determination result through the network according to the determination result of the fourth step.

또한, 본 발명은 상기 제 4 단계의 검색결과에 따라, 표절/도용 검색결과를 상기 데이터베이스에 저장된 부가정보와 함께 네트워크를 통해 사용자에게 알리는 제 6 단계를 더 포함하여 이루어진 것을 특징으로 한다.The present invention may further include a sixth step of informing a user through a network of the plagiarism / theft search result along with additional information stored in the database according to the fourth step search result.

상기 목적을 달성하기 위한 네트워크상의 각종 문서 및 코드의 표절/도용을 검사하기 위하여, 프로세서를 구비한 문서(문장) 표절/도용 검색시스템에, 상기 문서의 문장에 대한 제1 특징벡터를 추출하는 제1 기능; 추출된 상기 제1 특징벡터를 이용하여 데이터베이스를 구축하는 제2 기능; 네트워크를 통한 원격지의 사용자로부터 표절/도용 질의 입력시에, 질의 입력된 문서 혹은 문장의 제2 특징벡터를 추출하는 제3 기능; 및 상기 제2 특징벡터를 이용하여 상기 데이터베이스에 기 저장된 상기 제1 특징벡터와의 유사도를 검사하여 표절/도용 여부를 판정하는 제4 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In order to check the plagiarism / theft of various documents and codes on the network for achieving the above object, a document (sentence) plagiarism / theft search system having a processor extracts a first feature vector for a sentence of the document. 1 function; A second function of constructing a database using the extracted first feature vectors; A third function of extracting a second feature vector of the query-input document or sentence upon inputting a plagiarism / theft query from a remote user via a network; And a computer readable recording medium having recorded thereon a program for realizing a fourth function of judging plagiarism / theft by checking similarity with the first feature vector previously stored in the database using the second feature vector. to provide.

또한, 본 발명은 상기 제4 기능의 판정결과에 따라, 표절/도용 판정결과를 네트워크를 통해 사용자에게 알리는 제5 기능을 더 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.The present invention also provides a computer-readable recording medium having recorded thereon a program for further realizing a fifth function of informing a user over a network of plagiarism / theft determination results in accordance with the determination result of the fourth function.

또한, 본 발명은 상기 제4 기능의 검색결과에 따라, 표절/도용 검색결과를 상기 데이터베이스에 저장된 부가정보와 함께 네트워크를 통해 사용자에게 알리는 제5 기능을 더 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In addition, the present invention can be read by a computer recording a program for further realizing a fifth function of notifying a user via a network along with additional information stored in the database according to the search result of the fourth function. Provide a recording medium.

본 발명은 웹상에 사용되는 각종 문서의 내용을 검색하여, 문서에서의 각 문장의 특징 요소를 추출하여, 문서의 특징 데이터베이스를 구성하고, 사용자의 원문에 대한 표절에 대해 질의가 들어오면, 특징 데이터베이스에서 같은 특징 요소를 가지는 문서, 표절 문서를 검색할 수 있도록 하는 것이다.The present invention retrieves the contents of various documents used on the web, extracts the feature elements of each sentence in the document, constructs a feature database of the document, and enters a feature database when a query comes in about the plagiarism of the user's original text. To search for documents and plagiarism documents with the same feature elements.

본 발명은 웹상의 문서 및 각종 소스코드에 대한 표절 및 도용을 검색하는 방법에 대해 개시한다. 이는 웹상에 등록되어 있는 하이퍼텍스트생성언어(HTML : HyperText Mark-up Language) 문서 및 각종 문서에 대한 내용에 대하여 적용 가능하다.The present invention discloses a method for searching for plagiarism and theft of documents and various source codes on the web. This is applicable to the contents of HyperText Mark-up Language (HTML) documents and various documents registered on the web.

이를 위해, 본 발명은 인터넷 검색 로봇을 이용해 기존의 인터넷의 HTML을 비롯한 문서들을 검색하여, 각 문서에 대하여 원문에 대한 특징 요소를 검출하고, 이를 데이터베이스(DB)로 만들어 저장한다. 만약, 사용자가 자신이 작성한 문서에 대한 표절이 의심이 가는 경우, 인터넷에 자신의 문서를 이용하여 질의를 하면, 질의한 문서에 대한 특징 요소에 대해 기존 DB에서 검색하여, 표절 여부에 대한 결과를 보여준다. 이때, 표절 검색을 위한 특징 요소의 추출은 문장에서 단어들간의 의미, 통사관계를 이용한 문자 추출 알고리즘을 이용하여, 각 문서에 대한 특징 요소를 추출한다. 그리고, 검색 방법으로는 기존의 근사매칭 알고리즘을 사용하여, 전체 비교 및 부분 비교가 가능하다.To this end, the present invention searches the documents including the HTML of the existing Internet using the Internet search robot, detects the feature elements of the original text for each document, and create and store it as a database (DB). If a user is in doubt about the plagiarism of a document he or she has written, if a user makes a query using his or her document on the Internet, the existing DB is searched for the feature elements of the queried document, and the result of plagiarism is checked. Shows. At this time, extraction of feature elements for plagiarism search extracts feature elements for each document by using a character extraction algorithm using meanings and syntactic relations between words in a sentence. In addition, as a retrieval method, a full comparison and a partial comparison are possible using existing approximation matching algorithm.

본 발명에 따르면, 문서상에 포함되어 있는 문자의 패턴 또는 코드 패턴의 특징을 추출하여 웹상의 문서들 사이의 표절을 판단할 수 있고, 표절문서를 자동으로 검색할 수 있다.According to the present invention, plagiarism between documents on a web can be determined by extracting a feature of a character pattern or a code pattern included in a document, and a plagiarism document can be automatically searched.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명이 적용되는 표절 및 도용 검색 시스템의 구성 예시도이다.1 is an exemplary configuration of a plagiarism and theft search system to which the present invention is applied.

도 1에 도시된 바와 같이, 본 발명이 적용되는 웹상의 문서 및 각종 소스코드에 대한 표절 및 도용을 검색하기 위한 표절 및 도용 검색 시스템은, 웹 검색 로봇인 바이로봇(11)을 이용해 웹상의 문서를 계속 탐색하여 새로운 문서들을 찾고, 어플리케이션 서버(13)가 발견된 새로운 문서의 내용에서 문자의 형태적 또는 형태학적 특징을 이용한 특징벡터를 추출하여 문서의 부가적인 정보와 함께 데이터베이스(14)에 저장한다. 만약, 사용자가 자신이 작성한 문서에 대한 표절이 의심이 가는 경우, 사용자 인터페이스(12)를 통해 어플리케이션 서버(13)에 자신의 문서를 이용하여 질의를 하면, 어플리케이션 서버(13)에서는 질의한 문서에 대한 특징요소에 대해 기존 데이터베이스(14)를 검색하여, 표절 여부에 대한 결과를 보여준다. 이때, 표절 검색을 위한 특징요소의 추출은 글자 사이의 상호 관계를 이용한 텍스트 워터마킹 알고리즘을 이용하여, 각 문서에 대한 특징요소를 추출한다. 그리고, 검색 방법으로는 기존의 글자 매칭 알고리즘을 사용하여, 전체 비교 및 부분 비교가 가능하다.As shown in FIG. 1, the plagiarism and theft search system for searching for the plagiarism and theft for documents and various source codes on the web to which the present invention is applied uses the ViRobot 11 which is a web search robot. Continue searching to find new documents, and the application server 13 extracts a feature vector using the morphological or morphological features of the text from the contents of the found new document and stores it in the database 14 together with the additional information of the document. . If a user is suspicious of plagiarism for a document he or she has written, if the user queries the application server 13 using his or her document through the user interface 12, the application server 13 sends a query to the queried document. The existing database 14 is searched for the features of the search results, and the results of plagiarism are shown. At this time, the feature element for plagiarism extraction is extracted by using a text watermarking algorithm using the interrelationship between letters. In addition, as a retrieval method, a full comparison and a partial comparison are possible using a conventional character matching algorithm.

본 발명이 적용되는 웹상의 문서 및 각종 소스코드에 대한 표절 및 도용을 검색하기 위한 표절 및 도용 검색 시스템은, 바이로봇(11)을 이용해 웹상에 있는문서를 가져와 어플리케이션 서버(13)가 문서에서 각 문장의 단어 형태상 특징 벡터를 추출하여 문서별로 데이터베이스(14)에 저장한다. 이렇게 저장된 데이터베이스(14)를 기반으로, 사용자가 도용이나 표절의 의심이 가는 자신의 창작물에 대하여 문서를 이용하여 직접 질의를 하면, 어플리케이션 서버(13)가 창작물의 문서에서 특징벡터를 추출하여 이를 데이터베이스(14)에 기 저장되어 있는 특징벡터와 비교하여, 표절이나 도용에 대하여 검색한다.The plagiarism and theft search system for searching for plagiarism and theft of documents and various source codes on the web to which the present invention is applied is obtained by using the ViRobot 11 to retrieve the documents on the web, and the application server 13 makes a statement in each document. The feature vector of the word shape is extracted and stored in the database 14 for each document. Based on the database 14 stored in this way, when a user inquires directly about a document of his creation which is suspected of being stolen or plagiarized by using a document, the application server 13 extracts a feature vector from the document of the creation and generates the database. The plagiarism and theft are searched against the feature vectors previously stored in (14).

이와 같이, 문서에서의 각 문장에 대한 특징벡터를 추출하기 위해, 본 발명에서는 도 3a에 도시된 바와 같이 문서상의 각 문장에 대해 문장의 각 핵심단어에 대한 형태상의 특징을 지을 수 있는 철자 또는 문자를 추출하여 그 문장에 대한 특징벡터로 구성한다. 이 특징벡터는 문서의 부가 정보(문서의 위치, 날짜, 주제 등)와 같이 하나의 테이블로 구성되며, 이 테이블이 데이터베이스(14)로 구성된다. 즉, 현재의 인터넷 검색 시스템의 바이로봇(11)과 같은 기능을 가진 로봇을 이용해, 인터넷상의 문서를 탐색하여 각각의 문서에 대해 특징벡터를 추출하여 데이터베이스(14)에 저장한다.As such, in order to extract the feature vector for each sentence in the document, in the present invention, as shown in FIG. 3A, a spelling or character that can form a morphological feature for each key word of the sentence for each sentence in the document Extract and construct the feature vector for the sentence. This feature vector is composed of one table, such as document additional information (document position, date, subject, etc.), and this table is composed of a database 14. That is, by using a robot having the same function as the ViRobot 11 of the current Internet search system, the document on the Internet is searched and the feature vector is extracted for each document and stored in the database 14.

이후, 인터넷을 포함한 일반적인 네트워크상에서 원격으로 사용자가 자신이 원하는 데이터를 가지고 질의시에, 사용자의 데이터에서 특징벡터를 추출하여 데이터베이스(14)에 데이터를 검색할 수 있는 사용자 질의 인터페이스 및 질의 서비스를 구현한다(도 3b 참조). 이때, 질의로 들어온 문장 혹은 문서의 특징벡터와 기존의 데이터베이스의 특징벡터를 빠르게 비교하는 알고리즘이 요구되며, 특징벡터들의 비교는 벡터상 전체 비교 및 각 부분 비교도 가능하다. 그리고, 특징벡터에 대한 비교 결과는 다시 원격의 사용자에게 데이터에 대한 표절 여부와 함께, 데이터베이스(14)에 저장된 부가정보와 같이 전송된다.Then, the user query interface and query service for retrieving data in the database 14 by extracting feature vectors from the user's data when the user remotely queries the desired data over a general network including the Internet. (See FIG. 3B). At this time, an algorithm for quickly comparing the feature vector of the sentence or document entered into the query with the feature vector of the existing database is required, and the feature vectors can be compared with each other on the vector and each part. Then, the comparison result of the feature vector is transmitted to the remote user together with additional information stored in the database 14 together with the plagiarism of the data.

이제, 본 발명에 따른 문서 문장의 형태 특징을 이용한 표절 및 도용 검색 방법에 대해 보다 상세히 설명한다.Now, the plagiarism and theft search method using the form features of the document sentences according to the present invention will be described in more detail.

도 2 는 본 발명에 따른 표절 및 도용 검색 방법에 대한 일실시예 흐름도이다.2 is a flowchart illustrating a plagiarism and theft search method according to the present invention.

도 2에 도시된 바와 같이, 본 발명에 따른 표절 및 도용 검색 방법은, 인터넷을 포함한 네트워크상의 문서에 대한 표절이나 도용을 검색하기 위해, 바이로봇(11)을 이용해 웹상의 문서를 계속 탐색하여 새로운 문서들을 찾고(201), 이때 발견된 새로운 문서의 내용에서 문장의 형태적 또는 형태학적 특징벡터를 추출하여(202), 추출된 특징벡터를 이용해 데이터베이스(14)를 구성한다(203).As shown in Fig. 2, the method for searching for plagiarism and theft according to the present invention uses the ViRobot 11 to continuously search for documents on the web to search for plagiarism or theft of documents on a network including the Internet. In step 201, a morphological or morphological feature vector of a sentence is extracted from the content of a new document found at this time (202), and the database 14 is constructed using the extracted feature vector (203).

이후, 원격의 사용자가 네트워크를 통해 어플리케이션 서버(13)와 연결된 질의 입력 인터페이스(도 4 참조)를 통해 특정 문서 또는 문장에 대한 표절 또는 도용을 질의하면(204), 어플리케이션 서버(13)에서는 질의로 입력된 문서 또는 문장에 대하여 특징벡터를 추출하고 추출된 특징벡터를 이용하여 데이터베이스(14)에 저장된 특징벡터와의 유사성을 검사한다(205). 그리고, 검색 결과를 네트워크를 통해 사용자에게 전송한다.Subsequently, when a remote user queries plagiarism or theft for a specific document or sentence through a query input interface (see FIG. 4) connected to the application server 13 through a network (204), the application server 13 sends a query to the query. A feature vector is extracted from the input document or sentence, and similarity with the feature vector stored in the database 14 is checked using the extracted feature vector (205). The search result is then transmitted to the user through the network.

오프라인 또는 온라인 작업으로 진행되는 데이터베이스(14) 구축과정(203)에서는, 인터넷을 검색하는 바이로봇(11)이 웹상의 문서를 계속 탐색하여 새로운 문서들을 찾고(201), 발견된 새로운 문서의 내용에서 문장의 형태적 또는 형태학적특징을 이용한 특징벡터를 추출하여(202), 문서의 부가적인 정보와 함께 데이터베이스(14)에 저장한다(203).In the construction process 203 of the database 14 which is performed offline or online, the viRobot 11 searching the Internet continues to search the document on the web to find new documents (201), and the sentence in the content of the found new document. A feature vector is extracted using the morphological or morphological features of 202 and stored in the database 14 together with additional information of the document (203).

문서의 특징벡터는 문서내의 각 문장의 핵심 단어들에 대한 형태학적 특징을 이용하여 구성할 수 있다. 예를 들면, 관사, 전치사 등을 제외한 중요 핵심 단어들의 자음, 모음 또는 자음 겹침, 모음 겹침 등 여러 가지 단어 형태의 요소를 이용하여 특징벡터를 구성한다. 이처럼 핵심단어의 가장 처음에 나오는 자음을 이용하여 특징벡터를 구성한 예가 도 5에 도시되었다.The feature vector of the document can be constructed using the morphological features of the key words of each sentence in the document. For example, feature vectors are constructed using elements of various word forms such as consonants, vowel or consonant overlap, and vowel overlap of key words except articles, prepositions, etc. As shown in FIG. 5, an example of configuring a feature vector using the first consonants of a key word is shown.

문서에서 추출된 특징벡터는 도 5와 같이 여러 부가적인 정보와 결합된다. 이때, 웹상의 위치, 저자, 제작 일시, 키워드 등의 정보와 같이 하나의 레코드를 형성하고, 이 레코드가 데이터베이스(14)에 저장된다.The feature vector extracted from the document is combined with various additional information as shown in FIG. 5. At this time, one record is formed together with information such as a location on the web, the author, the date and time of production, a keyword, and the like, and the record is stored in the database 14.

원격 사용자를 위한 인터페이스에서는 사용자가 자신의 창작물에 대한 질의를 하기 위해, 텍스트로 구성된 자신의 문서를 직접 파일로 질의할 수 있으며, 또 문서의 특정 문장 요소에 대하여 질의할 수 있어야 한다. 또한, 자신이 만든 웹 문서의 인터넷 주소를 입력으로 질의 문안을 대신할 수 있다. 이때, 질의 문안으로 들어온 문서 또는 문장에 대하여, 데이터베이스(14) 구축시 사용된 알고리즘과 동일한 알고리즘으로 특징벡터를 구성한다.In the interface for a remote user, a user can directly query a text document of his own document to query his creation, and can query a specific sentence element of the document. You can also substitute the query text by inputting the internet address of the web document you created. At this time, the feature vector is constructed with the same algorithm as the algorithm used when constructing the database 14 for the document or sentence entered into the query text.

원격 사용자 인터페이스에서 생성된 특징벡터는 데이터베이스(14)의 특징벡터와 비교된다. 이때, 질의 특징벡터와 데이터베이스(14)에 저장되어 있는 특징벡터들간의 유사도를 데이터베이스(14)에 등록되어 있는 모든 문서들에 대하여 계산하여 각 특징벡터의 일치 부합 여부를 결정하게 된다.The feature vector generated at the remote user interface is compared with the feature vector of the database 14. At this time, the similarity between the query feature vector and the feature vectors stored in the database 14 is calculated for all documents registered in the database 14 to determine whether each feature vector matches.

특징벡터를 추출하기 위해서는 문장을 구성하고 있는 단어의 형태적, 통사적 특징을 파악해야 하는데, 이를 위해 형태소 분석과정을 거친다. 문서로부터 언어의 종류를 파악하고 해당 언어에 대한 형태소 분석기를 선택하여 문장을 처리한다. 이렇게 하면 하나의 문장은 여러 개의 단위 형태소들로 분리된다. 이들로부터 의미있는 단어들로 이루어진 특징벡터를 추출하는데 단어의 첫글자만으로 구성된 특징벡터 문자열을 생성한다. 문장에서 가장 의미있는 것은 명사가 되므로 명사의 첫글자로 이루어진 특징벡터가 생성되는데 문서내에 문장이 100개이면 100개의 특징벡터 문자열이 만들어지며, 이 개별 문자열들은 길이가 제각기 다르다.In order to extract the feature vector, it is necessary to grasp the morphological and syntactic features of the words constituting the sentence. Identify the type of language from the document and select a stemmer for the language to process the sentence. This breaks up a sentence into several unit morphemes. From these, feature vector strings composed of meaningful words are generated. A feature vector string consisting of only the first letters of a word is generated. The most significant in a sentence is a noun, so a feature vector consisting of the first letter of the noun is created. If there are 100 sentences in the document, 100 feature vector strings are created, and each individual string has a different length.

일본어와 중국어의 경우 문장에는 띄어쓰기가 없으므로 단순 기계적인 방법으로는 문장에서 의미있는 명사를 추출할 수 없으므로 반드시 형태소 사용해야 하지만 단어 단위로 띄어쓰는 영어와 어절단위로 띄어쓰는 한글의 경우 굳이 형태소 분석기를 이용하지 않아도 된다. 형태소 분석을 하는 일본어와 중국어의 경우 추출된 명사만을 대상으로 첫글자만을 발췌하여 특징벡터를 만들지만 형태소 분석을 하지 않는 경우라면 의미있는 단어의 첫글자만을 발췌한다. 이때, 형태소 분석을 하지 않았으므로 문장의 통사정보를 알 수 없어 의미있는 단어 추출이 곤란하지만 언어적 특성상 그리고 통계적으로 글자 수가 작을수록 조사, 전치사 등 의미없는 글자나 불용어일 가능성이 높으므로 3글자 이상의 단어로부터 첫글자만 추출한다. 다만, 영어의 경우는 4글자 이상의 단어를 대상으로 한다. 한글의 경우 영어와 달리 여러 형태소가 결합하여 하나의 어절을 형성하고 있지만 의미없는 단어인 조사, 어미 등은 단어 뒤에 오므로 문제가 없다.In Japanese and Chinese, there is no spacing in sentences, so you can't extract meaningful nouns from sentences by simple mechanical methods, but you must use morphemes. You do not have to do. In the case of Japanese and Chinese morphological analysis, only the first letter is extracted from the extracted nouns to create a feature vector, but if the morphological analysis is not performed, only the first letter of the meaningful word is extracted. At this time, it is difficult to extract meaningful words because the syntax information of the sentence is not known because the morphological analysis is not performed. However, the smaller the number of letters is, the more likely it is to be meaningless letters or stopwords such as surveys and prepositions. Extract only the first letter from a word. However, in the case of English, a word of 4 letters or more is used. In the case of Hangul, unlike English, several morphemes combine to form a single word, but there is no problem because a meaningless word, such as survey or ending, follows a word.

이제, 질의 문장 혹은 질의 문서로부터 발췌한 특징벡터를 Q, 표절여부를 검사하기 위해 비교 대상이 되는 문서로부터 발췌한 특징벡터를 S라 하면 다음과 같다.Now, a feature vector extracted from a query sentence or a query document is Q and S is a feature vector extracted from a document to be compared to check plagiarism.

DB: An improved technique for 3D head tracking under varying illumination conditions is proposed. The head is modeled as a texture mapped cylinder. ...DB: An improved technique for 3D head tracking under varying illumination conditions is proposed. The head is modeled as a texture mapped cylinder. ...

Query: The headwasmodeled as acontourmapped cylinder.Query: The head was modeled as a contour mapped cylinder.

S: ithtuvicp hmtmc ...S: ithtuvicp hmtmc ...

Q: hmcmcQ: hmcmc

DB내 문서들로부터 하나의 문장에 대하여 하나의 문자열이 구성되는데 질의한 문자열이 DB내에 있는 문자열과 비슷하지만 똑같지는 않다. 이 경우 S의 두 번째 문자열 'hmtmc'와 Q의 문자열 'hmcmc'는 가운데 문자 하나만 차이가 날뿐이다. 이 경우 매칭방법은 정확매칭도 부분매칭도 아닌 근사매칭만이 가능하다. 이러한 특징벡터들간의 유사도를 통한 일치 부합 여부의 판정에 있어서, 근사매칭 알고리즘을 사용하여, 표절여부를 판단하게 된다.A string is constructed for a statement from documents in the DB. The query string is similar to, but not identical to, the string in the DB. In this case, the second string 'hmtmc' in S and the string 'hmcmc' in Q differ only in the middle character. In this case, the matching method is only exact matching, not exact matching or partial matching. In determining whether a match is made through the similarity between the feature vectors, it is determined whether plagiarism is performed using an approximation matching algorithm.

정확매칭이나 부분매칭이 아닌 근사매칭을 사용하는 이유는 표절시 문장을 아무런 변형없이 그대로 도용하는 경우도 있으나 일부 단어나 말투 등을 바꾸는 경우도 많기 때문이다. 근사매칭 알고리즘으로 n-gram에 기반한 근사매칭 방법이 있는데 이를 두 특징벡터 비교에 적용한다.The reason for using approximate matching, not exact matching or partial matching, is that while plagiarizing the sentences without altering them, they often change some words or tone. As an approximation matching algorithm, there is an approximation matching method based on n-gram, which is applied to the comparison of two feature vectors.

두 문자열 사이의 유사도 측정은 유사계수 공식을 사용하는데 중요한 공식으로는 다이스계수(Dice's coefficient), 자카드계수(Jaccard's coefficient), 코싸인계수(Cosine coefficient), 중복도계수(Overlap coefficient), 타니모토계수(Tanimoto coefficient)가 있으며, 이외 엔젤계수(Angell's coefficient)도 있다.The similarity measure between two strings uses the similarity coefficient formula.The important formulas are Dice's coefficient, Jaccard's coefficient, Cosine coefficient, Overlap coefficient, and Tanimoto coefficient. There is a (Tanimoto coefficient), and there is another Angel's coefficient.

하기의 (수학식 1)은 다른 계수에 비해 비교적 널리 사용되는 공식으로 (수학식 2)의 n-gram 기반 근사매칭에 사용하였다.Equation 1 below is a relatively widely used formula compared to other coefficients and was used for n-gram based approximation matching of Equation 2.

where, g_c= Q와 S의 공통 그램수,where, g _c = common grams of Q and S,

g_Q, g_S= Q, S의 그램수g _Q , g _S = Q, grams of S

여기서, 그램이란 특징벡터 문자열을 일정크기 단위로 분절한 단위를 말한다. 분절방법은 여러 가지가 있을 수 있으나, 통상 1, 2, 3글자 단위로 나누며 중첩하거나 하지 않을 수도 있다.Here, the gram is a unit obtained by dividing the feature vector string into certain size units. There may be several methods of segmentation, but it is usually divided into 1, 2, 3 letter units and may or may not overlap.

표절 및 도용 여부 판단은 하기의 수학식3(표절 및 도용 여부의 판단 알고리즘)과 같이 이뤄진다.Plagiarism and theft is determined by Equation 3 below (algorithm for plagiarism and theft).

Sim(Q,S)〉TSim (Q, S)> T

상기 (수학식3)에서, "Sim(Q,S)"는 질의 벡터와 데이터베이스(14)의 한 레코드 사이의 유사도, "T"는 기준치를 각각 나타낸다.In Equation (3), "Sim (Q, S)" represents the similarity between the query vector and one record of the database 14, and "T" represents a reference value, respectively.

이상에서와 같이, 본 발명은 문서상에 포함되어 있는 문장의 패턴 또는 코드 패턴의 특징을 추출하여, 웹상의 문서들 사이의 표절을 판단하고, 표절 문서를 자동으로 검색할 수 있다. 이때, 사용자가 검색 결과에 대하여 마우스로 선택하면, 사용자 인터페이스부(12)가 표절 또는 도용이 의심되는 웹 문서로 직접 연결할 수 있다. 또한, 사용자는 특정 주제어를 부가하여, 검색 결과의 범위를 한정 또는 확장하여 검색할 수도 있다.As described above, the present invention can extract the features of the pattern of the sentence or the code pattern included in the document to determine the plagiarism between the documents on the web, and automatically retrieve the plagiarism document. At this time, if the user selects the search result with the mouse, the user interface unit 12 may directly connect to the web document suspected of plagiarism or theft. In addition, the user may add or search a specific topic by limiting or expanding the range of the search result.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the spirit of the present invention for those skilled in the art to which the present invention pertains, and the above-described embodiments and accompanying It is not limited to the drawing.

상기한 바와 같은 본 발명은, 인터넷상에서 무단으로 도용된 글에 대한 보다 정확하고 신속한 검색이 가능하여 웹상 전자 출판물에 대한 저작권의 보호가 보장되며, 웹상의 문서의 도용 및 표절에 대하여 사용자가 직접 찾아 다니며 내용에 대하여 읽을 필요가 없이 단순한 문서 특징벡터의 비교를 통해 웹상 문서와의 표절 여부를 쉽게 판단할 수 있고, 또한 특징벡터의 데이터베이스화를 통해 기존의 전자 출판물의 저작권에 대한 사전 등록 효과를 가져와 표절 시도 및 불법 도용에 대한 사전 예방의 역할도 할 수 있어 학술, 상업, 기술 보도 문서의 무단 도용을 막을 수 있는 효과가 있다.As described above, the present invention enables a more accurate and quick search for unauthorized theft on the Internet, thereby ensuring the protection of copyrights on electronic publications on the web, and directly searching for theft and plagiarism of documents on the Web. It is easy to determine whether plagiarism with the document on the web through simple comparison of document feature vectors without having to read the contents, and also the effect of pre-registration on the copyright of existing electronic publications through database of feature vectors. It can also act as a precaution against plagiarism attempts and illegal theft, preventing unauthorized theft of academic, commercial and technical press documents.

Claims

In a method for checking plagiarism / theft of various documents and codes on a network including the Internet,

Extracting a first feature vector of the sentence of the document;

A second step of constructing a database using the extracted first feature vectors;

A third step of extracting a second feature vector of the query-input document or sentence upon inputting a plagiarism / theft query from a remote user via a network; And

A fourth step of determining whether plagiarism / theft is performed by examining similarity with the first feature vector previously stored in the database by using the second feature vector.

Sentence plagiarism and theft search method comprising a.

The method of claim 1,

A fifth step of informing the user of the plagiarism / theft determination result through the network according to the determination result of the fourth step

Sentence plagiarism and theft search method further comprising a.

The method of claim 1,

A fifth step of informing a user via a network of the plagiarism / theft search result along with additional information stored in the database according to the search result of the fourth step

Sentence plagiarism and theft search method further comprising a.

The method according to any one of claims 1 to 3,

The first and second feature vectors,

Sentence plagiarism and theft retrieval method characterized by using the morphological or morphological features of the sentence.

The method of claim 4, wherein

The fourth step,

Determining whether plagiarism / theft is performed by checking the correspondence with the first feature vector previously stored in the database using the second feature vector. Sentence plagiarism and theft search method characterized in that each part can be compared.

The method of claim 4, wherein

Upon inputting a plagiarism / theft query from a remote user via the network in the third step,

Substantially, the sentence plagiarism and theft search method characterized in that the Internet address of the document or sentence, the web document is input.

The method of claim 4, wherein

The process of checking the consistency (similarity) of the fourth step,

Sentence plagiarism and theft retrieval method characterized in that for measuring the similarity (Sim (Q, S)) of the first and second feature vectors, as shown in the following equation.

Sim (Q, S) =

(However, g _c = common grams of Q and S, g _Q , g _S = Q, grams of S, and grams are units of feature vector strings divided into units of a certain size.)

The method of claim 4, wherein

The process of determining whether the fourth step is plagiarism / theft,

Plagiarism and theft retrieval method characterized in that it is determined whether the plagiarism / theft by the following equation.

Sim (Q, S)> T (where, similarity between the second feature vector and a record in the database, T is a reference value)

The method of claim 3, wherein

The additional information,

Substantially, sentence plagiarism and theft retrieval method comprising the location of the document, author, date and time of production, subject (keyword) information.

In order to check plagiarism / theft of various documents and codes on a network, a document (text) retrieval system equipped with a processor,

A first function of extracting a first feature vector for a sentence of the document;

A second function of constructing a database using the extracted first feature vectors;

A third function of extracting a second feature vector of the query-input document or sentence upon inputting a plagiarism / theft query from a remote user via a network; And

A fourth function of determining whether plagiarism / theft is performed by checking similarity with the first feature vector previously stored in the database using the second feature vector

A computer-readable recording medium having recorded thereon a program for realizing this.

The method of claim 10,

A fifth function of informing the user of the plagiarism / theft determination result through the network according to the determination result of the fourth function

A computer-readable recording medium that records a program for further realization.

The method of claim 10,

A fifth function of notifying a user through a network of plagiarism / theft search results with additional information stored in the database according to the search result of the fourth function