KR101663454B1

KR101663454B1 - Apparatus of sentence similarity calculation using keyword weight and method thereof

Info

Publication number: KR101663454B1
Application number: KR1020160098920A
Authority: KR
Inventors: 이재청; 이상우; 이성근
Original assignee: 주식회사 비욘드테크
Priority date: 2016-08-03
Filing date: 2016-08-03
Publication date: 2016-10-07

Abstract

The present invention relates to an apparatus and a method to calculate sentence similarity by using a keyword weight value. According to the present invention, the method to calculate sentence similarity by using an apparatus to calculate sentence similarity comprises the steps of: extracting a plurality of morphemes from a comparison target sentence; generating a hash value of the comparison target sentence by using a text code value and a location code value of each extracted morpheme; determining a morpheme change type of the comparison target sentence by comparing a hash value of the comparison target sentence with a hash value of the stored original sentence; determining the number of a core matched morpheme by comparing the hash value of the comparison target sentence and the stored keyword database; calculating the similarity between the comparison target sentence and the original sentence by using the morpheme change type and the number of the matched core morpheme; and determining whether the comparison target sentence is plagiarized by using the calculated similarity and a preset threshold value. The morpheme change type includes a first type in which the content of morpheme is changed within the comparison target sentence, a second type in which the arrangement of the morpheme is changed in the comparison target sentence, and a third type in which the morpheme is skipped in the comparison target sentence. According to the present invention, the apparatus can accurately calculate the similarity between the sentences by calculating the similarity between the sentences according to the morpheme change type.

Description

[0001] APPARATUS OF SENTENCE SIMILARITY CALCULATION USING KEYWORD WEIGHT AND METHOD THEREOF [0002]

본 발명은 키워드 가중치를 이용한 문장 유사도 산출 장치 및 그 방법 에 관한 것으로서, 더욱 상세하게는 국가과제보고서와 같은 특수 문서에 포함된 문장 내의 형태소 변경 정보 및 키워드 정보를 통해 문장간 유사도 산출의 정확성을 높이기 위한 키워드 가중치를 이용한 문장 유사도 산출 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus and a method for calculating a sentence similarity using a keyword weight, and more particularly, to an apparatus and a method for calculating sentence similarity using a keyword weight, The present invention relates to an apparatus and a method for calculating sentence similarity using keyword weights.

광범위한 인터넷의 보급 등으로 인해 최근 타인의 저작물 등을 무단으로 도용하는 사안이 빈번히 발생하고 있다. 이에 따라, 표절이나 저작권 침해와 같이 타인의 권리를 침해하는 행위가 사회적으로 문제시 되고 있다. 이러한 행위는 타인의 문학작품이나 논문 등의 일부 또는 전부를 베끼거나 모방하여 마치 자신의 창작물처럼 공표하는 행위 등으로 나타나게 되는데, 저작물이나 창작물의 수가 매우 많고 그 판단 방법이 까다로워 판단이 쉽지 않은 실정이다. Due to the widespread use of the Internet, there have been frequent cases of unauthorized use of copyrighted works of others. As a result, acts that violate the rights of others, such as plagiarism and copyright infringement, are becoming socially problematic. This behavior is manifested in the act of copying or mimicking some or all of other people's literary works or theses as if they were their own creations. The number of works or creations is very large, .

특히, 문서의 경우, 문장의 일부 단어를 변경하거나 단어의 순서를 변경하는 방식을 통해 표절이나 저작권 침해 행위가 발생하므로 기타 저작물에 비해 표절이나 저작물 침해를 판단하기 어렵다. 더군다나, 문서는 많은 수의 문장으로 구성되어 있을 뿐만 아니라 문서의 수도 매우 많으므로 사람이 수작업으로 도용 여부를 판단하는 것은 불가능하다. In particular, in the case of documents, it is difficult to judge plagiarism or infringement of copyrighted works compared to other works because it causes plagiarism or copyright infringement by changing some words in a sentence or changing the order of words. Furthermore, since the document is composed not only of a large number of sentences but also of a large number of documents, it is impossible for a person to judge whether or not the document is stolen by hand.

이에 따라, 문서 유사도 산출 알고리즘을 통해 문서의 표절이나 저작권 침해를 판단하는 방법이 연구되고 있다. 하지만, 기존의 문서 유사도를 산출하는 알고리즘은 단순히 문장의 패턴이나 단어를 비교하여 유사도를 산출하므로 정확성이 떨어지고, 문장 내에서 단어의 순서를 바꾸는 등의 방법을 통해 타인의 문서를 도용하는 경우 유사 문서를 검출하지 못하는 문제점이 있다.Accordingly, a method of judging plagiarism or copyright infringement of a document through a document similarity calculation algorithm has been studied. However, the existing algorithm for calculating document similarity simply compares patterns or words in a sentence to calculate the similarity. Thus, if the accuracy of the document is poor and the document of another person is stolen by changing the order of the words in the sentence, Is not detected.

본 발명의 배경이 되는 기술은 한국등록특허 제10-1626247호(2016.06.01.공고)에 개시되어 있다.The technology of the background of the present invention is disclosed in Korean Patent No. 10-1626247 (published on June 01, 2016).

본 발명이 이루고자 하는 기술적 과제는 문장 내의 형태소 변경 정보 및 키워드 정보를 통해 문장간 유사도 산출의 정확성을 높이기 위한 키워드 가중치를 이용한 문장 유사도 산출 장치 및 그 방법을 제공하기 위한 것이다.An object of the present invention is to provide an apparatus and a method for calculating sentence similarity using keyword weights for increasing the accuracy of calculation of similarity between sentences through morphological change information and keyword information in a sentence.

이러한 기술적 과제를 이루기 위한 본 발명의 실시예에 따르면, 문장 유사도 산출 장치를 이용한 문장 유사도 산출 방법에 있어서, 비교 대상 문장으로부터 복수의 형태소를 추출하는 단계, 상기 추출된 복수의 형태소 각각의 문자코드값 및 위치코드값을 이용하여 상기 비교 대상 문장의 해시값을 생성하는 단계, 상기 비교 대상 문장의 해시값과 기 저장된 원본 문장의 해시값을 비교하여 상기 비교 대상 문장의 형태소 변형 유형을 판단하는 단계, 상기 비교 대상 문장의 해시값과 기 저장된 키워드 데이터 베이스를 비교하여 일치하는 핵심 형태소의 개수를 판단하는 단계, 상기 형태소 변형 유형 및 상기 일치하는 핵심 형태소의 개수를 이용하여 상기 비교 대상 문장과 상기 원본 문장 사이의 유사도를 산출하는 단계, 그리고 상기 산출된 유사도와 기 설정된 임계값을 이용하여 상기 비교 대상 문장의 표절 여부를 판단하는 단계를 포함하며, 상기 형태소 변형 유형은, 비교 대상 문장 내 형태소의 내용이 변경되는 제1 유형, 비교 대상 문장 내 형태소의 배치가 변경되는 제2 유형 및 비교 대상 문장 내 형태소가 누락되는 제3 유형을 포함하고, 상기 유사도를 산출하는 단계는, 상기 제1 유형으로 판단된 경우 아래의 수학식을 이용하여 상기 유사도(S)를 산출한다.According to an embodiment of the present invention, there is provided a method of calculating sentence similarity using a sentence similarity degree calculating apparatus, comprising the steps of: extracting a plurality of morpheme from a comparison target sentence; Generating a hash value of the comparison target sentence by using a location code value, comparing the hash value of the comparison target sentence with a hash value of the stored original sentence to determine a morpheme modification type of the comparison target sentence, Comparing the hash value of the comparison target sentence with a pre-stored keyword database to determine the number of matching core morphemes, comparing the morpheme variant type and the number of matching core morphemes, Calculating a degree of similarity between the calculated degree of similarity and the degree of similarity; And determining whether or not the comparison target sentence is plagiarized using the set threshold value, wherein the morpheme modification type includes a first type in which the contents of the morpheme in the comparison target sentence are changed, And the third type in which the morpheme in the comparison target sentence is missing, and the step of calculating the similarity degree includes calculating the similarity degree S using the following equation when it is determined that the first type is the first type do.

여기서, L_tn은 원본 문장의 형태소의 총 개수, A_ln은 비교 대상 문장의 형태소와 원본 문장의 형태소의 일치 개수, C_n은 상기 원본 문장의 중심과 상기 비교 대상 문장에서 변경된 형태소 사이에 위치하는 형태소의 개수, α는 유의어 가중치, β는 키워드 가중치를 의미한다.Where L _n is the total number of morphemes of the original sentence, A _ln is the number of matches between the morpheme of the original sentence and the morpheme of the sentence to be compared, C _n is located between the center of the original sentence and the modified morpheme The number of morphemes, α is the weight of the thesaurus, and β is the keyword weight.

상기 키워드 가중치(β)는 아래의 수학식을 통해 연산될 수 있다.The keyword weighting factor? Can be calculated by the following equation.

여기서, k는 일치하는 핵심 형태소의 개수를 의미한다. Here, k means the number of matching core morphemes.

상기 제1 유형으로 판단된 경우, 기 저장된 유의어 데이터 베이스 및 반의어 데이터 베이스를 이용하여 상기 변경된 비교 대상 문장의 형태소가 대응하는 원본 문장의 형태소와 유의어 관계인지 반의어 관계인지를 판단하는 단계를 더 포함할 수 있다.And determining whether the morpheme of the changed comparison target sentence is a synonym or a semantic relation with the morpheme of the original sentence corresponding to the changed comparison target sentence using the pre-stored synonym database and an anonymous database .

상기 유의어 가중치(α)는, 상기 유의어 관계로 판단되면 1로 설정되고, 상기 반의어 관계로 판단되면 -1로 설정될 수 있다. The thesaurus weight α may be set to 1 if it is determined to be the synonym relationship, or may be set to -1 if it is determined to be an anonymous relationship.

상기 유사도를 산출하는 단계는, 상기 제2 유형으로 판단된 경우 아래의 수학식을 이용하여 상기 유사도(S)를 산출할 수 있다.The step of calculating the degree of similarity may calculate the degree of similarity (S) using the following equation when determining the second type.

여기서, C_d는 배치가 변경된 형태소들 사이에 위치하는 형태소의 개수를 의미한다.Here, C _d means the number of morphemes located between morphemes whose arrangement is changed.

상기 유사도를 산출하는 단계는, 상기 제3 유형으로 판단된 경우 아래의 수학식을 이용하여 상기 유사도(S)를 산출할 수 있다.The step of calculating the degree of similarity may calculate the degree of similarity S using the following equation when it is determined as the third type.

여기서, C_c는 상기 원본 문장의 중심과 상기 비교 대상 문장에서 누락된 형태소 사이에 위치하는 형태소의 개수를 의미한다. Here, C _c denotes the number of morphemes located between the center of the original sentence and the missing morpheme in the comparison target sentence.

상기 원본 문장은 국가연구보고서에 기재되어 있는 문장일 수 있다. The original sentence may be a sentence described in the national research report.

본 발명의 다른 실시예에 따른 문장 유사도 산출 장치는 비교 대상 문장으로부터 복수의 형태소를 추출하는 추출부, 상기 추출된 복수의 형태소 각각의 문자코드값 및 위치코드값을 이용하여 상기 비교 대상 문장의 해시값을 생성하는 생성부, 상기 비교 대상 문장의 해시값과 기 저장된 원본 문장의 해시값을 비교하여 상기 비교 대상 문장의 형태소 변형 유형을 판단하고, 상기 비교 대상 문장의 해시값과 기 저장된 키워드 데이터 베이스를 비교하여 일치하는 핵심 형태소의 개수를 판단하는 판단부, 상기 형태소 변형 유형 및 상기 핵심 형태소의 개수를 이용하여 상기 비교 대상 문장과 상기 원본 문장 사이의 유사도를 산출하는 산출부, 그리고 상기 산출된 유사도와 기 설정된 임계값을 이용하여 상기 비교 대상 문장의 표절 여부를 판단하는 검출부를 포함하며, 상기 형태소 변형 유형은, 비교 대상 문장 내 형태소의 내용이 변경되는 제1 유형, 비교 대상 문장 내 형태소의 배치가 변경되는 제2 유형 및 비교 대상 문장 내 형태소가 누락되는 제3 유형을 포함하고, 상기 산출부는, 상기 제1 유형으로 판단된 경우 아래의 수학식을 이용하여 상기 유사도(S)를 산출한다. According to another aspect of the present invention, there is provided an apparatus for calculating similarity of sentences, comprising: an extracting unit for extracting a plurality of morphemes from a comparison target sentence, a hash of the comparison target sentence using a character code value and a position code value of each of the extracted morphemes, A hash value of the comparison target sentence is compared with a hash value of the original sentence to compare the hash value of the comparison target sentence with the hash value of the comparison target sentence, Calculating a similarity between the comparison sentence and the original sentence using the morpheme variant type and the number of the core morphemes, and calculating the degree of similarity between the comparison sentence and the original sentence, And a determination unit for determining whether or not the comparison target sentence is plagiarized using the predetermined threshold value, Wherein the morpheme variant type includes a first type in which the contents of the morpheme in the comparison subject sentence is changed, a second type in which the arrangement of the morpheme in the comparison subject sentence is changed, and a third type in which the morpheme in the comparison subject sentence is missing And the calculating unit calculates the degree of similarity S using the following equation when it is determined as the first type.

여기서, L_tn은 원본 문장의 형태소의 총 개수, A_ln은 비교 대상 문장의 형태소와 원본 문장의 형태소의 일치 개수, C_n은 상기 원본 문장의 중심과 상기 비교 대상 문장에서 변경된 형태소 사이에 위치하는 형태소의 개수, α는 유의어 가중치, β는 키워드 가중치를 의미한다. Where L _n is the total number of morphemes of the original sentence, A _ln is the number of matches between the morpheme of the original sentence and the morpheme of the sentence to be compared, C _n is located between the center of the original sentence and the modified morpheme The number of morphemes, α is the weight of the thesaurus, and β is the keyword weight.

이와 같이 본 발명에 따르면, 형태소 변형 유형에 따라 문장의 유사도를 산출하므로 문장간의 유사도를 정확히 파악할 수 있다. 뿐만 아니라 변형된 형태소 사이의 거리나 문장의 중심으로부터 변형된 형태소와의 거리 등을 고려하여 유사도를 산출하므로 단순한 패턴 비교를 통한 유사도 산출보다 높은 정확도의 유사도를 산출할 수 있는 장점이 있다. 또한, 표절의 주제와 관련한 단어를 검출하여 이를 유사도 산출에 반영하므로 문장의 표절 여부를 정확히 검출할 수 있다. Thus, according to the present invention, the degree of similarity between sentences can be accurately grasped by calculating the degree of similarity of sentences according to morphological transformation types. In addition, since the similarity is calculated in consideration of the distance between the modified morpheme and the distance from the center of the sentence to the morpheme deformed, the degree of similarity can be calculated with higher precision than the similarity calculation through simple pattern comparison. In addition, a word related to the subject of plagiarism is detected and reflected in the calculation of the degree of similarity, so that the plagiarism of the sentence can be accurately detected.

도 1은 본 발명의 실시예에 따른 문장 유사도 산출 장치의 구성도이다.
도 2는 본 발명의 실시예에 따른 문장 유사도 산출 방법에 대한 순서도이다.
도 3은 도 2의 S220 단계를 구체적으로 설명하기 위한 도면이다.
도 4는 본 발명의 실시예에 따른 제1 유형의 판단 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 실시예에 따른 제2 유형의 판단 방법을 설명하기 위한 도면이다.
도 6은 본 발명의 실시예에 따른 제3 유형의 판단 방법을 설명하기 위한 도면이다.
도 7은 본 발명의 실시예에 따른 원본 문장의 중심을 설명하기 위한 도면이다. 1 is a configuration diagram of a sentence similarity degree calculating apparatus according to an embodiment of the present invention.
2 is a flowchart of a method of calculating sentence similarity according to an embodiment of the present invention.
FIG. 3 is a diagram for explaining step S220 of FIG. 2 in detail.
4 is a diagram for explaining a first type determination method according to an embodiment of the present invention.
5 is a diagram for explaining a second type of determination method according to an embodiment of the present invention.
6 is a diagram for explaining a third type determination method according to an embodiment of the present invention.
7 is a diagram for explaining the center of an original sentence according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

그러면 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention.

우선, 도 1을 통해 본 발명의 실시예에 따른 문장 유사도 산출 장치에 대하여 살펴본다. 도 1은 본 발명의 실시예에 따른 문장 유사도 산출 장치의 구성도이다. First, a description will be made of a sentence similarity degree calculating apparatus according to an embodiment of the present invention with reference to FIG. 1 is a configuration diagram of a sentence similarity degree calculating apparatus according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 실시예에 따른 문장 유사도 산출 장치(100)는 추출부(110), 생성부(120), 판단부(130), 산출부(140) 및 검출부(150)를 포함한다. 1, the sentence similarity degree calculation apparatus 100 according to the embodiment of the present invention includes an extraction unit 110, a generation unit 120, a determination unit 130, a calculation unit 140, and a detection unit 150 ).

먼저, 추출부(110)는 비교 대상 문장으로부터 복수의 형태소를 추출한다. 이때, 비교 대상 문장은 통신 연결된 사용자 단말이나 서버로부터 입력받을 수 있으며, 표절 여부에 대한 판단 대상이 되는 문장을 의미한다. First, the extraction unit 110 extracts a plurality of morphemes from the comparison target sentence. At this time, the comparison target sentence can be input from a communication terminal or a server connected to the communication, and means a sentence to be judged as to whether plagiarism has occurred.

그리고, 생성부(120)는 추출된 복수의 형태소 각각의 문자코드값 및 위치코드값을 이용하여 비교 대상 문장의 해시값을 생성한다. The generating unit 120 generates a hash value of the comparison target sentence by using the character code value and the position code value of each of the extracted morphemes.

다음으로, 판단부(130)는 비교 대상 문장의 해시값과 기 저장된 원본 문장의 해시값을 비교하여 비교 대상 문장의 형태소 변형 유형을 판단한다. Next, the determination unit 130 compares the hash value of the comparison target sentence with the hash value of the stored original sentence to determine the morpheme modification type of the comparison target sentence.

여기서, 형태소 변형 유형은 비교 대상 문장 내 형태소의 내용이 변경되는 제1 유형, 비교 대상 문장 내 형태소의 배치가 변경되는 제2 유형 및 비교 대상 문장 내 형태소가 누락되는 제3 유형을 포함한다. Here, the morpheme modification type includes a first type in which the contents of the morpheme in the comparison subject sentence is changed, a second type in which the arrangement of the morpheme in the comparison subject sentence is changed, and a third type in which the morpheme in the comparison subject sentence is missing.

한편, 비교 대상 문장의 형태소 변형 유형이 제1 유형으로 판단된 경우, 판단부(130)는 기 저장된 유의어 데이터 베이스 및 반의어 데이터 베이스를 이용하여 변경된 비교 대상 문장의 형태소가 대응하는 원본 문장의 형태소와 유의어 관계인지 반의어 관계인지를 판단한다. On the other hand, if it is determined that the morpheme modification type of the comparison target sentence is the first type, the determination unit 130 determines whether the morpheme of the changed comparison target sentence matches the morpheme of the original sentence corresponding to the changed morpheme It is determined whether or not there is a synonym relationship or an anonymity relation.

그리고, 판단부(130)는 비교 대상 문장의 해시값과 기 저장된 키워드 데이터 베이스를 비교하여 일치하는 핵심 형태소의 개수를 판단한다. Then, the determination unit 130 compares the hash value of the comparison target sentence with the pre-stored keyword database to determine the number of matching core morphemes.

다음으로, 산출부(140)는 형태소 변형 유형 및 일치하는 핵심 형태소의 개수를 이용하여 비교 대상 문장과 원본 문장 사이의 유사도를 산출한다. 구체적으로, 산출부(140)는 각 형태소 변형 유형 별로 기 설정된 수학식을 이용하여 비교 대상 문장과 원본 문장 사이의 유사도를 산출한다. Next, the calculating unit 140 calculates the degree of similarity between the comparison target sentence and the original sentence using the morpheme modification type and the number of matching core morphemes. Specifically, the calculating unit 140 calculates the degree of similarity between the comparison target sentence and the original sentence using the predetermined mathematical expression for each morpheme deformation type.

다음으로, 검출부(150)는 산출된 유사도와 기 설정된 임계값을 이용하여 비교 대상 문장의 표절 여부를 판단한다. Next, the detection unit 150 determines whether the comparison target sentence is plagiarized using the calculated similarity and a predetermined threshold value.

한편, 본 발명의 실시예에 따르면 원본 문장은 국가연구보고서에 기재되어 있는 문장일 수 있다. Meanwhile, according to the embodiment of the present invention, the original sentence may be a sentence described in the national research report.

다음으로, 도 2 내지 도 6을 통해 본 발명의 실시예에 따른 문장 유사도 산출 장치(100)를 이용한 문장 유사도 산출 방법에 대하여 살펴본다. 도 2는 본 발명의 실시예에 따른 문장 유사도 산출 방법에 대한 순서도이다. Next, a method of calculating the sentence similarity using the sentence similarity calculation apparatus 100 according to an embodiment of the present invention will be described with reference to FIGS. 2 to 6. FIG. 2 is a flowchart of a method of calculating sentence similarity according to an embodiment of the present invention.

우선, 추출부(110)는 비교 대상 문장으로부터 복수의 형태소를 추출한다(S210). First, the extraction unit 110 extracts a plurality of morphemes from the comparison target sentence (S210).

예를 들어, 비교 대상 문장이 "강의 산소 농도가 기준치 이상이다."라고 가정한다. 그러면, 추출부(110)는 비교 대상 문장으로부터 "강", "의", "산소", "농도", "가", "기준", "치?, "이상?, "이", "다"의 10개 형태소를 추출할 수 있다. For example, assume that the comparison sentence is "the oxygen concentration in the steel is above the reference value ". Then, the extracting unit 110 extracts, from the comparison target sentence, the words "strong, "," oxygen ", " "Can be extracted.

그러면, 생성부(120)는 추출된 복수의 형태소 각각의 문자코드값 및 위치코드값을 이용하여 비교 대상 문장의 해시값을 생성한다(S220). 도 3은 도 2의 S220 단계를 구체적으로 설명하기 위한 도면이다. Then, the generating unit 120 generates a hash value of the comparison target sentence using the extracted character code value and the position code value of each of the extracted morphemes (S220). FIG. 3 is a diagram for explaining step S220 of FIG. 2 in detail.

구체적으로, 생성부(120)는 도 3에 도시된 문자코드표로부터 형태소의 문자코드값을 추출하고 문장 내에 형태소의 위치로부터 위치코드값을 추출할 수 있다. Specifically, the generator 120 may extract the character code value of the morpheme from the character code table shown in Fig. 3 and extract the position code value from the position of the morpheme in the sentence.

예를 들어, "강"이라는 형태소가 비교 대상 문장의 1번째에 위치한다고 가정한다. 도 3을 참조하면, 생성부(120)는 "강"의 문자코드값으로 "B0AD"를 추출하고, 비교 대상 문장의 4번째에 위치하므로 위치코드값으로 ?01"이 추출할 수 있다. 그러면, 생성부(120)는 형태소 "강"의 해시값으로 "B0AD01"을 생성한다. For example, it is assumed that the stem of "strong" is located at the first position of the comparison target sentence. 3, the generating unit 120 extracts " 01 "as the position code value because it extracts" B0AD "as the character code value of" strong ", and is located at the fourth position of the comparison target sentence. , The generation unit 120 generates "B0AD01" as the hash value of the morpheme "strong".

즉, 생성부(120)는 상기 예시와 같이 복수의 형태소 각각에 대응하는 해시값을 배열하여 비교 대상 문장의 해시값을 생성할 수 있다. That is, the generating unit 120 may generate a hash value of the comparison target sentence by arranging the hash values corresponding to the plurality of morphemes, as shown in the above example.

다음으로, 판단부(130)는 비교 대상 문장의 해시값과 기 저장된 원본 문장의 해시값을 비교한다(S230). 구체적으로, 판단부(130)는 비교 대상 문장의 해시값과 기 저장된 원본 문장의 해시값 중 서로 상이한 해시값이 존재하는지를 판단한다. Next, the determination unit 130 compares the hash value of the comparison target sentence with the hash value of the stored original sentence (S230). Specifically, the determination unit 130 determines whether there is a different hash value among the hash value of the comparison target sentence and the hash value of the stored original sentence.

여기서, 원본 문장의 해시값은 S220 단계의 비교 대상 문장 해시값 생성과 같은 방법으로 생성되어 본 발명의 실시예에 따른 문장 유사도 산출 장치(100)에 저장된다. Here, the hash value of the original sentence is generated in the same manner as the comparison sentence hash value generation in step S220, and is stored in the sentence similarity calculation apparatus 100 according to the embodiment of the present invention.

그러면, 판단부(130)는 비교 대상 문장의 형태소 변형 유형을 판단한다(S240). Then, the determination unit 130 determines the morpheme modification type of the comparison target sentence (S240).

이때, 형태소 변형 유형은, 비교 대상 문장 내 형태소의 내용이 변경되는 제1 유형, 비교 대상 문장 내 형태소의 배치가 변경되는 제2 유형 및 비교 대상 문장 내 형태소가 누락되는 제3 유형을 포함한다. At this time, the morpheme modification type includes a first type in which the contents of the morpheme in the comparison subject sentence is changed, a second type in which the arrangement of the morpheme in the comparison subject sentence is changed, and a third type in which the morpheme in the comparison subject sentence is missing.

그러면, 도 4 내지 도 6을 통해 본 발명의 실시예에 따른 판단부(130)가 제1 내지 제3 유형을 판단하는 방법에 대해 구체적으로 살펴본다. 도 4는 본 발명의 실시예에 따른 제1 유형의 판단 방법을 설명하기 위한 도면이고, 도 5는 본 발명의 실시예에 따른 제2 유형의 판단 방법을 설명하기 위한 도면이고, 도 6은 본 발명의 실시예에 따른 제3 유형의 판단 방법을 설명하기 위한 도면이다. A method for determining the first to third types by the determination unit 130 according to the embodiment of the present invention will now be described in detail with reference to FIGS. 4 to 6. FIG. FIG. 4 is a view for explaining a first type determination method according to an embodiment of the present invention, FIG. 5 is a view for explaining a second type determination method according to an embodiment of the present invention, and FIG. FIG. 4 is a diagram for explaining a third type of determination method according to an embodiment of the present invention. FIG.

우선, 판단부(130)는 S230 단계의 비교 결과, 비교 대상 문장과 원본 문장의 해시값 중 위치코드값은 동일하나 문자코드값이 상이한 형태소의 해시값이 존재한다고 판단되면, 비교 대상 문장의 형태소 변형 유형을 제1 유형으로 판단한다. If it is determined in step S230 that the hash value of the morpheme having the same position code value but the same character code value exists among the hash values of the comparison target sentence and the original sentence, The type of deformation is determined as the first type.

예를 들어, 도 4의 (a) 및 (b)에 도시된 바와 같이, 원본 문장의 경우 A 내지 G의 형태소가 순서대로 나열되어 있다고 가정한다. 이때, 도 4의 (a)의 비교 대상 문장은 4번째 배열의 형태소가 K이고, 도 4의 (b)의 비교 대상 문장은 1번째에 배열된 형태소가 K임을 알 수 있다. For example, as shown in FIGS. 4A and 4B, it is assumed that the morphemes A to G of the original sentence are arranged in order. At this time, it can be seen that the morpheme of the fourth arrangement in FIG. 4A is K, and the morpheme arranged in the first comparison sentence in FIG. 4B is K.

즉, 도 4의 (a)의 4번째 형태소 및 (b)의 1번째 형태소의 해시값 중 문자코드값이 이에 대응하는 원본 문장의 해시값과 상이하다. 이와 같은 경우, 판단부(130)는 비교 대상 문장들의 형태소 변형 유형을 제1 유형으로 판단한다. That is, the hash value of the fourth morpheme in FIG. 4 (a) and the first morpheme in FIG. 4 (b) is different from the hash value of the corresponding original sentence. In this case, the determination unit 130 determines the morpheme modification type of the comparison target sentences as the first type.

다음으로, 판단부(130)는 S230 단계의 비교 결과, 비교 대상 문장과 원본 문장의 해시값 중 문자코드값은 동일하나 위치코드값이 상이한 형태소의 해시값이 존재한다고 판단되면, 비교 대상 문장의 형태소 변형 유형을 제2 유형으로 판단한다. If it is determined in step S230 that there is a hash value of the morpheme having the same character code value but different position code value among the hash values of the comparison target sentence and the original sentence, And judges the morphological transformation type as the second type.

예를 들어, 도 5의 (a) 및 (b)에 도시된 바와 같이, 원본 문장의 경우 A 내지 G의 형태소가 순서대로 나열되어 있다고 가정한다. 도 5의 (a)의 경우, 원본 문장에는 C가 3번째, D가 4번째에 위치하는 반면, 비교 대상 문장에는 C가 4번째, D가 3번째에 위치시키고 있다. 또한, 도 5의 (b)의 경우, 원본 문장에는 A가 1번째, F가 6번째에 위치하는 반면, 비교 대상 문장에는 A가 6번째, F가 1번째에 위치하고 있다. For example, as shown in Figs. 5 (a) and 5 (b), it is assumed that the morphemes of A to G in the original sentence are arranged in order. In the case of FIG. 5A, C is positioned at the third position and D is positioned at the fourth position in the original sentence, while C is positioned at the fourth position and D is positioned at the third position in the comparison target sentence. In the case of FIG. 5B, A is located at the first position and F is located at the sixth position in the original sentence, while A is positioned at the sixth position and F is located at the first position in the comparison target sentence.

즉, 도 5의 (a) 및 (b)의 형태소 중 어느 두 개의 형태소의 해시값 중 위치코드값이 이에 대응하는 원본 문장의 해시값과 상이하다. 이와 같은 경우, 판단부(130)는 도 5에 도시된 비교 대상 문장들의 형태소 변형 유형을 제2 유형으로 판단한다. That is, the position code value among the hash values of two morphemes of the morphemes of FIGS. 5 (a) and 5 (b) is different from the corresponding hash value of the original sentence. In this case, the determination unit 130 determines the morpheme modification type of the comparison target sentences shown in FIG. 5 as the second type.

다음으로, 판단부(130)는 S230 단계의 비교 결과, 원본 문장의 해시값 중 어느 하나가 비교 대상 문장에 존재하지 않는다고 판단되면, 비교 대상 문장의 형태소 변형 유형을 제3 유형으로 판단한다. If it is determined in step S230 that any of the hash values of the original sentence does not exist in the comparison target sentence, the determination unit 130 determines the morpheme modification type of the comparison target sentence as the third type.

예를 들어, 도 6의 (a) 및 (b)에 도시된 바와 같이, 원본 문장의 경우 A 내지 G의 형태소가 순서대로 나열되어 있다고 가정한다. 하지만, 도 6의 (a)의 경우, 비교 대상 문장에는 원본 문장의 5번째에 위치하는 형태소 E가 존재하지 않으며, 도 6의 (b)의 경우, 비교 대상 문장에는 원본 문장의 7번째에 위치하는 형태소 G가 존재하지 않는다. For example, as shown in FIGS. 6 (a) and 6 (b), it is assumed that the morphemes A to G in the original sentence are arranged in order. However, in the case of FIG. 6A, the morpheme E located at the fifth position of the original sentence does not exist in the comparison target sentence, and in the case of FIG. 6B, G is not present.

즉, 비교 대상 문장에는 원본 문장의 7개의 형태소 중 어느 하나의 형태소의 해시값이 누락되어 있다. 이와 같은 경우, 판단부(130)는 도 6에 도시된 비교 대상 문장의 형태소 변형 유형을 제3 유형으로 판단한다. That is, the hash value of the morpheme of one of the seven morphemes of the original sentence is missing in the comparison target sentence. In such a case, the determination unit 130 determines the morpheme modification type of the comparison target sentence shown in FIG. 6 as the third type.

한편, 판단부(130)는 비교 대상 문장의 형태소 변형 유형을 제1 유형으로 판단한 경우, 기 저장된 유의어 데이터 베이스 및 반의어 데이터 베이스를 이용하여 변경된 비교 대상 문장의 형태소가 대응하는 원본 문장의 형태소와 유의어 관계인지 반의어 관계인지를 판단한다(S250). Meanwhile, when the morpheme modification type of the comparison target sentence is determined to be the first type, the determination unit 130 determines whether or not the morpheme of the changed comparison target sentence using the pre-stored synonym database and the an- It is determined whether the relationship is an irreversible relationship (S250).

구체적으로, 판단부(130)는 원본 문장 및 비교 대상 문장의 문자코드값을 기 저장된 유의어 데이터 베이스 및 반의어 데이터 베이스와 비교한다. 그리고, 판단부(130)는 양 문자코드값이 유의어 데이터 베이스에 존재하는 경우 유의어 관계로 판단하고, 양 문자코드값이 반의어 데이터 베이스에 존재하는 경우 반의어 관계로 판단할 수 있다. Specifically, the determination unit 130 compares the character code values of the original sentence and the comparison target sentence with the previously stored synonym database and an anonymous database. The determination unit 130 determines that the two character code values exist in the database of the thesaurus, and determines that the two character code values exist in the antonym database.

이때, 유의어 데이터 베이스 및 반의어 데이터 베이스는 본 발명의 실시예에 따른 문장 유사도 산출 장치(100)에 기 저장될 수 있다. At this time, the synonym database and an anonymous database may be stored in the apparatus 100 for calculating similarity of text according to an embodiment of the present invention.

다음으로, 판단부(130)는 비교 대상 문장의 해시값과 기 저장된 키워드 데이터 베이스를 비교하여 일치하는 핵심 형태소의 개수를 판단한다(S260). Next, the determination unit 130 compares the hash value of the comparison target sentence with the pre-stored keyword database to determine the number of matching core morphemes (S260).

구체적으로, 판단부(130)는 비교 대상 문장의 해시값 중 문자코드값과 키워드 데이터 베이스에 포함된 키워드(keyword)의 해시값을 비교하여 일치여부를 판단함으로써, 핵심 형태소의 개수를 판단한다.Specifically, the determination unit 130 determines the number of key morphemes by comparing the character code value in the hash value of the comparison target sentence with the hash value of the keyword included in the keyword database, and determining whether or not they match.

이때, 키워드 데이터 베이스에 포함된 키워드는 문장의 주제에 대한 핵심 단어를 의미한다. 예를 들어, 본 발명의 실시예에 따른 문장 유사도 산출 장치(100)를 이용하여"수질 개선"을 주제로 하는 문장의 표절 여부를 판단하고자 하는 경우, 키워드 데이터 베이스에 포함된 키워드는 "수질"및 "개선"이라는 단어일 수 있다. 여기서, 키워드 데이터 베이스에 저장되는 키워드는 국가연구보고서의 연구 주제나 연구 내용에 따라 사용자에 의해 설계변경이 가능하다. At this time, the keyword included in the keyword database means a key word for the subject of the sentence. For example, in the case where the sentence similarity calculation apparatus 100 according to the embodiment of the present invention is used to determine whether or not plagiarism of a sentence with the theme of "water quality improvement" And "improvement ". Here, the keyword stored in the keyword database can be changed by the user according to the research topic or study contents of the national research report.

그러면, 산출부(140)는 형태소 변형 유형을 이용하여 비교 대상 문장과 원본 문장 사이의 유사도를 산출한다(S270). Then, the calculating unit 140 calculates a degree of similarity between the comparison target sentence and the original sentence using the morpheme modification type (S270).

우선, 제1 유형으로 판단된 경우, 산출부(140)는 아래의 수학식 1을 이용하여 유사도(S)를 산출한다. First, in the case of the first type, the calculating unit 140 calculates the similarity S using the following equation (1).

여기서, L_tn은 원본 문장의 형태소의 총 개수, A_ln은 비교 대상 문장의 형태소와 원본 문장의 형태소의 일치 개수, C_n은 원본 문장의 중심과 비교 대상 문장에서 변경된 형태소 사이에 위치하는 형태소의 개수, α는 유의어 가중치, β는 키워드 가중치를 의미한다. Where L _tn is the total number of morphemes in the original sentence, A _ln is the number of matches of the morpheme of the source sentence and the morpheme of the source sentence, C _n is the morpheme of the morpheme located between the center of the original sentence and the modified morpheme Α is the weight of the thesaurus, and β is the keyword weight.

이때, 산출부(140)는 S250 단계에서 유의어 관계로 판단된 경우, 유의어 가중치(α)를 1로 설정하고, 반의어 관계로 판단된 경우, 유의어 가중치(α)를 -1로 설정한다. At this time, the calculator 140 sets the thesaurus weight α to 1 if it is determined to be a synonymy in step S250, and sets the thesaurus weight α to -1 if it is determined to be an antonym.

예를 들어, 원본 문장의 "증가"라는 단어를 비교대상문장에서는 "감소"라는 단어로 변경한 경우, 다른 모든 형태소가 동일하다고 하더라도 유의어 가중치(α)는 -1로 설정된다. For example, if the word "increase" in the original sentence is changed to the word "decrease" in the comparison sentence, the thesaurus weight (α) is set to -1 even if all other morphemes are the same.

도 7은 본 발명의 실시예에 따른 원본 문장의 중심을 설명하기 위한 도면이다. 도 7의 (a)는 형태소의 개수가 홀수인 경우 원본 문장의 중심을 설명하기 위한 도면이고, 도 7의 (b)는 형태소의 개수가 짝수인 경우 원본 문장의 중심을 설명하기 위한 도면이다. 7 is a diagram for explaining the center of an original sentence according to an embodiment of the present invention. FIG. 7A is a diagram for explaining the center of an original sentence when the number of morphemes is an odd number, and FIG. 7B is a diagram for explaining the center of an original sentence when the number of morphemes is an even number.

예를 들어, 원본 문장이 도 7의 (a)와 같은 경우, 형태소의 개수는 7개가 된다. 그러면, 원본 문장의 중심은 4번째 형태소인 "폭락"이 된다. 이 경우, 산출부(140)는 5번째 형태소인 "하"와 원본 문장의 중심 사이에 위치하는 형태소의 개수를 0.5개로 산출하며, 6번째 형태소인 "였"과 원본 문장의 중심 사이에 위치하는 형태소의 개수를 1.5개로 산출한다. For example, when the original sentence is the same as in Fig. 7 (a), the number of morphemes is seven. Then, the center of the original sentence becomes the "morpheme", the fourth morpheme. In this case, the calculation unit 140 calculates the number of morphemes located between the fifth word morpheme and the center of the original sentence as 0.5, and calculates the number of morphemes located between the sixth word morpheme " The number of morphemes is calculated as 1.5.

반면, 원본 문장이 도 7의 (b)와 같은 경우, 형태소의 개수는 8개가 된다. 그러면, 원본 문장의 중심은 4번째 형태소인 "의"와 5번째 형태소인 "발전"사이가 된다. 이 경우, 산출부(140)는 6번째 형태소인 "과"와 원본 문장의 중심 사이에 위치하는 형태소의 개수를 1개로 산출한다. On the other hand, when the original sentence is the same as in Fig. 7 (b), the number of morphemes is eight. Then, the center of the original sentence is between the fourth morpheme "s" and the fifth morpheme "soul". In this case, the calculation unit 140 calculates the number of morphemes located between the sixth morpheme "and" and the center of the original sentence as one.

그리고, 비교 대상 문장의 형태소와 원본 문장의 형태소의 일치 개수는 문자코드값 및 위치코드값이 모두 일치하는 형태소의 개수를 말한다. 예를 들어, 원본 문장의 형태소 해시값이 B0AD04이고, 비교 대상 문장의 형태소 해시값이 B0AD02인 경우, 두 형태소의 문자코드값은 일치하나 위치코드값이 불일치하므로, 해당 형태소는 비교 대상 문장의 형태소와 원본 문장의 형태소의 일치 개수(A_ln)에 카운팅되지 않는다.The number of matching between the morpheme of the comparison target sentence and the morpheme of the original sentence refers to the number of morpheme matched with both the character code value and the position code value. For example, if the morpheme hash value of the original sentence is B0AD04 and the morpheme hash value of the comparison target sentence is B0AD02, the character code values of the two morphemes match, but the position code values do not match, And the number of matched morphemes (A _ln ) of the original sentence.

또한, 산출부(150)는 S260단계에서 판단된 핵심 형태소의 개수를 이용하여 키워드 가중치(β)를 산출할 수 있다. 예를 들어, 산출부(150)는 아래의 수학식 2를 이용하여 키워드 가중치(β)를 산출할 수 있다.In addition, the calculating unit 150 may calculate the keyword weight value beta using the number of core morphemes determined in step S260. For example, the calculating unit 150 may calculate the keyword weighting value? Using the following equation (2).

여기서, k는 일치하는 핵심 형태소의 개수를 의미한다.Here, k means the number of matching core morphemes.

그러면, 제1 유형으로 판단된 경우 유사도 산출 과정을 예를 들어 살펴보도록 한다. 우선, 원본 문장이 "공기 중 미세 먼지 농도가 감소하였다"이고 비교 대상 문장이 "공기 중 미세 먼지 농도가 증가하였다"이며, 키워드 데이터 베이스는 공기 오염에 관한 것으로 "공기", "미세", "먼지", "농도"라는 형태소를 포함한다고 가정한다. Then, if it is determined as the first type, the similarity calculation process will be described as an example. First, the original sentence was "the concentration of fine dust in the air decreased" and the comparison sentence was "the concentration of fine dust in the air increased." The keyword database is about air pollution, and "air" Quot ;, " dust ", and "concentration ".

그러면, 원본 문장의 형태소 총 개수는 10, 비교 대상 문장의 형태소와 원본 문장의 형태소의 일치 개수는 9이다. 그리고, 원본 문장의 7번째 형태소가 변경되었으므로 원본 문장의 중심과 비교 대상 문장에서 변경된 형태소 사이에 위치하는 형태소의 개수는 1.5가 된다. 또한, 증가와 감소는 서로 반의어 관계이므로 유의어 가중치는 -1이 되고, 4개의 키워드가 포함되므로 키워드 가중치는 5/6가 된다. Then, the total number of morphemes of the original sentence is 10, and the number of morphemes of the comparison sentence and the original sentence are 9. Since the seventh morpheme of the original sentence has been changed, the number of morphemes located between the center of the original sentence and the modified morpheme in the comparison sentence is 1.5. Since the increase and decrease are inversely related to each other, the thesaurus weight is -1, and the keyword weight is 5/6 since 4 keywords are included.

따라서, 산출부(150)는 30%를 원본 문장과 비교 대상 문장의 유사도로 산출한다. Therefore, the calculating unit 150 calculates 30% as the similarity between the original sentence and the comparison target sentence.

다음으로, 제2 유형으로 판단된 경우, 산출부(140)는 아래의 수학식 3을 이용하여 유사도(S)를 산출한다. Next, when it is judged as the second type, the calculating unit 140 calculates the similarity S by using the following equation (3).

그러면, 제2 유형으로 판단된 경우 유사도 산출 과정을 예를 들어 살펴보도록 한다. 우선, 원본 문장이 "중산층과 재벌의 소득 격차가 증가하고 있다"이고 비교 대상 문장이 "재벌과 중산층의 소득 격차가 증가하고 있다"이며, 키워드 데이터 베이스는 소득 계층에 관한 것으로 "중산층", "재벌", "소득"이라는 형태소를 포함한다고 가정한다. Then, if it is determined that the second type is used, the similarity calculation process will be described as an example. First, the original sentence is "the income gap between the middle class and the chaebol is increasing," and the comparison sentence is "the income gap between the chaebol and the middle class is increasing." The keyword database is about the income class, Chaebol ", and" income ".

그러면, 원본 문장의 형태소 총 개수는 12, 비교 대상 문장의 형태소와 원본 문장의 형태소의 일치 개수는 10이다. 그리고, 비교 대상 문장에서 원본 문장의 1번째 형태소와 3번째 형태소가 변경되었으므로 배치가 변경된 형태소들 사이에 위치하는 형태소의 개수는 1개가 된다. 또한, 3개의 키워드가 포함되므로 키워드 가중치는 4/5가 된다. Then, the total number of morphemes of the original sentence is 12, and the number of morphemes of the comparison sentence and the original sentence are 10. Since the first morpheme and the third morpheme of the original sentence are changed in the comparison target sentence, the number of morphemes located between the morphemes whose layout is changed is one. In addition, since three keywords are included, the keyword weight is 4/5.

따라서, 산출부(150)는 62.96%를 원본 문장과 비교 대상 문장의 유사도로 산출한다.Accordingly, the calculating unit 150 calculates 62.96% as the similarity between the original sentence and the comparison target sentence.

그리고, 제3 유형으로 판단된 경우, 산출부(140)는 아래의 수학식 4를 이용하여 유사도(S)를 산출한다. If the third type is determined, the calculating unit 140 calculates the degree of similarity S using the following equation (4).

여기서, C_c는 원본 문장의 중심과 상기 비교 대상 문장에서 누락된 형태소 사이에 위치하는 형태소의 개수를 의미한다. Here, C _c means the number of morphemes located between the center of the original sentence and the missing morpheme in the comparison target sentence.

그러면, 제1 유형으로 판단된 경우 유사도 산출 과정을 예를 들어 살펴보도록 한다. 우선, 원본 문장이 "가구 지출 중 월세가 차지하는 비중이 매우 높다"이고 비교 대상 문장이 "가구 지출 중 월세가 차지하는 비중이 매우 높다"이며, 키워드 데이터 베이스는 주택 정책에 관한 것으로 "가구", "월세" 라는 형태소를 포함한다고 가정한다. Then, if it is determined as the first type, the similarity calculation process will be described as an example. First, the original sentence is "the proportion of monthly rent is very high in household spending," and the comparison sentence is "very high in monthly rent in household spending." The keyword database is about housing policy, Quot ;, " monthly rent ", and the like.

그러면, 원본 문장의 형태소 총 개수는 13, 비교 대상 문장의 형태소와 원본 문장의 형태소의 일치 개수는 12이다. 그리고, 비교 대상 문장에서 원본 문장의 11번째 형태소가 누락되었으므로, 원본 문장의 중심과 비교 대상 문장에서 누락된 형태소 사이에 위치하는 형태소의 개수는 3.5개가 된다. 또한, 2개의 키워드가 포함되므로 키워드 가중치는 3/4가 된다. Then, the total number of morphemes in the original sentence is 13, and the number of matches of the morpheme of the comparison sentence and the original sentence is 12. Since the 11th morpheme of the original sentence is missing in the comparison sentence, the number of morphemes located between the center of the original sentence and the missing morpheme is 3.5. In addition, since two keywords are included, the keyword weight is 3/4.

따라서, 산출부(150)는 62.13%를 원본 문장과 비교 대상 문장의 유사도로 산출한다.Therefore, the calculation unit 150 calculates 62.13% as the similarity between the original sentence and the comparison target sentence.

다음으로, 검출부(150)는 S270 단계에서 산출된 유사도와 기 설정된 임계값을 이용하여 비교 대상 문장의 표절 여부를 판단한다(S280). 이때, 임계값은 문장에 포함된 형태소의 개수 등을 고려하여 당업자에 의해 설계변경이 가능하다. Next, the detection unit 150 determines whether the comparison target sentence is plagiarized using the similarity calculated in step S270 and a predetermined threshold (S280). At this time, the threshold value can be changed by a person skilled in the art considering the number of morphemes included in the sentence.

본 발명의 실시예에 따르면, 상기의 문장 유사도 산출 방법을 문장마다 반복함으로써 문서 전체의 유사도를 판단할 수 있다. 그리고, 본 발명의 실시예에 따른 문장 유사도 산출 장치(100)는 문장 유사도 산출 방법을 실행하는 프로그램이 기록된 컴퓨터 판독이 가능한 기록매체로 구현될 수 있다. According to the embodiment of the present invention, the similarity degree of the entire document can be determined by repeating the above-described sentence similarity degree calculating method for each sentence. The sentence similarity degree calculating apparatus 100 according to the embodiment of the present invention can be implemented as a computer-readable recording medium on which a program for executing the sentence similarity degree calculating method is recorded.

이하에서는, 표 1 내지 표 3을 통해 본 발명의 실시예에 따른 문장 유사도 산출 방법의 시뮬레이션 결과에 대해 살펴본다. 표 1 내지 표 3은 각각 도 4 내지 도 6에 대한 시뮬레이션 결과를 나타낸다. 이때, 일치하는 핵심 형태소의 개수가 2개인 경우와 3개인 경우를 가정하였으며 수학식 2에 따라 키워드 가중치(β)는 각각 3/4, 4/5로 설정하였다. Hereinafter, simulation results of the method of calculating the similarity of sentences according to the embodiment of the present invention will be described with reference to Tables 1 to 3. Tables 1 to 3 show simulation results for Figs. 4 to 6, respectively. In this case, assuming that the number of matching core morphemes is 2 and 3, the keyword weight (β) is set to 3/4 and 4/5, respectively, according to Equation (2).

아래의 표 1은 수학식 1 및 2를 이용하여 산출된 도 4의 (a) 및 (b)의 유사도 산출 결과를 나타낸다. Table 1 below shows the result of calculating the degree of similarity in (a) and (b) of FIG. 4 calculated using Equations (1) and (2).

표 1에 나타난 바와 같이, 변경된 형태소가 문장의 중심으로부터 멀어질수록 유사도가 낮아지며, 반의어가 포함되면 유사도가 낮아진다. 그리고 일치하는 핵심 형태소의 개수가 많아질수록 유사도가 높아진다.As shown in Table 1, the degree of similarity decreases as the modified morpheme moves away from the center of the sentence, and the degree of similarity decreases when an antonym is included. As the number of matching core morphemes increases, the degree of similarity increases.

아래의 표 2는 수학식 2 및 3을 이용하여 산출된 도 5의 (a) 및 (b)의 유사도 산출 결과를 나타낸다. Table 2 below shows the result of calculating the degree of similarity between (a) and (b) of FIG. 5 calculated using Equations (2) and (3).

표 2에 나타난 바와 같이, 배치가 변경된 형태소 사이가 멀어질수록 유사도는 낮아지고, 일치하는 핵심 형태소의 개수가 많아질수록 유사도가 높아진다.As shown in Table 2, the degree of similarity decreases as the arrangement morphology changes, and the degree of similarity increases as the number of matching core morphemes increases.

아래의 표 3은 수학식 2 및 4를 이용하여 산출된 도 6의 (a) 및 (b)의 유사도 산출 결과를 나타낸다. Table 3 below shows the result of calculating the degree of similarity in (a) and (b) of FIG. 6 calculated using Equations (2) and (4).

표 3에 나타난 바와 같이, 누락된 형태소가 문장의 중심으로부터 멀어질수록 유사도는 낮아지고, 일치하는 핵심 형태소의 개수가 많아질수록 유사도가 높아진다.As shown in Table 3, as the missing morpheme moves away from the center of the sentence, the degree of similarity decreases, and as the number of matching core morphemes increases, the degree of similarity increases.

본 발명의 실시예에 따르면, 형태소 변형 유형에 따라 문장의 유사도를 산출하므로 문장간의 유사도를 정확히 파악할 수 있다. 뿐만 아니라 변형된 형태소 사이의 거리나 문장의 중심으로부터 변형된 형태소와의 거리 등을 고려하여 유사도를 산출하므로 단순한 패턴 비교를 통한 유사도 산출보다 높은 정확도의 유사도를 산출할 수 있는 장점이 있다. According to the embodiment of the present invention, since the similarity degree of the sentence is calculated according to the morphological transformation type, the similarity degree between the sentences can be grasped accurately. In addition, since the similarity is calculated in consideration of the distance between the modified morpheme and the distance from the center of the sentence to the morpheme deformed, the degree of similarity can be calculated with higher precision than the similarity calculation through simple pattern comparison.

또한, 표절의 주제와 관련한 단어를 검출하여 이를 유사도 산출에 반영하므로 문장의 표절 여부를 정확히 검출할 수 있다. In addition, a word related to the subject of plagiarism is detected and reflected in the calculation of the degree of similarity, so that the plagiarism of the sentence can be accurately detected.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 다른 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의하여 정해져야 할 것이다. While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

100 : 문장 유사도 산출 장치 110 : 추출부
120 : 생성부 130 : 판단부
140 : 산출부 150 : 검출부100: Sentence similarity degree calculating device 110:
120: Generation unit 130:
140: Calculator 150: Detector

Claims

A method of calculating sentence similarity using a sentence similarity degree calculating device,
Extracting a plurality of morphemes from a comparison target sentence,
Generating a hash value of the comparison target sentence by using a character code value and a position code value of each of the extracted morphemes;
Comparing the hash value of the comparison target sentence and the hash value of the original sentence to determine a morpheme modification type of the comparison target sentence,
Comparing the hash value of the comparison target sentence and the pre-stored keyword database to determine the number of matching core morphemes,
Calculating a degree of similarity between the comparison sentence and the original sentence using the morpheme modification type and the number of matching core morphemes, and
Determining whether the comparison target sentence is plagiarized using the calculated similarity and a preset threshold value,
The morpheme variant type,
A first type in which the contents of the morpheme in the comparison target sentence is changed, a second type in which the arrangement of the morpheme in the comparison target sentence is changed, and a third type in which the morpheme in the comparison target sentence is missing,
If the first type is determined,
Further comprising the step of determining whether the morpheme of the changed comparison target sentence is a synonymy or anonymity relationship with the morpheme of the original sentence corresponding to the changed comparison target sentence using the previously stored synonym database and an anonymous language database,
The step of calculating the degree of similarity may include:
And calculating the similarity degree (S) using the following equation when it is determined as the first type:

Where L _n is the total number of morphemes of the original sentence, A _ln is the number of matches between the morpheme of the original sentence and the morpheme of the sentence to be compared, C _n is located between the center of the original sentence and the modified morpheme Α is the weight of the thesaurus, β is the keyword weight, and the thesaurus weight α is set to 1 when it is judged to be the synonym relationship and is set to -1 if it is judged to be the anonymous word relation.

The method according to claim 1,
The keyword weighting value? Is calculated by the following equation:

Here, k means the number of matching core morphemes.

delete

The method according to claim 1,
The step of calculating the degree of similarity may include:
And calculating the similarity degree (S) using the following equation when it is determined that the second type is the second type:

Here, C _d means the number of morphemes located between morphemes whose arrangement is changed.

The method according to claim 1,
The step of calculating the degree of similarity may include:
And calculating the similarity degree (S) using the following equation when it is determined as the third type:

Here, C _c denotes the number of morphemes located between the center of the original sentence and the missing morpheme in the comparison target sentence.

The method according to claim 1,
The original sentence is a sentence similar to that described in the national research report.

An extraction unit for extracting a plurality of morphemes from a comparison target sentence,
A generating unit for generating a hash value of the comparison target sentence by using a character code value and a position code value of each of the extracted morphemes,
Comparing the hash value of the comparison target sentence with a hash value of the original sentence to compare the hash value of the comparison target sentence with the pre-stored keyword database, A determination unit for determining the number
A calculating unit for calculating a degree of similarity between the comparison target sentence and the original sentence using the morpheme modification type and the number of core morphemes,
And a detector for determining whether the comparison target sentence is plagiarized using the calculated similarity and a predetermined threshold value,
The morpheme variant type,
A first type in which the contents of the morpheme in the comparison target sentence is changed, a second type in which the arrangement of the morpheme in the comparison target sentence is changed, and a third type in which the morpheme in the comparison target sentence is missing,
Wherein,
If it is determined that the morpheme of the modified comparison target sentence matches the morpheme of the original sentence corresponding to the changed comparison sentence using the previously stored synonym database and an anonymous database,
The calculating unit calculates,
(S) using the following equation when it is determined that the first type is the first type:

9. The method of claim 8,
The keyword weighting value? Is calculated by the following equation:

Here, k means the number of matching core morphemes.

delete

9. The method of claim 8,
The calculating unit calculates,
And calculating the degree of similarity (S) using the following equation when it is determined as the second type:

9. The method of claim 8,
The calculating unit calculates,
And the similarity degree (S) is calculated using the following equation when the third type is determined:

9. The method of claim 8,
The original sentence is a sentence similarity calculation device which is a sentence described in the national research report.