KR102337543B1

KR102337543B1 - Method and apparatus for normalization of modified post for copyright protection

Info

Publication number: KR102337543B1
Application number: KR1020200059830A
Authority: KR
Inventors: 이태진; 황찬웅
Original assignee: 호서대학교 산학협력단
Priority date: 2020-05-19
Filing date: 2020-05-19
Publication date: 2021-12-09
Also published as: KR20210142985A; WO2021235593A1

Abstract

본 발명은 저작권 보호를 위한 변형된 게시글 정규화 방법 및 장치에 관한 것으로서, 더욱 상세하게는 저작권 보호를 위해 불법 컨텐츠를 차단하도록, 모든 변형된 컨텐츠 게시글에 대한 복원 및 정규화를 수행하는 방법 및 장치에 관한 것이다.
본 발명에 의하면, 불법 컨텐츠를 차단하기 위하여, 모든 변형된 컨텐츠 게시글에 대한 복원 및 정규화를 수행하고, 이를 통하여 불법 컨텐츠 검색을 위한 검색어에 대한 차단 뿐만 아니라, 저작권이 있는 컨텐츠와 저작권이 없는 컨텐츠 사이의 유사성을 비교하여 저작권을 보호하는 방법 및 장치를 제공한다.The present invention relates to a method and apparatus for normalizing modified postings for copyright protection, and more particularly, to a method and apparatus for performing restoration and normalization on all modified content postings to block illegal contents for copyright protection will be.
According to the present invention, in order to block illegal content, restoration and normalization of all modified content postings are performed, thereby not only blocking search terms for illegal content search, but also between copyrighted content and non-copyrighted content A method and apparatus for protecting copyright by comparing the similarity of

Description

Method and apparatus for normalization of modified post for copyright protection

본 발명은 저작권 보호를 위한 변형된 게시글 정규화 방법 및 장치에 관한 것으로서, 더욱 상세하게는 저작권 보호를 위해 불법 컨텐츠를 차단하도록, 모든 변형된 컨텐츠 게시글에 대한 복원 및 정규화를 수행하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for normalizing modified postings for copyright protection, and more particularly, to a method and apparatus for performing restoration and normalization on all modified content postings to block illegal contents for copyright protection will be.

최근 인터넷 기술은 발달하고 개인 방송 서비스가 활발하게 이용되고 있으면서 기존에 웹하드나, P2P 사이트를 통하여 저작권이 있는 저작물에 대한 불법적 공유가 문제가 되고 있다. 위와 같은 서비스들은 사용자들에게 편리함을 제공하지만, 저작권이 있는 정보들을 보호하면서 합법적으로 서비스가 제공되어야 한다.Recently, with the development of Internet technology and the active use of personal broadcasting services, illegal sharing of copyrighted works through web hard or P2P sites has become a problem. The above services provide convenience to users, but they must be provided legally while protecting copyrighted information.

현재 저작권 보호를 위한 기술들은 많이 발전하였으나, 여전히 모든 저작권을 보호하기에는 한계가 존재하는데, 이를 극복하기 위해 저작권 보호를 위해 모든 서비스를 차단하는 것은, 인터넷 발전을 통한 정보화 시대를 역행하는 발상이 아닐 수 없다.Although technologies for copyright protection have developed a lot, there are still limitations to protecting all copyrights. none.

과거 저작권 보호를 위한 기술 동향은 DRM(Digital Right Management)이나 디지털 워터마킹(Digital Watermarking)을 사용했다. DRM은 사용자의 컨텐츠 구매 방식에 따라 설치 횟수, 형태, 사용 기간, 양도권 등 제어할 수 있는 기술이면서 동시에 컨텐츠를 암호화하고 사용자에게 컨텐츠를 복호화할 수 있는 암호 키를 제공하는 방식이지만 원본 컨텐츠에 압축 또는 약간의 변형이 가해지면 저작권을 보호할 수 없는 단점이 있다. 디지털 워터마킹은 사진이나 동영상 같은 각종 디지털 정보에 저작권 정보와 같은 비밀 정보를 삽입하여 관리하는 기술을 말한다. 이는 디지털 데이터를 불법적으로 배포하지 못하게 보호할 수 있지만, 누가 어떻게 어디서 불법적으로 디지털 데이터를 만들고 배포하였는지는 알 수 없다는 한계가 있다.In the past, technology trends for copyright protection used Digital Right Management (DRM) or Digital Watermarking. DRM is a technology that can control the number of installations, form, period of use, transfer right, etc. depending on the user's content purchase method, while at the same time encrypting the content and providing the user with an encryption key to decrypt the content Alternatively, there is a disadvantage that the copyright cannot be protected if a slight modification is applied. Digital watermarking refers to a technology that inserts and manages confidential information such as copyright information in various digital information such as photos and videos. Although this can protect digital data from being illegally distributed, there is a limitation in that it is impossible to know who, how, where, and how illegally created and distributed digital data.

최근 저작권 보호를 위해 유용하게 사용되는 기술들은 제목 필터링, 문자열 비교방식, 특정 유형 파일 필터링, 해시값 비교 필터링, 오디오/비디오 인식 필터링 등이 존재한다. 이들은 대부분 사용자가 설정된 검색 키워드를 입력하면 차단하는 방식으로서, 저작권 보호를 위해 저작권이 있는 정보에 접근할 시 사전에 차단하는 방식이다. 이 방식은 사용자가 입력한 검색어와 원본 컨텐츠 키워드와 일치하여야 하며, 이를 우회하기 위하여 불법 컨텐츠 유포자들은 변형된 형태의 키워드를 사용하는 경향이 있다.Recently, useful technologies for copyright protection include title filtering, string comparison method, specific type file filtering, hash value comparison filtering, audio/video recognition filtering, and the like. Most of these are methods that block when a user enters a set search keyword, and block access to copyrighted information in advance for copyright protection. In this method, the search word entered by the user and the original content keyword must match, and in order to circumvent this, illegal content distributors tend to use modified keywords.

전술한 바와 같은, 저작권 보호를 위한 기술들을 우회하기 위하여 사용되어지는 방법으로는, 특수문자를 삽입하거나, 한글의 경우 자음과 모음을 비슷한 영문자(h, I, o, l 등)으로 변형시키기도 한다. 이외에도, 초성, 중성, 종성을 분리(으ㅣ, 오ㅐ 등) 하기도 하고, 숫자(1,0)를 영문자(I,O)나 원문자(①,⑴)로 변형시켜, 육안으로는 구별이 가능하나 컴퓨터는 이를 인식하기 어렵게 하는 다양한 수법이 존재한다.As described above, as a method used to circumvent the techniques for copyright protection, special characters are inserted, or in the case of Korean, consonants and vowels are transformed into similar English characters (h, I, o, l, etc.). . In addition, the initial, middle, and final consonants are separated (e, o, etc.) It is possible, but there are various methods that make it difficult for the computer to recognize it.

따라서, 저작권이 있는 컨텐츠 정보에 전술한 바와 같이 적용된 우회 수법을 복원 및 정규화하여 차단할 수 있다면, 불법적으로 공유되는 컨텐츠들을 최소화하고 해당 컨텐츠들의 저작권을 보호할 수 있게 된다. Accordingly, if the bypass method applied to copyrighted content information as described above can be restored and normalized to block it, it is possible to minimize illegally shared content and protect the copyright of the content.

KRKR 10-088972810-0889728 B1B1

"스팸 문자 필터링을 위한 변형된 한글 SMS 문자의 정규화 기법 (정보처리학회논문지) / 강승식 / Vol.3, No.7 pp.271~pp276""Normalization Technique of Modified Korean SMS Text for Filtering Spam Text (Journal of Information Processing Society) / Seungsik Kang / Vol.3, No.7 pp.271~pp276"

본 발명은 이와 같은 문제점을 해결하기 위해 창안된 것으로서, 불법 컨텐츠를 차단하기 위하여, 모든 변형된 컨텐츠 게시글에 대한 복원 및 정규화를 수행하고, 이를 통하여 불법 컨텐츠 검색을 위한 검색어에 대한 차단 뿐만 아니라, 저작권이 있는 컨텐츠와 저작권이 없는 컨텐츠 사이의 유사성을 비교하여 저작권을 보호하는 방법 및 장치를 제공하는데 그 목적이 있다.The present invention was created to solve such a problem, and in order to block illegal content, restoration and normalization of all modified content postings are performed, and through this, not only blocking of search terms for illegal content search, but also copyright An object of the present invention is to provide a method and apparatus for protecting copyright by comparing the similarity between content with and without copyright.

이와 같은 목적을 달성하기 위하여 본 발명에 따른 저작권 보호를 위한 변형된 게시글 정규화 장치가, 저작권 보호를 위하여, 변형된 게시글을 정규화하는 방법은, (a) 인터넷 상의 파일공유 사이트로부터, 게시된 컨텐츠 데이터를 수집하고 이로부터 각 컨텐츠에 대한 게시글을 추출하는 단계; (b) 추출된 게시글에서 특수문자를 제거하는 단계; (c) 상기 게시글로부터 영문자로 변형된 숫자를 해당 숫자로 변환하는 단계; (d2) 한 개의 영문자 또는 숫자 또는 특수문자를, 한글음소변환 패턴 데이터베이스를 이용하여 한개의 한글 음소로 변환하는 단계; (d4) 상기 단계(d2)에서 변환된 한글 음소가, 앞 글자 또는 뒷 글자와 한글 음소 조합이 이루어지는지 여부를 판단하는 단계; 및, (d41) 상기 단계(d2)에서 변환된 한글 음소가, 앞 글자 및 뒷 글자 어느 것과도 음소 조합이 이루어지지 않아 한글 음절을 형성할 수 없는 경우 상기 단계(d2)의 변환을 오류로 판단하고, 상기 단계(d2)에서 변환된 한글 음소를 변환 전 원래의 문자로 역변환하는 단계를 포함한다.
상기 특수문자의 제거에는, 연속하여 중복 삽입된 'space' 문자들 중 1개만을 남겨두고 나머지는 제거하는 것을 포함할 수 있다.
상기 단계(c)는, 상기 게시글에서, 영문 대문자 'O' 또는 영문 소문자 'o'를 숫자 '0'으로 변환하고, 영문 대문자 'I'를 숫자 '1'로 변환할 수 있다.
상기 숫자로의 변환은, 상기 영문 대문자 'O' 또는 영문 소문자 'o' 또는 영문 대문자 'I'의 바로 앞 문자가 숫자인 경우, 숫자 '0' 또는 숫자 '1'로의 변환을 수행할 수 있다.
상기 영문 대문자 'O' 또는 영문 소문자 'o' 또는 영문 대문자 'I'의 바로 앞 문자가 숫자가 아닌 경우, 워드(word) 단위로 숫자로의 변환을 수행할 수 있다.
상기 단계(c)와 (d2) 사이에, (c1) 워드(word) 단위로 분리된 게시글에서 'space'에 의한 공백을 제거하는 단계를 더 포함할 수 있다.
상기 단계(d2) 이전에, (d1) 원 또는 다른 기호로 둘러싸인 문자를, 해당 문자만을 표시되도록 변환하는 단계를 더 포함할 수 있다.
상기 단계(d2) 이후, (d3) 한글음소조합 패턴 데이터베이스를 이용하여, 분리되어 연속하여 기재된 한글 자음 음소와 한글 모음 음소를 조합하여 하나의 한글 글자로 변환하는 단계를 더 포함할 수 있다.
상기 단계(d41) 이후, (d5) 상기 게시글에서 상기 단계(d41)까지의 과정을 통해서도 변환되지 않은 부분에 대하여, 변환 패턴 데이터베이스(이하 '복합 패턴 데이터베이스'라 한다)에 저장된 변환 패턴 이용하여 변환을 수행하는 단계를 더 포함할 수 있다.
본 발명의 다른 측면에 따르면, 저작권 보호를 위하여, 변형된 게시글을 정규화하기 위한 컴퓨터 프로그램은, 비일시적 저장 매체에 저장되며, 프로세서에 의하여, (a) 인터넷 상의 파일공유 사이트로부터, 게시된 컨텐츠 데이터를 수집하고 이로부터 각 컨텐츠에 대한 게시글을 추출하는 단계; (b) 추출된 게시글에서 특수문자를 제거하는 단계; (c) 상기 게시글로부터 영문자로 변형된 숫자를 해당 숫자로 변환하는 단계; (d2) 한 개의 영문자 또는 숫자 또는 특수문자를, 한글음소변환 패턴 데이터베이스를 이용하여 한개의 한글 음소로 변환하는 단계; (d4) 상기 단계(d2)에서 변환된 한글 음소가, 앞 글자 또는 뒷 글자와 한글 음소 조합이 이루어지는지 여부를 판단하는 단계; 및, (d41) 상기 단계(d2)에서 변환된 한글 음소가, 앞 글자 및 뒷 글자 어느 것과도 음소 조합이 이루어지지 않아 한글 음절을 형성할 수 없는 경우 상기 단계(d2)의 변환을 오류로 판단하고, 상기 단계(d2)에서 변환된 한글 음소를 변환 전 원래의 문자로 역변환하는 단계가 실행되도록 하는 명령을 포함한다.
본 발명의 또 다른 측면에 따르면, 저작권 보호를 위하여, 변형된 게시글을 정규화하는 장치는, 적어도 하나의 프로세서; 및 컴퓨터로 실행가능한 명령을 저장하는 적어도 하나의 메모리를 포함하되, 상기 적어도 하나의 메모리에 저장된 상기 컴퓨터로 실행가능한 명령은, 상기 적어도 하나의 프로세서에 의하여, (a) 인터넷 상의 파일공유 사이트로부터, 게시된 컨텐츠 데이터를 수집하고 이로부터 각 컨텐츠에 대한 게시글을 추출하는 단계; (b) 추출된 게시글에서 특수문자를 제거하는 단계; (c) 상기 게시글로부터 영문자로 변형된 숫자를 해당 숫자로 변환하는 단계; (d2) 한 개의 영문자 또는 숫자 또는 특수문자를, 한글음소변환 패턴 데이터베이스를 이용하여 한개의 한글 음소로 변환하는 단계; (d4) 상기 단계(d2)에서 변환된 한글 음소가, 앞 글자 또는 뒷 글자와 한글 음소 조합이 이루어지는지 여부를 판단하는 단계; 및, (d41) 상기 단계(d2)에서 변환된 한글 음소가, 앞 글자 및 뒷 글자 어느 것과도 음소 조합이 이루어지지 않아 한글 음절을 형성할 수 없는 경우 상기 단계(d2)의 변환을 오류로 판단하고, 상기 단계(d2)에서 변환된 한글 음소를 변환 전 원래의 문자로 역변환하는 단계가 실행되도록 한다.
본 발명의 또 다른 측면에 따르면, 저작권 보호를 위하여, 변형된 게시글을 정규화하는 장치는, 인터넷 상의 파일공유 사이트로부터, 게시된 컨텐츠 데이터를 수집하고 이로부터 각 컨텐츠에 대한 게시글을 추출하는 사이트 크롤링(crawling)부; 추출된 게시글에서 특수문자를 제거하고, 영문자로 변형된 숫자를 해당 숫자로 변환하는 기본 전처리부; 한개의 영문자 또는 숫자 또는 특수문자를 한개의 한글 음소로 변환하는 패턴을 저장하는 한글음소변환 패턴 데이터베이스; 상기 한글음소변환 패턴 데이터베이스에 저장된 변환 패턴을 이용하여, 한 개의 영문자 또는 숫자 또는 특수문자를 한개의 한글 음소로 변환하고, 변환된 한글 음소가, 앞 글자 또는 뒷 글자와 한글 음소 조합이 이루어지는지 여부를 판단하며, 변환된 한글 음소가, 앞 글자 및 뒷 글자 어느 것과도 음소 조합이 이루어지지 않아 한글 음절을 형성할 수 없는 경우 상기 한글 음소로의 변환을 오류로 판단하고, 변환된 한글 음소를 변환 전 원래의 문자로 역변환하는 기능을 수행하는 패턴 적용 변환부를 포함한다.
상기 저작권 보호를 위한 변형된 게시글 정규화 장치는, 원 또는 다른 기호로 둘러싸인 문자를, 해당 문자만을 표시되도록 변환하는 패턴을 저장하는 원문자 패턴 데이터베이스를 더 포함할 수 있고, 상기 패턴 적용 변환부는, 상기 원문자 패턴 데이터베이스를 이용하여, 원문자, 괄호문자 또는 타 기호에 의해 내포된 문자에서 원, 괄호 또는 해당 타 기호를 제거하여 문자만으로의 변환을 수행하는 기능을 더 포함할 수 있다.
상기 저작권 보호를 위한 변형된 게시글 정규화 장치는, 분리되어 연속하여 기재된 한글 자음 음소와 한글 모음 음소를 조합하여 하나의 한글 글자로 변환하는 패턴을 저장하는 한글음소조합 패턴 데이터베이스를 더 포함할 수 있고, 상기 패턴 적용 변환부는, 상기 한글음소조합 패턴 데이터베이스를 이용하여, 분리되어 연속하여 기재된 한글 자음 음소와 한글 모음 음소를 조합하여 하나의 한글 글자로 변환하는 기능을 더 포함할 수 있다.
상기 저작권 보호를 위한 변형된 게시글 정규화 장치는, 복합 패턴 데이터베이스를 더 포함할 수 있고, 상기 복합 패턴 데이터베이스는, 원문자 패턴 데이터베이스, 상기 한글음소변환 패턴 데이터베이스 및 한글음소조합 패턴 데이터베이스에 저장되지 않은 변환 패턴만을 저장하거나, 또는 저장된 변환 패턴 중 일부는 상기 원문자 패턴 데이터베이스, 상기 한글음소변환 패턴 데이터베이스 또는 상기 한글음소조합 패턴 데이터베이스에 저장된 변환 패턴과 동일할 수 있고, 상기 패턴 적용 변환부는, 상기 복합 패턴 데이터베이스를 이용하여, 변형된 게시글의 탐지 및 정규화 변환을 수행하는 기능을 더 포함할 수 있다.In order to achieve the above object, the method for normalizing a modified posting by the apparatus for normalizing a modified posting for copyright protection according to the present invention is a method for normalizing a modified posting for copyright protection, (a) content data posted from a file sharing site on the Internet collecting and extracting posts for each content therefrom; (b) removing special characters from the extracted posts; (c) converting the number transformed into English characters from the posting into the corresponding number; (d2) converting one English character, number, or special character into one Hangul phoneme using a Korean phoneme conversion pattern database; (d4) determining whether the Korean phoneme converted in the step (d2) is a combination of the first letter or the rear letter and the Korean phoneme; and (d41) when the Korean phoneme converted in step (d2) cannot form a Hangul syllable because the phoneme combination is not made with any of the preceding and following characters, the conversion in step (d2) is determined as an error and inversely converting the Hangul phoneme converted in step (d2) into the original character before conversion.
The removal of the special character may include leaving only one of the consecutively repeatedly inserted 'space' characters and removing the rest.
In the step (c), in the posting, an uppercase English letter 'O' or a lowercase English letter 'o' may be converted into the number '0', and an English capital letter 'I' may be converted into the number '1'.
In the conversion to the number, when the letter immediately preceding the uppercase letter 'O' or lowercase English letter 'o' or uppercase English letter 'I' is a number, conversion to the number '0' or the number '1' may be performed. .
When the letter immediately preceding the uppercase English letter 'O' or the lowercase English letter 'o' or the uppercase English letter 'I' is not a number, conversion to a number may be performed in word units.
Between the steps (c) and (d2), (c1) may further include the step of removing a space due to 'space' in the post separated by word units.
Before the step (d2), (d1) may further include the step of converting a character surrounded by a circle or other symbols so that only the corresponding character is displayed.
After the step (d2), (d3) using the Hangul phoneme combination pattern database, it may further include the step of converting the separated and successively described Hangul consonant phoneme and Hangul vowel phoneme by combining them into one Hangul character.
After the step (d41), (d5) the conversion pattern stored in the conversion pattern database (hereinafter referred to as the 'composite pattern database') for the part that has not been converted through the process from the post to step (d41) is converted using the conversion pattern It may further include the step of performing.
According to another aspect of the present invention, for copyright protection, a computer program for normalizing a modified posting is stored in a non-transitory storage medium, and by the processor, (a) content data posted from a file sharing site on the Internet collecting and extracting posts for each content therefrom; (b) removing special characters from the extracted posts; (c) converting the number transformed into English characters from the posting into the corresponding number; (d2) converting one English character, number, or special character into one Hangul phoneme using a Korean phoneme conversion pattern database; (d4) determining whether the Korean phoneme converted in the step (d2) is a combination of the first letter or the rear letter and the Korean phoneme; and (d41) when the Korean phoneme converted in step (d2) cannot form a Hangul syllable because the phoneme combination is not made with any of the preceding and following characters, the conversion in step (d2) is determined as an error and a command for inversely converting the Hangul phoneme converted in step (d2) into the original character before conversion is executed.
According to another aspect of the present invention, for copyright protection, an apparatus for normalizing a modified posting includes: at least one processor; and at least one memory storing computer-executable instructions, wherein the computer-executable instructions stored in the at least one memory are executed by the at least one processor (a) from a file sharing site on the Internet; collecting published content data and extracting posts for each content therefrom; (b) removing special characters from the extracted posts; (c) converting the number transformed into English characters from the posting into the corresponding number; (d2) converting one English character, number, or special character into one Hangul phoneme using a Korean phoneme conversion pattern database; (d4) determining whether the Korean phoneme converted in the step (d2) is a combination of the first letter or the rear letter and the Korean phoneme; and (d41) when the Korean phoneme converted in step (d2) cannot form a Hangul syllable because the phoneme combination is not made with any of the preceding and following characters, the conversion in step (d2) is determined as an error and inversely converting the Hangul phoneme converted in step (d2) into the original character before conversion is executed.
According to another aspect of the present invention, for copyright protection, the device for normalizing modified postings collects posted content data from a file sharing site on the Internet and crawls the site to extract the postings for each content therefrom. crawling) part; A basic preprocessor that removes special characters from the extracted posts and converts numbers transformed into English characters into corresponding numbers; Hangul phoneme conversion pattern database for storing a pattern for converting one English letter, number, or special character into one Hangul phoneme; By using the conversion pattern stored in the Hangul phoneme conversion pattern database, one English letter, number, or special character is converted into one Hangul phoneme, and the converted Hangul phoneme is a combination of the preceding character or the rear character and the Korean phoneme. If the converted Hangul phoneme cannot form a Hangul syllable because a phoneme combination is not made with any of the preceding and following characters, it is determined that the conversion to the Hangul phoneme is an error, and the converted Hangul phoneme is converted It includes a pattern application transformation unit that performs a function of inverse transformation to the original character.
The modified posting normalization device for copyright protection may further include a original character pattern database storing a pattern for converting a character surrounded by a circle or other symbol so that only the corresponding character is displayed, and the pattern application conversion unit includes: Using the original character pattern database, it may further include a function of performing conversion to only characters by removing the circle, parentheses, or other symbols from the original characters, parentheses, or characters implied by other symbols.
The modified post normalization device for copyright protection may further include a Hangul phoneme combination pattern database that stores a pattern for converting a Hangul consonant phoneme and a Hangul vowel phoneme to be converted into a single Hangul character by combining the Hangul consonant phonemes and Hangul vowel phonemes separately and consecutively. The pattern application conversion unit may further include a function of converting a Hangul consonant phoneme and a Hangul vowel phoneme, which are separately and continuously described, into one Hangul character using the Hangul phoneme combination pattern database.
The modified post normalization device for copyright protection may further include a complex pattern database, wherein the complex pattern database is a conversion that is not stored in the original character pattern database, the Hangul phoneme conversion pattern database, and the Hangul phoneme combination pattern database. Only the patterns are stored, or some of the stored conversion patterns may be the same as conversion patterns stored in the original character pattern database, the Hangul phoneme conversion pattern database, or the Hangul phoneme combination pattern database, and the pattern application conversion unit, the complex pattern Using the database, it may further include a function of detecting a modified posting and performing normalization transformation.

삭제delete

본 발명에 의하면, 불법 컨텐츠를 차단하기 위하여, 모든 변형된 컨텐츠 게시글에 대한 복원 및 정규화를 수행하고, 이를 통하여 불법 컨텐츠 검색을 위한 검색어에 대한 차단 뿐만 아니라, 저작권이 있는 컨텐츠와 저작권이 없는 컨텐츠 사이의 유사성을 비교하여 저작권을 보호하는 방법 및 장치를 제공하는 효과가 있다.According to the present invention, in order to block illegal content, restoration and normalization of all modified content postings are performed, thereby not only blocking search terms for illegal content search, but also between copyrighted content and non-copyrighted content It has the effect of providing a method and apparatus for protecting copyright by comparing the similarity of

도 1은 본 발명에 따른 저작권 보호를 위한 변형된 게시글 정규화가 수행되기 위한 네트워크 구성을 나타내는 도면.
도 2는 웹사이트 상에서 변형된 한글 문장이 포함된 P2P(웹하드) 크롤링(crawling) 게시글 데이터의 실시예를 나타내는 도면.
도 3은 게시글에서 특수문자 및 중복 공백을 제거한 상태에 대한 실시예를 나타내는 도면.
도 4는 게시글에서 알파뉴메릭(alphanumeric) 변환을 수행한 상태에 대한 실시예를 나타내는 도면.
도 5는 게시글에 원문자가 존재하는 경우, 해당 원문자를 그에 대응되는 문자로 변경하는 실시예를 나타내는 도면.
도 6은 게시글에서 영문자 또는 숫자 또는 특수문자를 한글 음소로 변환하는 실시예를 나타내는 도면.
도 7은 게시글에서 한글 음소를 조합하여 음절을 생성하는 실시예를 나타내는 도면.
도 8은 한글 음소 변환에서 나타날 수 있는 문제를 해결하기 위하여 변환된 한글 음소에 대한 역변환이 수행된 상태에 대한 실시예를 나타내는 도면.
도 9는 도 2 내지 도 7을 통한 정규화 과정을 통해서도 정규화되지 않은 부분에 대한 실시예를 나타내는 도면.
도 10은 게시글에서 복합 패턴을 적용한 정규화 과정의 실시예를 나타내는 도면.
도 11은 도 2 내지 도 10을 적용하여 정규화가 수행된 게시글의 실시예를 나타내는 도면.
도 12는 도 2 내지 도 10의 저작권 보호를 위한 변형된 게시글 정규화 방법이 수행되는 순서도.
도 13은 본 발명에 따른 저작권 보호를 위한 변형된 게시글 정규화 방법을 수행하는 장치의 구성을 나타내는 도면.1 is a diagram showing a network configuration for performing normalization of a modified posting for copyright protection according to the present invention.
2 is a diagram showing an embodiment of P2P (web hard) crawling posting data including modified Korean sentences on a website.
3 is a view showing an embodiment of a state in which special characters and duplicate spaces are removed from the posting.
4 is a view showing an embodiment of a state in which alphanumeric conversion is performed in a post.
5 is a view showing an embodiment of changing the original text to the corresponding text when there is an original text in the posting.
6 is a view showing an embodiment of converting English letters, numbers, or special characters into Korean phonemes in a post.
7 is a view showing an embodiment of generating syllables by combining Hangul phonemes in a post.
8 is a view showing an embodiment of a state in which inverse conversion is performed on the converted Hangul phoneme in order to solve a problem that may appear in the Hangul phoneme conversion.
9 is a view showing an embodiment of a portion that is not normalized even through the normalization process through FIGS. 2 to 7 .
10 is a diagram showing an embodiment of a normalization process applying a complex pattern in a post.
11 is a view showing an embodiment of a posting on which normalization is performed by applying FIGS. 2 to 10;
12 is a flowchart illustrating a modified posting normalization method for copyright protection of FIGS. 2 to 10 performed.
13 is a diagram showing the configuration of an apparatus for performing a modified posting normalization method for copyright protection according to the present invention.

이하 첨부된 도면을 참조로 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, the terms or words used in the present specification and claims should not be construed as being limited to conventional or dictionary meanings, and the inventor should properly understand the concept of the term in order to best describe his invention. Based on the principle that it can be defined, it should be interpreted as meaning and concept consistent with the technical idea of the present invention. Accordingly, the embodiments described in this specification and the configurations shown in the drawings are only the most preferred embodiment of the present invention and do not represent all the technical spirit of the present invention, so at the time of the present application, various It should be understood that there may be equivalents and variations.

도 1은 본 발명에 따른 저작권 보호를 위한 변형된 게시글 정규화가 수행되기 위한 네트워크 구성을 나타내는 도면이다.1 is a diagram showing a network configuration for performing normalization of a modified posting for copyright protection according to the present invention.

P2P 서버(200)는 P2P, 웹하드 등의 다양한 인터넷 파일공유사이트를 제공하는 서버를 총칭하여 'P2P 서버(200)'로 표기한 것이다. 이하에서도 'P2P 서버' 또는 'P2P 사이트' 등으로 사용되는 용어는, 이와 같은 인터넷을 통하여 파일공유서비스를 제공하는 모든 서버나 사이트를 총칭하는 것으로 한다.The P2P server 200 is a generic term for servers that provide various Internet file sharing sites such as P2P and web hard, and is referred to as 'P2P server 200'. Hereinafter, the terms used as 'P2P server' or 'P2P site' shall collectively refer to all servers or sites that provide file sharing services through the Internet.

저작권 보호를 위한 변형된 게시글 정규화 장치(100)는 본 발명의 변형된 게시글 정규화를 수행하는 장치로서, 그 구성에 대하여는 도 13을 참조하여 설명하고, 수행하는 변형된 게시글 정규화에 대하여는 도 2 내지 도 12를 참조하여 상세히 후술하기로 한다. 여기서 '정규화'란 변형된 게시글을, 해당 게시글의 본 의미대로의 문자들로 변환하는 것을 의미한다.Modified posting normalization apparatus 100 for copyright protection is a device for normalizing modified postings of the present invention, and its configuration will be described with reference to FIG. 13, and modified postings normalization performed will be described in FIGS. 2 to It will be described later in detail with reference to 12. Here, 'normalization' means converting a modified post into characters according to the original meaning of the post.

이와 같은 P2P 사이트 등에는 불법 다운로드 및 배포되는 다량의 불법 저작물이 존재한다. 이러한 불법 저작물들을 탐지하는 경우, 그러한 불법 저작물 파일의 제목 등에 대하여 아무런 전처리를 수행하지 않은 상태에서 탐지하는 것은, 비용없이 신속하게 실시할 수 있다는 장점이 있으나 제목 변경, 띄어쓰기 등 다양한 수법을 통해 우회가 가능하여, 불법 저작물에 대한 탐지율이 낮다는 문제점이 있다. 이에 따라 본 발명의 저작권 보호를 위한 변형된 게시글 정규화 장치(100)가 제공하는 변형된 게시글 정규화 방법은, 이러한 게시글에 대하여 효과적인 전처리를 수행함으로써, 변형된 문장의 게시글까지도 최대한 탐지해낼 수 있도록 한다.A large amount of illegal works that are illegally downloaded and distributed exist on such P2P sites. In the case of detecting such illegal works, detection without any pre-processing on the title of the illegal work file has the advantage that it can be carried out quickly without cost, but it can be bypassed through various methods such as title change and spacing Possible, there is a problem that the detection rate for illegal works is low. Accordingly, the modified posting normalization method provided by the modified posting normalization apparatus 100 for copyright protection of the present invention performs effective pre-processing on these postings, so that even postings with modified sentences can be detected as much as possible.

도 2는 웹사이트 상에서 변형된 한글 문장이 포함된 P2P(웹하드) 크롤링(crawling) 게시글 데이터의 실시예를 나타내는 도면이다.2 is a diagram illustrating an embodiment of P2P (web hard) crawling posting data including modified Korean sentences on a website.

본 발명의 저작권 보호를 위한 변형된 게시글 정규화 장치(100)는, 먼저 P2P 사이트(200)에서 공유되고 있는 데이터를 불러오는 크롤링(crawling)을 수행한다. 도 2는 그와 같이 네트워크상의 다양한 P2P 사이트에서 크롤링하여 불러온 데이터들의 실시예이다. 이러한 데이터의 각 제목에서, 탐지를 방지하기 위해 변형된 부분들을 정상적인 문장으로 변환하여 정규화시킴으로써 불법 저작물을 탐지할 수 있는 게시글 데이터를 생성하는 것이 본 발명의 목표이다.The modified posting normalization apparatus 100 for copyright protection according to the present invention first performs crawling of fetching data shared in the P2P site 200 . 2 is an embodiment of data crawled and retrieved from various P2P sites on the network as described above. It is an object of the present invention to generate posting data capable of detecting illegal works by converting and normalizing the deformed parts into normal sentences in order to prevent detection in each title of such data.

도 3은 게시글에서 특수문자 및 중복 공백을 제거한 상태에 대한 실시예를 나타내는 도면이다.3 is a view showing an embodiment of a state in which special characters and duplicate spaces are removed from postings.

불법 게시물은, 제목에서 불필요한 특수문자 등의 기호를 삽입하거나 'space'를 연속하여 중복으로 사용하는 등의 경우가 있으므로 이러한 기호 및 중복된 space를 제거하는 조치를 취한다. 제거되어야 할 특수문자들의 실시예는 다음과 같다.In illegal postings, there are cases where symbols such as unnecessary special characters are inserted in the title or 'space' is used repeatedly in succession, so take measures to remove these symbols and duplicate spaces. Examples of special characters to be removed are as follows.

- 제거될 기호의 실시예- Examples of symbols to be removed

[-=+_#}/\?;:{^$,.@*\"※~&%ㆍ!＋：』\\‘|\[\]\<\>`\'…》「◁◀▷▶▲□■」○◆⇒√☆♥★！→◐『』™●][-=+_#}/\?;:{^$,.@*\"※~&%ㆍ!＋：]\\'|\[\]\<\>`\'… 》「◁◀▷▶▲□■」○◆⇒√☆♥★！→◐「」™●]

도 3은, 좌측의 불법 게시물의 제목에서 이와 같이 기호 및 불필요한 space 등을 제거한 우측의 문장을 생성한 상태를 보여준다.FIG. 3 shows a state in which the sentence on the right is generated in which symbols and unnecessary spaces are removed from the title of the illegal post on the left.

도 4는 게시글에서 알파뉴메릭(alphanumeric) 변환을 수행한 상태에 대한 실시예를 나타내는 도면이다.4 is a diagram showing an embodiment of a state in which alphanumeric conversion is performed in a post.

불법 게시물의 제목에서 많이 사용하는 예 중의 하나는, 숫자 '0'을 영문 대문자 'O'로, 숫자 '1'을 영문 대문자 'I'로 치환시켜서 탐지를 피하려는 것이다. 도 4는 이를 탐지하여 다시 영문 대문자 'O'를 숫자 '0'으로, 영문 대문자 'I'를 숫자 '1'로 변환하는 것이다. 즉, 알파벳 문자를 숫자로 변환한다는 의미로 알파뉴메릭(alphanumeric) 변환이라 칭하기로 한다. 경우에 따라 위의 영문 대문자 'O'는 영문 소문자 'o'로 사용되는 경우도 있으며, 이에 대한 변환도 위와 동일하다.One of the examples frequently used in the titles of illegal posts is to avoid detection by replacing the number '0' with an English capital letter 'O' and the number '1' with an English capital letter 'I'. FIG. 4 detects this and converts the uppercase English letter 'O' into the number '0' and the English capital letter 'I' into the number '1' again. That is, it means converting alphabetic characters into numbers, and it will be referred to as alphanumeric conversion. In some cases, the uppercase English 'O' above is used as an English lowercase letter 'o', and the conversion is the same as above.

먼저, 단일 space를 기준으로 워드(word) 단위로 분리한다.First, it is divided into word units based on a single space.

이후, 예를 들어, 하나의 단어에서, 영문 대문자 'O'가 탐지된 경우, 그 앞자리의 문자를 확인하여, 그 문자가 숫자일 경우에는 해당 영문 대문자 'O'를 숫자 '0'으로 변환하는 것이다. 이것은 영문 대문자 'I'를 숫자 '1'로 변환하는 방법에도 동일하게 적용된다. 도 4의 왼쪽의 위 화살표가 가리키는 제목이 그와 같이 변환된 예를 보여준다. 즉, '2O18'이 '2018'로 변환된 것으로서 앞의 숫자는 영문 대문자 'O'가 포함된 것이고, 변환된 숫자에는 숫자 '0'이 포함된 상태를 나타낸다.Then, for example, when an English capital letter 'O' is detected in one word, the preceding character is checked, and if the letter is a number, the corresponding English capital letter 'O' is converted to the number '0'. will be. The same applies to the method of converting the English capital letter 'I' to the number '1'. The title indicated by the upper arrow on the left of FIG. 4 shows an example in which the title has been converted as such. That is, '2O18' is converted to '2018', and the preceding number includes an English capital letter 'O', and the converted number includes the number '0'.

'2OI8'의 경우에는 위의 방법으로 탐지할 수 없다. 왜냐하면, 2 다음에 연속으로 영문 대문자 'O'와 영문 대문자 'I'가 배치되어 있기 때문에, 'I'는 앞의 문자가 영문 대문자이므로 숫자로 변환되지 않는 것이다. 이 경우에는 '2OI8'을 하나의 단어로 보아서 이 단어가 발견되면 숫자로만 구성된 '2018'로 변환하는 것이다. 다만, 이 경우는 이와 같이 변형될 가능성이 있는 모든 단어를 데이터베이스에 저장해서, 후술하는 '패턴'에 의한 변환을 수행할 수도 있다.In the case of '2OI8', it cannot be detected by the above method. Because, after 2, an English capital letter 'O' and an English capital letter 'I' are arranged consecutively, 'I' is not converted to a number because the preceding character is an English capital letter. In this case, '2OI8' is viewed as a single word, and when this word is found, it is converted into '2018' consisting of only numbers. However, in this case, all words that are likely to be transformed as described above may be stored in the database, and transformation by a 'pattern' to be described later may be performed.

이와 같은 알파뉴메릭 변환 후, 워드(word) 단위로 분리된 문장에서 공백(space)을 제거하여 모든 음절이 연결된 문장으로 형성한다.After such alphanumeric transformation, spaces are removed from sentences separated by word units to form sentences in which all syllables are connected.

도 5는 게시글에 원문자가 존재하는 경우, 해당 원문자를 그에 대응되는 문자로 변경하는 실시예를 나타내는 도면이다.5 is a diagram illustrating an embodiment of changing the original text to a corresponding text when there is an original text in the posting.

즉, 원문자에서 둘레의 원을 제거한 문자로 변환하는 것이다. 여기서는 '원문자'만을 예시하여 설명하였으나, 괄호 문자(예를 들어 (1), 1) 등)나, 또는 문자를 특정 기호로 내포하는 다양한 종류의 문자도 여기서 처리하는 '원문자'에 포함되는 것이며, 그와 같은 특정 기호를 제거하고 내포된 문자만으로 변환할 있다. 이와 같은 원문자의 변환에는 도 5과 같이 변환 전, 후의 패턴을 매칭시킨 원문자 패턴 데이터베이스(151, 도 13 참조)를 사용할 수 있다.In other words, it is converted into a character by removing the surrounding circle from the original character. Here, only the 'original characters' have been described as examples, but parentheses characters (eg (1), 1), etc., or various types of characters containing characters as specific symbols are also included in the 'original characters' processed here , and can remove such specific symbols and convert only the nested characters. For such conversion of original characters, as shown in FIG. 5 , the original text pattern database 151 (refer to FIG. 13 ) in which patterns before and after conversion are matched may be used.

도 6은 게시글에서 영문자 또는 숫자 또는 특수문자를 한글 음소로 변환하는 실시예를 나타내는 도면이다.6 is a diagram illustrating an embodiment of converting English letters, numbers, or special characters into Korean phonemes in a post.

도 6 내지 도 10은, 패턴 데이터베이스(152 내지 154, 도 13 참조)에 저장된 패턴을 이용하여 특정 글자를 다른 특정 글자로 변환하는 실시예를 나타낸다.6 to 10 show an embodiment in which a specific character is converted into another specific character by using a pattern stored in the pattern database 152 to 154 (refer to FIG. 13).

도 6의 경우는, 한글 음소와 유사한 한개의 영문자 또는 숫자 또는 특수문자를 한글 음소로 변환하는 것으로서, 이와 같은 변환 패턴은 한글음소변환 패턴 데이터베이스(152, 도 13 참조)에 저장되어 있다. '한글 음소'란 한글의 개별 모음 또는 자음을 의미한다.In the case of FIG. 6, one alphabetic character, number, or special character similar to a Korean phoneme is converted into a Korean phoneme, and such a conversion pattern is stored in the Korean phoneme conversion pattern database 152 (refer to FIG. 13). 'Hangul phoneme' refers to individual vowels or consonants of Hangul.

즉, 불법 저작물의 제목에서는 탐지를 피하기 위한 또다른 수법으로서, 한글 음소 대신 유사한 형태의 영문자 또는 숫자를 사용하는데, 도 6(a)를 참조하면, 예를 들어 'ㅏ'가 들어갈 자리에 'r'을 사용하거나, 'ㅣ'가 들어갈 자리에 숫자 '1'을 사용하는 등의 방식이다. 본 발명은 한글음소변환 패턴 데이터베이스(152)를 이용하여 이를 다시 'ㅏ', 'ㅣ'로 복구시켜 놓게 된다. 도 6(b)는 이와 같이 복구시킨 실시예를 도시한 것이다.That is, in the title of an illegal work, another method to avoid detection is to use an English letter or number in a similar form instead of a Korean phoneme. Referring to FIG. 6(a), for example, an 'r' ' or the number '1' in place of 'ㅣ'. In the present invention, using the Hangul phoneme conversion pattern database 152, it is restored to 'A' and 'ㅣ'. Fig. 6(b) shows an embodiment restored in this way.

도 7은 게시글에서 한글 음소를 조합하여 음절을 생성하는 실시예를 나타내는 도면이다.7 is a diagram illustrating an embodiment of generating syllables by combining Hangul phonemes in a post.

도 7의 경우는, 분리되어 있는 자음과 모음을 결합시켜 한개의 한글 글자(음절)로 변환하는 것으로서, 이와 같은 변환 패턴은 한글음소조합 패턴 데이터베이스(153, 도 13 참조)에 저장되어 있다.In the case of FIG. 7, separated consonants and vowels are combined and converted into one Hangul character (syllable), and such a conversion pattern is stored in the Hangul phoneme combination pattern database 153 (refer to FIG. 13).

즉, 불법 저작물의 제목에서는 탐지를 피하기 위한 또다른 수법으로서, 한개의 음절에서 자음과 모음, 즉 음소를 분리해놓는 방법을 사용하는데, 도 7(a)를 참조하면, 예를 들어 '파'를 'ㅍㅏ'와 같이 표기하는 것이다. 본 발명은 한글음소조합 패턴 데이터베이스(153)를 이용하여 이를 다시 '파'로 복구시켜 놓게 된다. 도 7(b)의 맨 오른쪽 칸의 문장은 이와 같이 복구시킨 실시예를 도시한 것이다.That is, as another method to avoid detection in the title of illegal works, a method of separating consonants and vowels, that is, phonemes, is used in one syllable. Referring to FIG. 7(a), for example, 'pa' is expressed as 'P'. The present invention restores it back to 'pa' using the Hangul phoneme combination pattern database 153 . The sentence in the rightmost column of FIG. 7(b) shows the example restored in this way.

도 8은 한글 음소 변환에서 나타날 수 있는 문제를 해결하기 위하여 변환된 한글 음소에 대한 역변환이 수행된 상태에 대한 실시예를 나타내는 도면이다.8 is a diagram illustrating an embodiment of a state in which inverse transformation is performed on a converted Hangul phoneme in order to solve a problem that may appear in Hangul phoneme conversion.

전술한 한글 음소 변환 과정에서, 정상적인 영문자 또는 숫자가 한글 음소로 치환되는 경우가 발생할 수 있다. 예를 들어 도 8을 참조하면, 상부 왼쪽의 게시글에서 'Lost'가 상부 오른쪽의 'ㄴost'로 바뀐 경우가 있을 수 있다. 즉, 영어 대문자 'L'이 한글 음소 'ㄴ'로 치환된 것이다. 이 경우, 'ㄴ' 뒤에 나오는 글자와 한글 음소 조합이 이루어지는지를 판단한다. 도 3의 경우는 'ㄴ' 뒤에 영문 소문자 'o'가 나오므로 한글 음소 조합이 이루어질 수 없는 상황이다. 이 경우는 'ㄴ'으로의 변환이 오류인 것으로 판단하여 'ㄴ'을 다시 영문 대문자 'L'로 역변환시키게 된다. 이러한 역변환은 별도의 데이터베이스를 필요로 하지 않고, 도 6에서 설명한 한글음소변환 패턴 데이터베이스(152, 도 13 참조)를 이용하여 반대방향으로 변환하면 된다.In the above-described Hangul phoneme conversion process, a case may occur in which normal English characters or numbers are replaced with Hangul phonemes. For example, referring to FIG. 8 , there may be a case where 'Lost' in the post on the upper left is changed to 'Nost' on the upper right. That is, the English capital letter 'L' is replaced with the Korean phoneme 'ㄴ'. In this case, it is determined whether the letter following 'ㄴ' and the Korean phoneme are combined. In the case of FIG. 3, since a lowercase English letter 'o' appears after 'ㄴ', it is impossible to combine Korean phonemes. In this case, it is judged that the conversion to 'ㄴ' is an error, and 'ㄴ' is inversely converted back to an English capital letter 'L'. Such inverse conversion does not require a separate database, and may be converted in the opposite direction using the Hangul phoneme conversion pattern database 152 (refer to FIG. 13 ) described with reference to FIG. 6 .

또 다른 실시예로서는, 게시글에서의 'r'을 한글 음소 변환 과정을 통하여 'ㅏ'로 변환한 경우가 있을 수 있다. 이 경우에도, 앞의 문자와 한글 음소 조합이 되면 조합된 한글 음절로 두고, 한글 음소 조합이 이루어지지 않아 한글 음절이 되지 않으면 다시 'r'로 역변환을 수행하는 것이다.As another embodiment, there may be a case in which 'r' in a post is converted to 'a' through a Hangul phoneme conversion process. Even in this case, if the preceding character and Hangul phoneme are combined, it is set as a combined Hangul syllable, and if it is not a Hangul syllable because the Hangul phoneme combination is not made, inverse conversion is performed again to 'r'.

도 9는 도 2 내지 도 7을 통한 정규화 과정을 통해서도 정규화되지 않은 부분에 대한 실시예를 나타내는 도면이고, 도 10은 게시글에서 복합 패턴을 적용한 정규화 과정의 실시예를 나타내는 도면이다.9 is a diagram showing an embodiment of a portion that is not normalized even through the normalization process through FIGS. 2 to 7 , and FIG. 10 is a diagram showing an embodiment of a normalization process to which a complex pattern is applied in a post.

도 9에서 화살표는 그와 같이 정규화되지 않은 부분을 가리킨다. 예를 들어 도 9에서, 맨 위 화살표가 가리키는 부분에서 영문 소문자 2개로 구성된 'or'과, 영문 소문자 'o' 및 한글 음소 'ㅏ'로 구성된 'oㅏ'와 같은 경우이다. 가운데 화살표에서는 'oㅣ', 맨 아래 화살표에서는 'ㄹI' 가 그러한 경우를 나타낸다.The arrows in FIG. 9 indicate such non-normalized portions. For example, in FIG. 9 , in the portion indicated by the upper arrow, 'or' composed of two lowercase English letters and 'oa' composed of lowercase English letters 'o' and a Korean phoneme 'A' are the same. 'oㅣ' in the middle arrow and 'ㄹI' in the bottom arrow indicate such a case.

이와 같이 도 6 내지 도 8에서 설명한 한글음소변환 패턴 데이터베이스(152, 도 13 참조) 및 한글음소조합 패턴 데이터베이스(153, 도 13 참조)를 이용한 변환으로 탐지할 수 없는 패턴은 복합 패턴 데이터베이스(154, 도 13 참조)를 이용하여 변환 및 정규화를 수행하게 된다. 도 10(a)는 그와 같은 복합 패턴 데이터베이스(154)를 이용한 변환의 실시예를 도시한 도면이며, 도 10(b)는 그와 같이 변환되어 수정된 게시글의 변환 전(왼쪽) 및 후(오른쪽)의 실시예를 도시한 도면이다.As such, the pattern that cannot be detected by conversion using the Hangul phoneme conversion pattern database (152, see FIG. 13) and the Hangul phoneme combination pattern database (153, see FIG. 13) described in FIGS. 6 to 8 is a complex pattern database (154, 13) to perform transformation and normalization. 10 (a) is a diagram showing an embodiment of transformation using such a complex pattern database 154, and FIG. 10 (b) is before (left) and after (left) and after ( It is a diagram showing an embodiment of the right).

이러한 변환의 실시예로는, 영문자 2개로 한글 음절을 표현한 경우, '땅'을 '따ㅇ'으로 표현한 것과 같이 종성을 분리한 것 등이 있으나, 이에 한정되지 않고 다양한 패턴이 복합 패턴 데이터베이스(154)에 저장되어 있을 수 있다.Examples of such a conversion include, if two English syllables are used to express a Korean syllable, separation of the last consonant as in expressing 'land' as 'ta', but is not limited thereto and various patterns are provided in the complex pattern database (154). ) may be stored in

이와 같은 복합 패턴 데이터베이스(154)를 이용한 변환의 장점은, 전술한 원문자 패턴 데이터베이스(151), 한글음소변환 패턴 데이터베이스(152) 및 한글음소조합 패턴 데이터베이스(153)를 이용한 정규화에서 걸러지지 않는 다양한 변환 패턴을 사용할 수 있다는 것이며, 나아가, P2P 사이트 상에서 나타나는, 기존 없던 방식의 새로운 변형 패턴을 지속적으로 탐지하고, 용이하게 복합 패턴 데이터베이스(154)에 업데이트하여 사용할 수 있어 확장성이 뛰어나며, 이에 따라 최신의 거의 모든 변형된 게시글에 대한 탐지와 정규화가 가능하다는 장점이 있다.The advantage of such a conversion using the complex pattern database 154 is, the above-mentioned original character pattern database 151, Hangul phoneme conversion pattern database 152 and Hangul phoneme combination pattern database 153 that is not filtered in the normalization using various Conversion patterns can be used, and furthermore, it is excellent in scalability because it continuously detects new transformation patterns in a non-existing method appearing on the P2P site, and can be easily updated and used in the complex pattern database 154. It has the advantage of being able to detect and normalize almost all of the modified postings.

도 11은 도 2 내지 도 10을 적용하여 정규화가 수행된 게시글의 실시예를 나타내는 도면이다.11 is a diagram illustrating an embodiment of a post on which normalization is performed by applying FIGS. 2 to 10 .

왼쪽은 P2P 사이트에서 크롤링(crawling) 된 게시글 데이터, 오른쪽은 이에 대하여 각각 도 2 내지 도 10을 적용하여 정규화를 수행한 게시글이다.The left side is the post data crawled on the P2P site, and the right side is the post normalized by applying FIGS. 2 to 10, respectively.

도 12는 도 2 내지 도 10의 저작권 보호를 위한 변형된 게시글 정규화 방법이 수행되는 순서도이고, 도 13은 본 발명에 따른 저작권 보호를 위한 변형된 게시글 정규화 방법을 수행하는 장치(100)의 구성을 나타내는 도면이다.12 is a flowchart in which the modified posting normalization method for copyright protection of FIGS. 2 to 10 is performed, and FIG. 13 is a configuration of an apparatus 100 for performing the modified posting normalization method for copyright protection according to the present invention. It is a drawing showing

이미 본 발명의 저작권 보호를 위한 변형된 게시글 정규화 방법에 대하여는 도 2 내지 도 10을 참조하여 상세히 설명한 바 있으므로, 도 12를 참조하여서는 그 과정을 간략히 정리하여 설명하며, 도 13을 참조하여서는, 그와 같은 변형된 게시글 정규화 방법을 수행하는 장치(100)를, 그 구성을 중심으로 간략히 설명하기로 한다.Since the modified posting normalization method for copyright protection of the present invention has already been described in detail with reference to FIGS. 2 to 10, the process will be briefly described with reference to FIG. 12, and with reference to FIG. 13, The apparatus 100 for performing the same modified posting normalization method will be briefly described focusing on its configuration.

제어부(110)는 저작권 보호를 위한 변형된 게시글 정규화 방법을 수행하는 장치(100)의 이하 각 구성요소를 제어하여 변형된 게시글의 정규화를 위한 일련의 과정을 수행한다.The control unit 110 performs a series of processes for normalizing the modified posting by controlling each of the following components of the apparatus 100 for performing the modified posting normalization method for copyright protection.

저작권 보호를 위한 변형된 게시글 정규화 장치(100) 사이트 크롤링(crawling)부(120)는 먼저 인터넷 상의 파일공유 사이트(200)로부터, 게시된 컨텐츠 데이터를 수집하고 이로부터 각 컨텐츠에 대하여 게시된 게시글을 추출하는 크롤링(crawling)을 수행한다(S1201). 이러한 수집된 게시글 데이터에 대하여, 먼저 기본 전처리부(130)는 도 3 및 도 4를 참조하여 설명한 바와 같이, 특수문자 등의 다양한 기호 및 중복 공백의 제거(S1202), 알파뉴메릭(alphanumeric) 변환(S1203), 공백 제거(S1204) 등을 수행한다.The modified post normalization device 100 for copyright protection, the site crawling unit 120, first collects published content data from the file sharing site 200 on the Internet, and from this, the published posts for each content Crawling to extract is performed (S1201). With respect to the collected posting data, first, the basic preprocessor 130 removes various symbols such as special characters and redundant spaces (S1202), as described with reference to FIGS. 3 and 4, and alphanumeric conversion (S1203), blank removal (S1204), and the like are performed.

이후 패턴 적용 변환부(140)는, 각종 패턴 데이터베이스(150)를 이용하여 게시글에 대한 변환 및 정규화를 수행한다. 즉, 패턴 적용 변환부(140)는, 먼저 원문자 패턴 데이터베이스(151)를 이용하여, 원문자, 괄호문자 또는 다양한 기호에 의해 내포된 문자에서 원, 괄호 또는 다른 기호를 제거하거 문자만으로의 변환을 수행한다(S1205).Thereafter, the pattern application conversion unit 140 converts and normalizes postings using various pattern databases 150 . That is, the pattern application conversion unit 140, first, using the original character pattern database 151, removes circles, parentheses, or other symbols from original characters, parentheses, or characters implied by various symbols, or converts them into characters only. is performed (S1205).

이어서 패턴 적용 변환부(140)는, 한글음소변환 패턴 데이터베이스(152)를 이용하여 도 6을 참조하여 설명한 바와 같은 한글 음소 변환을 수행하고(S1206), 한글음소조합 패턴 데이터베이스(153)를 이용하여 도 7을 참조하여 설명한 바와 같은 한글 음소 조합을 수행한다(S1207). 또한 한글 음소 변환시 잘못 변환된 문자에 대하여 도 8을 참조하여 설명한 바와 같은 한글 음소 역변환을 수행하는데(S1208), 이 경우는 한글음소변환 패턴 데이터베이스(152)를 사용하게 된다.Subsequently, the pattern application conversion unit 140 performs Hangul phoneme conversion as described with reference to FIG. 6 using the Hangul phoneme conversion pattern database 152 (S1206), and uses the Hangul phoneme combination pattern database 153 The Hangul phoneme combination as described with reference to FIG. 7 is performed (S1207). In addition, the inverse Hangul phoneme conversion as described with reference to FIG. 8 is performed on the erroneously converted characters during Hangul phoneme conversion (S1208). In this case, the Hangul phoneme conversion pattern database 152 is used.

전술한 바와 같은 패턴으로 탐지할 수 없는 다양한 패턴에 대하여, 패턴 적용 변환부(140)는, 도 10을 참조하여 설명한 바와 같이 복합 패턴 데이터베이스(154)를 이용하여 변형된 게시글의 탐지 및 정규화 변환을 수행하게 된다. 이러한 복합 패턴 데이터베이스(154)에는, 전술한 원문자 패턴 데이터베이스(151), 한글음소변환 패턴 데이터베이스(152), 한글음소조합 패턴 데이터베이스(153)에 포함되지 않은 변환 패턴만이 저장되어 있을 수도 있으나, 경우에 따라 이러한 패턴 데이터베이스들에 저장되어 있는 패턴 중 일부가 복합 패턴 데이터베이스(154)에 중복 저장되어 있을 수도 있다.For various patterns that cannot be detected with the pattern as described above, the pattern application transformation unit 140 performs detection and normalization transformation of a modified posting using the complex pattern database 154 as described with reference to FIG. 10 . will perform In this complex pattern database 154, only the conversion patterns that are not included in the original character pattern database 151, the Hangul phoneme conversion pattern database 152, and the Hangul phoneme combination pattern database 153 described above may be stored, In some cases, some of the patterns stored in these pattern databases may be redundantly stored in the complex pattern database 154 .

이러한 복합 패턴 데이터베이스(154)를 이용한 변환은, P2P 사이트(200) 상에서 나타나는, 기존 없던 방식의 새로운 변형 패턴을 지속적으로 탐지하고, 용이하게 복합 패턴 데이터베이스(154)에 업데이트하여 사용할 수 있어 확장성이 뛰어나며, 이에 따라 최신의 거의 모든 변형된 게시글에 대한 탐지와 정규화가 가능하다는 특징을 가진다.Transformation using such a complex pattern database 154 continuously detects a new deformation pattern of an unprecedented method appearing on the P2P site 200, and can be easily updated and used in the complex pattern database 154, so that scalability is possible. It is excellent, and it has the characteristic that it is possible to detect and normalize almost all of the latest and modified postings.

Claims

A modified posting normalization device for copyright protection, for copyright protection, as a method of normalizing a modified posting,
(a) collecting published content data from a file sharing site on the Internet and extracting a post for each content therefrom;
(b) removing special characters from the extracted posts;
(c) converting the number transformed into English characters from the posting into the corresponding number;
(d2) converting one English character, number, or special character into one Hangul phoneme using a Korean phoneme conversion pattern database;
(d4) determining whether the Korean phoneme converted in the step (d2) is a combination of the first letter or the rear letter and the Korean phoneme; and,
(d41) If the Korean phoneme converted in step (d2) cannot form a Hangul syllable because the phoneme combination is not made with any of the preceding and following characters, the conversion in step (d2) is determined as an error, Inversely converting the Hangul phoneme converted in step (d2) into the original character before conversion
A modified post normalization method for copyright protection, including

The method according to claim 1,
To remove the special character,
Retaining only one of the consecutively inserted 'space' characters and removing the rest.
A modified post normalization method for copyright protection, characterized by

The method according to claim 1,
The step (c) is, in the post,
Converts uppercase English 'O' or lowercase English 'o' to number '0',
Converting the uppercase letter 'I' to the number '1'
A modified post normalization method for copyright protection, characterized by

4. The method according to claim 3,
The conversion to this number is,
When the letter immediately preceding the uppercase English letter 'O' or the lowercase English letter 'o' or the uppercase English letter 'I' is a number, converting to the number '0' or the number '1'
A modified post normalization method for copyright protection, characterized by

5. The method according to claim 4,
When the letter immediately preceding the uppercase English letter 'O' or the lowercase English letter 'o' or the uppercase English letter 'I' is not a number, converting to a number in word units
A modified post normalization method for copyright protection, characterized by

The method according to claim 1,
Between steps (c) and (d2),
(c1) Step of removing the space due to 'space' in the post separated by word unit
A modified post normalization method for copyright protection, characterized in that it further comprises.

delete

The method according to claim 1,
Before step (d2),
(d1) converting a character surrounded by a circle or other symbol so that only the character is displayed;
A modified post normalization method for copyright protection, characterized in that it further comprises.

delete

The method according to claim 1,
After the step (d2),
(d3) using the Hangul phoneme combination pattern database to convert the separated and consecutively written Hangul consonant phonemes and Hangul vowel phonemes into one Hangul character
A modified post normalization method for copyright protection, characterized in that it further comprises.

delete

The method according to claim 1,
After the step (d41),
(d5) performing conversion using the conversion pattern stored in the conversion pattern database (hereinafter referred to as 'composite pattern database') for the part that has not been converted even through the process from step (d41) in the above post
A modified post normalization method for copyright protection, characterized in that it further comprises.

A computer program for normalizing a modified post for copyright protection,
It is stored in a non-transitory storage medium, and by the processor,
(a) collecting published content data from a file sharing site on the Internet and extracting a post for each content therefrom;
(b) removing special characters from the extracted posts;
(c) converting the number transformed into English characters from the posting into the corresponding number;
(d2) converting one English character, number, or special character into one Hangul phoneme using a Korean phoneme conversion pattern database;
(d4) determining whether the Korean phoneme converted in the step (d2) is a combination of the first letter or the rear letter and the Korean phoneme; and,
(d41) If the Korean phoneme converted in step (d2) cannot form a Hangul syllable because the phoneme combination is not made with any of the preceding and following characters, the conversion in step (d2) is determined as an error, Inversely converting the Hangul phoneme converted in step (d2) into the original character before conversion
A computer program stored in a non-transitory storage medium for normalizing a modified posting for copyright protection, including instructions to be executed.

As a device for normalizing modified postings for copyright protection,
at least one processor; and
at least one memory for storing computer-executable instructions;
The computer-executable instructions stored in the at least one memory are executed by the at least one processor,
(a) collecting published content data from a file sharing site on the Internet and extracting a post for each content therefrom;
(b) removing special characters from the extracted posts;
(c) converting the number transformed into English characters from the posting into the corresponding number;
(d2) converting one English character, number, or special character into one Hangul phoneme using a Korean phoneme conversion pattern database;
(d4) determining whether the Korean phoneme converted in the step (d2) is a combination of the first letter or the rear letter and the Korean phoneme; and,
(d41) If the Korean phoneme converted in step (d2) cannot form a Hangul syllable because the phoneme combination is not made with any of the preceding and following characters, the conversion in step (d2) is determined as an error, Inversely converting the Hangul phoneme converted in step (d2) into the original character before conversion
A modified post normalization device for copyright protection that allows .

As a device for normalizing modified postings for copyright protection,
a site crawling unit that collects published content data from a file sharing site on the Internet and extracts a post for each content therefrom;
A basic preprocessor that removes special characters from the extracted posts and converts numbers transformed into English characters into corresponding numbers;
Hangul phoneme conversion pattern database for storing a pattern for converting one English letter, number, or special character into one Hangul phoneme;
By using the conversion pattern stored in the Hangul phoneme conversion pattern database, one English letter, number, or special character is converted into one Hangul phoneme, and the converted Hangul phoneme is a combination of the preceding character or the rear character and the Korean phoneme. If the converted Hangul phoneme cannot form a Hangul syllable because a phoneme combination is not made with any of the preceding and following characters, it is determined that the conversion to the Hangul phoneme is an error, and the converted Hangul phoneme is converted A pattern application conversion unit that performs the function of inverse conversion to the original character
Modified post normalization device for copyright protection, including.

17. The method of claim 16,
a character pattern database for storing a pattern for converting characters surrounded by circles or other symbols to display only the characters;
The pattern application conversion unit,
Using the original character pattern database, further comprising a function of performing conversion to only characters by removing the circle, parentheses, or other symbols from the original characters, parentheses, or characters implied by other symbols
A modified post normalization device for copyright protection, characterized by

delete

17. The method of claim 16,
It further comprises a Hangul phoneme combination pattern database that stores a pattern for converting into one Hangul character by combining the Hangul consonant phoneme and Hangul vowel phoneme separately and consecutively,
The pattern application conversion unit,
Using the Hangul phoneme combination pattern database, it further comprises a function of converting into a single Hangul character by combining the Hangul consonant phoneme and Hangul vowel phoneme separately and consecutively
A modified post normalization device for copyright protection, characterized by

17. The method of claim 16,
Further comprising a complex pattern database,
The complex pattern database,
Only the conversion patterns not stored in the original character pattern database, the Hangul phoneme conversion pattern database and the Hangul phoneme combination pattern database are stored,
Or some of the stored conversion patterns are the same as conversion patterns stored in the original character pattern database, the Hangul phoneme conversion pattern database, or the Hangul phoneme combination pattern database,
The pattern application conversion unit,
Using the complex pattern database, further comprising a function of performing detection and normalization transformation of modified postings
A modified post normalization device for copyright protection, characterized by