KR20210112955A

KR20210112955A - Swearwords detection system based on hangul jamo similarity and method of detecting the swearwords

Info

Publication number: KR20210112955A
Application number: KR1020200028675A
Authority: KR
Inventors: 최영준; 전민건; 주재린; 전혜진; 이환희
Original assignee: 아주대학교산학협력단
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2021-09-15
Also published as: KR102358553B1

Abstract

An expletive detection system includes a preprocessing unit, an initial, middle, and final consonants separation unit, an initial consonant distance calculation unit, a middle consonant calculation unit, a final consonant distance calculation unit, and an expletive determination unit. The preprocessing unit separates an input phrase into words. The initial, middle, and final consonants separation unit separates syllables in the word of the input phrase into initial, middle, and final consonants. The initial consonant distance calculation unit calculates an initial consonant distance by comparing the initial consonant of the syllable of the input phrase with the initial consonant of the syllable of an expletive sample. The middle consonant distance calculation unit calculates a middle consonant distance by comparing the middle consonant of the syllable of the input phrase with the middle consonant of the syllable of the expletive sample. The final consonant distance calculation unit calculates a final consonant distance by comparing the final consonant of the syllable of the input phrase with the final consonant of the syllable of the expletive sample. The expletive determination unit compares the average distance of the initial consonant distance, the middle consonant distance, and the final consonant distance with a threshold to determine whether the input phrase includes an expletive. Therefore, it is possible to greatly improve the accuracy of expletive detection.

Description

System and method for detecting profanity based on similarity between Korean alphabets

본 발명은 한글 자모간 유사성에 기반한 비속어 검출 시스템 및 방법에 관한 것으로, 보다 구체적으로는 입력 구문을 어절로 나누고, 상기 어절 내의 음절을 초성, 중성, 종성으로 나누어 비속어 샘플과 비교하는 한글 자모간 유사성에 기반한 비속어 검출 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for detecting profanity based on the similarity between Korean alphabets, and more specifically, to divide an input phrase into word words, and divide a syllable within the word into a leading consonant, a middle vowel, and a final consonant to compare with a sample of profanity. To a system and method for detecting profanity based on

많은 유명인들이 무분별한 악플로 많은 고통을 받고 있다. 최근 몇 연예인들은 악성 댓글, 루머에 의한 우울증을 극복하지 못하고 스스로 목숨을 끊기도 하였다. 이외에도 비연예인, 인플루언서나 일반인까지 다양한 사람들이 악플의 피해자가 되고 있다. 이것은 인터넷에 대한 접근성이 좋아지며 점점 더 사회적으로 심각한 문제로 대두되고 이는 상황이다.Many celebrities are suffering a lot from reckless malicious comments. Recently, some celebrities have taken their own lives, unable to overcome the depression caused by malicious comments and rumors. In addition, various people, including non-celebrities, influencers, and the general public, are becoming victims of malicious comments. This is a situation in which access to the Internet is improved and it is increasingly becoming a serious social problem.

이에 따라 각종 커뮤니티 사이트는 비속어 필터링 시스템을 사용하고 있지만 비속어가 완전히 일치하지 않으면 필터링을 하지 못하는 경우가 대부분이다. 사용자들은 이러한 약점을 이용하여 기존 비속어 필터링 프로그램을 우회하고자 전달하려는 내용을 유사한 자모로 대체하여 변형된 욕설을 사용하고 있다. 이와 같이 변형된 욕설이 포함되어 있음에도 불구하고 비속어가 필터링되지 않는 문제가 있다. 그래서 운영자가 신고 받은 문장을 직접 읽고 필터링을 해야 하는 번거로움을 겪고 있다.Accordingly, various community sites use a profanity filtering system, but if the profanity does not completely match, filtering is not possible in most cases. Users use this weakness to bypass the existing profanity filtering program, replacing the content they are trying to convey with similar letters and using modified profanity. There is a problem that profanity is not filtered despite the fact that such modified profanity is included. Therefore, the operator suffers from the inconvenience of having to directly read and filter the reported sentences.

본 발명이 이루고자 하는 목적은 입력 구문을 어절로 나누고, 상기 어절 내의 음절을 초성, 중성, 종성으로 나누어 비속어 샘플과 비교하여 정확도를 향상시킨 한글 자모간 유사성에 기반한 비속어 검출 시스템을 제공하는 것이다.It is an object of the present invention to provide a profanity detection system based on the similarity between Hangul letters and letters, in which an input phrase is divided into word words, and syllables within the word word are divided into a leading consonant, a middle consonant, and a final consonant to improve accuracy by comparing with profanity samples.

본 발명이 이루고자 하는 다른 목적은 한글 자모간 유사성에 기반한 비속어 검출 시스템을 이용한 비속어 검출 방법을 제공하는 것이다.Another object of the present invention is to provide a method for detecting profanity using a profanity detection system based on the similarity between Korean alphabets.

상기한 본 발명의 목적을 실현하기 위한 일 실시예에 따른 비속어 검출 시스템은 전처리부, 초성, 중성, 종성 분리부, 초성 거리 계산부, 중성 거리 계산부, 종성 거리 계산부 및 비속어 판단부를 포함한다. 상기 전처리부는 입력 구문을 어절로 분리한다. 상기 초성, 중성, 종성 분리부는 상기 입력 구문의 상기 어절 내의 음절을 초성, 중성 및 종성으로 분리한다. 상기 초성 거리 계산부는 상기 입력 구문의 상기 음절의 초성을 비속어 샘플의 음절의 초성과 비교하여 초성 거리를 계산한다. 상기 중성 거리 계산부는 상기 입력 구문의 상기 음절의 중성을 상기 비속어 샘플의 상기 음절의 중성과 비교하여 중성 거리를 계산한다. 상기 종성 거리 계산부는 상기 입력 구문의 상기 음절의 종성을 상기 비속어 샘플의 상기 음절의 종성과 비교하여 종성 거리를 계산한다. 상기 비속어 판단부는 상기 초성 거리, 상기 중성 거리 및 상기 종성 거리의 평균 거리를 쓰레스홀드와 비교하여 상기 입력 구문이 비속어를 포함하는지 판단한다. A profanity detection system according to an embodiment for realizing the object of the present invention includes a preprocessing unit, a leading consonant, a neutral, a final consonant separation unit, a leading consonant distance calculation unit, a neutral distance calculation unit, a final consonant distance calculation unit and a profanity determination unit . The preprocessor separates the input phrase into words. The leading, neutral, and final separating unit separates syllables in the word of the input phrase into a leading, neutral, and final consonant. The leading consonant distance calculating unit calculates a leading consonant distance by comparing the beginning of the syllable of the input phrase with the beginning of the syllable of the profanity sample. The neutral distance calculator calculates a neutral distance by comparing the neutrality of the syllable of the input phrase with the neutrality of the syllable of the profanity sample. The closing distance calculating unit calculates a closing distance by comparing the ending of the syllable of the input phrase with the ending of the syllable of the profanity sample. The profanity determining unit compares the average distance of the initial distance, the neutral distance, and the final distance with a threshold to determine whether the input phrase includes a profanity.

본 발명의 일 실시예에 있어서, 상기 전처리부는 상기 입력 구문의 상기 어절에서 영문자, 숫자 및 특수문자를 제거할 수 있다. In an embodiment of the present invention, the preprocessor may remove alphabetic characters, numbers, and special characters from the word word of the input phrase.

본 발명의 일 실시예에 있어서, 상기 입력 구문의 상기 어절의 음절수가 상기 비속어 샘플의 음절수보다 클 때, 상기 입력 구문의 상기 어절을 상기 비속어 샘플의 음절수에 일치하도록 분리하는 n-gram 설정부를 더 포함할 수 있다. In one embodiment of the present invention, when the number of syllables of the word syllable of the input phrase is greater than the number of syllables of the profanity sample, the n-gram setting for separating the word word of the input phrase to match the number of syllables of the profanity sample It may include more wealth.

본 발명의 일 실시예에 있어서, 상기 초성 거리 계산부는 한글의 초성들 간의 유사도에 대한 정보를 포함하는 초성 유사도 행렬을 이용하여 상기 초성 거리를 계산할 수 있다. 상기 초성 유사도 행렬은 상기 초성들 간의 유사도에 따라 적어도 3개 이상의 값을 가질 수 있다.In an embodiment of the present invention, the starting distance calculating unit may calculate the starting distance by using a leading similarity matrix including information on the similarity between the leading consonants of Hangul. The leading consonant similarity matrix may have at least three values according to the similarity between the leading consonants.

본 발명의 일 실시예에 있어서, 상기 입력 구문의 상기 음절의 상기 초성과 상기 비속어 샘플의 상기 음절의 상기 초성이 일치하면 상기 초성 유사도 행렬은 0의 값을 가질 수 있다. 상기 입력 구문의 상기 음절의 상기 초성과 상기 비속어 샘플의 상기 음절의 상기 초성이 불일치하면, 상기 초성 유사도 행렬은 상기 초성들 간의 유사도에 따라 0보다 크거나 같고 1보다 작거나 같은 값을 가질 수 있다. 상기 초성들 간의 유사도가 낮을수록 상기 초성 유사도 행렬은 1에 가까운 값을 가질 수 있다.In one embodiment of the present invention, when the leading consonant of the syllable of the input phrase matches the leading consonant of the syllable of the profanity sample, the leading consonant similarity matrix may have a value of zero. When the leading consonant of the syllable of the input phrase and the leading consonant of the syllable of the profanity sample do not match, the leading consonant similarity matrix may have a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the leading consonants. . As the similarity between the leading consonants is lower, the leading consonant similarity matrix may have a value close to 1.

본 발명의 일 실시예에 있어서, 상기 중성 거리 계산부는 한글의 중성들 간의 유사도에 대한 정보를 포함하는 중성 유사도 행렬을 이용하여 상기 중성 거리를 계산할 수 있다. 상기 중성 유사도 행렬은 상기 중성들 간의 유사도에 따라 적어도 3개 이상의 값을 가질 수 있다.In an embodiment of the present invention, the neutral distance calculator may calculate the neutral distance using a neutral similarity matrix including information on the similarity between neutrals in Korean. The neutral similarity matrix may have at least three values according to the similarity between the neutrals.

본 발명의 일 실시예에 있어서, 상기 입력 구문의 상기 음절의 상기 중성과 상기 비속어 샘플의 상기 음절의 상기 중성이 일치하면 상기 중성 유사도 행렬은 0의 값을 가질 수 있다. 상기 입력 구문의 상기 음절의 상기 중성과 상기 비속어 샘플의 상기 음절의 상기 중성이 불일치하면, 상기 중성 유사도 행렬은 상기 중성들 간의 유사도에 따라 0보다 크거나 같고 1보다 작거나 같은 값을 가질 수 있다. 상기 중성들 간의 유사도가 낮을수록 상기 중성 유사도 행렬은 1에 가까운 값을 가질 수 있다.In an embodiment of the present invention, when the neutrality of the syllable of the input phrase and the neutrality of the syllable of the profanity sample match, the neutral similarity matrix may have a value of 0. If the neutrality of the syllable of the input phrase and the neutrality of the syllable of the profanity sample do not match, the neutral similarity matrix may have a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the neutrals. . As the similarity between the neutrals is lower, the neutral similarity matrix may have a value close to 1.

본 발명의 일 실시예에 있어서, 상기 종성 거리 계산부는 한글의 종성들 간의 유사도에 대한 정보를 포함하는 종성 유사도 행렬을 이용하여 상기 종성 거리를 계산할 수 있다. 상기 종성 유사도 행렬은 상기 종성들 간의 유사도에 따라 적어도 3개 이상의 값을 가질 수 있다.In an embodiment of the present invention, the final consonant distance calculator may calculate the final consonant distance by using a final consonant similarity matrix including information on the similarity between the consonants of Hangul. The species similarity matrix may have at least three or more values according to the similarity between the species.

본 발명의 일 실시예에 있어서, 상기 입력 구문의 상기 음절의 상기 종성과 상기 비속어 샘플의 상기 음절의 상기 종성이 일치하면 상기 종성 유사도 행렬은 0의 값을 가질 수 있다. 상기 입력 구문의 상기 음절의 상기 종성과 상기 비속어 샘플의 상기 음절의 상기 종성이 불일치하면, 상기 종성 유사도 행렬은 상기 종성들 간의 유사도에 따라 0보다 크거나 같고 1보다 작거나 같은 값을 가질 수 있다. 상기 종성들 간의 유사도가 낮을수록 상기 종성 유사도 행렬은 1에 가까운 값을 가질 수 있다.In an embodiment of the present invention, when the finality of the syllable of the input phrase matches the finality of the syllable of the profanity sample, the longitudinal similarity matrix may have a value of zero. If the finality of the syllable of the input phrase and the finality of the syllable of the profanity sample do not match, the finality similarity matrix may have a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the finalities. . As the similarity between the species is lower, the matrix similarity of the species may have a value close to 1.

본 발명의 일 실시예에 있어서, 상기 쓰레스홀드를 설정하는 쓰레스홀드 설정부를 더 포함할 수 있다. 상기 쓰레스홀드 설정부는 상기 비속어를 포함하는 클래스가 클래스 0이고 상기 비속어를 포함하지 않는 클래스가 클래스 1이라고 할 때, 상기 클래스 1의 재현율(recall)을 최대로 만드는 값을 상기 쓰레스홀드로 설정할 수 있다. In one embodiment of the present invention, it may further include a threshold setting unit for setting the threshold. The threshold setting unit sets a value that maximizes recall of the class 1 as the threshold when the class including the profanity is class 0 and the class not including the profanity is class 1 can

본 발명의 일 실시예에 있어서, 상기 비속어 판단부는 상기 입력 구문이 상기 비속어를 포함할 때, 상기 입력 구문이 복수의 비속어 샘플들 중에서 어떤 비속어 샘플에 해당하는지 판단하기 위해 K-NN(K-Nearest Neighbors) 방식을 이용할 수 있다. 여기서, K=1일 수 있다.In an embodiment of the present invention, when the input phrase includes the profanity, the profanity determining unit K-NN (K-Nearest) to determine which profanity sample the input phrase corresponds to among a plurality of profanity samples Neighbors) method can be used. Here, K = 1 may be.

상기한 본 발명의 다른 목적을 실현하기 위한 일 실시예에 따른 비속어 검출 방법은 입력 구문을 어절로 분리하는 단계, 상기 입력 구문의 상기 어절 내의 음절을 초성, 중성 및 종성으로 분리하는 단계, 상기 입력 구문의 상기 음절의 초성을 비속어 샘플의 음절의 초성과 비교하여 초성 거리를 계산하는 단계, 상기 입력 구문의 상기 음절의 중성을 상기 비속어 샘플의 상기 음절의 중성과 비교하여 중성 거리를 계산하는 단계, 상기 입력 구문의 상기 음절의 종성을 상기 비속어 샘플의 상기 음절의 종성과 비교하여 종성 거리를 계산하는 단계 및 상기 초성 거리, 상기 중성 거리 및 상기 종성 거리의 평균 거리를 쓰레스홀드와 비교하여 상기 입력 구문이 비속어를 포함하는지 판단하는 단계를 포함한다.According to an embodiment of the present invention for realizing another object of the present invention, a method for detecting profanity includes dividing an input phrase into word words, separating syllables within the word of the input phrase into a leading consonant, a middle consonant, and a final consonant, and the input Comparing the leading consonant of the syllable of the phrase to the leading of the syllable of a profanity sample, calculating a consonant distance by comparing the syllable neutral of the input phrase with the syllable neutral of the profanity sample; calculating a closing distance by comparing the ending of the syllable of the input phrase with the ending of the syllable of the profanity sample; and determining whether the phrase contains profanity.

본 발명의 일 실시예에 있어서, 상기 입력 구문의 상기 어절의 음절수가 상기 비속어 샘플의 음절수보다 클 때, 상기 입력 구문의 상기 어절을 상기 비속어 샘플의 음절수에 일치하도록 분리하는 단계를 더 포함할 수 있다.In one embodiment of the present invention, when the number of syllables of the word syllable of the input phrase is greater than the number of syllables of the profanity sample, the method further includes separating the word word of the input phrase to match the number of syllables of the profanity sample can do.

본 발명에 따른 한글 자모간 유사성에 기반한 비속어 검출 시스템 및 방법에 따르면, 입력 구문을 어절로 나누고, 상기 어절 내의 음절을 초성, 중성, 종성으로 나누며, 한글의 발음 유사성을 반영한 초성 거리, 중성 거리 및 종성 거리를 이용하여 비속어를 검출할 수 있다. According to the system and method for detecting profanity based on the similarity between Hangeul letters and letters according to the present invention, an input phrase is divided into word words, syllables within the word word are divided into a leading consonant, a middle consonant, and a final consonant, and a consonant distance reflecting the pronunciation similarity of Hangul, a consonant distance, and a Profanity can be detected using the finality distance.

따라서, 비속어 샘플과 일치하는 비속어뿐만 아니라, 상기 비속어 샘플을 변형한 비속어도 검출이 가능하므로 비속어 검출의 정확도를 크게 향상시킬 수 있다.Accordingly, not only the profanity matching the profanity sample, but also the profanity modified from the profanity sample can be detected, thereby greatly improving the accuracy of profanity detection.

도 1은 본 발명의 일 실시예에 따른 비속어 검출 시스템을 나타내는 블록도이다.
도 2는 도 1의 전처리부의 동작을 나타내는 개념도이다.
도 3은 도 1의 초성 거리 계산부에서 사용되는 초성 유사도 행렬을 나타내는 도면이다.
도 4는 도 2의 중성 거리 계산부에서 사용되는 중성 유사도 행렬을 나타내는 도면이다.
도 5는 도 2의 종성 거리 계산부에서 사용되는 종성 유사도 행렬을 나타내는 도면이다.
도 6은 도 1의 쓰레스홀드 설정부에서 설정되는 쓰레스홀드에 따른 비속어 검출 정확도의 일례를 나타내는 표이다.
도 7은 도 1의 쓰레스홀드 설정부에서 설정되는 쓰레스홀드에 따른 비속어 검출 정확도의 일례를 나타내는 표이다.1 is a block diagram illustrating a system for detecting profanity according to an embodiment of the present invention.
FIG. 2 is a conceptual diagram illustrating an operation of the preprocessor of FIG. 1 .
FIG. 3 is a diagram illustrating a leading similarity matrix used in the starting distance calculating unit of FIG. 1 .
4 is a diagram illustrating a neutral similarity matrix used in the neutral distance calculator of FIG. 2 .
5 is a diagram illustrating a longitudinal similarity matrix used in the longitudinal distance calculating unit of FIG. 2 .
6 is a table showing an example of profanity detection accuracy according to a threshold set in the threshold setting unit of FIG. 1 .
7 is a table showing an example of profanity detection accuracy according to a threshold set in the threshold setting unit of FIG. 1 .

본문에 개시되어 있는 본 발명의 실시예들에 대해서, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시예를 설명하기 위한 목적으로 예시된 것으로, 본 발명의 실시예들은 다양한 형태로 실시될 수 있으며 본문에 설명된 실시예들에 한정되는 것으로 해석되어서는 아니 된다.With respect to the embodiments of the present invention disclosed in the text, specific structural or functional descriptions are only exemplified for the purpose of describing the embodiments of the present invention, and the embodiments of the present invention may be embodied in various forms. It should not be construed as being limited to the embodiments described in .

본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can have various changes and can have various forms, specific embodiments are illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to the specific disclosed form, it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로 사용될 수 있다. 예를 들어, 본 발명의 권리 범위로부터 이탈되지 않은 채 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The above terms may be used for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected” or “connected” to another component, it is understood that the other component may be directly connected or connected to the other component, but other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that no other element is present in the middle. Other expressions describing the relationship between elements, such as "between" and "immediately between" or "neighboring to" and "directly adjacent to", should be interpreted similarly.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this application, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof is present, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미이다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as meanings consistent with the context of the related art, and unless explicitly defined in the present application, they are not to be interpreted in an ideal or excessively formal meaning. .

한편, 어떤 실시예가 달리 구현 가능한 경우에 특정 블록 내에 명기된 기능 또는 동작이 순서도에 명기된 순서와 다르게 일어날 수도 있다. 예를 들어, 연속하는 두 블록이 실제로는 실질적으로 동시에 수행될 수도 있고, 관련된 기능 또는 동작에 따라서는 상기 블록들이 거꾸로 수행될 수도 있다.On the other hand, when a certain embodiment can be implemented differently, functions or operations specified in a specific block may occur in a different order from that specified in the flowchart. For example, two consecutive blocks may be performed substantially simultaneously, or the blocks may be performed in reverse according to a related function or operation.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the accompanying drawings. The same reference numerals are used for the same components in the drawings, and duplicate descriptions of the same components are omitted.

도 1은 본 발명의 일 실시예에 따른 비속어 검출 시스템을 나타내는 블록도이다. 도 2는 도 1의 전처리부(100)의 동작을 나타내는 개념도이다.1 is a block diagram illustrating a system for detecting profanity according to an embodiment of the present invention. FIG. 2 is a conceptual diagram illustrating an operation of the preprocessor 100 of FIG. 1 .

도 1 및 도 2를 참조하면, 한글 자모간 유사성에 기반한 비속어 검출 시스템은 전처리부(100), 초성, 중성, 종성 분리부(300), 초성 거리 계산부(400), 중성 거리 계산부(500), 종성 거리 계산부(600) 및 비속어 판단부(800)를 포함한다. 예를 들어, 상기 비속어 검출 시스템은 n-gram 설정부(200)를 더 포함할 수 있다. 상기 비속어 검출 시스템은 쓰레스홀드 설정부(700)를 더 포함할 수 있다.1 and 2, the profanity detection system based on the similarity between the Korean alphabets includes a preprocessor 100, a leading consonant, a neutral, and a final consonant separating unit 300, a starting distance calculating unit 400, and a neutral distance calculating unit 500. ), a final distance calculation unit 600 and a profanity determination unit 800 . For example, the profanity detection system may further include an n-gram setting unit 200 . The profanity detection system may further include a threshold setting unit 700 .

상기 비속어 검출 시스템은 휴대폰, 스마트폰, 태블릿, 노트북 컴퓨터, 데스크 탑 등의 전자 장치 및 소프트웨어 애플리케이션을 통해 구현될 수 있다. The profanity detection system may be implemented through electronic devices and software applications such as mobile phones, smart phones, tablets, notebook computers, and desktops.

상기 비속어 검출 시스템은 입력 구문을 데이터 베이스에 저장되어 있는 복수의 비속어 샘플들과 비교하여 상기 입력 구문이 비속어를 포함하는지 포함하지 않는지 판단한다.The profanity detection system compares the input phrase with a plurality of samples of profanity stored in a database to determine whether the input phrase includes profanity or not.

상기 전처리부(100)는 입력 구문을 어절로 분리한다. 도 2를 보면, 입력 구문은 "생각을 하고 사는거냐 병신"이다. 예를 들어, 상기 비속어 샘플들은 "씨발", "지랄", "병신"을 포함한다. The preprocessor 100 separates the input phrase into words. Referring to FIG. 2 , the input phrase is “Thinking and living or a fool”. For example, the profanity samples include "fuck", "fuck" and "assassin."

상기 전처리부(100)는 상기 입력 구문인 "생각을 하고 사는거냐 병신"을 각각 "생각을", "하고", "사는거냐", "병신"으로 분리할 수 있다. The pre-processing unit 100 may separate the input phrase, “thinking and living, or not” into “thinking”, “doing”, “living”, and “bysin”, respectively.

상기 입력 구문의 어절들은 상기 비속어 샘플들과 각각 비교될 수 있다. "생각을"은 비속어 샘플들과 일치하지 않고 유사하지 않으므로 상기 비속어 검출 시스템은 "생각을"을 비속어로 판단하지 않는다. "하고"는 비속어 샘플들과 일치하지 않고 유사하지 않으므로 상기 비속어 검출 시스템은 "하고"를 비속어로 판단하지 않는다. "사는거냐"는 비속어 샘플들과 일치하지 않고 유사하지 않으므로 상기 비속어 검출 시스템은 "사는거냐"를 비속어로 판단하지 않는다. 반면,"병신"은 비속어 샘플들 중 하나와 일치하므로 상기 비속어 검출 시스템은 "병신"을 비속어로 판단하게 된다. 이와 같은 방식으로, 상기 전처리부(100)는 입력 구문을 각각의 어절로 분리할 수 있다.Words of the input phrase may be compared with the profanity samples, respectively. The profanity detection system does not judge "thinking" as profane because "thinking" does not match and is not similar to the profanity samples. The profanity detection system does not judge "do" as a profanity because "and" does not match and is not similar to the profanity samples. The profanity detection system does not judge "living" as a profanity because "I'm living" does not match and is not similar to the profanity samples. On the other hand, since "Bongsin" matches one of the profanity samples, the profanity detection system determines "Byongsin" as a profanity. In this way, the preprocessor 100 may separate the input phrase into each word.

상기 전처리부(100)는 상기 입력 구문의 상기 어절에서 영문자, 숫자 및 특수문자를 제거할 수 있다. 예를 들어, "병22222신"이 입력되거나, "병lll신"이 입력되거나, "병!!!신"이 입력되는 경우, 상기 전처리부(100)는 상기 입력 구문의 상기 어절에서 영문자, 숫자 및 특수문자를 제거하므로, 상기 "병22222신", "병lll신", 및 "병!!!신"을 모두 "병신"으로 변환하여, 상기 비속어 검출의 정확도를 향상시킬 수 있다. The preprocessor 100 may remove English letters, numbers, and special characters from the word word of the input phrase. For example, when "Byung 22222 Sin" is input, "Byung lll Sin" is input, or "Byeong!!! Sin" is input, the preprocessor 100 is an English letter in the word of the input phrase, Since numbers and special characters are removed, the accuracy of detecting the profanity can be improved by converting all of the "Byung22222Shin", "ByungllllShin", and "Byeong!!!Sin" into "Byeongsin".

상기 n-gram 설정부(200)는 상기 입력 구문의 상기 어절의 음절수가 상기 비속어 샘플의 음절수보다 클 때, 상기 입력 구문의 상기 어절을 상기 비속어 샘플의 음절수에 일치하도록 분리할 수 있다. When the number of syllables of the word of the input phrase is greater than the number of syllables of the profanity sample, the n-gram setting unit 200 may separate the word of the input phrase to match the number of syllables of the profanity sample.

예를 들어, 상기 입력 구문이 "씨발놈들이네"일 때, 입력 구문인 "씨발놈들이네"와 상기 비속어 샘플의 "씨발"을 비교하면, 어절의 길이 차이로 인해 유사도가 낮게 계산되어, 입력 구문이 비속어를 포함하지 않는 것으로 판단될 수 있다. For example, when the input phrase is "fuck you guys", if you compare the input phrase "fuck guys" with "fuck you" of the profanity sample, the similarity is calculated to be low due to the difference in word length, so the input phrase It can be judged that it does not contain this profane language.

따라서, 상기 n-gram 설정부(200)는 상기 비속어 샘플인 "씨발"의 음절수가 2이므로 n-gram을 상기 비속어 샘플의 음절수인 2로 설정할 수 있다. 상기 n-gram이 2로 설정되면, 상기 입력 구문인 "씨발놈들이네"는 각각 "씨발", "발놈", "놈들", "들이", "이네"의 부분집합들로 분리될 수 있다. Accordingly, the n-gram setting unit 200 may set the n-gram to be 2, which is the number of syllables of the profanity sample, since the number of syllables of “fuck” as the profanity sample is 2. When the n-gram is set to 2, the input phrase “fuck you guys” may be divided into subsets of “fuck”, “bare guys”, “gnomes”, “theys”, and “yes”, respectively.

상기 비속어 검출 시스템은 상기 n-gram 설정부(200)에 의해 분리된 부분집합들("씨발", "발놈", "놈들", "들이", "이네")을 각각 상기 비속어 샘플들과 비교하므로, 상기 입력 구문("씨발놈들이네")이 비속어를 포함하는 것으로 판단할 수 있다. The profanity detection system compares the subsets (“Fuck”, “Balnom”, “Norm”, “Eyes”, “Ine”) separated by the n-gram setting unit 200 with the profanity samples, respectively. Therefore, it can be determined that the input phrase (“these fuckers”) includes profanity.

본 실시예에서, 상기 비속어 검출 시스템은 상기 입력 구문의 상기 어절 내의 음절들을 초성, 중성 및 종성으로 분리하고, 상기 비속어 샘플의 음절들을 초성, 중성 및 종성으로 분리하여, 초성을 초성끼리, 중성을 중성끼리, 종성을 종성끼리 비교하게 된다.In this embodiment, the profanity detection system separates the syllables in the word of the input phrase into a leading consonant, a neutral and a final consonant, and separates the syllables of the profanity sample into a leading consonant, a neutral and a final consonant, Neutrals and finalities are compared with each other.

상기 초성, 중성, 종성 분리부(300)는 상기 입력 구문의 상기 어절 내의 음절을 초성, 중성 및 종성으로 분리한다. The leading, neutral, and final consonant separating unit 300 separates syllables in the word of the input phrase into leading, neutral, and final consonants.

상기 초성 거리 계산부(400)는 상기 입력 구문의 상기 음절의 초성을 비속어 샘플의 음절의 초성과 비교하여 초성 거리를 계산한다. The leading consonant distance calculating unit 400 compares the leading consonant of the syllable of the input phrase with the consonant of the syllable of the profanity sample to calculate the leading consonant distance.

상기 중성 거리 계산부(500)는 상기 입력 구문의 상기 음절의 중성을 상기 비속어 샘플의 상기 음절의 중성과 비교하여 중성 거리를 계산한다. The neutral distance calculator 500 calculates a neutral distance by comparing the neutrality of the syllable of the input phrase with the neutrality of the syllable of the profanity sample.

상기 종성 거리 계산부(600)는 상기 입력 구문의 상기 음절의 종성을 상기 비속어 샘플의 상기 음절의 종성과 비교하여 종성 거리를 계산한다. The final distance calculating unit 600 calculates a final distance by comparing the final syllable of the input phrase with the final syllable of the profanity sample.

상기 비속어 판단부(800)는 상기 초성 거리, 상기 중성 거리 및 상기 종성 거리의 평균 거리를 쓰레스홀드와 비교하여 상기 입력 구문이 비속어를 포함하는지 판단하게 된다.The profanity determining unit 800 compares the average distance of the starting distance, the neutral distance, and the final distance with a threshold to determine whether the input phrase includes a profanity.

도 3은 도 1의 초성 거리 계산부(400)에서 사용되는 초성 유사도 행렬을 나타내는 도면이다. FIG. 3 is a diagram illustrating a leading similarity matrix used in the leading distance calculating unit 400 of FIG. 1 .

도 3을 참조하면, 상기 초성 거리 계산부(400)는 한글의 초성들 간의 유사도에 대한 정보를 포함하는 초성 유사도 행렬을 이용하여 상기 초성 거리를 계산할 수 있다. Referring to FIG. 3 , the leading consonant distance calculator 400 may calculate the leading consonant distance using a consonant similarity matrix including information on the similarity between consonants of Hangul.

상기 초성 유사도 행렬은 상기 초성들 간의 유사도에 따라 적어도 3개 이상의 값을 가질 수 있다. 예를 들어, 상기 입력 구문의 상기 음절의 상기 초성과 상기 비속어 샘플의 상기 음절의 상기 초성이 일치하면 상기 초성 유사도 행렬은 0의 값을 가질 수 있다. 상기 입력 구문의 상기 음절의 상기 초성과 상기 비속어 샘플의 상기 음절의 상기 초성이 불일치하면, 상기 초성 유사도 행렬은 상기 초성들 간의 유사도에 따라 0보다 크거나 같고 1보다 작거나 같은 값을 가질 수 있다. 상기 초성들 간의 유사도가 낮을수록 상기 초성 유사도 행렬은 1에 가까운 값을 갖도록 설정될 수 있다.The leading consonant similarity matrix may have at least three values according to the similarity between the leading consonants. For example, when the leading consonant of the syllable of the input phrase matches the leading consonant of the syllable of the profanity sample, the leading consonant similarity matrix may have a value of zero. When the leading consonant of the syllable of the input phrase and the leading consonant of the syllable of the profanity sample do not match, the leading consonant similarity matrix may have a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the leading consonants. . As the similarity between the leading consonants is lower, the leading consonant similarity matrix may be set to have a value close to 1.

예를 들어, 도 3에서 "ㄱ"은 "ㄱ"과 동일하므로, "ㄱ"-"ㄱ"의 값은 0을 갖는다. 비속어를 변형하기 위한 목적 및 발음의 면에서 볼 때, "ㄲ"은 "ㄱ"과 극히 유사하므로, "ㄱ"-"ㄲ"의 값은 0으로 설정하였다. 비속어를 변형하기 위한 목적 및 발음의 면에서 볼 때, "ㅋ"은 "ㄱ"과 극히 유사하므로, "ㄱ"-"ㅋ"의 값은 0으로 설정하였다. 반면, "ㄱ"과 "ㄴ", "ㄷ", "ㄹ", "ㅁ" 등은 서로 유사도가 극히 낮으므로, "ㄱ"-"ㄴ", "ㄱ"-"ㄷ", "ㄱ"-"ㄹ", "ㄱ"-"ㅁ"의 값은 1로 설정하였다.For example, in FIG. 3, "a" is the same as "a", and thus the value of "a"-"a" has 0. In view of the purpose and pronunciation of profanity, "ㄲ" is very similar to "a", so the value of "a"-"ㄲ" is set to 0. In terms of the purpose and pronunciation for transforming profanity, "ㅋ" is very similar to "ㄱ", so the value of "ㄱ"-"ㅋ" is set to 0. On the other hand, "a", "b", "c", "ㄹ", and "ㅁ" have very low similarity to each other, so "a"-"b", "a"-"c", "a"- The values of "ㄹ" and "a"-"ㅁ" were set to 1.

예를 들어, 도 3에서 "ㅅ"은 "ㅅ"과 동일하므로, "ㅅ"-"ㅅ"의 값은 0을 갖는다. 비속어를 변형하기 위한 목적 및 발음의 면에서 볼 때, "ㅆ"은 "ㅅ"과 극히 유사하므로, "ㅅ"-"ㅆ"의 값은 0으로 설정하였다. 비속어를 변형하기 위한 목적 및 발음의 면에서 볼 때, "ㅅ"은 "ㅎ"과 어느 정도 유사하므로, "ㅅ"-"ㅎ"의 값은 0.3으로 설정하였다. 반면, "ㅅ"과 "ㄱ", "ㄴ", "ㄷ", "ㄹ" 등은 서로 유사도가 극히 낮으므로, "ㅅ"-"ㄱ", "ㅅ"-"ㄴ", "ㅅ"-"ㄷ", "ㅅ"-"ㄹ"의 값은 1로 설정하였다.For example, in FIG. 3, "s" is the same as "s", and thus the value of "s" - "s" has 0. In terms of the purpose and pronunciation of the profanity, "ㅆ" is very similar to "ㅅ", so the value of "ㅅ"-"ㅆ" is set to 0. In terms of the purpose of transforming profanity and pronunciation, "s" is somewhat similar to "h", so the value of "s"-"h" is set to 0.3. On the other hand, "ㅅ" and "a", "b", "c", "ㄹ", etc. have very low similarity to each other. The values of "c", "s"-"d" were set to 1.

도 4는 도 1의 중성 거리 계산부(500)에서 사용되는 중성 유사도 행렬을 나타내는 도면이다. 4 is a diagram illustrating a neutral similarity matrix used in the neutral distance calculator 500 of FIG. 1 .

도 4을 참조하면, 상기 중성 거리 계산부(500)는 한글의 중성들 간의 유사도에 대한 정보를 포함하는 중성 유사도 행렬을 이용하여 상기 중성 거리를 계산할 수 있다. Referring to FIG. 4 , the neutral distance calculator 500 may calculate the neutral distance by using a neutral similarity matrix including information on the similarity between neutrals in Korean.

상기 중성 유사도 행렬은 상기 중성들 간의 유사도에 따라 적어도 3개 이상의 값을 가질 수 있다. 예를 들어, 상기 입력 구문의 상기 음절의 상기 중성과 상기 비속어 샘플의 상기 음절의 상기 중성이 일치하면 상기 중성 유사도 행렬은 0의 값을 가질 수 있다. 상기 입력 구문의 상기 음절의 상기 중성과 상기 비속어 샘플의 상기 음절의 상기 중성이 불일치하면, 상기 중성 유사도 행렬은 상기 중성들 간의 유사도에 따라 0보다 크거나 같고 1보다 작거나 같은 값을 가질 수 있다. 상기 중성들 간의 유사도가 낮을수록 상기 중성 유사도 행렬은 1에 가까운 값을 갖도록 설정될 수 있다.The neutral similarity matrix may have at least three values according to the similarity between the neutrals. For example, if the neutrality of the syllable of the input phrase matches the neutrality of the syllable of the profanity sample, the neutral similarity matrix may have a value of zero. If the neutrality of the syllable of the input phrase and the neutrality of the syllable of the profanity sample do not match, the neutral similarity matrix may have a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the neutrals. . As the similarity between the neutrals is lower, the neutral similarity matrix may be set to have a value close to 1.

예를 들어, 도 4에서 "ㅏ"는 "ㅏ"와 동일하므로, "ㅏ"-"ㅏ"의 값은 0을 갖는다. 비속어를 변형하기 위한 목적 및 발음의 면에서 볼 때, "ㅏ"는 "ㅑ"와 극히 유사하므로, "ㅏ"-"ㅑ"의 값은 0.1로 설정하였다. 비속어를 변형하기 위한 목적 및 발음의 면에서 볼 때, "ㅏ"는 "ㅓ"와 약간 유사하므로, "ㅏ"-"ㅓ"의 값은 0.7로 설정하였다. 반면, "ㅏ"와 "ㅗ", "ㅜ", "ㅡ", "ㅣ" 등은 서로 유사도가 극히 낮으므로, "ㅏ"-"ㅗ", "ㅏ"-"ㅜ", "ㅏ"-"ㅡ", "ㅏ"-"ㅣ"의 값은 1로 설정하였다.For example, in FIG. 4 , "a" is the same as "a", and thus the value of "a"-"a" has 0. In view of the purpose and pronunciation of profanity for transforming profanity, "a" is very similar to "ㅑ", so the value of "a"-"ㅑ" is set to 0.1. In terms of the purpose and pronunciation of the profanity, "a" is slightly similar to "ㅓ", so the value of "a"-"ㅓ" was set to 0.7. On the other hand, "a", "ㅗ", "TT", "ㅡ", and "ㅣ" have very low similarity to each other, so "a"-"ㅗ", "a"-"TT", "a"- The values of "ㅡ", "A"-"ㅣ" were set to 1.

도 5는 도 1의 종성 거리 계산부(600)에서 사용되는 종성 유사도 행렬을 나타내는 도면이다. FIG. 5 is a diagram illustrating a longitudinal similarity matrix used in the longitudinal distance calculating unit 600 of FIG. 1 .

도 5를 참조하면, 상기 종성 거리 계산부(600)는 한글의 종성들 간의 유사도에 대한 정보를 포함하는 종성 유사도 행렬을 이용하여 상기 종성 거리를 계산할 수 있다. Referring to FIG. 5 , the final consonant distance calculator 600 may calculate the final consonant distance by using a final consonant similarity matrix including information on the similarity between the final consonants of Hangul.

상기 종성 유사도 행렬은 상기 종성들 간의 유사도에 따라 적어도 3개 이상의 값을 가질 수 있다. 예를 들어, 상기 입력 구문의 상기 음절의 상기 종성과 상기 비속어 샘플의 상기 음절의 상기 종성이 일치하면 상기 종성 유사도 행렬은 0의 값을 가질 수 있다. 상기 입력 구문의 상기 음절의 상기 종성과 상기 비속어 샘플의 상기 음절의 상기 종성이 불일치하면, 상기 종성 유사도 행렬은 상기 종성들 간의 유사도에 따라 0보다 크거나 같고 1보다 작거나 같은 값을 가질 수 있다. 상기 종성들 간의 유사도가 낮을수록 상기 종성 유사도 행렬은 1에 가까운 값을 갖도록 설정될 수 있다.The species similarity matrix may have at least three or more values according to the similarity between the species. For example, if the finality of the syllable of the input phrase matches the finality of the syllable of the profanity sample, the finality similarity matrix may have a value of zero. If the finality of the syllable of the input phrase and the finality of the syllable of the profanity sample do not match, the finality similarity matrix may have a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the finalities. . As the similarity between the species is lower, the matrix similarity may be set to have a value close to 1.

도 5의 종성도 자음이라는 점에서는 초성과 유사하지만, 발음 측면에서는 초성과 크게 상이한 점이 있으므로, 상기 종성 유사도 행렬은 상기 초성 유사도 행렬과 독립적으로 형성될 수 있다. The final consonant of FIG. 5 is similar to the initial consonant in that it is a consonant, but is significantly different from the initial consonant in terms of pronunciation.

예를 들어, 도 5에서 "ㄷ"은 "ㄷ"과 동일하므로, "ㄷ"-"ㄷ"의 값은 0을 갖는다. 비속어를 변형하기 위한 목적 및 발음의 면에서 볼 때, "ㄷ"은 "ㅅ"과 어느 정도 유사하므로, "ㄷ"-"ㅅ"의 값은 0.3로 설정하였다. 비속어를 변형하기 위한 목적 및 발음의 면에서 볼 때, "ㄷ"은 "ㅌ"과 어느 정도 유사하므로, "ㄷ"-"ㅌ"의 값은 0.3로 설정하였다. 반면, "ㄷ"과 "ㄱ", "ㄴ", "ㄹ", "ㅁ" 등은 서로 유사도가 극히 낮으므로, "ㄷ"-"ㄱ", "ㄷ"-"ㄴ", "ㄷ"-"ㄹ", "ㄷ"-"ㅁ"의 값은 1로 설정하였다.For example, in FIG. 5, "c" is the same as "c", and thus the value of "c"-"c" has 0. In terms of the purpose and pronunciation for transforming profanity, "c" is somewhat similar to "ㅅ", so the value of "c"-"ㅅ" is set to 0.3. In terms of the purpose and pronunciation of the profanity, "c" is somewhat similar to "t", so the value of "c"-"t" was set to 0.3. On the other hand, "c" and "a", "b", "ㄹ", and "ㅁ" have very low similarity to each other, so "c"-"a", "c"-"b", "c"- The value of "d", "c"-"ㅁ" was set to 1.

예를 들어, 도 5의 종성 유사도 행렬은 받침이 없는 경우에 대한 정보("없음")을 포함할 수 있다. 예를 들어, 도 5의 종성 유사도 행렬은 ㄺ, ㄻ, ㄼ, ㄽ, ㅄ 등의 겹받침에 대한 정보들을 포함할 수 있다.For example, the longitudinal similarity matrix of FIG. 5 may include information (“none”) for a case in which there is no support. For example, the longitudinal similarity matrix of FIG. 5 may include information on the overlapping of ㄺ, ㄻ, ㄼ, ㄽ, ㅄ, and the like.

상기한 바와 같이, 상기 비속어 판단부(800)는 상기 초성 거리, 상기 중성 거리 및 상기 종성 거리의 평균 거리를 쓰레스홀드와 비교하여 상기 입력 구문이 비속어를 포함하는지 판단할 수 있다.As described above, the profanity determining unit 800 may determine whether the input phrase includes a profanity by comparing the average distance of the initial consonant distance, the neutral distance, and the final consonant distance with a threshold.

예를 들어, 입력 구문의 어절이 "기발"이고 비속어 샘플이 "씨발"인 경우, "ㄱ"과 "ㅆ"의 제1 음절 초성 거리는 1이고, "ㅣ"와 "ㅣ"의 제1 음절 중성 거리는 0이며, "없음"과 "없음"의 제1 음절 종성 거리는 0이며, "ㅂ"과 "ㅂ"의 제2 음절 초성 거리는 0이고, "ㅏ"와 "ㅏ"의 제2 음절 중성 거리는 0이며, "ㄹ"과 "ㄹ"의 제2 음절 종성 거리는 0이다. 상기 입력 구문과 상기 비속어 샘플의 평균 거리는 1/6이므로 0.167이 된다. 만약 쓰레스홀드가 0.1인 경우, "기발"은 비속어로 판단되지 않는다.For example, if the word of the input phrase is "geek" and the profanity sample is "fuck", the first syllable leading distance of "a" and "ㅆ" is 1, and the first syllable neutral of "ㅣ" and "ㅣ" The distance is 0, the closing distance of the first syllable of "none" and "none" is 0, the starting distance of the second syllable of "b" and "b" is 0, and the second syllable neutral distance of "a" and "a" is 0 , and the final distance between the second syllable of "d" and "d" is 0. Since the average distance between the input phrase and the profanity sample is 1/6, it is 0.167. If the threshold is 0.1, "geeky" is not considered profanity.

예를 들어, 입력 구문의 어절이 "?첫?"이고 비속어 샘플이 "씨발"인 경우, "ㅅ"과 "ㅆ"의 제1 음절 초성 거리는 0이고, "ㅢ"와 "ㅣ"의 제1 음절 중성 거리는 0.1이며, "없음"과 "없음"의 제1 음절 종성 거리는 0이며, "ㅂ"과 "ㅂ"의 제2 음절 초성 거리는 0이고, "ㅏ"와 "ㅏ"의 제2 음절 중성 거리는 0이며, "ㄹ"과 "ㄹ"의 제2 음절 종성 거리는 0이다. 상기 입력 구문과 상기 비속어 샘플의 평균 거리는 0.1/6이므로 0.0167이 된다. 만약 쓰레스홀드가 0.1인 경우, "?첫?"은 비속어로 판단된다.For example, if the word of the input phrase is "?first?" and the profanity sample is "fuck", the first syllable leading distance of "ㅅ" and "ㅆ" is 0, and the first syllable distance of "ㅢ" and "ㅣ" is The syllable neutral distance is 0.1, the first syllable ending distance of "none" and "none" is 0, the second syllable neutral distance of "b" and "b" is 0, and the second syllable neutral of "a" and "a" The distance is 0, and the final distance of the second syllable of "d" and "d" is 0. Since the average distance between the input phrase and the profanity sample is 0.1/6, it becomes 0.0167. If the threshold is 0.1, "?first?" is judged as profane.

상기 비속어 판단부(800)는 상기 입력 구문이 상기 비속어를 포함할 때, 상기 입력 구문이 복수의 비속어 샘플들 중에서 어떤 비속어 샘플에 해당하는지 판단하기 위해 K-NN(K-Nearest Neighbors) 방식(K개 최근접 이웃 방식)을 이용할 수 있다. When the input phrase includes the profanity, the profanity determining unit 800 determines which profanity sample among a plurality of profanity samples corresponds to a K-Nearest Neighbors (K-NN) method (K). nearest neighbor method) can be used.

여기서, K를 큰 수로 지정하는 경우, 가장 가까운 비속어가 있음에도 다른 비속어로 분류될 수 있기 때문에 K=1로 설정할 수 있다. 즉, 상기 비속어 판단부(800)는 1-NN 방식을 이용하여 상기 비속어를 분류할 수 있다. Here, when K is designated as a large number, K=1 may be set because it can be classified as another profanity even though there is a closest profanity. That is, the profanity determiner 800 may classify the profanity using the 1-NN method.

도 6은 도 1의 쓰레스홀드 설정부에서 설정되는 쓰레스홀드에 따른 비속어 검출 정확도의 일례를 나타내는 표이다. 도 7은 도 1의 쓰레스홀드 설정부에서 설정되는 쓰레스홀드에 따른 비속어 검출 정확도의 일례를 나타내는 표이다.6 is a table showing an example of profanity detection accuracy according to a threshold set in the threshold setting unit of FIG. 1 . 7 is a table showing an example of profanity detection accuracy according to a threshold set in the threshold setting unit of FIG. 1 .

도 1 내지 도 7을 참조하면, 상기 입력 구문의 어절과 상기 비속어 샘플의 초성, 중성 및 종성의 평균 거리가 상기 쓰레스홀드보다 작거나 같으면, 상기 입력 구문의 어절은 상기 비속어를 포함하는 것으로 판단될 수 있다. 반대로, 상기 입력 구문의 어절과 상기 비속어 샘플의 초성, 중성 및 종성의 평균 거리가 상기 쓰레스홀드보다 크면, 상기 입력 구문의 어절은 상기 비속어를 포함하지 않는 것으로 판단될 수 있다. 1 to 7 , if the average distance between the word of the input phrase and the initial, middle, and final consonant of the sample of profanity is less than or equal to the threshold, it is determined that the word of the input phrase includes the profanity can be Conversely, if the average distance between the word of the input phrase and the leading, neutral, and final consonant of the profanity sample is greater than the threshold, it may be determined that the word of the input phrase does not include the profanity.

단어와 단어를 비교할 때, 추가, 삭제, 치환이 일어나면 거리가 1씩 증가될 수 있다. 본 실시예에서는 초성, 중성, 종성을 각각 비교하기 때문에 초성, 중성, 종성 중 어느 하나가 비유사한 것으로 변환되면 1/3의 거리를 갖게 된다. 다만, 상기 초성 유사도 행렬, 상기 중성 유사도 행렬 및 상기 종성 유사도 행렬에서 보듯이, 상기 초성, 중성, 종성 중 어느 하나가 상대적으로 유사한 것으로 변환되면 행렬 내의 값이 1보다 작은 값을 가지므로, 그 거리는 1/3보다 작을 수 있다. When comparing words with words, the distance can be increased by one when additions, deletions, or substitutions occur. In this embodiment, since each of the initial, neutral, and final is compared, if any one of the initial, neutral, and final is converted into a dissimilar one, the distance is 1/3. However, as shown in the initial consonant similarity matrix, the neutral similarity matrix, and the final consonant similarity matrix, if any one of the initial consonant, neutral, and final consonant is converted to a relatively similar one, the value in the matrix has a value less than 1, so the distance It can be less than 1/3.

예를 들어, 상기 쓰레스홀드는 0.5보다 작게 설정할 수 있다. 도 6에서는 상기 쓰레스홀드가 0.2, 0.3, 0.4, 0.5인 경우의 정확도(accuracy), 정밀도(precision) 및 재현율(recall)을 도시하였다. 도 6에서 사용된 데이터 셋은 비속어를 불포함한 입력 구문(Class 0)을 709개가 비속어를 포함한 입력 구문(Class 1)을 747개 포함할 수 있다. 상기 데이터 셋에서 비속어의 비율은 51.3%이며 일반적인 댓글들에 비해 비속어 댓글의 비율이 현저히 높은 비율로 생성된 데이터 셋이다.For example, the threshold may be set to be less than 0.5. 6 shows accuracy, precision, and recall when the thresholds are 0.2, 0.3, 0.4, and 0.5. The data set used in FIG. 6 may include 747 input phrases (Class 1) including 709 input phrases without profanity (Class 0) including profanity. The proportion of profanity in the data set is 51.3%, and it is a data set generated with a significantly higher rate of profanity comments compared to general comments.

정확도는 쓰레스홀드에 따라 차이는 있지만 평균적으로 79%, 최고 80.15%로 측정되었다. 본 실시예의 비속어 검출 시스템은 단순히 데이터베이스의 비속어 목록만을 필터링하는 종래의 시스템의 정확도인 48%보다 현저히 우수한 정확도를 보였다.Accuracy varies depending on the threshold, but the average was 79% and the maximum was 80.15%. The profanity detection system of this embodiment showed significantly better accuracy than the 48% accuracy of the conventional system that simply filters only the list of profanity in the database.

다만, 비속어가 없는 문장에 비속어가 포함된 것으로 잘못 판단하더라도 비속어의 적발이 더욱 중요하기 때문에 상기 쓰레스홀드 설정부(700)는 상기 비속어를 포함하는 클래스가 Class 0이고 상기 비속어를 포함하지 않는 클래스가 Class 1이라고 할 때, 상기 Class 1의 재현율(recall)을 최대로 만드는 값을 상기 쓰레스홀드로 설정할 수 있다. 도 6에서는 Class 1의 재현율(recall)이 0.89로 최대인 0.4가 쓰레스홀드로 설정될 수 있다.However, even if it is erroneously determined that a sentence without profanity contains profanity, since detection of profanity is more important, the threshold setting unit 700 determines that the class including the profane word is Class 0 and the class does not include the profane word. When is Class 1, a value that maximizes the recall of Class 1 may be set as the threshold. In FIG. 6 , the class 1 recall rate is 0.89, and 0.4, which is the maximum, may be set as the threshold.

도 7은 도 6의 데이터 셋에 비해 실제 상황을 더욱 반영하여 비속어의 비율이 26%(Class 0: 710, Class 1: 250)인 데이터 셋에 대한 정확도(accuracy), 정밀도(precision) 및 재현율(recall)을 도시하였다.7 shows the accuracy, precision, and recall ( recall) is shown.

도 7에서 Class 0의 정밀도는 96%를 보였고, Class 1의 재현율은 93%를 나타낸다. 즉, 본 실시예의 비속어 검출 시스템은 대부분의 실제 비속어를 검출하며, 잘못 분류한 일반어가 4% 밖에 되지 않는 성능을 나타낸다.In FIG. 7, the precision of Class 0 was 96%, and the recall of Class 1 was 93%. That is, the profanity detection system of the present embodiment detects most of the actual profane words, and shows a performance of only 4% of the misclassified general words.

여기서, 정답이 True인데 시스템이 True를 예측한 경우를 a, 정답이 True인데 시스템이 False를 예측한 경우를 b, 정답이 False인데 시스템이 False를 예측한 경우를 b, 정답이 False인데 시스템이 False를 예측한 경우를 d라고 할 때, 상기 정밀도(precision)는 a/(a+c)이고, 상기 재현율(recall)은 a/(a+b)이며, 상기 정확도(accuracy)는 (a+d)/(a+b+c+d)일 수 있다.Here, a case where the correct answer is True and the system predicts True, b when the correct answer is True and the system predicts False, b when the correct answer is False and the system predicts False, and if the correct answer is False, the system Assuming that false prediction is d, the precision is a/(a+c), the recall is a/(a+b), and the accuracy is (a+) d)/(a+b+c+d).

본 실시예에 따르면, 입력 구문을 어절로 나누고, 상기 어절 내의 음절을 초성, 중성, 종성으로 나누며, 한글의 발음 유사성을 반영한 초성 거리, 중성 거리 및 종성 거리를 이용하여 비속어를 검출할 수 있다. According to this embodiment, it is possible to divide an input phrase into word words, divide syllables within the word into a leading consonant, a middle consonant, and a final consonant, and detect profanity using the consonant distance, the middle distance, and the final consonant distance reflecting the pronunciation similarity of Hangul.

본 발명은 휴대폰, 스마트폰, 태블릿, 노트북 컴퓨터, 데스크 탑 등의 전자 장치에 적용될 수 있다. The present invention can be applied to electronic devices such as mobile phones, smart phones, tablets, notebook computers, and desktops.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술분야의 숙련된 당업자는 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 것이다.Although the above has been described with reference to the preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention as set forth in the following claims. you will understand that you can

Claims

a preprocessor that separates input phrases into words;
a leading, neutral, and final separating unit for separating syllables in the word of the input phrase into a leading, neutral, and final consonant;
a leading consonant distance calculating unit for calculating a leading consonant distance by comparing a consonant of the syllable of the input phrase with a consonant of a syllable of a profanity sample;
a neutral distance calculator for calculating a neutral distance by comparing the neutrality of the syllable of the input phrase with the neutrality of the syllable of the profanity sample;
a finality distance calculation unit for calculating a finality distance by comparing a finality of the syllable of the input phrase with a finality of the syllable of the profanity sample; and
and a profanity determining unit configured to determine whether the input phrase includes profanity by comparing an average distance of the starting distance, the neutral distance, and the final distance with a threshold.

The profanity detection system of claim 1, wherein the preprocessor removes alphabetic characters, numbers, and special characters from the word word of the input phrase.

The method of claim 1 , further comprising an n-gram setting unit configured to separate the words of the input phrase to match the number of syllables of the profanity sample when the number of syllables of the word of the input phrase is greater than the number of syllables of the profanity sample. Profanity detection system, characterized in that.

The method of claim 1, wherein the starting distance calculating unit calculates the starting distance by using a starting similarity matrix including information on the similarity between the leading vowels of Hangul;
The profanity similarity matrix has at least three values according to the degree of similarity between the initial consonants.

5. The method of claim 4, wherein if the leading consonant of the syllable of the input phrase coincides with the leading consonant of the syllable of the profanity sample, the consonant similarity matrix has a value of 0;
If the leading consonant of the syllable of the input phrase is inconsistent with the leading consonant of the syllable of the profanity sample, the leading consonant similarity matrix has a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the leading consonants,
The profanity detection system, characterized in that, as the similarity between the initial consonants is lower, the initial consonant similarity matrix has a value close to 1.

5. The method of claim 4, wherein the neutral distance calculator calculates the neutral distance by using a neutral similarity matrix including information on the similarity between neutrals in Hangul;
The profanity detection system, characterized in that the neutral similarity matrix has at least three values according to the similarity between the neutrals.

7. The method of claim 6, wherein the neutral similarity matrix has a value of 0 if the neutrality of the syllable of the input phrase and the neutrality of the syllable of the profanity sample match;
If the neutrality of the syllable of the input phrase and the neutrality of the syllable of the profanity sample do not match, the neutral similarity matrix has a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the neutrals,
The profanity detection system of claim 1, wherein the neutral similarity matrix has a value close to 1 as the similarity between the neutrals is lower.

The method of claim 6, wherein the final consonant distance calculation unit calculates the final consonant distance using a final consonant similarity matrix including information on the similarity between the final consonants of Hangul;
The profanity similarity matrix has at least three values according to the degree of similarity between the subspecies.

9. The method of claim 8, wherein if the finality of the syllable of the input phrase matches the finality of the syllable of the profanity sample, the longitudinal similarity matrix has a value of 0;
If the finality of the syllable of the input phrase and the finality of the syllable of the profanity sample do not match, the finality similarity matrix has a value greater than or equal to 0 and less than or equal to 1 according to the similarity between the finalities,
The profanity detection system, characterized in that the lower the degree of similarity between the last names, the closer the value of the last word similarity matrix is to 1.

According to claim 1, further comprising a threshold setting unit for setting the threshold,
The threshold setting unit sets a value that maximizes recall of the class 1 as the threshold when the class including the profanity is class 0 and the class not including the profanity is class 1 Profanity detection system, characterized in that.

The K-Nearest Neighbors (K-NN) method of claim 1, wherein the profanity determining unit determines which profanity sample from among a plurality of profanity samples the input phrase corresponds to when the input phrase includes the profanity. use ,
A profanity detection system, characterized in that K = 1.

separating the input phrase into words;
separating the syllables in the word of the input phrase into a leading consonant, a middle consonant, and a final consonant;
comparing the beginning of the syllable of the input phrase with the beginning of the syllable of a profanity sample and calculating a leading distance;
calculating a neutral distance by comparing the neutrality of the syllable of the input phrase with the neutrality of the syllable of the profanity sample;
comparing the ending of the syllable of the input phrase with the ending of the syllable of the profanity sample and calculating a final distance; and
and determining whether the input phrase includes profanity by comparing an average distance of the starting distance, the neutral distance, and the final distance with a threshold.

13. The method of claim 12, further comprising, when the number of syllables of the word of the input phrase is greater than the number of syllables of the profanity sample, separating the word of the input phrase to match the number of syllables of the profanity sample. A method of detecting profanity.