KR20190002202A

KR20190002202A - Method and Device for Detecting Slang Based on Learning

Info

Publication number: KR20190002202A
Application number: KR1020170082767A
Authority: KR
Inventors: 한요섭; 이호석; 이홍래
Original assignee: 연세대학교 산학협력단
Priority date: 2017-06-29
Filing date: 2017-06-29
Publication date: 2019-01-08
Also published as: KR102034346B1

Abstract

Disclosed are a method and a device for detecting slang based on learning. The device comprises: a vector extraction unit to extract a feature vector of a word to be detected from a feature vector database storing word-feature vectors; and a slang determination unit to extract N words corresponding to N feature vectors similar to the feature vector of the word to be detected from the feature vector database, and determine the word to be detected as slang if at least one among the extracted N words and the word to be detected is slang stored in a slang database. The feature vectors stored in the feature vector database are set by learning. The learning is carried out to allow words with a high possibility of appearance with a specific word in a learned sentence to have similar vectors. According to the device and the method, sentences indirectly expressing existing slang and slang of a new form are efficiently detected to prevent cyber violence.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention < RTI ID = 0.0 > [0002] <

본 발명의 실시예들은 비속어 탐지 장치 및 방법에 것으로서, 더욱 상세하게는 학습 기반의 비속어 탐지 장치 및 방법에 관한 것이다. Embodiments of the present invention are directed to a method and apparatus for detecting and detecting a slanderous speech, and more particularly, to a learning-based slang detection apparatus and method.

게시판, 뉴스, 커뮤니티 등을 통해 인터넷상에서 사용자들간의 의견 교환은 활발히 이루어지고 있으며, 공개적인 게시판이 아니더라도 사적인 채팅 메시지 교환이 인터넷 공간에서 빈번히 이루어지고 있다. The exchange of opinions among users on the Internet is actively conducted through bulletin boards, news, and communities. Even if it is not a public bulletin board, a private chat message exchange is frequently performed in the Internet space.

인터넷은 비속어 및 욕설이 어떠한 환경보다도 빠르게 생성되어 사용되고 있으며, 이러한 종류의 비속어는 빠르게 전파되는 특성을 가진다. 이러한 특성으로 인해 과거에는 청소년을 대상으로 발생하여 주목받았던 사이버 폭력이 이제는 연령에 상관없이 전 연령대에서 발생하고 있으며, 많은 경우에 있어서 의견 교환보다는 다른 사용자에 대한 비속어와 욕설로 컨텐츠가 채워지고 있는 실정이다. The Internet has been created and used more quickly than any other language, and this type of profanity has the characteristic of spreading rapidly. Due to these characteristics, cyber violence, which has been attracted to youth in the past, now occurs in all ages regardless of age. In many cases, content is filled with profanity and abuse of other users rather than exchanging opinions. to be.

인터넷상에서 무분별한 비속어 사용이 사이버 폭력으로 확장되면서 이에 대한 연구가 활발히 이루어지고 있으며, 허용되지 않는 비속어를 미리 설정하고(블랙리스트) 해당 비속어가 사용될 경우 필터링하는 기법이 주로 이용되고 있다. As the use of indiscriminate profanity on the internet has expanded to cyber violence, researches on it have been actively conducted, and techniques for predicting unauthorized prosaic words (black list) and filtering when the prosaic words are used are mainly used.

그러나, 이러한 블랙리스트 기반의 필터링 기법은 사용자들이 비속어에 대한 신조어와 우회적으로 비속어를 표현하는 방법에 의해 사실상 실효성을 거두고 있지 못하는 실정이다. 특정 비속어가 필터링된다는 것을 인지하고 있는 사용자는 해당 비속어로 인지될만한 비속어를 우회적으로 사용하기도 하며, 기존의 블랙리스트에는 새로운 형태의 비속어를 창조하여 사용하기도 한다. 새로운 형태의 비속어는 인터넷 공간의 특성상 빠르게 전파되어 사용되므로 기존의 블랙리스트 기법으로는 이러한 형태의 사이버 폭력에 대처할 수 없는 문제점이 있었다. However, such a blacklist-based filtering technique is practically ineffective due to a method of expressing profanity of a user and a coined word for a profanity. A user who recognizes that a particular profanity is filtered may use a profanity that is recognized as a profanity of the profanity, and may also create a new type of profanion in an existing black list. Since the new type of profanity is spread rapidly due to the nature of the Internet space, there is a problem that conventional black list techniques can not cope with this type of cyber violence.

본 발명은 기존의 비속어를 우회적으로 표현한 비속어 및 새로운 형태의 비속어를 효율적으로 탐지하여 사이버 폭력을 미연에 방지할 수 있는 비속어 탐지 장치 및 방법을 제안한다. The present invention proposes an apparatus and method for detecting a slanderous word that can detect cyber violence efficiently by detecting a slanderous word and a new type of slang word expressing a conventional slang word.

본 발명의 일 측면에 따르면, 단어-특징벡터를 저장하고 있는 특징 벡터 데이터베이스로부터 탐지 대상 단어의 특징 벡터를 추출하는 특징 벡터 추출부; 상기 특징 벡터 데이터베이스로부터 상기 탐지 대상 단어의 특징 벡터와 유사한 N개의 특징 벡터에 상응하는 N개의 단어를 추출하고 상기 추출된 N개의 단어 및 상기 탐지 대상 단어 중 적어도 하나가 비속어 데이터베이스에 저장된 비속어일 경우 상기 탐지 대상 단어를 비속어라고 판단하는 비속어 판단부를 포함하되, 상기 특징 벡터 데이터베스에 저장되는 특징 벡터들은 학습을 통해 설정되며, 상기 학습은 학습되는 문장에서 특정 단어와 동시출연 가능성이 높은 다른 단어들이 유사한 특징 벡터를 가지도록 이루어지는 학습 기반의 비속어 탐지 장치가 제공된다. According to an aspect of the present invention, a feature vector extracting unit extracts a feature vector of a detection target word from a feature vector database storing a word-feature vector; Extracting N words corresponding to N feature vectors similar to the feature vector of the detection subject word from the feature vector database, and when at least one of the extracted N words and the detection subject word is a profanity word stored in the profanity database, The feature vectors stored in the feature vector database are set through learning, and the learning is performed in a sentence in which a specific word and other words likely to appear at the same time are similar to each other A learning-based orac-to-speech detection device is provided having feature vectors.

상기 학습 기반의 비속어 탐지 장치는, 입력되는 문장을 분석하여 탐지 대상 단어를 추출하고, 각 탐지 대상 단어에 대한 품사 태깅을 수행하는 문장 분석부를 더 포함한다. The learning-based profanity detecting apparatus further includes a sentence analyzing unit for analyzing input sentences to extract a detection target word and performing part-of-speech tagging for each detection target word.

상기 학습 기반의 비속어 탐지 장치는, 학습 과정에서 입력되는 문장을 이용하여 단어에 대한 특징 벡터를 학습을 통해 설정하는 특징 벡터 학습부를 더 포함한다. The learning-based profanity detection apparatus further includes a feature vector learning unit configured to set a feature vector for a word by learning using a sentence input in a learning process.

상기 특징 벡터 학습부는 각 단어에 대한 단어 아이디 벡터와 상기 단이 아이디 벡터를 변환하기 위한 변환 벡터를 학습한다. The feature vector learning unit learns a word ID vector for each word and a transform vector for transforming the identity vector.

상기 특징 벡터 학습부는 벡터 설정/갱신부 및 에러 함수 피드백 모듈을 포함하고, 초기에는 임의로 각 단어의 단어 아이디 벡터 및 변환 벡터를 설정하고, 상기 에러 함수 피드백 모듈은 특정 단어와 동시 출연하는 단어의 동시 출연 확률과 상기 특정 단어와 상기 동시 출연하는 단어의 특징 벡터의 차이값에 기초한 에러값을 상기 벡터 설정/갱신부에 제공하며, 상기 벡터 설정/갱신부는 상기 에러값에 기초하여 상기 각 단어의 단어 아이디 벡터 및 상기 변환 벡터를 갱신한다. Wherein the feature vector learning unit includes a vector setting / updating unit and an error function feedback module, and initially arbitrarily sets a word ID vector and a conversion vector of each word, and the error function feedback module Updating means for providing an error value based on a probability of occurrence and a difference value between the specific word and a feature vector of a word appearing at the same time, wherein the vector setting / And updates the identity vector and the transformation vector.

본 발명의 다른 측면에 따르면, 단어-특징벡터를 저장하고 있는 특징 벡터 데이터베이스로부터 탐지 대상 단어의 특징 벡터를 추출하는 특징 벡터 추출부; 상기 특징 벡터 데이터베이스로부터 상기 탐지 대상 단어의 특징 벡터와 유사한 N개의 특징 벡터에 상응하는 N개의 단어를 추출하고 상기 추출된 N개의 단어 및 상기 탐지 대상 단어 중 적어도 하나가 비속어 데이터베이스에 저장된 비속어일 경우 상기 탐지 대상 단어를 비속어라고 판단하는 비속어 판단부를 포함하되, 상기 특징 벡터 데이터베스에 저장되는 특징 벡터들은 단어들간 동시 출연 확률에 기초하여 그 값이 설정되는 학습 기반의 비속어 탐지 장치가 제공된다. According to another aspect of the present invention, a feature vector extraction unit extracts a feature vector of a detection target word from a feature vector database storing a word-feature vector; Extracting N words corresponding to N feature vectors similar to the feature vector of the detection subject word from the feature vector database, and when at least one of the extracted N words and the detection subject word is a profanity word stored in the profanity database, And a value of the feature vectors stored in the feature vector database is set based on the probability of simultaneous appearance between words.

본 발명의 또 다른 측면에 따르면, 단어-특징벡터를 저장하고 있는 특징 벡터 데이터베이스로부터 탐지 대상 단어의 특징 벡터를 추출하는 단계(a); 상기 특징 벡터 데이터베이스로부터 상기 탐지 대상 단어의 특징 벡터와 유사한 N개의 특징 벡터에 상응하는 N개의 단어를 추출하고 상기 추출된 N개의 단어 및 상기 탐지 대상 단어 중 적어도 하나가 비속어 데이터베이스에 저장된 비속어일 경우 상기 탐지 대상 단어를 비속어라고 판단하는 단계(b)를 포함하되, 상기 특징 벡터 데이터베스에 저장되는 특징 벡터들은 학습을 통해 설정되며, 상기 학습은 학습되는 문장에서 특정 단어와 동시출연 가능성이 높은 다른 단어들이 유사한 특징 벡터를 가지도록 이루어지는 학습 기반의 비속어 탐지 방법이 제공된다. According to another aspect of the present invention, there is provided a method for detecting a feature vector, comprising: (a) extracting a feature vector of a detection target word from a feature vector database storing a word-feature vector; Extracting N words corresponding to N feature vectors similar to the feature vector of the detection subject word from the feature vector database, and when at least one of the extracted N words and the detection subject word is a profanity word stored in the profanity database, (B) judging that a word to be detected is a profanity word, characterized in that the feature vectors stored in the feature vector database are set through learning, and the learning is performed in a sentence in which a specific word and another word A learning-based profanity detection method is provided that has similar feature vectors.

본 발명의 또 다른 측면에 따르면, 단어-특징벡터를 저장하고 있는 특징 벡터 데이터베이스로부터 탐지 대상 단어의 특징 벡터를 추출하는 단계(a); 상기 특징 벡터 데이터베이스로부터 상기 탐지 대상 단어의 특징 벡터와 유사한 N개의 특징 벡터에 상응하는 N개의 단어를 추출하고 상기 추출된 N개의 단어 및 상기 탐지 대상 단어 중 적어도 하나가 비속어 데이터베이스에 저장된 비속어일 경우 상기 탐지 대상 단어를 비속어라고 판단하는 단계(b)를 포함하되, 상기 특징 벡터 데이터베스에 저장되는 특징 벡터들은 단어들간 동시 출연 확률에 기초하여 그 값이 설정되는 학습 기반의 비속어 탐지 방법이 제공된다. According to another aspect of the present invention, there is provided a method for detecting a feature vector, comprising: (a) extracting a feature vector of a detection target word from a feature vector database storing a word-feature vector; Extracting N words corresponding to N feature vectors similar to the feature vector of the detection subject word from the feature vector database, and when at least one of the extracted N words and the detection subject word is a profanity word stored in the profanity database, And a step (b) of judging that the word to be detected is a profanity word, wherein the feature vectors stored in the feature vector database are set based on the probability of simultaneous casting between words.

본 발명에 의하면, 기존의 비속어를 우회적으로 표현한 비속어 및 새로운 형태의 비속어를 효율적으로 탐지하여 사이버 폭력을 미연에 방지할 수 있는 장점이 있다. According to the present invention, there is an advantage that cyber violence can be prevented in advance by efficiently detecting a profanity word and a new type of profanity word expressing a conventional profanity word.

도 1은 본 발명의 일 실시예에 따른 비속어 탐지 장치의 전체적인 구성을 도시한 블록도.
도 2는 본 발명의 일 실시예에 따른 단어-특징 벡터 학습부의 구조를 도시한 블록도.
도 3은 본 발명의 일 실시예에 따라 각 단어에 대한 단어 아이디 벡터를 설정하는 일례를 도시한 도면.
도 4는 본 발명의 일 실시예에 따라 학습이 수행되면서 설정되는 특징 벡터의 일례를 도시한 도면.
도 5는 본 발명의 일 실시예에 따른 학습 기반의 비속어 탐지 방법의 전체적인 흐름을 도시한 순서도.1 is a block diagram showing an overall configuration of a speech recognition apparatus according to an embodiment of the present invention;
FIG. 2 is a block diagram illustrating a structure of a word-feature vector learning unit according to an embodiment of the present invention; FIG.
FIG. 3 illustrates an example of setting a word ID vector for each word according to an embodiment of the present invention. FIG.
4 is a diagram illustrating an example of a feature vector set while learning is performed according to an embodiment of the present invention;
FIG. 5 is a flow chart showing a general flow of a learning-based slave speech detection method according to an embodiment of the present invention; FIG.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

이하에서, 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 비속어 탐지 장치의 전체적인 구성을 도시한 블록도이다. FIG. 1 is a block diagram showing an overall configuration of a speech recognition apparatus according to an embodiment of the present invention. Referring to FIG.

도 1을 참조하면, 본 발명의 일 실시에에 따른 비속어 탐지 장치는 문장 분석부(100), 특징 벡터 추출부(120), 단어-특징벡터 학습부(130), 단어-특징벡터 데이터베이스(140), 비속어 판단부(150), 비속어 데이터베이스(160)를 포함할 수 있다. 1, the apparatus for detecting a specific language according to an embodiment of the present invention includes a sentence analysis unit 100, a feature vector extraction unit 120, a word-feature vector learning unit 130, a word-feature vector database 140 ), An abusive language determination unit 150, and a profanity language database 160.

본 발명의 일 실시예에 따른 비속어 탐지 장치는 사용자가 작성한 문장을 입력받아 해당 문장에 비속어가 포함되어 있는지 여부를 탐지한다. 여기서, 사용자가 작성하는 문장은 각종 게시판에 사용자가 올리는 문서, 댓글 등에 포함되는 문장일 수 있으며, 사용자가 최종적으로 작성한 문서에 대한 업로드를 요청하였을 때 사용자가 작성한 문서에 포함된 문장을 입력받아 비속어 존재 여부를 판단한다. The apparatus for detecting a slave language according to an embodiment of the present invention receives a sentence created by a user and detects whether or not a profanity is included in the sentence. Here, the sentence to be created by the user may be a sentence included in a document, a comment, or the like posted by the user on various bulletin boards. When the user requests the upload of the document finally created, the sentence included in the document created by the user is input, And judges whether or not it exists.

또한, 본 발명은 학습을 통해 비속어를 탐지하는 방식을 사용하며, 본 발명의 비속어 탐지 장치는 학습 모드와 탐지 모드의 두 가지 모드로 동작하게 된다. 이하에서는 학습이 완료된 상태인 탐지 모드를 주 실시예로 설명하기로 하며, 학습 모드 시의 동작에 대해서는 별도의 도면을 참조하여 설명하기로 한다. In addition, the present invention uses a method of detecting a profanity through learning, and the profanity detection apparatus of the present invention operates in two modes, a learning mode and a detection mode. Hereinafter, a detection mode in which learning is completed will be described as a main embodiment, and an operation in the learning mode will be described with reference to another drawing.

문장 분석부(100)는 입력받은 문장을 분석하여 탐지 대상 단어를 출력하는 기능을 한다. 문장 분석부(100)는 조사와 어미 및 기타 불필요한 부분들을 제거하고 탐지 대상 단어를 추출한다. 예를 들어, “아버지가 방에 들어가신다”라는 문장이 있을 때 아버지와 연결되는 “가” 및 방과 연결되는 “에”는 조사이고, 들어가신다의 “다”는 어미에 해당되므로 이러한 부분들을 제거하고, “아버지”, 방, “들어가신”을 탐지 대상 단어로 출력한다. 또한, 문장 분석부(100)는 추출된 탐지 대상 단어에 대한 품사 태깅을 수행할 수 있다. The sentence analyzing unit 100 analyzes the input sentence and outputs a word to be detected. The sentence analysis unit 100 removes the search, the mother and other unnecessary parts, and extracts the word to be detected. For example, when there is a sentence "Father enters the room", "A" connected to the father and "A" connected to the room is an investigation, and "Da" belongs to the mother. And outputs "father," "room," "entered," as a detection target word. In addition, the sentence analysis unit 100 may perform part-of-speech tagging on the extracted detection target word.

특징 벡터 추출부(110)는 문장 분석부(100)로부터 출력되는 탐지 대상 단어에 대한 특징 벡터를 추출한다. 특정 단어에 대한 특징 벡터는 학습을 통해 단어-특징 벡터 데이터베이스(140)에 저장되어 있으며, 특징 벡터 추출부(110)는 탐지 대상 단어에 상응하는 특징 벡터를 추출한다. The feature vector extracting unit 110 extracts a feature vector of a detection target word output from the sentence analyzing unit 100. The feature vector for a specific word is stored in the word-feature vector database 140 through learning, and the feature vector extraction unit 110 extracts a feature vector corresponding to the word to be detected.

여기서 특징 벡터는 학습을 통해 특정 단어에 부여되는 특징값에 대한 벡터이며, 특징 벡터는 학습 과정에서 특정 문장 단위 또는 특정 윈도우 단위에서 동시 출연하는 단어들에 기초하여 그 값이 결정된다. Here, the feature vector is a vector for a feature value assigned to a specific word through learning, and the feature vector is determined based on words that appear simultaneously in a specific sentence unit or a specific window unit in a learning process.

각 단어별로 특징 벡터를 결정하기 위한 학습은 단어-특징 벡터 학습부(130)에서 이루어진다. 학습을 위해 다양한 문장들이 단어-특징 벡터 학습부(130)에 입력되며 단어-특징 벡터 학습부(130)는 문장 분석 및 함께 등장하는 단어들의 빈도를 판단하고, 특정 단어의 특징 벡터와 해당 특정 단어와 높은 확률로 동시 출연하는 단어들의 특징 벡터들이 서로 유사한 값을 가지도록 학습을 수행한다. Learning to determine a feature vector for each word is performed in the word-feature vector learning unit 130. [ Various learning sentences are input to the word-feature vector learning unit 130, and the word-feature vector learning unit 130 determines the frequency of the sentence analysis and the accompanying words, And the feature vectors of the words simultaneously appearing at a high probability have similar values to each other.

예를 들어, 단어-특징 벡터 학습부(130)에서 학습되는 각 단어에는 단어 아이디 벡터(또는 매트릭스)가 설정되고, 해당 단어 아이디 벡터를 변환하기 위한 변환 벡터(또는 매트릭스)가 설정된다. 단어-특징 벡터 학습부(130)는 특정 단어와 동시에 출연하는 단어들의 특징 벡터들과 해당 특정 단어의 특징 벡터가 서로 유사해지도록 변화 벡터 및 단어 아이디 벡터를 학습하게 된다. 물론, 변환 벡터가 다수의 서브 벡터로 구분될 수도 있다는 점은 당업자에게 있어 자명할 것이다. For example, a word ID vector (or a matrix) is set for each word learned in the word-feature vector learning unit 130, and a conversion vector (or matrix) for converting the word ID vector is set. The word-feature vector learning unit 130 learns the change vector and the word ID vector such that the feature vectors of the words appearing simultaneously with the specific word and the feature vector of the specific word become similar to each other. Of course, it will be obvious to those skilled in the art that the transform vector may be divided into a plurality of subvectors.

도 2는 본 발명의 일 실시예에 따른 단어-특징 벡터 학습부의 구조를 도시한 블록도이다. 2 is a block diagram illustrating a structure of a word-feature vector learning unit according to an embodiment of the present invention.

도 2를 참조하면, 전처리부(200), 문장 분석부(210), 벡터 설정/갱신부(220) 및 에러 함수 피드백 모듈(230)을 포함한다. Referring to FIG. 2, a pre-processing unit 200, a sentence analyzing unit 210, a vector setting / updating unit 220, and an error function feedback module 230 are included.

전처리부(200)는 학습을 위해 입력되는 문장에 대한 전처리를 수행한다. 전처리부(200)에서 이루어지는 전처리는 중복되는 문장을 제거하고 기사나 광고로 인식되는 문장을 제거하는 작업을 의미한다. 예를 들어, 리트윗된 문장은 중복되는 문장이 명백하므로 이에 대한 제거 작업이 전처리부(200)에서 이루어진다. The preprocessing unit 200 performs preprocessing on sentences input for learning. The preprocessing performed by the preprocessing unit 200 means to remove redundant sentences and to remove sentences recognized as articles or advertisements. For example, a sentence that is overlaid has a redundant sentence. Therefore, the preprocessing unit 200 removes the overlapped sentence.

문장 분석부(210)는 전처리부(200)에서 전처리된 문장들에서 불필요한 조사 및 어미를 제거하고, 각 단어에 품사를 설정하는 태깅 작업을 수행한다. 문장 분석부(210)에서 이루어지는 문장 분석은 문장 분석부(100)에서 이루어지는 작업과 동일하다. The sentence analyzer 210 removes unnecessary sentences and sentences from sentences preprocessed by the preprocessing unit 200 and performs a tagging task of setting parts of speech for each word. The sentence analysis performed by the sentence analysis unit 210 is the same as that performed by the sentence analysis unit 100. [

벡터 설정/갱신부(220), 에러 함수 피드백 모듈(230)은 학습을 수행하면서, 단어 아이디 벡터 및 변환 벡터를 갱신하여 최종적인 특징 벡터를 결정하는 모듈이다. The vector setting / updating unit 220 and the error function feedback module 230 are modules for determining a final feature vector by updating a word ID vector and a transformation vector while performing learning.

학습 초기에 각 단어별 단어 아이디 벡터 및 변환 벡터는 임의로 설정된다. 학습을 계속 수행하게 되면, 특정 단어와 동시 출연 빈도가 높은 단어들에 대한 정보가 축적되며, 동시 출연 빈도가 높은 단어들에 대한 동시 출연 확률을 획득할 수 있으며, 이에 기초하여 단어 아이디 벡터 및 변환 벡터를 갱신함으로써 최종적인 특징 벡터를 결정하도록 한다. At the beginning of learning, word-by-word ID vectors and conversion vectors are arbitrarily set. If the learning is continued, information on words having a high frequency of concurrent appearance with a specific word is accumulated, and the probability of simultaneous appearance of words having a high concurrent appearance frequency can be obtained. Based on this, The final feature vector is determined by updating the vector.

에러 함수 피드백 모듈(230)은 특정 단어와 동시 출연하는 단어의 동시 출연 확률과 특정 단어와 동신 출연하는 단어와의 특징 벡터 값 정보에 기초하여 에러 값을 벡터 설정/갱신부(220)에 피드백하며, 벡터 설정/갱신부(220)는 에러 함수 피드백 모듈(230)에서 피드백하는 에러값에 기초하여 단어 아이디 벡터 및 특징 벡터에 대한 갱신을 수행한다. The error function feedback module 230 feeds the error value to the vector setting / updating unit 220 based on the simultaneous appearance probability of the word coinciding with the specific word and the feature vector value information of the word and the word appearing at the same time , And the vector setting / updating unit 220 updates the word ID vector and the feature vector based on the error value fed back from the error function feedback module 230.

만일 제1 단어와 제2 단어의 동시 출연 확률이 높으나, 두 단어의 특징 벡터의 차이값이 크다면 에러 함수 피드백 모듈(230)은 비교적 큰 에러값을 피드백하게 되며, 이에 기초하여 벡터 설정/갱신부(220)에서 단어 아이디 벡터 및 변환 벡터에 대한 갱신 작업이 이루어지게 된다. If the probability of simultaneous appearance of the first word and the second word is high but the difference value between the feature vectors of the two words is large, the error function feedback module 230 feeds back a relatively large error value, The update operation for the word ID vector and the conversion vector is performed in step S220.

이하에서는 실제의 문장을 예로 하여 본 발명에서 이루어지는 학습에 대해 설명하기로 한다. Hereinafter, learning performed in the present invention will be described taking an actual sentence as an example.

도 3은 본 발명의 일 실시예에 따라 각 단어에 대한 단어 아이디 벡터를 설정하는 일례를 도시한 도면이다. 3 is a diagram illustrating an example of setting a word ID vector for each word according to an embodiment of the present invention.

도 3을 참조하면, 총 6개의 문장이 입력되는 경우가 도시되어 있으나, 이는 설명의 편의를 위한 것일 뿐 실제의 학습이 더 많은 수의 문장으로 수행된다는 점은 당업자에게 있어 자명할 것이다. Referring to FIG. 3, a total of six sentences are shown. However, it will be obvious to those skilled in the art that the actual learning is performed with a larger number of sentences, for convenience of explanation.

도 3에서 주어진 문장은 다음과 같다. The sentence given in FIG. 3 is as follows.

-the king loves the queen.-the king loves the queen.

-the queen loves the king. -the queen loves the king.

-the dwarf hates the king.-the dwarf hates the king.

-the queen hates the dwarf.-the queen hates the dwarf.

-the dwarf poisons the king.-the dwarf poisons the king.

위 문장들에서 총 추출되는 단어는 the, king, loves, queen, dwarf, hates, poisons로써 총 등장하는 단어는 7개이다. The total words extracted from the sentences are the, king, loves, queen, dwarf, hates, poisons.

초기에는 각 단어에 대한 단어 아이디 벡터는 임의로 설정되며, 도 3에 도시된 바와 같이, the는 1,0,0,0,0,0,0, king은 0,1,0,0,0,0,0과 같이 설정될 수 있다. 도 3의 예에서, 주어진 단어의 수가 7개이기 때문에 7차원의 벡터가 예시로 제공되었으나, 실제 학습 시에는 수천만개의 단어가 학습될 수 있으므로 실제의 단어 아이디 벡터가 이와는 다른 형태를 가질 수 있다는 점은 당업자에게 있어 자명할 것이다. Initially, the word ID vectors for each word are arbitrarily set. As shown in Fig. 3, the is 1, 0, 0, 0, 0, 0, and king is 0, 1, 0.0 > 0, < / RTI > In the example of FIG. 3, a seven-dimensional vector is provided as an example because the number of given words is seven. However, since tens of millions of words can be learned during actual learning, the actual word ID vector may have a different form Will be apparent to those skilled in the art.

도 3에 도시된 바와 같이, 초기에는 단어 아이디 벡터가 임의로 설정되지만, 학습이 이루어지면서 단어 아이디 벡터 및 이를 특징 벡터로 변환하기 위한 변환 벡터는 계속적으로 갱신된다. As shown in FIG. 3, a word ID vector is initially set arbitrarily. However, as learning is performed, a word ID vector and a conversion vector for converting the word ID vector into a feature vector are continuously updated.

도 4는 본 발명의 일 실시예에 따라 학습이 수행되면서 설정되는 특징 벡터의 일례를 도시한 도면이다. 4 is a diagram illustrating an example of a feature vector set while learning is performed according to an embodiment of the present invention.

앞서 설명한 바와 같이, 특정 단어의 특징 벡터는 해당 특정 단어의 단어 아이디 벡터 및 변환 벡터의 곱에 의해 결정될 수 있으며, 도 4는 이와 같이 결정된 특징 벡터의 일례를 나타낸 도면이다. As described above, the feature vector of a specific word can be determined by a product of a word ID vector and a transform vector of the specific word, and FIG. 4 is a diagram showing an example of the feature vector thus determined.

도 4를 참조하면, 도 3에 도시된 단어들에 대한 특징 벡터가 표시되어 있으며, 각 특징 벡터는 3차원의 값을 가지는 것을 확인할 수 있다. 도 3에서, w_k1이 제1 차원의 특징값이고, w_k2가 제2 차원의 특징값이며, w_k3가 제3 차원의 특징값이다. Referring to FIG. 4, the feature vectors for the words shown in FIG. 3 are displayed, and each feature vector has a three-dimensional value. 3, w _k1 is a feature value of the first dimension, w _k2 is a feature value of the second dimension, and w _k3 is a feature value of the third dimension.

만일, 제1 단어와 제2 단어의 동시 출연 확률이 높다면 제1 단어와 제2 단어의 특징 벡터는 유사하게 설정된다. If the probability of simultaneous appearance of the first word and the second word is high, the feature vectors of the first word and the second word are set to be similar.

현재 인터넷 서비스 제공자들은 비속어에 대한 데이터베이스를 별도로 구축하고 있으며, 사용자가 작성한 문장 중에 비속어가 포함되어 있을 경우 이를 별도의 방식으로 표시하거나 해당 문장의 업로드를 금지하는 방식으로만 서비스가 제공되고 있다. Currently, Internet service providers are building a database for profanity. When a user's written sentence includes a profanity, the service is provided only by displaying it in a separate way or prohibiting the upload of the sentence.

이러한, 인터넷 서비스 제공자들의 비속어 처리 프로세스를 잘 알고 있는 인터넷 사용자들은 새로운 비속어를 끊임없이 생산하고 있으며, 이러한 신조 비속어에 따른 사이버 폭력에 대해서는 전혀 대응할 수 없는 실정이다. 또한, 기존의 의미론적 추론 또는 형태적 추론만으로는 이러한 신조 비속어가 진정한 비속어인지 여부를 정확히 판단할 수 있는 방법도 없기에 신조 비속어가 오랫동안 사용되어 비속어 데이터베이스에 등록되지 않는 한 이러한 신조 비속어를 탐색할 수는 없었다. 이러한 신조 비속어가 데이터베이스에 등록되는 시점에는 또 다른 신조 비속어가 사용되고 있기에 사실상 신조 비속어에 의한 사이버 폭력은 대응하기 어려운 측면이 있었다. Internet users who are well aware of the process of processing the profanity of Internet service providers are constantly producing new profanity and can not respond to cyber violence at all. In addition, there is no way to accurately determine whether such a new profanity is a true profanity only by existing semantic inference or morphological reasoning. Therefore, unless a new profanity is used for a long time and registered in the profanity database, There was no. At the time of registration of these newborn profligate words in the database, another newborn vulgar language was used, and in fact, cyber-violence caused by the newborn vulgar words was difficult to cope with.

본 발명은 기존과는 전혀 다른 방식으로 비속어 탐지를 수행하며, 이는 신조 비속어가 기존 비속어가 사용되는 문장 구조와 동시에 사용되는 단어가 크게 다르지 않다는 점에 착안한 것이다. The present invention is based on the fact that the slovak language is not significantly different from the words used at the same time as the sentence structure in which the existing profanity is used.

예를 들어, “야 이 (기존 비속어)야”라는 문장이 있을 때, 기존 비속어는 탐지되어 필터링이 될 것이다. 이 경우, 사용자들은 기존의 문장은 그대로 유지하면서 “야 이 (신조 비속어)야”라는 문장으로 변형하는 것이 일반적이다. For example, when there is a sentence called "Ya (the old profanity)", the existing profanity will be detected and filtered. In this case, it is common for users to change the sentence to "Ya (Ya)" while maintaining the existing sentence.

결국, 기존 비속어는 “야”와 “이”라는 단어와 동시 출연하는 경우가 많으며, 기존 비속어를 변형한 신조 비속어 역시 “야”와 “이”라는 단어와 동시 출연하는 경우가 많기에 본 발명에 따른 학습이 수행되어 특징 벡터가 설정된다면, 신조 비속어는 기존 비속어와 매우 유사한 특징 벡터를 가질 가능성이 높아지는 것이다. As a result, many of the existing profanity languages appear at the same time with the words "Ya" and "Lee", and the Shin-Yao profanity language that transforms the existing profanous words is also often accompanied by the words "Ya" and "Lee" If the learning is performed and the feature vector is set, it is highly probable that the neophyte profane word has a feature vector very similar to the existing profanity word.

다시 도 1을 참조하면, 특징 벡터 추출부(110)는 이상과 같은 학습을 통해 설정된 특징 벡터들을 저장하고 있는 특징 벡터 데이터베이스로부터 탐지 대상 단어의 특징 벡터를 추출하는 것이다. Referring again to FIG. 1, the feature vector extracting unit 110 extracts a feature vector of a detection target word from a feature vector database storing feature vectors set through the above learning.

비속어 판단부(150)는 탐지 대상 단어의 특징 벡터와 유사한 특징 벡터에 상응하는 N개의 단어를 특징 벡터 데이터베이스로부터 추가적으로 추출한다. 본 발명의 일 실시예에 따르면, 코사인 유사도를 이용하여 탐지 대상 단어의 특징 벡터와 유사한 특징 벡터에 상응하는 N개의 단어를 추출할 수 있을 것이다. 코사인 유사도 판단은 널리 알려진 기술이므로 이에 대한 상세한 설명은 생략하기로 하며, 추출하는 단어의 수 N이 설정에 따라 자유롭게 지정될 수 있다는 점은 당업자에게 있어 자명할 것이다. The profanity language determination unit 150 further extracts N words corresponding to the feature vector of the detection target word from the feature vector database. According to an embodiment of the present invention, N words corresponding to a feature vector similar to a feature vector of a detection target word can be extracted using the cosine similarity. Since the determination of the degree of cosine similarity is a well known technique, a detailed description thereof will be omitted, and it will be obvious to those skilled in the art that the number N of extracted words can be freely specified according to the setting.

이와 같이 N개의 단어가 추출되면, 비속어 판단부(150)는 N개의 단어 및 탐지 대상 단어 중 적어도 하나가 비속어 데이터베이스(160)에 저장된 비속어에 해당되는지 여부를 판단한다. 만일 추출된 N개의 단어 및 탐지 대상 비속어 중 적어도 하나가 비속어 데이터베이스에 저장된 비속어에 해당될 경우 비속어 판단부(150)는 탐지 대상 단어를 비속어라고 판단한다. When the N words are extracted, the profanity determining unit 150 determines whether at least one of the N words and the detection target words corresponds to the profanity stored in the profanity database 160. [ If at least one of the extracted N words and the detected target profanity corresponds to a profane word stored in the profanity database, the profanity determining unit 150 determines that the target word is a profanity.

만일 탐지 대상 단어가 기존 비속어를 변형한 신조 비속어라면 신조 비속어와 기존 비속어가 사용되는 문장에 동시 출연하는 단어들은 유사할 것이며, 이에 기존 비속어와 이를 변형한 신조 비속어는 학습을 통해 유사한 특징 벡터를 가지게 될 것이다. 따라서, 신조 비속어와 유사한 특징 벡터를 가지는 단어 N개를 추출하게 될 경우 기존 비속어가 추출될 가능성이 높으며, 기존 비속어가 추출될 경우 비속어 판단부(150)는 탐지 대상 단어인 신조 비속어를 비속어라고 판단하는 것이다. If the target word is a newborn profanity that is a modification of the existing profanity, the words that appear at the same time in the sentences of the newbie profanity and the existing profanity will be similar. Will be. Therefore, when extracting N words having a feature vector similar to the newborn profanity, there is a high possibility that the existing profanity is extracted. When the existing profanity is extracted, the profanity determining unit 150 determines that the newborn profan .

한편, 입력된 탐지 대상 단어에 대한 특징 벡터가 특징 벡터 데이터베이스에 저장되어 있지 않을 수도 있다. 이 경우, 임의의 단어 벡터 아이디를 부여하면서 해당 단어에 대한 학습을 단어-특징 벡터 학습부(130)에서 수행할 수 있을 것이며, 특징 벡터 추출부(120)는 임의로 부여된 단어 벡터 아이디를 이용하여 임시적으로 해당 단어에 대한 특징 벡터를 생성할 수 있을 것이다. On the other hand, the feature vector for the inputted detection target word may not be stored in the feature vector database. In this case, the word-feature vector learning unit 130 may perform learning on the word while giving an arbitrary word vector ID, and the feature vector extraction unit 120 may use a word vector ID It is possible to temporarily generate a feature vector for the word.

도 5는 본 발명의 일 실시예에 따른 학습 기반의 비속어 탐지 방법의 전체적인 흐름을 도시한 순서도이다. FIG. 5 is a flow chart showing an overall flow of a learning-based orac-to-speech detection method according to an embodiment of the present invention.

도 5를 참조하면, 우선 사용자가 업로드한 문장이 입력된다(단계 500). Referring to FIG. 5, a sentence that the user has uploaded is input (step 500).

사용자가 업로드한 문장이 입력되면, 문장을 분석하여 탐지 대상 단어를 추출하고, 각 단어에 대한 품사 태깅을 수행한다(단계 502). When a sentence that the user has uploaded is input, the sentence is analyzed to extract a detection target word, and speech tagging for each word is performed (step 502).

탐지 대상 단어가 설정되면, 해당 탐지 대상 단어의 특징 벡터가 특징 벡터 데이터베이스에 존재하는지 여부를 판단한다(단계 504). When the detection subject word is set, it is determined whether the feature vector of the detection subject word exists in the feature vector database (step 504).

탐지 대상 단어의 특징 벡터가 특징 벡터 데이터베이스에 존재하지 않을 경우 임의로 탐지 대상 단어에 대한 단어 아이디 벡터를 부여하고, 학습된 변환 벡터를 이용하여 특징 벡터를 임시적으로 생성하고, 해당 탐지 대상 단어에 대한 학습을 수행한다(단계 506). If the feature vector of the detected word does not exist in the feature vector database, a word ID vector for the detected word is arbitrarily given, a feature vector is temporarily generated using the learned transform vector, (Step 506).

탐지 대상 단어의 특징 벡터가 특징 벡터 데이터베이스에 저장되어 있을 경우, 해당 특징 벡터를 추출한다(단계 508). If the feature vector of the word to be detected is stored in the feature vector database, the feature vector is extracted (step 508).

단계 506 또는 단계 508에서 특징 벡터를 획득하면, 해당 특징 벡터와 유사한 특징 벡터들에 상응하는 N개의 단어를 추출한다(단계 510). When the feature vector is obtained in step 506 or 508, N words corresponding to feature vectors similar to the feature vector are extracted (step 510).

N개의 단어가 추출되면, 추출된 N개의 단어를 이용하여 탐지 대상 단어가 비속어인지 여부를 판단한다(단계 512). 추출된 N개의 단어 및 탐지 대상 단어 중 적어도 하나가 기존의 비속어 데이터베이스에 등록되어 있을 경우 탐지 대상 단어를 비속어라고 판단한다. 그러나, 추출된 N개의 단어 및 탐지 대상 단어 모두 기존의 비속어 데이터베이스에 등록된 단어가 아닐 경우 탐지 대상 단어는 비속어가 아니라고 판단한다.If N words are extracted, it is determined whether the detection target word is a profanity word using the extracted N words (step 512). If at least one of the extracted N words and the detected word is registered in the existing profanity database, it is determined that the detected word is a profanity. However, if both the extracted N words and the detected word are not registered in the existing profanity database, it is determined that the detected word is not a profanity word.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- And various modifications and changes may be made thereto by those skilled in the art to which the present invention pertains. Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .

Claims

A feature vector extractor for extracting a feature vector of a detection target word from a feature vector database storing a word-feature vector;
Extracting N words corresponding to N feature vectors similar to the feature vector of the detection subject word from the feature vector database, and when at least one of the extracted N words and the detection subject word is a profanity word stored in the profanity database, And a judging unit for judging that the detection target word is a profanity,
Characterized in that the feature vectors stored in the feature vector database are set through learning and the learning is performed such that a specific word and other words having a high probability of appearing at the same time have a similar feature vector in the sentence to be learned, Detector.

The method according to claim 1,
Further comprising a sentence analyzing unit for analyzing an input sentence to extract a detection target word and performing part-of-speech tagging for each detection target word.

The method according to claim 1,
Further comprising a feature vector learning unit configured to set a feature vector for a word by learning using a sentence input in a learning process.

The method of claim 3,
Wherein the feature vector learning unit learns a word ID vector for each word and a transform vector for transforming the identity vector.

5. The method of claim 4,
Wherein the feature vector learning unit includes a vector setting / updating unit and an error function feedback module, and initially arbitrarily sets a word ID vector and a conversion vector of each word, and the error function feedback module Updating means for providing an error value based on a probability of occurrence and a difference value between the specific word and a feature vector of a word appearing at the same time, wherein the vector setting / And said conversion vector is updated.

A feature vector extractor for extracting a feature vector of a detection target word from a feature vector database storing a word-feature vector;
Extracting N words corresponding to N feature vectors similar to the feature vector of the detection subject word from the feature vector database, and when at least one of the extracted N words and the detection subject word is a profanity word stored in the profanity database, And a judging unit for judging that the detection target word is a profanity,
Wherein the feature vectors stored in the feature vector database are set based on probabilities of simultaneous appearance between words.

(A) extracting a feature vector of a detection target word from a feature vector database storing a word-feature vector;
Extracting N words corresponding to N feature vectors similar to the feature vector of the detection subject word from the feature vector database, and when at least one of the extracted N words and the detection subject word is a profanity word stored in the profanity database, (B) judging that the word to be detected is a profanity word,
Characterized in that the feature vectors stored in the feature vector database are set through learning and the learning is performed such that a specific word and other words having a high probability of appearing at the same time have a similar feature vector in the sentence to be learned, Detection method.

8. The method of claim 7,
Further comprising the step of analyzing a sentence inputted in advance in the step (a), extracting a word to be detected, and performing part-of-speech tagging for each detection target word.

8. The method of claim 7,
And setting a feature vector for a word by learning using a sentence input in a learning process.

10. The method of claim 9,
Wherein learning the feature vector through learning comprises learning a word ID vector for each word and a transform vector for transforming the identity vector.

11. The method of claim 10,
Wherein the step of setting the feature vector through learning comprises:
Performing vector setting and updating; And an error function feedback step,
Wherein the step of setting and updating the vector sets a word ID vector and a transformation vector of each word at an initial stage, and the error function feedback step is a step of setting a word ID vector and a transformation vector of each word, Wherein the step of feeding back an error value based on a difference value of a feature vector of a co-appearing word and performing the vector setting and updating comprises: updating the word ID vector and the transform vector of each word based on the error value A learning - based profanity detection method characterized by.

(A) extracting a feature vector of a detection target word from a feature vector database storing a word-feature vector;
Extracting N words corresponding to N feature vectors similar to the feature vector of the detection subject word from the feature vector database, and when at least one of the extracted N words and the detection subject word is a profanity word stored in the profanity database, (B) judging that the word to be detected is a profanity word,
Wherein the feature vectors stored in the feature vector database are set based on probabilities of simultaneous appearance between words.

A recording medium in which a program for executing the method of any one of claims 7 to 13 is recorded tangibly and a program readable by a computer is recorded.