KR102244982B1

KR102244982B1 - Text filtering method and device using the image learning

Info

Publication number: KR102244982B1
Application number: KR1020190179915A
Authority: KR
Inventors: 정윤경; 유주연
Original assignee: 성균관대학교산학협력단
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-04-27

Abstract

Disclosed are a method and device for generating an image for text, learning the image and classifying prohibited words included in the text, and converting into special characters. The text filtering method and device using image learning according to the present invention include an artificial intelligence module, a robot, an augmented reality (AR) device, a virtual reality (VR) device, and a device related to a 5G service and the like can be linked.

Description

Text filtering method and device using image learning {TEXT FILTERING METHOD AND DEVICE USING THE IMAGE LEARNING}

본 발명은 텍스트에 대한 이미지를 생성하고, 해당 이미지를 학습하여 해당 텍스트에 포함된 금칙어를 분류하고 특수문자로 변환하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for generating an image for a text, learning the image, classifying the prohibited words included in the text, and converting them into special characters.

최근 다수의 사용자가 온라인 상에 다양한 내용을 포함한 텍스트를 등록할 수 있다. 이러한 텍스트에는 비속어나 욕설과 같은 금칙어가 포함되어 있는 경우가 있다.Recently, a large number of users can register texts including various contents online. These texts sometimes contain prohibited words such as profanity or profanity.

종래에는 텍스트에 포함된 욕설 등의 금칙어를 탐지하기 위해서, 기계 학습 기반의 알고리즘을 이용하여 해당 텍스트에서 금칙어를 분류하는 방법을 주로 사용하였다. Conventionally, in order to detect prohibited words such as profanity contained in the text, a method of classifying the prohibited words from the text using a machine learning-based algorithm was mainly used.

하지만, 금칙어는 끊임없이 변형되므로, 모든 변형된 금칙어들에 대한 데이터베이스를 구축하기 어렵다는 문제점이 있었다. 또한, 기계 학습을 위한 기존 데이터베이스는 금칙어 및 변형된 금칙어에 대한 데이터를 많이 저장하고 있지 못하여, 기계 학습을 통해 텍스트에 포함된 금칙어를 필터링하는데 한계가 있었다. However, there is a problem in that it is difficult to establish a database for all the modified prohibitions because the prohibitions are constantly transformed. In addition, existing databases for machine learning do not store a lot of data on banned words and modified banned words, so there was a limit to filtering the banned words included in the text through machine learning.

더욱이, 금칙어가 한글로 이루어진 경우, 음절 또는 음소로 분류하여 표현할 수 있는 한글의 특성상 초성, 중성, 종성으로 분리된 금칙어는 종래의 필터링 방법으로는 필터링 되지 않는다는 문제점이 있었다. Moreover, when the banned words are composed of Hangul, there is a problem that the banned words separated into initial, neutral, and ending are not filtered by the conventional filtering method due to the characteristics of Hangul that can be classified and expressed as syllables or phonemes.

본 발명은 전술한 필요성 및/또는 문제점을 해결하는 것을 목적으로 한다.It is an object of the present invention to solve the aforementioned necessities and/or problems.

본 발명에 따른 이미지 학습을 이용한 텍스트 필터링 방법 및 장치는 텍스트에 대한 이미지를 생성하여, 해당 이미지에 대한 판독을 통해 금칙어를 분류하고, 특수 문자로 변환하는 방법을 제공한다.A text filtering method and apparatus using image learning according to the present invention provides a method of generating an image for text, classifying prohibited words by reading the corresponding image, and converting them into special characters.

또한, 본 발명에 따른 이미지 학습을 이용한 텍스트 필터링 방법 및 장치는 기존 데이터베이스에 저장되거나 등록되지 않은 금칙어 및 금칙어의 변형례에 대해서도, 이미지 분석 및 이미지 딥러닝을 활용하여 금칙어 및 변형된 금칙어 필터링이 가능한 텍스트 필터링 방법을 제안한다.In addition, the text filtering method and apparatus using image learning according to the present invention can filter forbidden words and modified forbidden words using image analysis and image deep learning, even for variants of prohibitions and prohibitions that are not stored or registered in an existing database. We propose a text filtering method.

본 발명의 일 실시예에 따른 이미지 학습을 이용한 텍스트 필터링 방법은, 수신부가 사용자 단말로부터 입력된 텍스트를 수신하는 단계, 제1 식별부가 상기 텍스트에 제1 금칙어가 있는지 여부를 1차 식별하는 단계, 제2 식별부가 상기 텍스트 중 상기 제1 금칙어에 포함되지 않으면서, 동시에 상기 데이터베이스에 저장된 정상 단어 리스트에도 포함되지 않은 제1 단어 또는 제1 문자가 포함되어 있는지 여부를 판단하는 단계, 상기 제2 식별부가 상기 텍스트에 대한 제1 이미지를 생성하는 단계, 상기 제2 식별부가 상기 제1 이미지를 딥러닝하여 제2 금칙어가 있는지 여부를 2차 식별하는 단계 및 변형부가 상기 텍스트에 포함된 제1 및 제2 금칙어를 특수 문자로 교체하는 단계를 포함하며, 상기 제1 금칙어는 데이터베이스에 저장된 금칙어들 및 이와 유사한 금칙어들을 포함하며, 상기 제2 금칙어는 상기 데이터베이스에 저장되지 않은 금칙어들 및 이들이 변형된 금칙어들이다.In the text filtering method using image learning according to an embodiment of the present invention, the receiving unit receives the text input from the user terminal, the first identification unit first identifies whether the text contains a first banned word, Determining whether a second identification unit includes a first word or a first character that is not included in the first prohibition word among the text and is not included in the list of normal words stored in the database at the same time, the second identification Additional steps of generating a first image for the text, the second identification unit deep learning the first image to secondly identify whether there is a second banned word, and the first and second modification units included in the text 2 Including the step of replacing a banned word with a special character, wherein the first banned word includes banned words stored in the database and similar banned words, and the second banned word is banned words that are not stored in the database and their modified banned words. .

상기 제2 식별부가 상기 제1 이미지를 딥러닝하여 제2 금칙어가 있는지 여부를 2차 식별하는 단계는, 상기 제1 단어 또는 상기 제1 문자가 상기 제1 금칙어 또는 상기 제2 금칙어와 얼마나 유사한지 정도를 나타내는 유사 정도를 판단하는 단계 및 상기 유사 정도에 따라, 상기 제1 문자 또는 상기 제1 문자를 상기 제2 금칙어로 간주할지 여부를 결정하는 단계를 더 포함할 수 있다.The step of secondly identifying whether there is a second prohibition word by deep learning by the second identification unit deep-learning the first image may include how similar the first word or the first character is to the first prohibition word or the second prohibition word. The method may further include determining a degree of similarity indicating a degree of similarity, and determining whether to regard the first character or the first character as the second prohibition word according to the degree of similarity.

상기 변형부가 상기 텍스트에 포함된 제2 금칙어를 특수 문자로 교체하는 단계는, 상기 텍스트가 어떠한 언어로 이루어져 있는지를 판단하는 단계를 더 포함할 수 있다.The step of replacing the second prohibition word included in the text by the transforming unit with a special character may further include determining in which language the text is composed.

상기 텍스트가 어떠한 언어로 이루어져 있는지를 판단하는 단계는, 상기 텍스트가 한글로 이루어져 있음을 확인하는 단계 및 상기 텍스트에 포함된 상기 제2 금칙어에 한글 변형 알고리즘을 적용하는 단계를 더 포함할 수 있다.Determining which language the text is composed of may further include confirming that the text is composed of Hangul and applying a Hangul transformation algorithm to the second prohibited words included in the text.

상기 텍스트에 포함된 상기 제2 금칙어에 한글 변형 알고리즘을 적용하는 단계는, 상기 텍스트를 구성하는 개별 문자들을 각각 초성, 중성 및 종성으로 분리하는 단계, 상기 제2 금칙어를 구성하는 초성, 중성 및 종성을 각각 대체할 특수 문자들을 무작위로 추출하는 단계 및 상기 제2 금칙어를 구성하는 초성, 중성 및 종성을 추출된 특수 문자들로 교체하는 단계를 더 포함할 수 있다.The step of applying the Hangul transformation algorithm to the second banned word included in the text includes the steps of separating individual characters constituting the text into an initial, neutral, and final, respectively, and an initial, neutral, and final sentence constituting the second banned word. The method may further include randomly extracting special characters to be substituted for each, and replacing the initial, neutral, and final characters constituting the second banned word with the extracted special characters.

상기 제2 금칙어를 구성하는 초성, 중성 및 종성을 상기 특수 문자들로 교체하는 단계 이후, 상기 텍스트를 구성하는 개별 문자들에 1번부터 n번 순으로 순서를 부여하는 단계, 상기 1번부터 n번 중 복수의 특정 순서들을 무작위로 선택하고, 상기 특정 순서마다 추가될 특수 문자를 무작위로 선택하는 단계 및 선택된 특수 문자를 선택된 특정 순서에 추가하는 단계를 더 포함할 수 있다.After the step of replacing the initial, neutral and ending characters constituting the second prohibition word with the special characters, assigning an order to individual characters constituting the text in order from 1 to n, from 1 to n The method may further include randomly selecting a plurality of specific orders among the times, randomly selecting special characters to be added in each specific order, and adding the selected special characters to the selected specific order.

상기 텍스트가 어떠한 언어로 이루어져 있는지를 판단하는 단계는, 상기 텍스트가 영어로 이루어져 있음을 확인하는 단계 및 상기 텍스트에 포함된 상기 제2 금칙어에 영어 변형 알고리즘을 적용하는 단계를 더 포함할 수 있다.The determining in which language the text is composed of may further include confirming that the text is composed of English, and applying an English transformation algorithm to the second prohibited words included in the text.

상기 텍스트에 포함된 상기 제2 금칙어에 영어 변형 알고리즘을 적용하는 단계는, 상기 텍스트를 구성하는 개별 문자들을 교체할 특수 문자들을 무작위로 추출하는 단계, 상기 텍스트를 구성하는 개별 문자들을 추출된 특수 문자들로 교체하는 단계를 더 포함할 수 있다.The step of applying the English transformation algorithm to the second prohibition included in the text includes: randomly extracting special characters to replace individual characters constituting the text, and special characters from which individual characters constituting the text are extracted It may further include the step of replacing with.

상기 텍스트를 구성하는 개별 문자들을 추출된 특수 문자들로 교체하는 단계 이후, 상기 텍스트를 구성하는 개별 문자들에 1번부터 n번 순으로 순서를 부여하는 단계, 상기 1번부터 n번 중 복수의 특정 순서들을 무작위로 선택하고, 상기 특정 순서마다 추가될 특수 문자를 무작위로 선택하는 단계 및 선택된 특수 문자를 선택된 특정 순서에 추가하는 단계를 더 포함할 수 있다.After the step of replacing the individual characters constituting the text with the extracted special characters, assigning an order to individual characters constituting the text in order from 1 to n times, a plurality of the first to n times The method may further include randomly selecting specific orders, randomly selecting special characters to be added in each specific order, and adding the selected special characters to the selected specific order.

상기 변형부가 상기 텍스트에 포함된 제2 금칙어를 특수 문자로 교체하는 단계 이후, 상기 특수 문자로 교체되거나 상기 특수 문자가 추가되어 형성된 변형 단어를 생성하는 단계, 생성된 상기 변형 단어를 텍스트 파일로 저장하는 단계 및 상기 변형 단어를 제2 이미지 파일로 상기 데이터베이스에 저장하는 단계를 더 포함할 수 있다.After the step of replacing the second banned word included in the text by the transforming unit with a special character, generating a transformed word formed by being replaced with the special character or adding the special character, and storing the generated transformed word as a text file And storing the modified word as a second image file in the database.

본 발명의 일 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 장치는 사용자 단말로부터 텍스트를 수신하는 수신부, 욕설 및 비속어를 포함하는 제1 금칙어들을 저장하고 있는 데이터베이스, 상기 데이터베이스로부터 상기 제1 금칙어들에 대한 리스트를 읽어와서 상기 텍스트에 상기 제1 금칙어가 포함되어 있는지 여부를 식별하는 제1 식별부, 상기 데이터베이스에 저장되지 않은 금칙어들 및 이들이 변형된 금칙어들인 제2 금칙어가 상기 텍스트에 포함되어 있는지 여부를 식별하는 제2 식별부 및 상기 텍스트에 포함된 상기 제1 금칙어 및 상기 제2 금칙어를 특수 문자로 교체하는 변형부를 포함하며, 상기 제2 식별부는 상기 텍스트를 촬영하여 제1 이미지를 생성하고, 상기 제1 이미지를 딥러닝하여 상기 제2 금칙어가 상기 텍스트에 포함되어 있는지 여부를 판단한다.A text filtering apparatus using image learning according to an embodiment of the present invention includes a receiving unit for receiving text from a user terminal, a database storing first prohibited words including profanity and profanity, and the first prohibited words from the database. A first identification unit that reads a list and identifies whether the text contains the first prohibition, prohibitions not stored in the database, and whether the second prohibition, which is a modified prohibition, is included in the text. A second identification unit to identify and a transformation unit for replacing the first and second prohibition words included in the text with special characters, and the second identification unit generates a first image by photographing the text, and The first image is deep-learned to determine whether the second prohibition word is included in the text.

상기 제2 식별부는 상기 제1 이미지를 생성하는 이미지 생성부 및 상기 제1 이미지를 딥러닝하는 학습부를 더 포함한다.The second identification unit further includes an image generation unit generating the first image and a learning unit deep learning the first image.

본 발명에 따른 이미지 학습을 이용한 텍스트 필터링 방법 및 장치는 텍스트에 대한 이미지를 딥러닝하여 새롭게 생성된 금칙어 리스트를 기반으로 하여, 변형된 금칙어를 필터링 함으로써, 종래 기계 학습을 이용하여 금칙어 탐지를 하는 경우에 비해 소요되는 시간과 노력을 절약할 수 있다.A text filtering method and apparatus using image learning according to the present invention filters a modified prohibition word based on a newly created prohibition list by deep learning an image of the text, thereby detecting a prohibition word using conventional machine learning. It can save time and effort spent compared to.

또한, 본 발명에 따른 이미지 학습을 이용한 텍스트 필터링 방법 및 장치는 기존의 데이터베이스에 저장되거나 등록되어 있는 금칙어 데이터의 부족을 해결할 수 있다. In addition, the text filtering method and apparatus using image learning according to the present invention can solve the shortage of prohibition data stored or registered in an existing database.

또한, 본 발명에 따른 이미지 학습을 이용한 텍스트 필터링 방법 및 장치는 새롭게 생성되거나 기존의 금칙어가 변형된 금칙어를 수집하기 위해 소요되는 시간과 노력 및 비용을 절감하여, 데이터를 구하기 어려운 콜드-스타트(cold-start) 문제를 해결할 수 있다.In addition, the text filtering method and apparatus using image learning according to the present invention reduce the time, effort, and cost required to collect the banned words newly created or modified from the existing banned words, so that it is difficult to obtain data. -start) problem can be solved.

도 1은 본 발명의 일 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 장치에 대한 블록도이다.
도 2는 본 발명의 일 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 방법을 나타내는 순서도이다.
도 3은 본 발명의 일 실시 예에 따른 제2 식별부가 텍스트에 금칙어가 포함되어 있는지 여부를 식별하는 과정을 나타내는 순서도이다.
도 4는 본 발명의 일 실시 예에 따른 변형부가 텍스트에 포함된 금칙어를 특수문자로 변환하는 과정을 나타내는 순서도이다.
도 5는 본 발명의 일 실시 예예 따른 변형부(130)가 변형 단어를 생성하는 과정을 나타내는 순서도이다.
도 6은 변형부(130)가 변형 단어의 생성을 종료하는 조건을 나타내는 순서도이다.1 is a block diagram of a text filtering apparatus using image learning according to an embodiment of the present invention.
2 is a flowchart illustrating a text filtering method using image learning according to an embodiment of the present invention.
3 is a flow chart illustrating a process of identifying whether or not a prohibited word is included in a text by a second identification unit according to an embodiment of the present invention.
4 is a flowchart illustrating a process of converting a banned word included in a text into a special character by a transforming unit according to an embodiment of the present invention.
5 is a flowchart illustrating a process of generating a modified word by the transforming unit 130 according to an embodiment of the present invention.
6 is a flowchart illustrating a condition in which the transforming unit 130 ends generation of a transformed word.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, exemplary embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but identical or similar elements are denoted by the same reference numerals regardless of reference numerals, and redundant descriptions thereof will be omitted. The suffixes "module" and "unit" for constituent elements used in the following description are given or used interchangeably in consideration of only the ease of preparation of the specification, and do not have meanings or roles that are distinguished from each other by themselves. In addition, in describing the embodiments disclosed in the present specification, when it is determined that a detailed description of related known technologies may obscure the subject matter of the embodiments disclosed in the present specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are for easy understanding of the embodiments disclosed in the present specification, and the technical idea disclosed in the present specification is not limited by the accompanying drawings, and all changes included in the spirit and scope of the present invention It should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinal numbers such as first and second may be used to describe various elements, but the elements are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it is understood that it may be directly connected or connected to the other component, but other components may exist in the middle. It should be. On the other hand, when a component is referred to as being "directly connected" or "directly connected" to another component, it should be understood that there is no other component in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this application, terms such as "comprises" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof does not preclude in advance.

이하에서는 도 1을 참조하여 본 발명의 일 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 장치에 대하여 설명한다.Hereinafter, a text filtering apparatus using image learning according to an embodiment of the present invention will be described with reference to FIG. 1.

도 1은 본 발명의 일 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 장치(10)에 대한 블록도이다.1 is a block diagram of a text filtering apparatus 10 using image learning according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 장치(10)는, 수신부(100), 제1 식별부(110), 제2 식별부(120), 변형부(130) 및 데이터베이스(140)를 포함할 수 있다. The text filtering apparatus 10 using image learning according to an embodiment of the present invention includes a receiving unit 100, a first identification unit 110, a second identification unit 120, a transformation unit 130, and a database 140. ) Can be included.

수신부(100)는 사용자 단말로부터 소정의 데이터를 수신한다. 여기서 수신되는 데이터는 사용자가 자신의 사용자 단말을 사용하여 입력시키는 텍스트 데이터를 의미한다.The receiving unit 100 receives predetermined data from a user terminal. The data received here means text data that the user inputs using his/her user terminal.

사용자 단말은 컴퓨팅 장치(미도시)의 기능을 수행할 수 있는 통신 단말기를 포함할 수 있으며, 본 실시 예에서 예시로 하는 사용자 단말은 사용자가 조작하는 데스크 탑 컴퓨터, 스마트폰, 노트북, 태블릿 PC, 스마트 TV, 휴대폰, PDA(Personal Digital Assistant), 랩톱, 미디어 플레이어, 마이크로 서버, GPS(Global Positioning System) 장치, 전자책 단말기, 디지털 방송용 단말기, 네비게이션, 키오스크, MP3 플레이어, 디지털 카메라, 가전기기 및 기타 모바일 또는 비모바일 컴퓨팅 장치일 수 있으나, 이에 제한되지 않는다.The user terminal may include a communication terminal capable of performing a function of a computing device (not shown), and the user terminal illustrated in the present embodiment includes a desktop computer, a smartphone, a laptop computer, a tablet PC, and Smart TV, mobile phone, PDA (Personal Digital Assistant), laptop, media player, micro server, GPS (Global Positioning System) device, e-book terminal, digital broadcasting terminal, navigation, kiosk, MP3 player, digital camera, home appliance and others It may be a mobile or non-mobile computing device, but is not limited thereto.

또한, 사용자 단말은 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등의 웨어러블 단말기일 수 있다. 사용자 단말은 상술한 내용에 제한되지 아니하며, 웹 브라우징이 가능한 단말이라면 모두 사용자 단말에 포함될 수 있다.In addition, the user terminal may be a wearable terminal such as a watch, glasses, hair band, and ring having a communication function and a data processing function. The user terminal is not limited to the above, and any terminal capable of web browsing may be included in the user terminal.

제1 식별부(110)는 수신부(100)로부터 사용자가 입력한 텍스트를 제공받아, 텍스트에 제1 금칙어가 포함되어 있는지 여부를 식별한다. The first identification unit 110 receives the text input by the user from the reception unit 100 and identifies whether the text contains a first prohibition word.

한편, 본 실시 예에서 금칙어는 욕설, 욕설과 유사한 단어나 문자, 비속어들을 포함하며, 제1 금칙어는 데이터베이스(140)에 저장되어 있는 금칙어들 또는 그에 대한 목록을 의미한다. 또한, 제1 금칙어는 데이터베이스(140)에 저장되어 있는 금칙어들과 유사한 금칙어들 및 이에 대한 목록을 포함할 수 있다. 또한, 제2 금칙어는 데이터베이스(140)에 저장되지 않은 금칙어들을 의미한다. 즉, 제2 금칙어는 기존에 데이터베이스(140)에 저장된 욕설, 욕설과 유사한 단어나 문자, 비속어들이 변형된 욕설, 욕설과 유사한 단어나 문자 및 비속어를 의미한다. 또한, 제2 금칙어는 새롭게 생성된 욕설, 욕설과 유사한 단어나 문자 및 비속어를 의미할 수도 있다. 따라서, 제2 금칙어는 데이터베이스(140)에 저장되어 있지는 않지만, 욕설, 욕설과 유사한 단어나 문자, 비속어들이므로 금칙어에 해당된다.Meanwhile, in the present embodiment, the forbidden words include words, characters, and profanity similar to swear words and swear words, and the first forbidden words mean forbidden words stored in the database 140 or a list of them. In addition, the first prohibited words may include prohibited words similar to those stored in the database 140 and a list thereof. In addition, the second prohibited words mean prohibited words that are not stored in the database 140. That is, the second forbidden word means abusive language, words or letters similar to swearwords, and swearwords in which profanity is modified, words or letters and profanity similar to swearwords previously stored in the database 140. In addition, the second forbidden word may mean a newly created profanity, a word similar to the profanity, or a letter, and a profanity. Accordingly, the second prohibited words are not stored in the database 140, but are words, characters, and profanities similar to swear words and swear words, and thus correspond to prohibited words.

본 실시 예에 따른 제1 식별부(110)는 데이터베이스(140)로부터 제1 금칙어 및/또는 제1 금칙어 리스트를 로딩(loading)하여 사용자가 입력한 텍스트에 제1 금칙어에 해당되는 문자나 단어가 포함되어 있는지 여부를 식별할 수 있다.The first identification unit 110 according to the present embodiment loads the first prohibition word and/or the first prohibition word list from the database 140, and the character or word corresponding to the first prohibition word is displayed in the text input by the user. Whether or not it is included can be identified.

한편, 제1 식별부(110)는 제1 금칙어에 포함된 욕설과 유사한 단어나 문자를 데이터베이스(140)에 업로드 하여, 제1 금칙어의 범위를 확장시킬 수 있다.Meanwhile, the first identification unit 110 may expand the range of the first prohibition word by uploading a word or character similar to the profanity included in the first prohibition word to the database 140.

데이터베이스(140)는 일반적으로 많이 알려진 욕설, 이러한 욕설과 유사한 단어나 문자, 비속어들을 미리 저장하고 있을 수 있다. 이렇게 미리 데이터베이스(140)에 저장된 욕설, 이러한 욕설과 유사한 단어나 문자, 비속어들이 제1 금칙어로 분류될 수 있음은 상술하였다. 또한, 데이터베이스(140)는 욕설에 해당하지 않는, 즉 정상 단어나 문자들을 저장하고 있을 수 있다. 데이터베이스(140)는 이렇게 비욕설, 정상 단어나 정상 문자들을 정상 단어 목록 또는 정상 단어 리스트로 분류하고 해당 목록과 리스트를 미리 저장해둘 수 있다.The database 140 may store in advance a commonly known swearword, words, letters, and slang words similar to such swearwords. It has been described above that profanity stored in the database 140 in advance, words or characters similar to such profanity, and profanity may be classified as the first prohibition words. In addition, the database 140 may store normal words or characters that do not correspond to profanity. In this way, the database 140 may classify non-profanity, normal words or normal characters into a normal word list or a normal word list, and store the corresponding list and list in advance.

또한, 본 실시 예에 따른 데이터베이스(140)는 각종 인공지능 알고리즘을 적용하는데 필요한 빅데이터 및 텍스트 인식에 관한 데이터를 제공하는 데이터베이스 서버로 구성될 수 있다. 또한, 데이터베이스(140)가 서버로 구성될 경우, 이러한 서버는 사용자 단말에 설치된 어플리케이션 또는 웹 브라우저를 이용하여 사용자 단말을 원격에서 제어할 수 있도록 하는 웹 서버 또는 어플리케이션 서버를 포함할 수 있다.In addition, the database 140 according to the present embodiment may be configured as a database server that provides data related to text recognition and big data required to apply various artificial intelligence algorithms. In addition, when the database 140 is configured as a server, such a server may include an application installed in the user terminal or a web server or an application server that enables remote control of the user terminal using a web browser.

한편, 제2 식별부(120)는 텍스트를 구성하는 단어나 문자들 중, 제1 금칙어에 포함되지 않으면서, 동시에 데이터베이스(140)에 저장된 정상 단어 리스트에도 포함되지 않은 단어나 문자들을 식별하고, 이러한 단어나 문자들이 비속어, 욕설, 욕설과 유사한 단어나 문자에 해당하는지 여부를 판단할 수 있다. 설명의 편의를 위해, 텍스트를 구성하는 단어나 문자들 중, 제1 금칙어에 포함되지 않으면서, 데이터베이스(140)에 저장된 정상 단어 리스트에도 포함되지 않은 단어나 문자들을 제1 단어 또는 제1 문자라고 지칭할 수 있다. On the other hand, the second identification unit 120 identifies words or characters that are not included in the first banned word and are not included in the normal word list stored in the database 140, among words or characters constituting the text, It is possible to determine whether these words or letters correspond to words or letters similar to profanity, swear words, and swear words. For convenience of explanation, among words or characters constituting the text, words or characters that are not included in the first banned word and are not included in the list of normal words stored in the database 140 are referred to as first words or first characters. Can be referred to.

제2 식별부(120)는 텍스트에 포함된 제1 단어 또는 제1 문자가 제2 금칙어에 해당하는지 여부를 판단하기 위해, 텍스트를 캡쳐하거나 촬영할 수 있도록 구성되며, 이를 위하여, 이미지 생성부(121)를 더 포함할 수 있다.The second identification unit 120 is configured to capture or photograph text in order to determine whether the first word or the first character included in the text corresponds to the second prohibition. To this end, the image generation unit 121 ) May be further included.

이미지 생성부(121)는 사용자가 사용자 단말을 통해 입력한 텍스트를 캡쳐 또는 촬영하여 텍스트에 대한 이미지를 생성한다. 여기서 사용자가 입력한 텍스트를 캡쳐하거나 촬영하여 생성된 이미지는 제1 이미지로 분류할 수 있다.The image generator 121 generates an image for the text by capturing or photographing the text input by the user through the user terminal. Here, an image generated by capturing or photographing the text input by the user may be classified as a first image.

또한, 이미지 생성부(121)는 사용자가 입력한 텍스트를 캡쳐하거나 촬영할 경우, 일정 단위로 이미지를 생성할 수 있다. 즉, 이미지 생성부(121)는 텍스트에 포함된 문자 단위로 제1 이미지를 생성할 수 있으며, 단어 단위로 제1 이미지를 생성할 수도 있다. 또한, 이미지 생성부(121)는 텍스트에 포함된 단어들의 음절, 음소 또는 철자 단위로 제1 이미지를 생성할 수도 있다. In addition, when capturing or photographing text input by a user, the image generator 121 may generate an image in a predetermined unit. That is, the image generator 121 may generate the first image in units of characters included in the text, and may also generate the first image in units of words. In addition, the image generator 121 may generate the first image in units of syllables, phonemes, or spellings of words included in the text.

따라서, 제1 이미지는 텍스트 전체에 대한 캡쳐 이미지일 수도 있지만, 해당 텍스트에 포함된 문자나 단어에 대한 캡쳐 이미지를 포함할 수도 있다. Accordingly, the first image may be a captured image of the entire text, but may include a captured image of a character or word included in the text.

한편, 제2 식별부(120)는 이미지 생성부(121)가 텍스트 전체 및 텍스트의 구성 단위인 단어나 문자를 구별하여 캡쳐 또는 촬영할 수 있도록 하기 위해서, 텍스트를 딥러닝하여 텍스트를 구성하는 단어나 문자를 인식하고 구별할 수 있는 학습부(122)를 더 포함할 수 있다.Meanwhile, the second identification unit 120 deep-learns the text so that the image generating unit 121 can capture or capture the entire text and words or characters that are constituting units of the text. It may further include a learning unit 122 capable of recognizing and distinguishing characters.

따라서, 본 실시 예에 따른 학습부(122)는 이미지 생성부(121)가 텍스트 전체 및 텍스트의 구성 단위인 단어나 문자를 구별하여 캡쳐 또는 촬영하도록 하며, 이미지 생성부(121)가 생성한 제1 이미지를 딥러닝하도록 구성된다. 또한, 학습부(122)는 제1 이미지를 딥러닝한 결과를 데이터베이스(140)에 저장시킬 수 있으며, 추후 데이터베이스(140)에 저장된 결과들을 다른 딥러닝에 응용할 수 있다.Accordingly, the learning unit 122 according to the present embodiment allows the image generation unit 121 to capture or capture the entire text and words or characters that are constitutional units of the text, and capture or photograph the text. It is configured to deep-learn 1 image. In addition, the learning unit 122 may store the result of deep learning the first image in the database 140, and later apply the result stored in the database 140 to other deep learning.

본 실시 예에 따른 학습부(122)는 이렇게 딥러닝을 수행하기 위해, 자체적으로 AI 프로세서(미도시), 메모리(미도시) 및/또는 통신부(미도시)를 포함할 수 있다.The learning unit 122 according to the present embodiment may itself include an AI processor (not shown), a memory (not shown), and/or a communication unit (not shown) to perform deep learning.

AI 프로세서는 메모리에 저장된 프로그램을 이용하여 신경망을 학습할 수 있다. 특히, AI 프로세서는 디바이스 관련 데이터를 인식하기 위한 신경망을 학습할 수 있다. 여기서, 디바이스 관련 데이터를 인식하기 위한 신경망은 인간의 뇌 구조를 컴퓨터 상에서 모의하도록 설계될 수 있으며, 인간의 신경망의 뉴런(neuron)을 모의하는, 가중치를 갖는 복수의 네트워크 노드들을 포함할 수 있다. 복수의 네트워크 모드들은 뉴런이 시냅스(synapse)를 통해 신호를 주고받는 뉴런의 시냅틱 활동을 모의하도록 각각 연결 관계에 따라 데이터를 주고받을 수 있다. 여기서 신경망은 신경망 모델에서 발전한 딥러닝 모델을 포함할 수 있다. 딥 러닝 모델에서 복수의 네트워크 노드들은 서로 다른 레이어에 위치하면서 컨볼루션(convolution) 연결 관계에 따라 데이터를 주고받을 수 있다. 신경망 모델의 예는 심층 신경망(DNN, deep neural networks), 합성곱 신경망(CNN, convolutional deep neural networks), 순환 신경망(RNN, Recurrent Boltzmann Machine), 제한 볼츠만 머신(RBM, Restricted Boltzmann Machine), 심층 신뢰 신경망(DBN, deep belief networks), 심층 Q-네트워크(Deep Q-Network)와 같은 다양한 딥 러닝 기법들을 포함하며, 컴퓨터비젼, 음성인식, 자연어처리, 음성/신호처리 등의 분야에 적용될 수 있다.AI processors can learn neural networks using programs stored in memory. In particular, the AI processor can learn a neural network to recognize device-related data. Here, the neural network for recognizing device-related data may be designed to simulate a human brain structure on a computer, and may include a plurality of network nodes having weights that simulate neurons of the human neural network. The plurality of network modes can exchange data according to their respective connection relationships so that neurons can simulate synaptic activity of neurons that send and receive signals through synapses. Here, the neural network may include a deep learning model developed from a neural network model. In a deep learning model, a plurality of network nodes are located in different layers and may exchange data according to a convolutional connection relationship. Examples of neural network models include deep neural networks (DNN), convolutional deep neural networks (CNN), Recurrent Boltzmann Machine (RNN), Restricted Boltzmann Machine (RBM), and deep trust. It includes various deep learning techniques such as deep belief networks (DBN) and deep Q-network, and can be applied to fields such as computer vision, speech recognition, natural language processing, and speech/signal processing.

한편, 전술한 바와 같은 기능을 수행하는 AI 프로세서는 범용 프로세서(예를 들어, CPU)일 수 있으나, 인공지능 학습을 위한 AI 전용 프로세서(예를 들어, GPU)일 수 있다.Meanwhile, the AI processor performing the functions as described above may be a general-purpose processor (eg, a CPU), but may be an AI-only processor (eg, a GPU) for artificial intelligence learning.

메모리는 학습부(122)의 동작에 필요한 각종 프로그램 및 데이터를 저장할 수 있다. 메모리는 비 휘발성 메모리, 휘발성 메모리, 플래시 메모리(flash-memory), 하드디스크 드라이브(HDD) 또는 솔리드 스테이트 드라이브(SDD) 등으로 구현할 수 있다. 메모리는 AI 프로세서에 의해 액세스되며, AI 프로세서에 의한 데이터의 독취/기록/수정/삭제/갱신 등이 수행될 수 있다. 또한, 메모리는, 텍스트를 촬영한 제1 이미지의 인식 및 데이터 분류를 위해 학습부(122)가 사용하는 학습 알고리즘 또는 이러한 알고리즘에 의해 생성된 신경망 모델(예를 들어, 딥 러닝 모델)을 저장할 수 있다.The memory may store various programs and data required for the operation of the learning unit 122. The memory may be implemented as a non-volatile memory, a volatile memory, a flash memory, a hard disk drive (HDD), or a solid state drive (SDD). The memory is accessed by the AI processor, and data read/write/edit/delete/update by the AI processor may be performed. In addition, the memory may store a learning algorithm used by the learning unit 122 for recognition and data classification of the first image photographed text or a neural network model (for example, a deep learning model) generated by such an algorithm. have.

한편, 본 실시 예에 따른 AI 프로세서는 데이터 분류/인식을 위한 신경망을 학습하는 데이터 학습부를 더 포함할 수 있다. 데이터 학습부는 데이터 분류/인식을 판단하기 위하여 어떤 학습 데이터를 이용할지, 학습 데이터를 이용하여 데이터를 어떻게 분류하고 인식할지에 관한 기준을 학습할 수 있다. 데이터 학습부는 학습에 이용될 학습 데이터를 획득하고, 획득된 학습데이터를 딥러닝 모델에 적용함으로써, 딥러닝 모델을 학습할 수 있다. Meanwhile, the AI processor according to the present embodiment may further include a data learning unit for learning a neural network for data classification/recognition. The data learning unit may learn a criterion for how to classify and recognize data using which training data to use to determine data classification/recognition. The data learning unit may learn the deep learning model by acquiring training data to be used for training and applying the acquired training data to the deep learning model.

데이터 학습부는 적어도 하나의 하드웨어 칩 형태로 제작되어 학습부(122)에 탑재될 수 있다. 예를 들어, 데이터 학습부는 인공지능(AI)을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 범용 프로세서(CPU) 또는 그래픽 전용 프로세서(GPU)의 일부로 제작되어 학습부(122)에 탑재될 수도 있다. 또한, 데이터 학습부는 소프트웨어 모듈로 구현될 수 있다. 소프트웨어 모듈(또는 인스트럭션(instruction)을 포함하는 프로그램 모듈)로 구현되는 경우, 소프트웨어 모듈은 컴퓨터로 읽을 수 있는 판독 가능한 비일시적 판독 가능 기록 매체(non-transitory computer readable media)에 저장될 수 있다. 이 경우, 적어도 하나의 소프트웨어 모듈은 OS(Operating System)에 의해 제공되거나, 애플리케이션에 의해 제공될 수 있다. The data learning unit may be manufactured in the form of at least one hardware chip and mounted on the learning unit 122. For example, the data learning unit may be manufactured in the form of a dedicated hardware chip for artificial intelligence (AI), or may be manufactured as a part of a general-purpose processor (CPU) or a graphics dedicated processor (GPU) and mounted on the learning unit 122. . In addition, the data learning unit may be implemented as a software module. When implemented as a software module (or a program module including an instruction), the software module may be stored in a computer-readable non-transitory computer readable media. In this case, at least one software module may be provided by an operating system (OS) or an application.

또한, 데이터 학습부는 학습 데이터 획득부 및 모델 학습부를 포함할 수 있다. In addition, the data learning unit may include a learning data acquisition unit and a model learning unit.

학습 데이터 획득부는 데이터를 분류하고 인식하기 위한 신경망 모델에 필요한 학습 데이터를 획득할 수 있다. 예를 들어, 학습 데이터 획득부는 학습 데이터로서, 신경망 모델에 입력하기 위한 제1 이미지에 대한 데이터 및/또는 샘플 데이터를 획득할 수 있다.The training data acquisition unit may acquire training data necessary for a neural network model for classifying and recognizing data. For example, the training data acquisition unit may acquire data and/or sample data for a first image to be input into the neural network model as training data.

모델 학습부는 상기 획득된 학습 데이터를 이용하여, 신경망 모델이 소정의 데이터를 어떻게 분류할지에 관한 판단 기준을 가지도록 학습할 수 있다. 이 때 모델 학습부는 학습 데이터 중 적어도 일부를 판단 기준으로 이용하는 지도 학습(supervised learning)을 통하여, 신경망 모델을 학습시킬 수 있다. 또는 모델 학습부는 지도 없이 학습 데이터를 이용하여 스스로 학습함으로써, 판단 기준을 발견하는 비지도 학습(unsupervised learning)을 통해 신경망 모델을 학습시킬 수 있다. 또한, 모델 학습부는 학습에 따른 상황 판단의 결과가 올바른지에 대한 피드백을 이용하여 강화 학습(reinforcement learning)을 통하여, 신경망 모델을 학습시킬 수 있다. 또한, 모델 학습부는 오류 역전파법(error back-propagation) 또는 경사 하강법(gradient decent)을 포함하는 학습 알고리즘을 이용하여 신경망 모델을 학습시킬 수 있다. The model learning unit may learn to have a criterion for determining how the neural network model classifies predetermined data by using the acquired training data. In this case, the model learning unit may train the neural network model through supervised learning using at least a portion of the training data as a criterion for determination. Alternatively, the model learning unit may train the neural network model through unsupervised learning that discovers a criterion by learning by itself using the training data without guidance. In addition, the model learning unit may train the neural network model through reinforcement learning by using feedback on whether the result of situation determination according to the learning is correct. In addition, the model learning unit may train the neural network model using a learning algorithm including error back-propagation or gradient decent.

신경망 모델이 학습되면, 모델 학습부는 학습된 신경망 모델을 메모리에 저장할 수 있다. 모델 학습부는 학습된 신경망 모델을 학습부(122)와 유선 또는 무선 네트워크로 연결된 데이터베이스(140) 및/또는 서버의 메모리에 저장할 수도 있다.When the neural network model is trained, the model learning unit may store the trained neural network model in a memory. The model learning unit may store the learned neural network model in the memory of the database 140 and/or the server connected to the learning unit 122 through a wired or wireless network.

데이터 학습부는 인식 모델의 분석 결과를 향상시키거나, 인식 모델의 생성에 필요한 리소스 또는 시간을 절약하기 위해 학습 데이터 전처리부(미도시) 및 학습 데이터 선택부(미도시)를 더 포함할 수도 있다. The data learning unit may further include a training data preprocessor (not shown) and a training data selection unit (not shown) to improve the analysis result of the recognition model or to save resources or time required for generating the recognition model.

학습 데이터 전처리부는 획득된 데이터가 상황 판단을 위한 학습에 이용될 수 있도록, 획득된 데이터를 전처리할 수 있다. 예를 들어, 학습 데이터 전처리부는, 모델 학습부가 제1 이미지 인식을 위한 학습을 위하여 획득된 학습 데이터를 이용할 수 있도록, 획득된 데이터를 미리 설정된 포맷으로 가공할 수 있다.The learning data preprocessor may preprocess the acquired data so that the acquired data can be used for learning to determine a situation. For example, the training data preprocessor may process the acquired data into a preset format so that the model training unit can use the training data acquired for learning for first image recognition.

또한, 학습 데이터 선택부는, 학습 데이터 획득부에서 획득된 학습 데이터 또는 전처리부에서 전처리된 학습 데이터 중 학습에 필요한 데이터를 선택할 수 있다. 선택된 학습 데이터는 모델 학습부에 제공될 수 있다. 예를 들어, 학습 데이터 선택부는, 이미지 생성부(121)를 통해 획득한 제1 이미지 중 특정 영역 만을 검출함으로써, 특정 영역에 포함된 객체 즉, 특정 영역에 대한 텍스트에 대한 데이터만을 학습 데이터로 선택할 수 있다.In addition, the learning data selection unit may select data necessary for learning from the learning data acquired by the learning data acquisition unit or the training data preprocessed by the preprocessor. The selected training data may be provided to the model training unit. For example, the learning data selection unit selects only data for text for a specific area, that is, an object included in a specific area, as the training data by detecting only a specific area of the first image acquired through the image generating unit 121. I can.

또한, 데이터 학습부는 신경망 모델의 분석 결과를 향상시키기 위하여 모델 평가부(미도시)를 더 포함할 수도 있다.In addition, the data learning unit may further include a model evaluation unit (not shown) to improve the analysis result of the neural network model.

모델 평가부는, 신경망 모델에 평가 데이터를 입력하고, 평가 데이터로부터 출력되는 분석 결과가 소정 기준을 만족하지 못하는 경우, 모델 학습부로 하여금 다시 학습하도록 할 수 있다. 이 경우, 평가 데이터는 인식 모델을 평가하기 위한 미리 정의된 데이터일 수 있다. 일 예로, 모델 평가부는 평가 데이터에 대한 학습된 인식 모델의 분석 결과 중, 분석 결과가 정확하지 않은 평가 데이터의 개수 또는 비율이 미리 설정되 임계치를 초과하는 경우, 소정 기준을 만족하지 못한 것으로 평가할 수 있다.The model evaluation unit may input evaluation data to the neural network model and, when an analysis result output from the evaluation data does not satisfy a predetermined criterion, may cause the model learning unit to learn again. In this case, the evaluation data may be predefined data for evaluating the recognition model. As an example, the model evaluation unit may evaluate as not satisfying a predetermined criterion if the number or ratio of evaluation data whose analysis result is not accurate among the analysis results of the learned recognition model for evaluation data exceeds a threshold value. have.

한편, 통신부는 AI 프로세서에 의한 AI 프로세싱 결과를 본 실시 예에 따른 장치(10)의 외부로 전송할 수 있다.Meanwhile, the communication unit may transmit the result of AI processing by the AI processor to the outside of the device 10 according to the present embodiment.

여기서 본 실시 예에 따른 장치(10)의 외부란, 외부 전자 장치를 포함하며, 이러한 외부 전자 장치는 서버, 및 사용자 단말 또는 이들과 5G 네트워크로 연결된 사용자 단말이나 서버로 정의될 수 있다. Here, the external term of the device 10 according to the present embodiment includes an external electronic device, and the external electronic device may be defined as a server, a user terminal, or a user terminal or server connected to them through a 5G network.

한편, 본 실시 예에 따른 변형부(130)는 사용자가 입력한 텍스트에 포함된 제1 금칙어 및 제2 금칙어를 특수 문자로 교체한다. 예를 들어, 텍스트에 포함된 욕설, 욕설과 유사한 단어나 문자, 비속어들을 "*"로 치환하거나, 해당 금칙어들 사이에 "*"와 같은 특수 문자가 배치되도록 할 수 있다.On the other hand, the transforming unit 130 according to the present embodiment replaces the first prohibition word and the second prohibition word included in the text input by the user with special characters. For example, words, characters, and slang words similar to swear words and swear words included in the text may be replaced with "*", or special characters such as "*" may be placed between corresponding prohibitions.

또한, 변형부(130)는 제1 식별부(110) 및 제2 식별부(120)의 명령에 의해 텍스트에 포함된 제1 금칙어 및 제2 금칙어를 직접 제거할 수도 있다. 또한, 변형부(130)는 데이터베이스(140)로부터 제1 금칙어 리스트를 제공받거나 제2 식별부(120)가 인식한 제2 금칙어 리스트를 제2 식별부로부터 제공받아 텍스트에 포함된 제1 금칙어 및 제2 금칙어를 직접 제거할 수도 있다.In addition, the transforming unit 130 may directly remove the first prohibited words and the second prohibited words included in the text by commands of the first identification unit 110 and the second identification unit 120. In addition, the modification unit 130 receives a first prohibition list from the database 140 or a second prohibition list recognized by the second identification unit 120 from the second identification unit, You can also remove the second ban on yourself.

따라서, 본 실시 예에 따른 장치(10)는 음소나 문자 형태로 분리되거나 변형된 금칙어를 텍스트에서 필터링하기 위해, 제1 식별부(110)가 1차적으로 텍스트에 데이터베이스(140)에 저장된 금칙어 데이터를 적용하고, 제2 식별부(120)가 2차적으로 텍스트에 포함된 알 수 없는 단어 또는 Out-of-vocabulary(OOV)의 단어를 식별하고, 이들에 대한 이미지 검출을 통해 등록되거나 등록되지는 않았지만, 금칙어와 매우 유사한 변형 금칙어들이 있는지 확인한다.Accordingly, in the apparatus 10 according to the present embodiment, in order to filter the banned words separated or transformed in the form of phonemes or characters from text, the first identification unit 110 firstly stores the banned words data stored in the database 140 in the text. Is applied, and the second identification unit 120 secondly identifies unknown words or out-of-vocabulary (OOV) words included in the text, and is registered or registered through image detection for them. No, but check for variations on the banned words that are very similar to the banned words.

이하에서는, 도 2 내지 도 6을 참조하여 본 발명의 일 실시 예에 따라, 이미지 학습을 이용한 텍스트 필터링 장치(10)를 사용하여 텍스트를 필터링 하는 방법에 대하여 설명한다.Hereinafter, a method of filtering text using the text filtering apparatus 10 using image learning will be described according to an embodiment of the present invention with reference to FIGS. 2 to 6.

도 2를 참조하면, 이미지 학습을 이용한 텍스트 필터링 장치(10)는 본 실시 예에서 텍스트를 필터링하기 위해, 다양한 단계들(S100 내지 S160)을 수행할 수 있도록 구성된다. 도 2는 본 발명의 일 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 방법을 나타내는 순서도이다. Referring to FIG. 2, the text filtering apparatus 10 using image learning is configured to perform various steps (S100 to S160) to filter text in the present embodiment. 2 is a flowchart illustrating a text filtering method using image learning according to an embodiment of the present invention.

비록 도 2에는 각 단계들이 순차적으로 표현되어 있으나, 일부 단계들은 본 실시 예에 따른 장치(10)에 동시에 수행되는 것도 가능하다.Although the steps are sequentially represented in FIG. 2, some steps may be performed simultaneously by the apparatus 10 according to the present embodiment.

도 2를 참조하면, 이미지 학습을 이용한 텍스트 필터링 장치(10)는 전원이 온(On)되면, 기동을 시작한다. Referring to FIG. 2, the text filtering apparatus 10 using image learning starts to start when the power is turned on.

이후, 사용자가 사용자 단말을 통해 텍스트를 입력하면, 수신부(100)는 텍스트를 수신한다(S100).Thereafter, when the user inputs text through the user terminal, the receiving unit 100 receives the text (S100).

제1 식별부(110)는 수신된 텍스트에 제1 금칙어가 있는지 여부를 1차 식별한다(S110). 제1 식별부(110)는 데이터베이스(140)에 저장되어 있는 제1 금칙어 리스트를 로딩하여 사용자가 입력한 텍스트에 제1 금칙어가 포함되어 있는지 여부를 식별한다. The first identification unit 110 first identifies whether or not there is a first banned word in the received text (S110). The first identification unit 110 loads the first prohibition word list stored in the database 140 and identifies whether the first prohibition word is included in the text input by the user.

예를 들면, 아래의 온라인 상의 대화는 A와 B라는 서로 다른 두 명의 사용자의 대화 내용이다.For example, the online conversation below is a conversation between two different users, A and B.

A: 아놔, 멍청한 새끼.A: ahnwa, stupid motherfucker.

B: 까고 있네. 겜 접어 ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ.B: I'm breaking it. Fold the game ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ.

위 대화를 참조하면, A가 먼저 ‘새끼’라는 일반 욕설 단어를 텍스트로 입력하였고, B는 이어서 ‘ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ’라는 욕설을 문자 또는 음소 단위로 분리시켜 입력하였다. Referring to the above dialogue, A first inputs the general profanity word ‘bird’ as text, and B then enters the profanity ‘ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' by separating them into letters or phonemes.

위 대화에서, A가 언급한 ‘새끼’는 데이터베이스(140)에 저장된 제1 금칙어 리스트에 포함된 일반 욕설 단어이고, B가 언급한 '까고 있네' 및 ‘ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ’는 데이터베이스(140)에 저장된 제1 금칙어 리스트에 포함되지 않은 단어인 것으로 가정한다.In the above dialogue,'Bit' mentioned by A is a general profanity word included in the first banned word list stored in the database 140, and'I am close' and'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' are database 140 It is assumed that the word is not included in the first prohibition list stored in ).

이 경우, 제1 식별부(110)는 제1 금칙어 리스트에 '새끼’가 저장되어 있으므로, 텍스트에서 '새끼’가 금칙어에 해당함을 식별할 수 있다. 하지만, 제1 식별부(110)는 '까고 있네' 및 ‘ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ’가 제1 금칙어 리스트에 저장되어 있지 않으므로, '까고 있네' 및 'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ’는 금칙어에 해당함을 식별할 수 없다.In this case, the first identification unit 110 may identify that the'suckle' corresponds to the banned word in the text, since'Bit' is stored in the first banned word list. However, since the first identification unit 110 is not stored in the first prohibition list,'You are close' and'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' and'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' correspond to prohibited words. Cannot be identified.

이후, 제2 식별부(120)는 텍스트 중 제1 금칙어에 포함되지 않으면서, 동시에 데이터베이스(140)에 저장된 정상 단어 리스트에도 포함되지 않은 제1 단어 또는 제1 문자가 포함되어 있는지 여부를 판단한다(S120). Thereafter, the second identification unit 120 determines whether a first word or a first character not included in the list of normal words stored in the database 140 at the same time is included in the text that is not included in the first banned word. (S120).

제1 단어 또는 제1 문자는 텍스트 중 제1 금칙어에 포함되지 않으면서, 동시에 데이터베이스(140)에 저장된 정상 단어 리스트에도 포함되지 않은 단어나 문자를 의미한다.The first word or the first character refers to a word or character that is not included in the first prohibited word among texts and is not included in the list of normal words stored in the database 140 at the same time.

상술한 예에 따르면, 제2 식별부(120)는 '까고 있네' 및 'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ’를 제1 금칙어에 포함되지 않으면서, 동시에 데이터베이스(140)에 저장된 정상 단어 리스트에도 포함되지 않은 제1 단어 또는 제1 문자로 인식할 수 있다.According to the above-described example, the second identification unit 120 does not include'You are close' and'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' in the first banned word, and is not included in the list of normal words stored in the database 140 at the same time. It can be recognized as a first word or a first letter.

그리고, 제2 식별부(120)는 텍스트에 대한 제1 이미지를 생성한다(S130). 여기서 제1 이미지는 상술한 바와 같이, 텍스트 전체에 대한 이미지 및 텍스트를 구성하는 문자 또는 단어 단위 별로 생성된 이미지를 포함할 수 있다. Then, the second identification unit 120 generates a first image for the text (S130). Here, as described above, the first image may include an image for the entire text and an image generated for each character or word constituting the text.

예를 들어, '까고 있네. 겜 접어 ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ.'를 전체적으로 제1 이미지로 생성할 수 있다. 또한, '까고 있네'와 'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ'에 대한 각각의 이미지를 제1 이미지로 생성할 수 있다.For example,'I'm close. Fold the game ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ.' can be created as a first image as a whole. In addition, each image of'You are close' and'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' may be generated as the first image.

그리고, 제2 식별부(120)는 이렇게 생성된 제1 이미지를 딥러닝하여 제1 이미지에 제2 금칙어가 포함되어 있는지 여부를 2차 식별한다(S140). 즉, 제2 식별부(120)는 제1 이미지를 분석하여 제1 단어 또는 제1 문자가 제2 금칙어에 해당될 만한 욕설, 비속어, 욕설과 유사한 문자인지 여부를 식별한다. Then, the second identification unit 120 deep-learns the generated first image to secondly identify whether or not the second prohibition is included in the first image (S140). That is, the second identification unit 120 analyzes the first image and identifies whether the first word or the first character is a character similar to an abusive language, slang, or abusive language equivalent to the second prohibition.

여기서, 제2 금칙어는 상술한 바와 같이, 제1 금칙어에 포함되지 않으면서, 동시에 데이터베이스(140)에 저장된 정상 단어 리스트에도 포함되지 않은 욕설, 비속어, 욕설과 유사한 문자나 단어이다. Here, as described above, the second forbidden word is a character or word similar to swearword, slang, and swearword that is not included in the first prohibited word and is not included in the list of normal words stored in the database 140 at the same time.

제2 식별부(120)가 제1 이미지를 분석한 결과, 제1 단어 또는 제1 문자가 제2 금칙어에 해당된다고 판단하면(S150), 변형부(130)는 텍스트에 포함된 제1 및 제2 금칙어를 모두 특수 문자로 교체한다(S160).When the second identification unit 120 analyzes the first image and determines that the first word or the first character corresponds to the second prohibition (S150), the transforming unit 130 2 Replace all prohibited words with special characters (S160).

만약, 상술한 S120 단계에서, 텍스트에 제1 단어 또는 제1 문자가 포함되어 있지 않은 경우, 제1 식별부(110)가 제1 금칙어로 판단한 금칙어들은 변형부(130)에서 특수 문자로 변형되어 출력된다.If, in step S120 described above, if the text does not contain the first word or the first character, the prohibition words determined by the first identification unit 110 as the first prohibition words are transformed into special characters by the transforming unit 130 Is output.

또한, 상술한 S150 단계에서, 제1 단어 또는 제1 문자가 제2 금칙어에 해당되지 않는다고 제2 식별부(120)가 판단한 경우, 제1 식별부(110)가 제1 금칙어로 판단한 금칙어들은 변형부(130)에서 특수 문자로 변형되어 출력하고, 텍스트 필터링은 종료될 수 있다.In addition, in step S150 described above, if the second identification unit 120 determines that the first word or the first character does not correspond to the second prohibition word, the prohibited words determined by the first identification unit 110 are modified. The unit 130 may be transformed into a special character and output, and text filtering may be terminated.

한편, 도 3을 참조하여 상술한 S150 단계를 좀 더 구체적으로 설명한다. 도 3은 본 발명의 일 실시 예에 따른 제2 식별부(120)가 텍스트에 금칙어가 포함되어 있는지 여부를 식별하는 과정을 나타내는 순서도이다.Meanwhile, step S150 described above will be described in more detail with reference to FIG. 3. 3 is a flow chart illustrating a process of identifying whether or not a banned word is included in a text by the second identification unit 120 according to an embodiment of the present invention.

도 3을 참조하면, S150 단계는 제2 식별부(120)가 텍스트에 포함된 제1 단어 또는 제1 문자가 제1 금칙어 또는 제2 금칙어와 얼마나 유사한지 여부를 판단하는 단계(S151)를 더 포함한다. 이 경우, 제2 식별부(120)는 텍스트에 포함된 제1 단어 또는 제1 문자가 기존에 데이터베이스(140)에 저장되어 있는 제1 금칙어 및/또는 딥러닝이나 외부 서버 검색을 통해 식별한 제2 금칙어와 얼마나 유사한지 여부를 판단한다(S152). 제2 식별부(120)는 제1 단어 또는 제1 문자의 음절, 음소 및 의미가 가지는 특징을 딥러닝으로 학습한다. 그리고, 제2 식별부(120)는 제1 금칙어 및/또는 제2 금칙어가 가지는 음절, 음소 및 의미의 특징과 학습 결과가 얼마의 비율로 유사한지 여부를 판단한다(S152). 예를 들어, 양 요소들 간의 유사성이 70%이상인 경우, 실질적으로 동일하다고 판단하도록 설정될 수 있다. Referring to FIG. 3, step S150 further includes a step of determining, by the second identification unit 120, how similar the first word or the first character included in the text is to the first prohibition word or the second prohibition word (S151). Includes. In this case, the second identification unit 120 includes a first word or a first character included in the text and/or a first prohibition word previously stored in the database 140 and/or a product identified through deep learning or an external server search. 2 It is determined whether it is similar to the prohibition word (S152). The second identification unit 120 learns features of the syllables, phonemes, and meanings of the first word or the first letter through deep learning. In addition, the second identification unit 120 determines whether the characteristics of syllables, phonemes, and meanings of the first and/or second prohibited words and the learning result are similar in a certain ratio (S152). For example, when the similarity between the two elements is 70% or more, it may be set to be determined to be substantially the same.

이렇게 제2 식별부(120)가 제1 단어 또는 제1 문자가 제1 금칙어 및/또는 제2 금칙어와 유사한 정도가 일정 기준 이상이라고 판단하면(S152), 제2 식별부(120)는 제1 단어 또는 제1 문자를 제2 금칙어로 간주할 수 있다(S153).In this way, if the second identification unit 120 determines that the first word or the first character is similar to the first prohibition word and/or the second prohibition word is more than a certain standard (S152), the second identification unit 120 The word or the first letter may be regarded as a second prohibition (S153).

상술한 예에서, 제1 금칙어로 '깝치지마'가 등록되어 있고, 제2 식별부(120)가 딥러닝을 통해 'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ'가 '씨발', '씨발놈'과 같은 의미를 가지고 있음을 학습하였다고 가정하면, 제2 식별부(120)는 제1 단어 또는 제1 문자인 '까고 있네'는 제1 금칙어인 '깝치지마'와 의미적인 특징이 매우 유사하다고 판단할 수 있고(S152), 'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ'는 '씨발', '씨발놈'과 같은 의미를 가지고 있다고 판단할 수 있다.In the above-described example, the first prohibition word'don't get caught' is registered, and'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' means the same as'fuck' and'fucker' through deep learning by the second identification unit 120 Assuming that they have learned that they have, the second identification unit 120 can determine that the semantic characteristics are very similar to the first word or the first character,'I am close', which is the first prohibition word,'Don't get close'. There is (S152), and'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' can be judged to have the same meaning as'fuck' and'fucker'.

이후, 제2 식별부(120)는 제2 금칙어로 간주된 제1 단어 또는 제1 문자를 변형부(130)를 통해 특수 문자로 교체할 수 있다(S160).Thereafter, the second identification unit 120 may replace the first word or the first character, which is regarded as the second prohibition word, with a special character through the deformation unit 130 (S160).

한편, 상술한 예에서 제1 단어 또는 제1 문자가 '까고 있네' 대신 '웃기지마'이고, 'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ'가 'ㄴㄴ ㅈㅎㅅㅇ'인 경우에는, 제2 식별부(120)는 제1 단어 또는 제1 문자가 제1 금칙어 및 제2 금칙어와 유사하지 않다고 판단할 수 있다(S152).On the other hand, in the above-described example, when the first word or the first letter is'don't laugh' instead of'you're close', and'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ' is'ㄴㄴ ㅈㅎㅅㅇ', the second identification unit 120 It may be determined that the first word or the first character is not similar to the first prohibition word and the second prohibition word (S152).

즉, '웃기지마'는 제1 금칙어로 등록되어 있지는 않지만, 제2 식별부(120)의 딥러닝 결과 욕설, 욕설과 유사한 단어 또는 문자, 비속어로 보기는 어렵고, 'ㄴㄴ ㅈㅎㅅㅇ'에 대하여는 제2 식별부(120)의 딥러닝 결과 '너나 잘하세요'에 대한 의미를 가지고 있다고 판단할 수 있으므로, 'ㄴㄴ ㅈㅎㅅㅇ' 역시 욕설, 욕설과 유사한 단어 또는 문자, 비속어로 볼 수 없을 것이다.In other words,'Don't be funny' is not registered as the first banned word, but as a result of deep learning by the second identification unit 120, it is difficult to see abusive language, a word or character similar to abusive language, or as a profanity, and for'ㄴㄴ ㅈㅎㅅㅇ' 2 As a result of the deep learning of the identification unit 120, it may be determined that it has the meaning of'You are good', so'ㄴㄴ ㅈㅎㅅㅇ' will not be seen as a word or character or profanity similar to swearwords and swear words.

다만, 이러한 경우, 제2 식별부(120)는 '웃기지마'와 'ㄴㄴ ㅈㅎㅅㅇ'에 대한 리스트를 생성하거나(S154), 웃기지마'와 'ㄴㄴ ㅈㅎㅅㅇ'를 기존의 제1 금칙어 리스트에 욕설이나 비속어와 유사하지는 않지만, 참고할 수 있는 단어나 문자 데이터로 추가할 수 있다(S154).However, in this case, the second identification unit 120 generates a list of'Don't be funny' and'ㄴㄴ ㅈㅎㅅㅇ' (S154), or'Do not be funny' and'ㄴㄴ ㅈㅎㅅㅇ' to the existing first prohibition list. Although not similar to profanity or profanity, it can be added as reference word or text data (S154).

그리고, 제2 식별부(120)는 '웃기지마'와 'ㄴㄴ ㅈㅎㅅㅇ'에 대한 제3 이미지를 생성하여 딥러닝하고 추후 제2 금칙어 추론에 대한 학습 데이터로 삼을 수 있다(S155).In addition, the second identification unit 120 may generate third images for'don't be laughable' and'ㄴㄴ ㅈㅎㅅㅇ', deep learning, and use them as learning data for the second prohibition inference (S155).

아울러, 도 4를 참조하여, S160단계에 대하여 구체적으로 설명한다. 도 4는 본 발명의 일 실시 예에 따른 변형부(130)가 텍스트에 포함된 금칙어를 특수문자로 변환하는 과정을 나타내는 순서도이다.In addition, referring to FIG. 4, step S160 will be described in detail. 4 is a flowchart illustrating a process of converting a banned word included in a text into a special character by the transforming unit 130 according to an embodiment of the present invention.

도 4를 참조하면, 변형부(130)는 제2 식별부(120)가 제2 금칙어로 간주된 제1 단어 또는 제1 문자를 포함하는 텍스트가 전체적으로 어떠한 언어로 이루어져 있는지 확인한다(S161). Referring to FIG. 4, the transforming unit 130 checks in which language the second identification unit 120 is composed of a first word or a text including a first character considered as a second prohibition (S161).

제2 식별부(120)는 텍스트가 한글로 이루어져 있는지 여부를 우선적으로 판단할 수 있다(S162). The second identification unit 120 may preferentially determine whether the text is in Korean (S162).

S162 단계에서 제2 금칙어로 간주된 제1 단어 또는 제1 문자를 포함하는 텍스트가 한글로 이루어져 있다고 제2 식별부(120)가 판단한 경우(S162), 변형부(130)는 제1 및 제2 금칙어에 '한글 변형 알고리즘'을 적용할 수 있다(S163). 만약, S162 단계에서 제2 금칙어로 간주된 제1 단어 또는 제1 문자를 포함하는 텍스트가 한글로 이루어져 있지 않다고 제2 식별부(120)가 판단한 경우(S162), 변형부(130)는 제1 및 제2 금칙어에 '영어 변형 알고리즘'을 적용할 수 있다(S164).In step S162, when the second identification unit 120 determines that the text including the first word or the first character, which is regarded as the second prohibition word, is composed of Korean (S162), the transforming unit 130 comprises the first and second characters. The'Hangul transformation algorithm' may be applied to the banned word (S163). If, in step S162, the second identification unit 120 determines that the text including the first word or the first character, which is regarded as the second prohibition word, is not composed of Korean (S162), the transforming unit 130 is And it is possible to apply the'English transformation algorithm' to the second prohibition word (S164).

상술한 ‘한글 변형 알고리즘’ 및 '영어 변형 알고리즘'은 예를 들어, Windows 기반 Visual Studio 2015 C++/MFC로 개발될 수 있다. 한글 변형 알고리즘’ 및 '영어 변형 알고리즘'은 제2 금칙어를 포함한 텍스트 파일을 읽고, 언어를 인식한 후, 해당 언어에 대한 '변형 알고리즘'을 적용하여 변형된 금칙어에 대한 단어 리스트를 만들 수 있다. The above-described'Hangul transform algorithm' and'English transform algorithm' may be developed with, for example, Windows-based Visual Studio 2015 C++/MFC. The Hangul transforming algorithm' and the'English transforming algorithm' read a text file including the second banned word, recognize a language, and then apply a'transformation algorithm' for the language to create a word list for the modified banned word.

특히, 상술한 ‘한글 변형 알고리즘’ 및 '영어 변형 알고리즘'은 예를 들어, 이렇게 생성된 변형된 금칙어의 단어 리스트를 편집 영역(edit control)에 출력하고 해당 영역을 캡처하여, 이미지 파일로 저장할 수 있다. In particular, the above-described'Hangul Transformation Algorithm' and'English Transformation Algorithm' can output, for example, a list of words of the transformed prohibition words generated in this way to an edit control, capture the corresponding area, and save it as an image file. have.

또한, 상술한 ‘한글 변형 알고리즘’ 및 '영어 변형 알고리즘'은 로딩되는 텍스트 파일의 이름에 따라 변형된 금칙어의 단어 리스트를 만들거나, 변형된 금칙어를 단어 그대로 이미지 파일로 캡처하여 이미지 파일로 저장할 수 있다. In addition, the above-described'Hangul Transformation Algorithm' and'English Transformation Algorithm' can create a word list of modified prohibitions according to the name of the text file to be loaded, or capture the transformed prohibitions as words as an image file and save them as image files. have.

또한, 상술한 ‘한글 변형 알고리즘’ 및 '영어 변형 알고리즘'은 변형된 금칙어의 단어 리스트를 만드는 작업을 할 때, 변형된 금칙어에 포함된 변형된 문자를 텍스트 파일에 적는 작업을 추가할 수 있다. In addition, the above-described'Hangul transforming algorithm' and'English transforming algorithm' may add a task of writing a modified character included in a modified banned word in a text file when creating a word list of modified banned words.

예를 들어, 제1 단어 또는 제1 문자가 'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ'인 경우, 변형된 문자는 'ㅅㅅ'일 수 있다. 즉, 'ㅅㅅ'은 'ㅆ'이 변형된 문자인 것이며, 'ㅅㅅ'은 제2 금칙어에 포함될 가능성이 많은 변형된 문자로 제2 식별부(120)에 인식될 수 있다. For example, when the first word or the first character is'ㅅㅅㅣㅂㅏㄹㄴㅓㅁㅇㅏ', the transformed character may be'ㅅㅅ'. That is,'ㅅㅅ' is a modified character of'ㅆ', and'ㅅㅅ' is a modified character that is likely to be included in the second prohibition and may be recognized by the second identification unit 120.

따라서, 변형부(130)의 ‘한글 변형 알고리즘’은 변형된 금칙어의 단어 리스트를 만들 때, 아래의 표 1과 같이 리스트를 만들거나 표 1을 참조하여 금칙어를 변형할 수 있다. Accordingly, the'Hangul modification algorithm' of the transforming unit 130 may make a list as shown in Table 1 below or may modify the prohibited words by referring to Table 1 when creating a word list of modified banned words.

[표 1][Table 1]

또한, 변형부(130)의 ‘영어 변형 알고리즘’은 변형된 금칙어의 단어 리스트를 만들 때, 아래의 표 2와 같이 리스트를 만들거나, 아래 표 2를 참조하여 금칙어를 변형할 수 있다. In addition, the'English transformation algorithm' of the transformation unit 130 may create a list as shown in Table 2 below, or may modify the prohibition word by referring to Table 2 below when creating a word list of modified prohibitions.

[표 2][Table 2]

또한, 변형부(130)는 변형된 금칙어를 특수 문자로 변환하거나 변형된 금칙어에 특수 문자를 추가하여, 금칙어가 제대로 표시되지 않도록 할 경우, 아래 표 3에 도시된 특수 문자를 사용할 수 있다. In addition, the transforming unit 130 may use the special characters shown in Table 3 below when converting the modified banned words into special characters or adding special characters to the modified banned words so that the banned words are not properly displayed.

[표 3][Table 3]

한편, 변형부(130)가 제1 및 제2 금칙어에 '한글 변형 알고리즘'을 적용하는 단계(S163)는, 도 4에 도시된 바와 같이 좀 더 세부적인 단계들을 포함한다. On the other hand, the step (S163) of applying the'Hangul transforming algorithm' to the first and second prohibited words by the transforming unit 130 (S163) includes more detailed steps as shown in FIG. 4.

도 4를 참조하면, 변형부(130)는 텍스트를 구성하는 개별 단어나 문자들을 각각 초성, 중성 및 종성으로 분리하고(S1631), 제2 금칙어를 구성하는 초성, 중성 및 종성을 각각 대체할 특수 문자들을 무작위로 추출하여(S1632), 제2 금칙어를 구성하는 초성, 중성 및 종성을 추출된 특수 문자들로 교체한다(S1633).Referring to FIG. 4, the transforming unit 130 separates individual words or characters constituting the text into an initial, neutral, and final (S1631), and a special to replace the initial, neutral and final respectively constituting the second prohibition. The characters are randomly extracted (S1632), and the initial, neutral, and final constituents constituting the second banned word are replaced with the extracted special characters (S1633).

또한, S1633 단계 이후, 변형부(130)는 텍스트를 구성하는 개별 문자들에 각각 1번부터 n번 순으로 순서를 부여하고(S1634), 1번부터 n번 중 복수의 특정 순서들을 무작위로 선택하여, 특정 순서에 추가될 특수 문자를 무작위로 선택하며(S1635), 선택된 특수 문자를 선택된 특정 순서에 추가하여(S1636), 텍스트에 포함된 금칙어가 표시되지 않도록 하거나, 텍스트에 포함된 금칙어를 사람이 인식하지 못하도록 출력한다.In addition, after step S1633, the transforming unit 130 assigns an order to individual characters constituting the text in order from 1 to n times (S1634), and randomly selects a plurality of specific sequences from 1 to n times Then, the special characters to be added in a specific order are randomly selected (S1635), and the selected special characters are added to the selected specific order (S1636) so that the prohibited words included in the text are not displayed, or the prohibited words included in the text are selected. It prints out so that it doesn't recognize it.

이와 같이, 본 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 방법은 텍스트가 한글일 경우, 한글 변형 알고리즘을 적용한다. 그리고, 한글 변형 알고리즘은 입력된 한글을 초성, 중성, 종성으로 분리하고 특수 문자로 교체하거나 추가하는 변형 작업을 진행한다. As described above, in the text filtering method using image learning according to the present embodiment, when the text is Hangul, a Hangul transformation algorithm is applied. In addition, the Hangul transformation algorithm separates the input Hangul into the initial, neutral, and ending, and performs a transformation operation of replacing or adding special characters.

한글로 이루어지는 금칙어는 두 글자 혹은 세 글자로 이루어진 경우가 많기 때문에 단어 중 1글자 또는 2글자를 분리하고 중성 또는 종성을 아랫줄에 적는다. 그리고, 한글 변형 알고리즘은 이렇게 만들어진 단어들을 특수 문자로 교체하는 작업 및/또는 특수 문자를 추가하는 작업을 진행한다.Because banned words in Korean are often composed of two or three letters, separate one or two letters of the word and write the neutral or finality on the lower line. In addition, the Hangul transformation algorithm performs a task of replacing the created words with special characters and/or adding a special character.

또한, 변형부(130)가 제1 및 제2 금칙어에 '영어 변형 알고리즘'을 적용하는 단계(S164)는, 도 4에 도시된 바와 같이 좀 더 세부적인 단계들을 포함한다. In addition, the step (S164) of applying the'English transforming algorithm' to the first and second prohibited words by the transforming unit 130 (S164) includes more detailed steps as shown in FIG. 4.

도 4를 참조하면, 변형부(130)는 텍스트를 구성하는 개별 단어나 문자들을 대체할 특수 문자들을 무작위로 추출하여(S1641), 제2 금칙어를 구성하는 단어나 문자들을 추출된 특수 문자들로 교체한다(S1642).Referring to FIG. 4, the transforming unit 130 randomly extracts special characters to replace individual words or characters constituting the text (S1641), and extracts the words or characters constituting the second prohibition into the extracted special characters. Replace (S1642).

또한, S1642 단계 이후, 변형부(130)는 텍스트를 구성하는 개별 문자들에 각각 1번부터 n번 순으로 순서를 부여하고(S1643), 1번부터 n번 중 복수의 특정 순서들을 무작위로 선택하여, 특정 순서에 추가될 특수 문자를 무작위로 선택하며(S1644), 선택된 특수 문자를 선택된 특정 순서에 추가하여(S1645), 텍스트에 포함된 금칙어가 표시되지 않도록 하거나, 텍스트에 포함된 금칙어를 사람이 인식하지 못하도록 출력한다. In addition, after step S1642, the transforming unit 130 assigns an order to individual characters constituting the text in order from 1 to n times (S1643), and randomly selects a plurality of specific sequences from 1 to n times. In this way, special characters to be added in a specific order are randomly selected (S1644), and the selected special characters are added to the selected specific order (S1645), so that the prohibited words contained in the text are not displayed, or the prohibited words contained in the text It prints out so that it doesn't recognize it.

본 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 방법이 '영어 변형 알고리즘'을 적용할 경우, 영어는 한글처럼 조합 문자가 아니므로, 초성, 중성, 종성을 분리하거나 철자를 분리하는 작업은 진행하지 않는다. 영어 금칙어는 한글 금칙어에 비해 단어의 길이가 다양하므로, 본 실시 예에 따른 '영어 변형 알고리즘'은 금칙어를 구성하는 철자 중 일부를 무작위로 선택하여 특수 문자로 교체 및/또는 철자나 단어 사이에 특수 문자를 추가한다.When the text filtering method using image learning according to the present embodiment applies the'English transformation algorithm', since English is not a combination character like Hangul, the work of separating the initial, neutral, and final characters or the spelling is not performed. . Since English banned words have different lengths of words compared to Korean banned words, the'English transformation algorithm' according to the present embodiment randomly selects some of the letters constituting the banned words and replaces them with special characters and/or special characters between spellings or words. Add text.

한편, 도 5 및 도 6을 참조하여, 본 실시 예에 따른 이미지 학습을 이용한 텍스트 필터링 방법이 S150 단계 이후, 금칙어가 특수 문자로 교체되거나 금칙어에 특수 문자가 추가되어 형성된 변형 단어를 생성하는 과정을 설명한다. On the other hand, referring to FIGS. 5 and 6, the text filtering method using image learning according to the present embodiment shows a process of generating a modified word formed by replacing a banned word with a special character or adding a special character to the banned word after step S150. Explain.

도 5는 본 발명의 일 실시 예예 따른 변형부(130)가 변형 단어를 생성하는 과정을 나타내는 순서도이며, 도 6은 변형부(130)가 변형 단어의 생성을 종료하는 조건을 나타내는 순서도이다.FIG. 5 is a flowchart illustrating a process in which the transforming unit 130 generates a transformed word according to an embodiment of the present invention, and FIG. 6 is a flowchart illustrating a condition in which the transforming unit 130 terminates generating a transformed word.

한편, 여기서 변형 단어란 금칙어가 특수 문자로 교체되면서 형성된 단어, 금칙어에 특수 문자가 추가되어 형성된 단어를 의미한다. 또한, 본 발명의 일 실시 예예 따른 제2 인식부(120)는 학습부(122)를 통하여 변형 단어를 학습하고, 변형 단어가 생성되는 과정에서 사용한 제1 및 제2 금칙어와 특수 문자들을 통해, 금칙어의 다양한 변형 사례들을 학습할 수 있다. Meanwhile, the modified word here refers to a word formed by replacing a banned word with a special character, and a word formed by adding a special character to the banned word. In addition, the second recognition unit 120 according to an embodiment of the present invention learns the modified word through the learning unit 122, and through the first and second prohibited words and special characters used in the process of generating the modified word, You can learn various examples of variations of banned words.

도 5를 참조하면, S150 단계 이후, 변형부(130)는 특수 문자로 교체되거나 특수 문자가 추가되어 형성된 변형 단어를 생성한다(S170).Referring to FIG. 5, after step S150, the transforming unit 130 generates a transformed word formed by being replaced with a special character or by adding a special character (S170).

그리고, 변형부(130)는 생성된 변형 단어를 텍스트 파일로 변환하여 데이터베이스(140)에 저장할 수 있다(S180). In addition, the transforming unit 130 may convert the generated transformed word into a text file and store it in the database 140 (S180).

또한, 변형부(130)는 생성된 변형 단어를 제2 인식부(120)로 보내어 변형 단어에 대한 제2 이미지가 생성될 수 있도록하고, 해당 제2 이미지를 데이터베이스(140)에 저장되도록 할 수 있다. In addition, the transforming unit 130 may send the generated transformed word to the second recognition unit 120 so that a second image for the transformed word can be generated, and store the corresponding second image in the database 140. have.

특히, 도 6에 도시된 바와 같이, 변형부(130)는 변형 단어를 생성함에 있어서, 상술한 S163 및/또는 S164 단계를 반복하여 변형 단어들을 계속 생성할 수 있다(S171). In particular, as shown in FIG. 6, in generating the transformed word, the transforming unit 130 may continuously generate transformed words by repeating steps S163 and/or S164 described above (S171).

변형부(130)는 생성된 변형 단어의 개수가 일정 수(N)에 달할 때까지, S163 및/또는 S164 단계를 N번 반복하여 변형 단어를 생성할 수 있다(S172). 이후, 변형부(130)가 S172 단계에서, 생성된 단어의 수 N 만큼 S163 및/또는 S164 단계를 반복하였을 경우, 변형부(130)는 변형 단어 생성을 종료할 수 있다(S173).The transforming unit 130 may generate a transformed word by repeating steps S163 and/or S164 N times until the number of transformed words generated reaches a predetermined number (N) (S172). Thereafter, when the transforming unit 130 repeats steps S163 and/or S164 as many as N of the generated words in step S172, the transforming unit 130 may finish generating the transformed word (S173).

이는, 제2 인식부(120)의 학습부(122)가 제1 금칙어로 등록되거나 저장되지 않은 제2 금칙어 및 제2 금칙어의 다양한 변형 실시 예들을 학습할 수 있도록 하기 위해, 변형부(130)가 학습부(122)의 학습 데이터를 생성하는 과정이다.This is, in order to enable the learning unit 122 of the second recognition unit 120 to learn various modified embodiments of the second prohibition word and the second prohibition word that are not registered or stored as the first prohibition word, the transformation unit 130 Is a process of generating learning data of the learning unit 122.

예를 들어, 변형부(130)가 변형 단어 일만개를 생성하도록 조건이 설정되어 있다면, 변형부(130)는 S163 및/또는 S164 단계를 일만번 반복하고, 일만개의 변형 단어를 생성한다. For example, if the condition is set so that the transforming unit 130 generates 10,000 transformed words, the transforming unit 130 repeats steps S163 and/or S164 tens of thousands of times, and generates tens of thousands of transformed words.

이 과정에서 변형부(130)가 변형한 제1 및 제2 금칙어와 교체 및 추가에 사용된 특수 문자, 특수 문자로 교체되거나 특수 문자가 추가된 금칙어들을 모두 학습부(122)의 학습 데이터로 생성한다. 그리고, 학습부(122)는 이러한 학습 데이터를 학습하여, 스스로 제2 금칙어의 범위를 확장한다.In this process, the first and second prohibited words modified by the transforming unit 130, special characters used for replacement and addition, and prohibited words replaced with special characters or added with special characters are all generated as learning data of the learning unit 122 do. Then, the learning unit 122 learns the learning data and expands the range of the second prohibition by itself.

이렇게 학습부(122)가 학습하여 제2 금칙어의 범위를 확장시키면, 예를 들어, 한글 금칙어 140개를 사용하여 금칙어를 특수 문자로 치환하거나 금칙어에 특수 문자를 추가하는 경우, 한글 금칙어 140개가 각각 변형된 한글 금칙어 이미지 파일들이 데이터베이스(140)에 저장되어 있을 수 있다.In this way, if the learning unit 122 learns and expands the range of the second prohibition words, for example, when 140 Hangul prohibitions are used to replace the prohibitions with special characters or add special characters to the prohibitions, 140 Hangul prohibitions are each Modified Hangul prohibited words image files may be stored in the database 140.

한편, 본 발명에 따른 장치(10)를 사용하여 정상 단어 클래스에서 한글 금칙어 클래스를 분류하는 실험을 진행하였다. 결과는 다음과 같다.Meanwhile, an experiment was conducted to classify a Hangul banned word class from a normal word class using the apparatus 10 according to the present invention. The result is as follows.

분류될 금칙어 클래스는 총 한글 금칙어 140개를 포함하며, 각 클래스에는 한글 변형 알고리즘으로 만들어진 한글 금칙어 이미지 파일이 존재한다. 실험에 사용된 클래스별 이미지의 개수는 1만개 이상이다. 변형되지 않은 비금칙어로 구성된 정상 단어 클래스는 변형되지 않고, 중복되지 않은 약 1만 개의 단어로 구성된다. 실험에 사용된 총 이미지의 개수는 1,473,378개이다.The banned words class to be classified includes a total of 140 banned words in Hangul, and each class has a Hangul banned words image file created with the Hangul transformation algorithm. The number of images for each class used in the experiment is more than 10,000. A normal word class consisting of unmodified non-banning words is composed of about 10,000 words that are not modified and are not duplicated. The total number of images used in the experiment was 1,473,378.

변형된 한글은 문자를 초성, 중성, 종성으로 분리하고, 중성 또는 종성을 다음 줄에 적기도 하며, 중간에 특수 문자가 추가되기도 하며, 자음과 모음을 닮은 특수 문자로 교체되기도 한다. 문자 생성에 규칙이 생기지 않도록 하기 위해, 사용할 특수 문자와 추가 또는 변경될 문자의 위치는 매번 무작위로 선택하게 된다. In the transformed Hangeul, characters are separated into initial, neutral, and final, and neutral or final characters are written in the next line, special characters are added in the middle, and special characters resembling consonants and vowels are sometimes replaced. In order not to create rules for character generation, special characters to be used and the positions of characters to be added or changed are chosen at random each time.

한편, 학습부(120)가 포함하는 신경망 모델 즉, CNN 모델의 구조는 이미지 학습에서 자주 볼 수 있는 형태로, 컨볼루션 레이어와 풀링 레이어를 쌓아 올린다. 모델의 출력 개수는 클래스 개수가 되며, Softmax함수를 사용하며 가장 확률이 높은 클래스를 출력하도록 만든다. Meanwhile, the structure of a neural network model included in the learning unit 120, that is, a CNN model, is a form that is often seen in image learning, and a convolution layer and a pooling layer are stacked up. The number of outputs of the model is the number of classes, and the Softmax function is used to output the class with the highest probability.

또한, 본 발명에 따른 장치(10)를 사용하여 정상 단어 클래스에서 영어 금칙어 클래스를 분류하는 실험을 진행하였다. 결과는 다음과 같다.In addition, an experiment was conducted to classify an English banned word class from a normal word class using the apparatus 10 according to the present invention. The result is as follows.

분류될 금칙어 클래스는 영어 금칙어 69개이며, 각 클래스에는 영어 변형 알고리즘으로 만들어진 영어 금칙어 이미지 파일이 존재한다. 학습에 사용된 클래스별 이미지 개수는 각각 1만 개 이상이다. 정상 단어 클래스는 NLTK 패키지의 corpus 말뭉치를 사용하여 중복되지 않은 약 1만개의 단어 이미지로 구성된다. 실험에 사용된 이미지의 개수는 총 726,114개이다. 변형된 영어 단어는 한글처럼 문자를 분리하지 않고, 특수 문자를 추가 또는 교체하여 변형된 단어를 만든다. 변형된 영어 단어 또한 규칙이 생기지 않도록 하기 위해 무작위로 특수 문자를 고르고, 위치를 선택한다.The prohibition classes to be classified are 69 English prohibitions, and each class has an English prohibition image file created with an English transformation algorithm. The number of images for each class used for learning is more than 10,000 each. The normal word class consists of about 10,000 non-overlapping word images using the corpus corpus of the NLTK package. The total number of images used in the experiment was 726,114. Transformed English words do not separate characters like Hangul, but add or replace special characters to create transformed words. Changed English words are also randomly selected for special characters and positions to prevent rules from occurring.

이러한 실험들을 통해, 본 발명에 따른 장치(10) 및 이를 이용한 텍스트 필터링 방법은 텍스트를 구성하는 언어 특성에 맞춰 단어를 변형시키고 변형된 단어들을 만들 수 있음이 입증되었다. 또한, 변형된 단어에 대한 이미지 파일을 만들고, 변형된 단어로 편집거리 알고리즘 실험과 OCR 실험을 진행했다. 두 실험 모두 원래 단어를 예측하기 어려웠지만, CNN 학습을 통해 변형된 단어 인식을 할 수 있었다. Through these experiments, it was proved that the apparatus 10 and the text filtering method using the same according to the present invention can transform words and create transformed words according to language characteristics constituting text. In addition, image files were created for transformed words, and edited distance algorithm experiments and OCR experiments were conducted with the transformed words. In both experiments, it was difficult to predict the original word, but modified word recognition was possible through CNN learning.

즉, 데이터베이스(140)에 미리 저장된 금칙어인, 제1 금칙어로 등록되거나 분류되지 않은 제2 금칙어들을 본 발명에 따른 장치(10)는 인식할 수 있으며, 제2 금칙어를 변형한 유사 금칙어들도 본 발명에 따른 장치(10)가 인식할 수 있다.That is, the device 10 according to the present invention can recognize the second prohibition words that are previously stored in the database 140, the second prohibition words that are not registered or classified as the first prohibition word, and similar prohibition words modified from the second prohibition word are also viewed. The device 10 according to the invention is recognizable.

본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀 질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The present invention can be implemented as a computer-readable code on a medium on which a program is recorded. The computer-readable medium includes all types of recording devices that store data that can be read by a computer system. Examples of computer-readable media include hard disk drives (HDDs), solid state disks (SSDs), silicon disk drives (SDDs), ROMs, RAM, CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, etc. There is also a carrier wave (for example, transmission over the Internet) also includes the implementation in the form of. Therefore, the detailed description above should not be construed as restrictive in all respects and should be considered as illustrative. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

100: 수신부
110: 제1 식별부
120: 제2 식별부
130: 변형부
140: 데이터베이스100: receiver
110: first identification unit
120: second identification unit
130: deformation portion
140: database

Claims

Receiving, by a receiving unit, the text input from the user terminal;
First identifying, by a first identification unit, whether or not a first prohibition word is present in the text;
Determining whether a first word or a first character that is not included in the first prohibition word among the text and is not included in the normal word list stored in the database at the same time is included by a second identification unit;
Generating, by the second identification unit, a first image for the text;
Secondly identifying whether there is a second prohibition word by deep learning the first image by the second identification unit; And
A step of replacing the first and second prohibited words included in the text with special characters,
The first banned words include banned words and similar banned words stored in a database,
The second banned words are banned words not stored in the database and modified banned words, a text filtering method using image learning.

The method of claim 1,
The step of secondly identifying whether there is a second prohibition word by deep learning the first image by the second identification unit,
Determining a degree of similarity indicating how similar the first word or the first letter is to the first prohibition word or the second prohibition word; And
Further comprising determining whether to regard the first word or the first character as the second kinsoku word according to the degree of similarity.

The method of claim 2,
The step of replacing the second prohibition word included in the text by the transforming unit with a special character,
The text filtering method using image learning, further comprising the step of determining in which language the text is composed.

The method of claim 3,
The step of determining in which language the text is composed of,
Confirming that the text is composed of Korean; And
The text filtering method using image learning, further comprising the step of applying a Hangul transformation algorithm to the second banned word included in the text.

The method of claim 4,
The step of applying a Hangul transforming algorithm to the second banned word included in the text,
Separating individual characters constituting the text into an initial, neutral, and final sound, respectively;
Randomly extracting special characters to replace the initial, neutral, and final characters constituting the second banned word; And
The text filtering method using image learning further comprising the step of replacing the initial, neutral, and final constituents constituting the second banned word with extracted special characters.

The method of claim 5,
After the step of replacing the initial, neutral and ending constituting the second prohibition word with the special characters,
Assigning an order to individual characters constituting the text in order from 1 to n;
Randomly selecting a plurality of specific sequences among the 1 to n times, and randomly selecting special characters to be added in each specific order; And
Text filtering method using image learning, further comprising the step of adding the selected special character to the selected specific order.

The method of claim 3,
The step of determining in which language the text is composed of,
Confirming that the text is in English; And
The text filtering method using image learning, further comprising the step of applying an English transformation algorithm to the second prohibited words included in the text.

The method of claim 7,
The step of applying an English transformation algorithm to the second prohibited words included in the text,
Randomly extracting special characters to replace individual characters constituting the text;
Further comprising the step of replacing the individual characters constituting the text with the extracted special characters, text filtering method using image learning.

The method of claim 8,
After the step of replacing individual characters constituting the text with the extracted special characters,
Assigning an order to individual characters constituting the text in order from 1 to n
Randomly selecting a plurality of specific sequences among the 1 to n times, and randomly selecting special characters to be added in each specific order; And
Text filtering method using image learning, further comprising the step of adding the selected special character to the selected specific order.

The method of claim 1,
After the step of replacing the second prohibition word included in the text by the transforming unit with a special character,
Generating a modified word formed by being replaced with the special character or by adding the special character;
Storing the generated modified word as a text file; And
The text filtering method using image learning further comprising the step of storing the modified word as a second image file in the database.

A receiver for receiving text from a user terminal;
A database storing first prohibited words including profanity and profanity;
A first identification unit for reading a list of the first prohibited words from the database and identifying whether the first prohibited words are included in the text;
A second identification unit that identifies whether or not the text contains prohibited words not stored in the database and a second prohibited words that are modified prohibited words; And
And a transforming unit for replacing the first and second prohibited words included in the text with special characters,
The second identification unit,
A text filtering apparatus using image learning, which generates a first image by photographing the text, and determines whether the second prohibition is included in the text by deep learning the first image.

The method of claim 11,
The second identification unit,
An image generator for generating the first image; And
A text filtering apparatus using image learning, further comprising a learning unit for deep learning the first image.