KR102410582B1

KR102410582B1 - Apparatus, method and computer program for augmenting learning data for harmful words

Info

Publication number: KR102410582B1
Application number: KR1020210146563A
Authority: KR
Inventors: 박규병
Original assignee: 주식회사 튜닙
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-06-22

Abstract

유해어에 대한 학습 데이터를 증강하는 장치에 있어서, 학습 데이터에 포함된 특정 발화를 기학습된 윤리 모델에 입력하여 상기 특정 발화에 포함된 어휘 및 어구가 기설정된 복수의 비윤리 클래스 각각에 속할 복수의 확률값을 산출하고, 상기 복수의 확률값에 기초하여 상기 특정 발화에 포함된 유해어를 검출하는 유해어 검출부; 상기 유해어에 대한 상기 복수의 확률값에 기초하여 상기 유해어를 상기 복수의 비윤리 클래스 중 적어도 어느 하나로 분류하는 유해어 분류부; 및 상기 유해어에 분류된 적어도 하나의 비윤리 클래스의 확률값에 기초하여 상기 유해어에 기정의된 음절 변형 방법을 적용함으로써 상기 유해어를 증강하여 신규 학습 데이터를 생성하는 유해어 증강부를 포함한다.
서울특별시 서울산업진흥원 2021년도 서울혁신챌린지(예선)(SC210060) "오픈 도메인 대화형 챗봇 사업을 위한 윤리성 판별 엔진 개발"을 통해 개발된 기술이다.A device for augmenting learning data for harmful words, by inputting a specific utterance included in the learning data into a pre-learned ethical model, and a plurality of vocabulary and phrases included in the specific utterance to belong to each of a plurality of preset unethical classes a harmful word detection unit calculating a probability value of , and detecting harmful words included in the specific utterance based on the plurality of probability values; a harmful word classification unit for classifying the harmful word into at least one of the plurality of unethical classes based on the plurality of probability values for the harmful word; and a harmful word enhancer configured to generate new learning data by augmenting the harmful word by applying a predefined syllable transformation method to the harmful word based on the probability value of at least one unethical class classified in the harmful word.
It is a technology developed through the Seoul Innovation Challenge 2021 (Preliminary) (SC210060) "Development of Ethics Determination Engine for Open Domain Conversational Chatbot Business" by Seoul Industry Promotion Agency.

Description

Apparatus, method and computer program for augmenting learning data for harmful words

본 발명은 유해어에 대한 학습 데이터를 증강하는 장치, 방법 및 컴퓨터 프로그램에 관한 것이다.The present invention relates to an apparatus, method and computer program for augmenting learning data for harmful words.

최근 인공지능 챗봇 '이루다'의 성희롱 및 혐오 발언 논란으로 대화형 인공지능의 윤리적 언어 표현과 인식에 대한 문제가 대두되었다. Recently, due to the controversy over sexual harassment and hate speech by the artificial intelligence chatbot 'Iruda', the problem of ethical language expression and recognition of interactive artificial intelligence has emerged.

인공지능은 자연언어로 된 사람들의 대화를 학습하는 과정에서 데이터에 내재되어 있는 각종 편향과 비윤리적 표현들을 그대로 흡수한다. 특히 익명성이 보장된 온라인 공간에서는 욕설, 혐오 표현 등 비윤리적 언어 사용이 보다 자유롭기 때문에 이러한 대화를 학습의 원재료로 사용할 경우 인공지능에도 필연적으로 편향이 생긴다. Artificial intelligence absorbs various biases and unethical expressions inherent in data in the process of learning human conversations in natural language. In particular, in the online space where anonymity is guaranteed, the use of unethical language such as profanity and hate speech is more free, so if such conversations are used as raw materials for learning, artificial intelligence will inevitably be biased.

한편, 영어에 비해 기계 학습을 위한 한국어 비윤리 데이터 세트와 관련 연구는 아직 부족하다. 영어는 상대적으로 다양한 데이터가 공개되어 있으며 혐오 표현에 대한 기준이 세분화되어 있다. 통상적으로 개인이나 집단의 정체성을 표적 삼아 욕하거나, 공격하거나, 모욕하는 표현을 혐오 표현이라고 정의한다.Meanwhile, compared to English, Korean unethical datasets and related studies for machine learning are still lacking. In English, relatively diverse data are publicly available, and standards for hate speech are subdivided. Generally, hate speech is defined as an expression that insults, attacks, or insults the identity of an individual or group as a target.

그러나, 현재 공개된 혐오 표현 한국어 데이터는 국내 포털 사이트의 연예 기사 댓글을 수집한 Korean Hate Speech Detection Dataset이 유일하다. 해당 세트에서는 댓글 10,000문장을 Hate, Offensive, None 등 세 가지 기준으로 분류했으며 젠더 편향 포함 여부를 판별했다. However, the only currently published data on hate speech in Korean is the Korean Hate Speech Detection Dataset, which collects comments on entertainment articles from domestic portal sites. In the set, 10,000 comments were classified according to three criteria: Hate, Offensive, and None, and whether gender bias was included or not was determined.

이와 관련하여, 최근 인공지능을 이용하여 특정 발화에 포함된 욕설, 모욕 및 성차별 등과 같은 비윤리 발화를 탐지하는 기술이 있다. 종래 기술은 인공지능 기반의 학습 모델을 이용하여 특정 발화에 포함된 비윤리 발화를 탐지한다. In this regard, recently, there is a technology for detecting unethical utterances such as profanity, insult, and sexism included in specific utterances using artificial intelligence. The prior art detects unethical utterances included in specific utterances using an artificial intelligence-based learning model.

그러나, 종래 기술에 따른 학습 모델은 다양한 기준으로 분류될 수 있는 비윤리 발화에 대한 세부적인 분류가 없어, 다양한 서비스에 유연하게 적용하기가 용이하지 않다. 예를 들어, 제 1 학습 모델은 '편견'과 관련된 발화에는 민감하게 감지하나, '성희롱'과 관련된 발화에는 상대적으로 둔감하고, 제 2 학습 모델은 '성희롱'과 관련된 발화에는 민감하게 감지하나, '비방'과 관련된 발화에는 상대적으로 둔감하다. 또한, 종래 기술에 따른 학습 모델은 비윤리적 표현의 유무를 이분법적으로 판별하기 때문에 검출 정확도나 데이터 활용의 유연성이 떨어질 수 있다. 나아가 특정 속성만을 강조한 분류 기준이 제한적일 수 있다.However, the prior art learning model does not have a detailed classification of unethical utterances that can be classified according to various criteria, so it is not easy to flexibly apply to various services. For example, the first learning model is sensitive to utterances related to 'prejudice', but is relatively insensitive to utterances related to 'sexual harassment', and the second learning model is sensitive to utterances related to 'sexual harassment', They are relatively insensitive to utterances related to 'slander'. In addition, since the prior art learning model discriminates the presence or absence of an unethical expression binary, detection accuracy or flexibility in data utilization may be reduced. Furthermore, classification criteria emphasizing only specific attributes may be limited.

서울특별시 서울산업진흥원 2021년도 서울혁신챌린지(예선)(SC210060) "오픈 도메인 대화형 챗봇 사업을 위한 윤리성 판별 엔진 개발"을 통해 개발된 기술이다.It is a technology developed through the Seoul Innovation Challenge 2021 (Preliminary) (SC210060) "Development of Ethics Determination Engine for Open Domain Conversational Chatbot Business" by Seoul Industry Promotion Agency.

한국등록특허공보 제10-0662337호 (2006. 12. 21. 등록)Korean Patent Publication No. 10-0662337 (Registered on December 21, 2006) 한국공개특허공보 제10-2011-0056999호 (2011. 5. 31. 공개)Korean Patent Publication No. 10-2011-0056999 (published on May 31, 2011)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 유해어를 탐지하는 윤리 모델이 다양한 서비스에 유연하게 적용할 수 있도록 학습하는 학습 데이터 증강 장치를 제공하고자 한다. The present invention is to solve the problems of the prior art described above, and to provide a learning data augmentation apparatus for learning so that an ethical model for detecting harmful words can be flexibly applied to various services.

또한, 형태가 변형된 유해어 및 일부 형태가 생략된 유해어 등 여러 형태의 유해어를 정확하게 탐지할 수 있도록 학습하는 학습 데이터 증강 장치를 제공하고자 한다. Another object of the present invention is to provide a learning data augmentation device that learns to accurately detect various types of harmful words, such as harmful words with deformed forms and words with some forms omitted.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problems to be achieved by the present embodiment are not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 유해어에 대한 학습 데이터를 증강하는 장치에 있어서, 학습 데이터에 포함된 특정 발화를 기학습된 윤리 모델에 입력하여 상기 특정 발화에 포함된 어휘 및 어구가 기설정된 복수의 비윤리 클래스 각각에 속할 복수의 확률값을 산출하고, 상기 복수의 확률값에 기초하여 상기 특정 발화에 포함된 유해어를 검출하는 유해어 검출부; 상기 유해어에 대한 상기 복수의 확률값에 기초하여 상기 유해어를 상기 복수의 비윤리 클래스 중 적어도 어느 하나로 분류하는 유해어 분류부; 및 상기 유해어에 분류된 적어도 하나의 비윤리 클래스의 확률값에 기초하여 상기 유해어에 기정의된 음절 변형 방법을 적용함으로써 상기 유해어를 증강하여 신규 학습 데이터를 생성하는 유해어 증강부를 포함하는 것인, 학습 데이터 증강 장치를 제공 할 수 있다. As a means for achieving the above-described technical problem, an embodiment of the present invention provides an apparatus for augmenting learning data for harmful words, inputting a specific utterance included in the learning data into a pre-learned ethical model to the specific a harmful word detection unit for calculating a plurality of probability values that vocabulary and phrases included in the utterance belong to each of a plurality of preset unethical classes, and detecting harmful words included in the specific utterance based on the plurality of probability values; a harmful word classification unit for classifying the harmful word into at least one of the plurality of unethical classes based on the plurality of probability values for the harmful word; and a harmful word enhancer for generating new learning data by augmenting the harmful word by applying a predefined syllable transformation method to the harmful word based on the probability value of at least one unethical class classified in the harmful word In, it is possible to provide a learning data augmentation device.

본 발명의 다른 실시예는, 유해어에 대한 학습 데이터를 증강하는 방법에 있어서, 학습 데이터에 포함된 특정 발화를 기학습된 윤리 모델에 입력하여 상기 특정 발화에 포함된 어휘 및 어구가 기설정된 복수의 비윤리 클래스 각각에 속할 복수의 확률값을 산출하는 단계; 상기 복수의 확률값에 기초하여 상기 특정 발화에 포함된 유해어를 검출하는 단계; 상기 유해어에 대한 상기 복수의 확률값에 기초하여 상기 유해어를 상기 복수의 비윤리 클래스 중 적어도 어느 하나로 분류하는 단계; 및 상기 유해어에 분류된 적어도 하나의 비윤리 클래스의 확률값에 기초하여 상기 유해어에 기정의된 음절 변형 방법을 적용함으로써 상기 유해어를 증강하여 신규 학습 데이터를 생성하는 단계를 포함하는 것인, 학습 데이터 증강 방법을 제공할 수 있다. Another embodiment of the present invention provides a method for augmenting learning data for harmful words, by inputting a specific utterance included in the learning data into a pre-learned ethical model, and a plurality of preset vocabulary and phrases included in the specific utterance calculating a plurality of probability values to belong to each of the unethical classes of ; detecting harmful words included in the specific utterance based on the plurality of probability values; classifying the harmful word into at least one of the plurality of unethical classes based on the plurality of probability values for the harmful word; and generating new learning data by augmenting the harmful words by applying a predefined syllable transformation method to the harmful words based on the probability value of at least one unethical class classified in the harmful words, A method of augmenting learning data may be provided.

본 발명의 또 다른 실시예는, 유해어에 대한 학습 데이터를 증강하는 명령어들의 시퀀스를 포함하는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램에 있어서, 상기 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 학습 데이터에 포함된 특정 발화를 기학습된 윤리 모델에 입력하여 상기 특정 발화에 포함된 어휘 및 어구가 기설정된 복수의 비윤리 클래스 각각에 속할 복수의 확률값을 산출하고, 상기 복수의 확률값에 기초하여 상기 특정 발화에 포함된 유해어를 검출하고, 상기 유해어에 대한 상기 복수의 확률값에 기초하여 상기 유해어를 상기 복수의 비윤리 클래스 중 적어도 어느 하나로 분류하고, 상기 유해어에 분류된 적어도 하나의 비윤리 클래스의 확률값에 기초하여 상기 유해어에 기정의된 음절 변형 적용함으로써 상기 유해어를 증강하여 신규 학습 데이터를 생성하도록 하는 명령어들의 시퀀스를 포함하는, 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램을 제공할 수 있다. Another embodiment of the present invention is a computer program stored in a computer-readable recording medium including a sequence of instructions for augmenting learning data for harmful words, wherein the computer program is stored in the learning data when executed by a computing device. By inputting the included specific utterances into the pre-learned ethical model, a plurality of probability values for the vocabulary and phrases included in the specific utterances to belong to each of a plurality of preset unethical classes are calculated, and based on the plurality of probability values, the specific utterances are calculated. detects a harmful word included in , classifies the harmful word into at least one of the plurality of unethical classes based on the plurality of probability values for the harmful word, and at least one unethical class classified into the harmful word It is possible to provide a computer program stored in a computer-readable recording medium comprising a sequence of instructions for generating new learning data by augmenting the harmful word by applying a predefined syllable transformation to the harmful word based on the probability value of .

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary, and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 유해어에 대한 세부 분류 기준에 기초하여 특정 발화에 포함된 유해어를 검출할 수 있고, 해당 유해어를 세부 분류 기준에 따라 분류할 수 있다. 세부 분류 기준에 따라 분류된 유해어에 기초하여 윤리 모델을 학습할 수 있다. 따라서, 유해어를 탐지하기 위한 윤리 모델이 다양한 서비스에 적용되고 유연하게 활용될 수 있는 학습 데이터 증강 장치를 제공할 수 있다.According to any one of the above-described problem solving means of the present invention, a harmful word included in a specific utterance can be detected based on the detailed classification standard for the harmful word, and the harmful word can be classified according to the detailed classification standard . An ethical model can be trained based on the harmful words classified according to the detailed classification criteria. Therefore, it is possible to provide a learning data augmentation device in which an ethical model for detecting harmful words can be applied to various services and used flexibly.

또한, 특정 발화에 포함된 유해어에 기초하여 적어도 하나 이상의 유해어를 증강할 수 있다. 증강된 유해어는 기정의된 음절 변형 방법에 따라 형태가 변형되거나, 일부 형태가 생략될 수 있다. 증강된 유해어를 이용하여 윤리 모델을 학습할 수 있고, 따라서, 학습된 윤리 모델을 통해 특정 발화에 포함된 여러 형태의 유해어를 정확하게 탐지할 수 있는 학습 데이터 증강 장치를 제공할 수 있다.In addition, at least one harmful word may be augmented based on the harmful word included in the specific utterance. The augmented harmful word may have a shape modified according to a predefined syllable modification method, or some shapes may be omitted. It is possible to learn an ethics model using the augmented harmful words, and thus, it is possible to provide a learning data augmentation apparatus capable of accurately detecting various types of harmful words included in a specific utterance through the learned ethical model.

도 1은 본 발명의 일 실시예에 따른 학습 데이터 증강 장치의 구성도이다.
도 2는 본 발명의 일 실시예에 따른 윤리 모델을 사전 학습하는 과정을 설명하기 위한 예시적인 도면이다.
도 3은 본 발명의 일 실시예에 따른 윤리 모델을 설명하기 위한 예시적인 도면이다.
도 4 및 도 5는 본 발명의 일 실시예에 따른 유해 단어를 증강하는 과정을 설명하기 위한 예시적인 도면이다.
도 6 및 도 7은 본 발명의 일 실시예에 따른 유해 어구를 증강하는 과정을 설명하기 위한 예시적인 도면이다.
도 8은 본 발명의 일 실시예에 따른 학습 데이터 증강 방법의 순서도이다.1 is a block diagram of an apparatus for augmenting learning data according to an embodiment of the present invention.
2 is an exemplary diagram for explaining a process of pre-learning an ethics model according to an embodiment of the present invention.
3 is an exemplary view for explaining an ethical model according to an embodiment of the present invention.
4 and 5 are exemplary views for explaining a process of augmenting a harmful word according to an embodiment of the present invention.
6 and 7 are exemplary views for explaining a process of augmenting harmful phrases according to an embodiment of the present invention.
8 is a flowchart of a learning data augmentation method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated, and one or more other features However, it is to be understood that the existence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded in advance.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, a "part" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. In addition, one unit may be implemented using two or more hardware, and two or more units may be implemented by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.Some of the operations or functions described as being performed by the terminal or device in this specification may be instead performed by a server connected to the terminal or device. Similarly, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 학습 데이터 증강 장치의 구성도이다. 도 1을 참조하면, 학습 데이터 증강 장치(100)는 유해어 검출부(110), 유해어 분류부(120), 유해어 증강부(130) 및 학습부(140)를 포함할 수 있다. 다만 위 구성 요소들(110 내지 140)은 학습 데이터 증강 장치(100)에 의하여 제어될 수 있는 구성요소들을 예시적으로 도시한 것일 뿐이다. 1 is a block diagram of an apparatus for augmenting learning data according to an embodiment of the present invention. Referring to FIG. 1 , the learning data augmentation apparatus 100 may include a harmful word detection unit 110 , a harmful word classification unit 120 , a harmful word augmentation unit 130 , and a learning unit 140 . However, the above components 110 to 140 are merely illustrative of components that can be controlled by the learning data augmentation apparatus 100 .

도 1의 학습 데이터 증강 장치(100)의 각 구성요소들, 예를 들어, 유해어 검출부(110), 유해어 분류부(120), 유해어 증강부(130) 및 학습부(140)는 동시에 또는 시간 간격을 두고 연결될 수 있다. Each component of the learning data augmentation apparatus 100 of FIG. 1 , for example, the harmful word detection unit 110 , the harmful word classification unit 120 , the harmful word augmentation unit 130 , and the learning unit 140 simultaneously Alternatively, they may be connected at intervals of time.

학습 데이터 증강 장치(100)는 유해어에 대한 세부 분류 기준에 기초하여 특정 발화에 포함된 유해어를 검출할 수 있고, 해당 유해어를 세부 분류 기준에 따라 분류할 수 있다. 예를 들어, 학습 데이터 증강 장치(100)는 특정 발화에서 검출된 유해어를 '욕설', '모욕', '폭력/위험' 및 '외설' 등을 포함하는 기설정된 복수의 비윤리 클래스 중 적어도 어느 하나로 분류할 수 있다.The learning data augmentation apparatus 100 may detect harmful words included in a specific utterance based on the detailed classification criteria for the harmful words, and classify the corresponding harmful words according to the detailed classification criteria. For example, the learning data augmentation apparatus 100 uses at least one of a plurality of preset unethical classes including 'profanity', 'insult', 'violence/danger' and 'obscenity' for harmful words detected in a specific utterance. can be classified as either one.

학습 데이터 증강 장치(100)는 세부 분류 기준에 따라 분류된 유해어에 기초하여 윤리 모델을 학습할 수 있다. 따라서, 학습 데이터 증강 장치(100)는 학습된 윤리 모델을 다양한 서비스에 적용하여 유연하게 활용되도록 할 수 있다.The learning data augmentation apparatus 100 may learn an ethical model based on harmful words classified according to detailed classification criteria. Accordingly, the learning data augmentation apparatus 100 may be flexibly utilized by applying the learned ethical model to various services.

학습 데이터 증강 장치(100)는 특정 발화에 포함된 유해어에 기초하여 적어도 하나 이상의 유해어를 증강할 수 있다. 예를 들어, 학습 데이터 증강 장치(100)는 기정의된 음절 변형 방법에 따라 유해어의 형태를 변형시키거나, 일부 형태를 생략하거나 또는 유해어를 구성하는 어절의 순서를 변경하여 유해어를 증강할 수 있다. 학습 데이터 증강 장치(100)는 증강된 유해어를 이용하여 윤리 모델을 학습할 수 있다.The learning data augmentation apparatus 100 may augment at least one harmful word based on a harmful word included in a specific utterance. For example, the learning data augmentation apparatus 100 augments the harmful word by changing the shape of the harmful word, omitting some forms, or changing the order of words constituting the harmful word according to a predefined syllable modification method can do. The learning data augmentation apparatus 100 may learn an ethical model using the augmented harmful words.

따라서, 학습 데이터 증강 장치(100)는 학습된 윤리 모델을 통해 특정 발화에 포함된 여러 형태의 유해어를 정확하게 탐지하도록 할 수 있다.Accordingly, the learning data augmentation apparatus 100 may accurately detect various types of harmful words included in a specific utterance through the learned ethics model.

이하, 학습 데이터 증강 장치(100)의 각 구성에 대해 살펴보도록 한다.Hereinafter, each configuration of the learning data augmentation apparatus 100 will be described.

유해어 검출부(110)는 학습 데이터에 포함된 특정 발화를 기학습된 윤리 모델에 입력하여 특정 발화에 포함된 어휘 및 어구가 기설정된 복수의 비윤리 클래스 각각에 속할 복수의 확률값을 산출할 수 있다. 예를 들어, 복수의 비윤리 클래스는 욕설, 모욕, 폭력/위협, 외설, 범죄/조장, 성별, 연령, 인종 출신지, 성적 지향, 장애, 종교, 정치 성향 및 기타 혐오로 세분화되어 총 13개의 비윤리 클래스로 설정될 수 있다. The harmful word detection unit 110 may input a specific utterance included in the learning data into a pre-learned ethical model to calculate a plurality of probability values that the vocabulary and phrase included in the specific utterance will belong to each of a plurality of preset unethical classes. . For example, multiple unethical classes are subdivided into profanity, insult, violence/threat, obscenity, crime/promoting, gender, age, racial origin, sexual orientation, disability, religion, political affiliation, and other hate, for a total of 13 unethical classes. It can be set as an ethics class.

구체적으로, 유해어 검출부(110)는 특정 발화를 어절로 분할할 수 있다. 유해어 검출부(110)는 분할된 각 어절을 기학습된 윤리 모델에 입력하여 적어도 하나의 어절에 대한 복수의 확률값을 산출할 수 있고, 복수의 확률값에 기초하여 특정 발화에 포함된 유해어를 검출할 수 있다.Specifically, the harmful word detection unit 110 may divide a specific utterance into words. The harmful word detection unit 110 may input each divided word into the pre-learned ethical model to calculate a plurality of probability values for at least one word, and detect harmful words included in a specific utterance based on the plurality of probability values. can do.

예를 들어, 유해어 검출부(110)는 특정 발화, 일예로, '사이버 공간에서 행하는 범죄'를 '사이버', '공간에서', '행하는' 및 '범죄'로 각각 분할할 수 있고, 분할된 각 어절, '사이버', '공간에서', '행하는' 및 '범죄'를 기학습된 윤리 모델에 입력할 수 있다. 유해어 검출부(110)는 적어도 하나의 어절, 일예로, '범죄'가 총 13개의 비윤리 클래스에 속할 확률값을 각각 산출할 수 있고, 산출된 확률값으로부터 '범죄'가 유해어인지 여부를 검출할 수 있다. For example, the harmful word detection unit 110 may divide a specific utterance, for example, 'crime committed in cyberspace' into 'cyber', 'in space', 'doing' and 'crime', respectively, and Each word, 'cyber', 'in space', 'doing' and 'crime' can be entered into the pre-learned ethical model. The harmful word detection unit 110 may calculate a probability value that at least one word, for example, 'crime' belongs to a total of 13 unethical classes, respectively, and detect whether 'crime' is a harmful word from the calculated probability value. can

유해어 분류부(120)는 유해어에 대한 복수의 확률값에 기초하여 유해어를 복수의 비윤리 클래스 중 적어도 어느 하나로 분류할 수 있다. 예를 들어, 유해어 분류부(120)는 특정 발화에서 유해어가 검출된 경우, 일예로, '범죄'라는 어절이 유해어로 검출된 경우, '범죄'가 13개의 비윤리 클래스에 속할 각각의 확률값에 기초하여 '범죄'를 13개의 비윤리 클래스 중 범죄 조장 클래스로 분류할 수 있다. The harmful word classification unit 120 may classify the harmful word into at least one of a plurality of unethical classes based on a plurality of probability values for the harmful word. For example, when a harmful word is detected in a specific utterance, for example, when the word 'crime' is detected as a harmful word, the harmful word classification unit 120 may determine that 'crime' belongs to 13 unethical classes. Each probability value Based on this, 'crime' can be classified as a crime-promoting class among 13 unethical classes.

이와 같이, 학습 데이터 증강 장치(100)는 범주별로 세분화된 비윤리 클래스에 기초하여 특정 발화에 포함된 유해어를 검출할 수 있고, 검출된 유해어가 세분화된 각 범주에 어느 정도 속하는지에 대한 확률값을 산출하여 특정 발화에 포함된 유해어를 세분화된 각 범주 중 어느 하나의 범주로 분류할 수 있다. 따라서, 학습 데이터 증강 장치(100)는 다양한 분야의 서비스에 적용될 수 있고, 각 서비스 분야에서 해당 유해어를 민감하게 감지할 수 있다. In this way, the learning data augmentation apparatus 100 may detect harmful words included in a specific utterance based on the unethical class subdivided into categories, and a probability value of how much the detected harmful words belong to each subdivided category. By calculating, harmful words included in specific utterances can be classified into any one of the subdivided categories. Accordingly, the learning data augmentation apparatus 100 may be applied to services in various fields, and may sensitively detect corresponding harmful words in each service field.

유해어 증강부(130)는 유해어에 분류된 적어도 하나의 비윤리 클래스의 확률값에 기초하여 유해어에 기정의된 음절 변형 방법을 적용함으로써 유해어를 증강하여 신규 학습 데이터를 생성할 수 있다. 예를 들어, 유해어 증강부(130)는 특정 발화에서 유해어가 검출된 경우, 일예로, '범죄'라는 어절이 유해어로 검출된 경우, '범죄'가 분류된 범죄 조장 클래스의 확률값에 기초하여 '범죄'에 기정의된 음절 변형 방법을 적용하여 '범죄'에 대한 유해어를 증강할 수 있다.The harmful word enhancer 130 may generate new learning data by augmenting the harmful word by applying a predefined syllable transformation method to the harmful word based on the probability value of at least one unethical class classified into the harmful word. For example, when a harmful word is detected in a specific utterance, for example, when the word 'crime' is detected as a harmful word, the harmful word augmentation unit 130 is based on the probability value of the crime promoting class into which the 'crime' is classified. Harmful words for 'crime' can be reinforced by applying a predefined syllable transformation method to 'crime'.

본 발명에 따른 기정의된 음절 변형 방법은, 예를 들어, 제 1 방법 내지 제 5 방법을 포함할 수 있다. 제 1 방법은 유해어에 포함된 초성, 중성 또는 종성을 분리할 수 있고, 제 2 방법은 유해어의 모음을 유사 모음으로 치환할 수 있다. 예를 들어, 제 1 방법에 따라 '범죄'에 포함된 초성, 중성 또는 종성을 분리하여 'ㅂㅓㅁㅈㅗㅣ'를 생성할 수 있고, 제 2 방법에 따라 '범죄'의 모음을 유사 모음으로 치환하여 '범좨'를 생성할 수 있다. The predefined syllable modification method according to the present invention may include, for example, the first to fifth methods. The first method may separate the initial, neutral, or final consonant included in the harmful word, and the second method may replace the vowel of the harmful word with a similar vowel. For example, according to the first method, by separating the initial, neuter, or final consonant included in 'crime', it is possible to generate 'vㅓㅁㅗㅣ', and according to the second method, the vowel of 'crime' is replaced with a similar vowel. So you can create a 'beom'.

제 3 방법은 유해어의 자음 및 모음을 유사 문자로 치환할 수 있고, 제 4 방법은 유해어 사이에 문자를 삽입할 수 있고, 제 5 방법은 유해어의 배열 순서를 변경할 수 있다. 예를 들어, 제 3 방법에 따라 '범죄'의 자음 및 모음을 유사 문자로 치환하여 '범조|'를 생성할 수 있고, 제 4 방법에 따라 '범죄' 사이에 문자를 삽입하여 '범@죄'를 생성할 수 있고, 제 5 방법에 따라 '범죄'의 배열 순서를 변경하여 '죄범'을 생성할 수 있다.In the third method, consonants and vowels of the harmful words may be substituted with similar characters, in the fourth method, characters may be inserted between the harmful words, and in the fifth method, the arrangement order of the harmful words may be changed. For example, according to the third method, 'beomjo|' can be created by replacing the consonants and vowels of 'crime' with similar characters, and according to the fourth method, 'crime @ crime' by inserting a letter between 'crime' ', and by changing the arrangement order of 'crime' according to the fifth method, 'criminal' may be generated.

유해어 증강부(130)는 검출된 유해어가 단어이고, 단어에 분류된 적어도 하나의 비윤리 클래스의 확률값이 가장 높은 어느 한 비윤리 클래스의 확률값이 기설정된 확률값 이상인 경우, 단어에 제 1 방법 내지 제 5 방법 중 적어도 어느 하나의 방법을 적용할 수 있다. When the detected harmful word is a word, and the probability value of any one unethical class having the highest probability value of at least one unethical class classified in the word is greater than or equal to a preset probability value, the first method to At least one of the fifth methods may be applied.

예를 들어, 유해어 증강부(130)는 '범죄'라는 유해어에 대하여 분류된 범죄 조장 클래스, 기타 혐오 클래스 및 폭력 위협 클래스 중 가장 높은 확률값이 산출된 범죄 조장 클래스의 확률값과 기설정된 확률값을 비교할 수 있다. 유해어 증강부(130)는 범죄 조장 클래스의 확률값과 기설정된 확률값의 비교 결과에 따라 기정의된 음절 변형 방법, 제 1 방법 내지 제 5 방법 중 적어도 어느 하나의 방법을 적용할 수 있다. For example, the harmful word augmentation unit 130 compares the probability value and the preset probability value of the crime promotion class for which the highest probability value among the crime promotion class, other hate class, and violence threat class classified for the harmful word 'crime' is calculated. can be compared The harmful word enhancer 130 may apply at least one of the predefined syllable transformation method and the first to fifth methods according to the comparison result of the probability value of the crime promotion class and the predetermined probability value.

또한, 유해어 증강부(130)는 단어의 글자 형태에 따라 기정의된 음절 변형 방법, 즉 제 1 방법 내지 제 5 방법 중 적어도 어느 하나의 방법을 적용할 수 있다. 유해어 증강부(130)는 단어의 모든 글자가 초성, 중성으로만 이루어진 경우, 제 1 방법 내지 제 5 방법을 모두 적용할 수 있다.Also, the harmful word enhancer 130 may apply a predefined syllable modification method according to the letter form of the word, that is, at least one of the first to fifth methods. The harmful word enhancer 130 may apply all of the first to fifth methods when all the letters of a word consist only of a leading consonant and a neutral.

유해어 증강부(130)는 단어의 모든 글자가 초성, 중성 및 종성으로 이루어진 경우, 제 1 방법, 제 4 방법 및 제 5 방법을 적용할 수 있고, 단어의 글자에서 초성 및 중성으로만 이루어진 글자가 적어도 하나 존재하는 경우, 제 1 방법 내지 제 3 방법을 적용할 수 있다. The harmful word reinforcement unit 130 may apply the first method, the fourth method, and the fifth method when all the letters of a word are made up of a beginning, a neutral, and a final. When at least one is present, the first to third methods may be applied.

유해어 증강부(130)는 단어마다 증강된 단어, 단어에 분류된 적어도 하나의 비윤리 클래스의 확률값이 매핑되어 저장된 유해어 사전을 구축할 수 있다. 예를 들어, 유해어 증강부(130)는 '범죄' 단어에 증강된 단어, 'ㅂㅓㅁㅈㅗㅣ', '범좨', '범조|', '범@죄' 및 '죄범'과, 분류된 비윤리 클래스의 확률값, 범죄 조장 클래스/0.7, 기타 혐오 클래스/0.5 및 폭력 위협 클래스/0.4가 매핑되어 저장된 유해어 사전을 구축할 수 있다.The harmful word augmentation unit 130 may build a harmful word dictionary in which an augmented word for each word and a probability value of at least one unethical class classified in the word are mapped and stored. For example, the harmful word augmentation unit 130 is a word augmented to the word 'crime', Probability values of unethical classes, crime promotion class/0.7, other hate classes/0.5, and violent threat class/0.4 are mapped to build a stored hazard dictionary.

유해어 증강부(130)는 검출된 유해어가 어구인 경우, 어구가 유해어 사전에 포함되어 있는지 여부를 판단할 수 있다. 유해어 증강부(130)는 검출된 유해어가 어구이고, 어구가 유해어 사전에 포함되어 있지 않은 경우, 어구에 제 1 방법 내지 제 5 방법 중 적어도 어느 하나의 방법을 적용함으로써 어구를 증강할 수 있다. When the detected harmful word is a phrase, the harmful word enhancer 130 may determine whether the phrase is included in the harmful word dictionary. When the detected harmful word is a phrase and the phrase is not included in the harmful word dictionary, the harmful word reinforcement unit 130 may enhance the phrase by applying at least one of the first to fifth methods to the phrase. have.

예를 들어, 유해어 증강부(130)는 검출된 유해어가 어구인 경우, 일예로, '유해 환경을 차단하자'라는 어구가 유해어로 검출된 경우, 해당 어구를 각 어절로 분할할 수 있고, 분할된 어절이 유해어 사전에 포함되어 있는지 여부를 판단할 수 있다. 유해어 증강부(130)는 해당 어구로부터 분할된 어절이 유해어 사전에 포함되어 있지 않은 경우, 해당 어구에 대해 기정의된 음절 변형 방법, 제 1 방법 내지 제 5 방법에 기초하여 해당 어구를 증강할 수 있다.For example, when the detected harmful word is a phrase, for example, when the phrase 'Let's block the harmful environment' is detected as a harmful word, the harmful word enhancement unit 130 may divide the phrase into each word, It can be determined whether the divided word word is included in the harmful word dictionary. When a word divided from the corresponding phrase is not included in the harmful word dictionary, the harmful word reinforcement unit 130 enhances the corresponding phrase based on the syllable transformation method and the first to fifth methods predefined for the corresponding phrase. can do.

유해어 증강부(130)는 어구에 분류된 적어도 하나의 비윤리 클래스의 확률값 중 기설정된 개수의 상위 복수의 비윤리 클래스의 확률값들의 합이 기설정된 임계치 이상인 경우, 어구에 기정의된 어구 변형 방법을 적용함으로써 어구를 증강할 수 있다.When the sum of the probability values of the plurality of upper unethical classes of the predetermined number among the probability values of at least one unethical class classified in the phrase is equal to or greater than a preset threshold, the harmful word reinforcement unit 130 is configured to modify a phrase defined in the phrase. You can reinforce the phrase by applying .

예를 들어, 유해어 증강부(130)는 검출된 유해 어구, 일예로, '유해 환경을 차단하자'에 대한 비윤리 클래스의 확률값 중 상위 3개의 비윤리 클래스, 폭력 위협 클래스의 확률값, 범죄 조장 클래스의 확률값 및 기타 혐오 클래스의 확률값을 합할 수 있고, 상위 3개의 비윤리 클래스의 확률값의 합과 기설정된 임계치를 비교할 수 있다. 유해어 증강부(130)는 상위 3개의 비윤리 클래스의 확률값의 합, 0.7이 기설정된 임계치(예: 0.5)보다 큰 경우, 검출된 유해 어구에 기정의된 어구 변형 방법을 적용하여 해당 어구를 증강할 수 있다.For example, the harmful word augmentation unit 130 is the detected harmful phrase, for example, among the probability values of the unethical class for 'Let's block the harmful environment' The probability value of the class and the probability value of other hate classes may be summed, and the sum of the probability values of the top three unethical classes may be compared with a preset threshold. When the sum of the probability values of the top three unethical classes, 0.7, is greater than a preset threshold (eg, 0.5), the harmful word reinforcement unit 130 applies a predefined phrase transformation method to the detected harmful phrase to improve the corresponding phrase. can be augmented

본 발명에 따른 기정의된 어구 변형 방법은 제 6 방법 내지 제 8 방법을 포함할 수 있다. 제 6 방법은 어구에 포함된 단어를 유사 단어로 치환할 수 있고, 제 7 방법은 어구 내 어절의 배열 순서를 변경할 수 있다. 제 8 방법은 어구에 포함된 단어를 삭제할 수 있다. 예를 들어, 제 6 방법에 따라 '유해 환경을 차단하자'에 포함된 단어를 유사 단어로 치환하여 '유해 환경을 배제하자'를 생성할 수 있고, 제 7 방법에 따라, '유해 환경을 차단하자'에서 어절의 배열 순서를 변경하여 '환경 유해을 차단하자'를 생성할 수 있고, 제 8 방법에 따라 '유해 환경을 차단하자'에 포함된 단어를 삭제하여 '유해을 차단하자'를 생성할 수 있다.The predefined phrase modification method according to the present invention may include sixth to eighth methods. The sixth method may replace words included in the phrase with similar words, and the seventh method may change the arrangement order of words in the phrase. The eighth method may delete a word included in the phrase. For example, according to the sixth method, the word included in 'Let's block the harmful environment' may be replaced with a similar word to generate 'Let's exclude the harmful environment', and according to the seventh method, 'Block the harmful environment' 'Let's block harmful environment' can be created by changing the arrangement order of words in 'Let's do' have.

유해어 증강부(130)는 기정의된 어구 변형 방법에 따라 변형된 어구를 윤리 모델에 입력하여 변형된 어구가 기설정된 복수의 비윤리 클래스 각각에 속할 복수의 확률값을 산출할 수 있다. The harmful word enhancer 130 may input a phrase modified according to a predefined method of changing the phrase into an ethical model to calculate a plurality of probability values that the modified phrase belongs to each of a plurality of preset unethical classes.

예를 들어, 유해어 증강부(130)는 변형된 어구, '유해 환경을 배제하자', '환경 유해을 차단하자' 및 '유해을 차단하자'를 윤리 모델에 입력하여 각 변형된 어구가 13개의 비윤리 클래스 각각에 속할 확률값을 산출할 수 있다. For example, the harmful word augmentation unit 130 inputs the modified phrases, 'let's exclude harmful environments', 'let's cut off environmental hazards' and 'let's cut harm' into the ethical model, so that each modified phrase has 13 ratios. Probability values belonging to each ethical class can be calculated.

유해어 증강부(130)는 기설정된 개수의 상위 복수의 비윤리 클래스 확률값들의 합이 기설정된 임계치 이상인 경우, 어구마다 변형된 어구, 변형된 어구에 분류된 적어도 하나의 비윤리 클래스의 확률값이 매핑되어 저장된 유해어구 사전을 구축할 수 있다.When the sum of the predetermined number of the upper plurality of unethical class probability values is equal to or greater than a predetermined threshold, the harmful word enhancer 130 is a modified phrase for each phrase, the probability value of at least one unethical class classified into the modified phrase is mapped It is possible to build a dictionary of harmful phrases that have been saved.

예를 들어, 유해어 증강부(130)는 변형된 어구, '유해 환경을 배제하자'에 대하여 상위 3개의 비윤리 클래스의 확률값의 합(예: 0.7)이 기설정된 임계치(예: 0.5)보다 큰 경우, '유해 환경을 차단하자'라는 어구에 변형된 어구, '유해 환경을 배제하자'와 범죄 조장 클래스의 확률값(0.7), 폭력 위협 클래스의 확률값(0.4) 및 기타 혐오 클래스의 확률값(0.3)이 매핑되어 저장된 유해어구 사전을 구축할 수 있다. For example, the harmful word augmentation unit 130 determines that the sum (eg, 0.7) of the probability values of the top three unethical classes with respect to the modified phrase, 'let's exclude the harmful environment' is higher than a preset threshold (eg, 0.5). In the large case, a modified phrase for the phrase 'Let's block the harmful environment', 'Let's exclude the harmful environment' and the probability value of the crime promotion class (0.7), the probability value of the violent threat class (0.4), and the probability value of the other hate class (0.3) ) can be mapped and stored to build a dictionary of harmful phrases.

학습부(140)는 생성 모델(Generator)이 특정 발화로부터 생성된 복수의 토큰 중 적어도 하나의 토큰을 페이크 토큰으로 변형하도록 사전 학습할 수 있고, 판별 모델(Discriminator)이 복수의 토큰에 포함된 페이크 토큰을 판별하도록 사전 학습시킬 수 있다. 예를 들어, 학습부(140)는 트랜스포머 인코더(Transformer Encoder)를 이용하여 생성 모델 및 판별 모델을 사전 학습할 수 있다.The learning unit 140 may pre-lear a generation model (Generator) to transform at least one token among a plurality of tokens generated from a specific utterance into a fake token, and a discriminator is a fake included in the plurality of tokens. It can be pre-trained to discriminate tokens. For example, the learner 140 may pre-lear the generation model and the discrimination model using a transformer encoder.

본 발명에 따른 윤리 모델은 판별 모델 및 순환 신경망(Feed Forward Neural Network)을 포함할 수 있다. 예를 들어, 학습부(140)는 사전 학습된 판별 모델에 순환 신경망을 추가하여 윤리 모델을 생성할 수 있다. 학습부(140)는 사전 학습된 판별 모델을 토큰별로 분류하는 방식으로 파인튜닝할 수 있다.The ethical model according to the present invention may include a discrimination model and a feed forward neural network. For example, the learning unit 140 may generate an ethics model by adding a recurrent neural network to the pre-trained discrimination model. The learning unit 140 may perform fine tuning by classifying the pre-trained discrimination model for each token.

학습부(140)는 유해어 사전 및 유해어구 사전에 기초하여 순환 신경망을 업데이트하여 윤리 모델을 학습시킬 수 있다. 예를 들어, 학습부(140)는 증강된 유해어를 활용하여 윤리 모델을 재학습시킬 수 있다. The learning unit 140 may learn the ethical model by updating the recurrent neural network based on the harmful word dictionary and the harmful phrase dictionary. For example, the learning unit 140 may re-learn the ethical model by using the augmented harmful words.

따라서, 학습 데이터 증강 장치(100)는 특정 발화에 포함된 여러 형태의 유해어를 정확하게 탐지할 수 있다. 예를 들어, 학습 데이터 증강 장치(100)는 특정 발화에 포함된 형태가 변형되거나 일부 형태가 생략된 유해어를 정확하게 검출할 수 있다. Accordingly, the learning data augmentation apparatus 100 may accurately detect various types of harmful words included in a specific utterance. For example, the learning data augmentation apparatus 100 may accurately detect harmful words in which a form included in a specific utterance is changed or some forms are omitted.

도 2는 본 발명의 일 실시예에 따른 윤리 모델을 사전 학습하는 과정을 설명하기 위한 예시적인 도면이다. 도 2를 참조하면, 학습 데이터 증강 장치(100)는 생성 모델(230) 및 판별 모델(250)을 사전 학습할 수 있다. 2 is an exemplary diagram for explaining a process of pre-learning an ethical model according to an embodiment of the present invention. Referring to FIG. 2 , the training data augmentation apparatus 100 may pre-lear the generation model 230 and the discrimination model 250 .

예를 들어, 학습 데이터 증강 장치(100)는 특정 발화로부터 복수의 토큰(210)을 생성할 수 있고, 복수의 토큰(210)을 생성 모델(230)에 입력하여 복수의 토큰(210)에 포함된 하나 이상의 토큰을 페이크 토큰으로 변형할 수 있다. 학습 데이터 증강 장치(100)는 생성 모델(230)이 복수의 토큰(210)으로부터 페이크 토큰을 포함하는 복수의 토큰(240)을 생성하도록 사전 학습할 수 있다.For example, the training data augmentation apparatus 100 may generate a plurality of tokens 210 from a specific utterance, and input the plurality of tokens 210 to the generation model 230 to include the plurality of tokens 210 . One or more tokens can be transformed into fake tokens. The training data augmentation apparatus 100 may pre-lear the generation model 230 to generate a plurality of tokens 240 including fake tokens from the plurality of tokens 210 .

구체적으로, 학습 데이터 증강 장치(100)는 특정 발화로부터 생성된 복수의 토큰(210)에 대해 마스킹할 위치의 집합을 결정할 수 있다. 예를 들어, 복수의 토큰(210) 중 마스킹을 수행할 위치는 1과 n 사이의 정수이고, 하기 수학식 1과 같이 표현될 수 있다.Specifically, the training data augmentation apparatus 100 may determine a set of positions to be masked with respect to the plurality of tokens 210 generated from a specific utterance. For example, a position to be masked among the plurality of tokens 210 is an integer between 1 and n, and may be expressed as Equation 1 below.

<수학식 1><Equation 1>

수학식 1에서, m은 복수의 토큰(210) 내에서 마스킹을 수행할 위치를 의미하고, k는 복수의 토큰(210) 내에서 마스킹을 수행할 개수를 의미한다. 예를 들어, 마스킹을 수행할 개수 k는 0.15n일 수 있고, 학습 데이터 증강 장치(100)는 복수의 토큰(210) 전체에서 약 15%의 토큰에 대해 마스킹 처리할 수 있다. 복수의 토큰(210) 내에서 마스킹 처리하기로 결정된 위치의 토큰은 하기 수학식 2와 같이 표현될 수 있다.In Equation 1, m means a position to perform masking within the plurality of tokens 210 , and k means the number of tokens to perform masking in the plurality of tokens 210 . For example, the number k to be masked may be 0.15n, and the training data augmentation apparatus 100 may perform masking on about 15% of the tokens in the total of the plurality of tokens 210 . A token at a position determined to be masked within the plurality of tokens 210 may be expressed as in Equation 2 below.

<수학식 2><Equation 2>

예를 들어, 학습 데이터 증강 장치(100)는 {the, chef, cooked, the, meal}을 포함하는 복수의 토큰(210)에서 1번째 토큰 및 3번째 토큰에 대해 마스킹 처리할 수 있고, 1번째 토큰 및 3번째 토큰이 마스킹 처리된 복수의 토큰(220), {[MASK], chef, [MASK], the, meal}을 생성할 수 있다. For example, the training data augmentation apparatus 100 may mask the first token and the third token from the plurality of tokens 210 including {the, chef, cooked, the, meal}, and the first A plurality of tokens 220, {[MASK], chef, [MASK], the, meal} in which the token and the third token are masked may be generated.

예를 들어, 학습 데이터 증강 장치(100)는 마스킹 처리된 복수의 토큰(220)을 생성 모델(230)에 입력하여 마스킹 처리된 토큰이 무엇인지 예측할 수 있다. 생성 모델(230)은 하기 수학식 3으로 표현할 수 있다.For example, the training data augmentation apparatus 100 may input a plurality of masked tokens 220 into the generation model 230 to predict what the masked tokens are. The generative model 230 can be expressed by Equation 3 below.

<수학식 3><Equation 3>

수학식 3에서, e(·)는 임베딩을 의미하고, 수학식 3을 통해 생성 모델(230)의 출력 레이어와 임베딩 레이어는 가중치를 공유할 수 있다.In Equation 3, e(·) means embedding, and through Equation 3, the output layer and the embedding layer of the generative model 230 may share a weight.

예를 들어, 학습 데이터 증강 장치(100)는 수학식 3을 이용하여 생성 모델(230)을 통해 마스킹 처리된 복수의 토큰(220) 내 포함된 일부 마스킹 처리된 토큰에 대한 의미를 예측할 수 있다. 학습 데이터 증강 장치(100)는 마스킹 처리된 복수의 토큰(220)에서 마스킹 처리된 일부 토큰에 대한 의미를 예측하여 페이크 토큰을 생성할 수 있다. 즉, 학습 데이터 증강 장치(100)는 마스킹 처리된 복수의 토큰(220)을 생성 모델(230)에 입력하여 페이크 토큰을 포함하는 복수의 토큰(240)을 생성할 수 있다.For example, the training data augmentation apparatus 100 may predict the meaning of some masked tokens included in the plurality of masked tokens 220 through the generation model 230 using Equation 3 . The training data augmentation apparatus 100 may generate a fake token by predicting the meaning of some masked tokens from the plurality of masked tokens 220 . That is, the training data augmentation apparatus 100 may generate a plurality of tokens 240 including fake tokens by inputting a plurality of masked tokens 220 into the generation model 230 .

예를 들어, 학습 데이터 증강 장치(100)는 마스킹 처리된 복수의 토큰(220), {[MASK], chef, [MASK], the, meal}을 생성 모델(230)에 입력하여 페이크 토큰을 포함하는 복수의 토큰(240), {the, chef, ate, the, meal}을 생성할 수 있다. 페이크 토큰의 생성은 하기 수학식 4로 표현될 수 있다.For example, the training data augmentation apparatus 100 includes a fake token by inputting a plurality of masked tokens 220, {[MASK], chef, [MASK], the, meal} to the generation model 230 . It is possible to generate a plurality of tokens 240, {the, chef, ate, the, meal}. Generation of a fake token may be expressed by Equation 4 below.

<수학식 4><Equation 4>

구체적으로, 학습 데이터 증강 장치(100)는 생성 모델(230)을 통해 마스킹 처리된 1번째 토큰을 처음과 같은 'the' 토큰으로 예측하여 출력할 수 있고, 마스킹 처리된 3번째 토큰을 처음과 다른 'ate' 토큰으로 예측하여 대체된 토큰을 출력할 수 있다. 즉, 학습 데이터 증강 장치(100)는 생성 모델(230)을 통해 3번째 토큰이 대체된, 페이크 토큰을 포함하는 복수의 토큰(240)을 생성할 수 있다.Specifically, the training data augmentation apparatus 100 may predict and output the masked first token through the generation model 230 as the same 'the' token as the first, and display the masked third token different from the first. You can output the replaced token by predicting with the 'ate' token. That is, the learning data augmentation apparatus 100 may generate a plurality of tokens 240 including fake tokens in which the third token is substituted through the generation model 230 .

예를 들어, 학습 데이터 증강 장치(100)는 페이크 토큰을 포함하는 복수의 토큰(240)을 판별 모델(250)에 입력하여 페이크 토큰이 포함되어 있는지 여부를 판별할 수 있다. 즉, 학습 데이터 증강 장치(100)는 판별 모델(250)이 페이크 토큰을 포함하는 복수의 토큰(240)에 포함된 페이크 토큰을 판별하도록 사전 학습할 수 있다. For example, the training data augmentation apparatus 100 may determine whether a fake token is included by inputting a plurality of tokens 240 including a fake token into the determination model 250 . That is, the training data augmentation apparatus 100 may pre-lear the determination model 250 to determine the fake tokens included in the plurality of tokens 240 including the fake tokens.

구체적으로, 판별 모델(250)은 페이크 토큰을 포함하는 복수의 토큰(240)에 포함된 각 토큰이 일반(original) 토큰인지, 생성 모델(230)을 통해 페이크(replaced) 토큰인지 여부를 구분하여 이진 분류로 사전 학습할 수 있다. 여기서, 일반 토큰은 처음과 같은 토큰을 의미하고, 페이크 토큰은 처음과 달리 생성 모델(230)을 거쳐 대체된 토큰을 의미한다.Specifically, the determination model 250 distinguishes whether each token included in the plurality of tokens 240 including a fake token is a normal (original) token or a fake (replaced) token through the generation model 230. It can be pre-trained by binary classification. Here, the general token means the same token as the first, and the fake token means a token replaced through the generation model 230 unlike the first.

예를 들어, 학습 데이터 증강 장치(100)는 손실 함수를 이용하여 생성 모델(230) 및 판별 모델(250)을 사전 학습할 수 있다. 손실 함수는 하기 수학식 5와 같이 표현될 수 있다. For example, the training data augmentation apparatus 100 may pre-train the generative model 230 and the discriminant model 250 using a loss function. The loss function may be expressed as in Equation 5 below.

<수학식 5><Equation 5>

수학식 5를 참조하면, 손실 함수는 평균 제곱 오차(Mean Square Error, MSE)일 수 있다. 예를 들어, 학습 데이터 증강 장치(100)는 수학식 5와 같은 손실 함수를 이용하여 생성 모델(230) 및 판별 모델(250)에 대한 사전 학습을 수행할 수 있다.Referring to Equation 5, the loss function may be a mean square error (MSE). For example, the training data augmentation apparatus 100 may perform pre-learning on the generative model 230 and the discriminant model 250 using a loss function as in Equation 5 .

도 3은 본 발명의 일 실시예에 따른 윤리 모델을 설명하기 위한 예시적인 도면이다. 도 3을 참조하면, 학습 데이터 증강 장치(100)는 사전 학습이 완료된 판별 모델(320)에 기초하여 윤리 모델(300)을 생성할 수 있다.3 is an exemplary view for explaining an ethical model according to an embodiment of the present invention. Referring to FIG. 3 , the training data augmentation apparatus 100 may generate the ethical model 300 based on the pre-learning completed discrimination model 320 .

예를 들어, 학습 데이터 증강 장치(100)는 사전 학습된 판별 모델(320)에 순환 신경망(330)을 추가하여 윤리 모델(300)을 생성할 수 있다. 학습 데이터 증강 장치(100)는 사전 학습된 판별 모델(320)이 토큰별로 클래스를 분류하는 방식으로 파인 튜닝할 수 있다.For example, the training data augmentation apparatus 100 may generate the ethical model 300 by adding the recurrent neural network 330 to the pre-trained discrimination model 320 . The training data augmentation apparatus 100 may perform fine tuning in such a way that the pre-trained discrimination model 320 classifies classes for each token.

예를 들어, 학습 데이터 증강 장치(100)는 특정 발화로부터 생성된 복수의 토큰(310)을 윤리 모델(300)에 입력하여 각 토큰이 기설정된 복수의 비윤리 클래스(340) 중 하나 이상의 클래스로 분류할 수 있다. 여기서, 기설정된 복수의 비윤리 클래스(340)는 예를 들어, 욕설/모욕/폭력위협/외설/범죄조장/성별/연령/인종출신자/성적지향/장애/종교/정치성향/기타혐오와 같이 총 13개의 비윤리 클래스(340)로 세분화될 수 있다.For example, the learning data augmentation apparatus 100 inputs a plurality of tokens 310 generated from a specific utterance into the ethics model 300 so that each token is one or more of a plurality of preset unethical classes 340 . can be classified. Here, a plurality of preset unethical classes 340 are, for example, profanity / insult / threat of violence / obscenity / crime promotion / gender / age / racial origin / sexual orientation / disability / religion / political orientation / other hate and Likewise, it can be subdivided into a total of 13 unethical classes 340 .

예를 들어, 학습 데이터 증강 장치(100)는 특정 발화로부터 생성된 복수의 토큰(310), 일예로, {the, chef, cooked, the, meal}을 윤리 모델(300)에 입력하여 각 토큰, 일예로, {meal} 토큰이 13개의 비윤리 클래스(340)에 속할 확률값을 각각 산출할 수 있다. 학습 데이터 증강 장치(100)는 확률값에 기초하여 각 토큰, 일예로, {meal} 토큰을 기타 혐오 클래스로 분류할 수 있다. For example, the learning data augmentation apparatus 100 inputs a plurality of tokens 310 generated from a specific utterance, for example, {the, chef, cooked, the, meal} into the ethical model 300, each token, As an example, a probability value that the {meal} token belongs to the 13 unethical classes 340 may be calculated, respectively. The learning data augmentation apparatus 100 may classify each token, for example, {meal} token, into other hate classes based on the probability value.

도 4 및 도 5는 본 발명의 일 실시예에 따른 유해 단어를 증강하는 과정을 설명하기 위한 예시적인 도면이다. 도 4를 참조하면, 학습 데이터 증강 장치(100)는 검출된 유해어가 단어인 경우, 해당 단어에 대한 비윤리 클래스의 확률값 중 가장 높은 비윤리 클래스의 확률값(x)과 기설정된 확률값(예: 0.5)을 비교할 수 있다(S410). 4 and 5 are exemplary views for explaining a process of augmenting a harmful word according to an embodiment of the present invention. Referring to FIG. 4 , when the detected harmful word is a word, the learning data augmentation apparatus 100 provides the highest probability value (x) of the unethical class among the probability values of the unethical class for the word and a preset probability value (eg, 0.5). ) can be compared (S410).

예를 들어, 학습 데이터 증강 장치(100)는 검출된 유해 단어, '범죄'에 대한 범죄 조장 클래스의 확률값(예: 0.7) 및 기타 혐오 클래스의 확률값(예: 0.4) 중 가장 높은 범죄 조장 클래스의 확률값(예: 0.7)과 기설정된 확률값(예: 0.5)을 비교할 수 있다.For example, the learning data augmentation apparatus 100 is the highest crime promotion class among the detected harmful words, the probability value of the crime promoting class for 'crime (eg 0.7), and the probability value (eg, 0.4) of other hate classes. A probability value (eg, 0.7) may be compared with a preset probability value (eg, 0.5).

학습 데이터 증강 장치(100)는 단어에 대한 비윤리 클래스의 확률값, x가 기설정된 확률값, 일예로 0.5보다 큰 경우(S411), 단어의 모든 글자가 초성 및 중성으로만 이루어져있는지 판단할 수 있다(S420). The learning data augmentation apparatus 100 may determine whether all letters of a word are made up of only the initial and neutral characters when the probability value of the unethical class for the word, x, is greater than a preset probability value, for example, 0.5 (S411) ( S420).

예를 들어, 학습 데이터 증강 장치(100)는 '범죄'에 대한 범죄 조장 클래스의 확률값(예: 0.7)이 기설정된 확률값(예: 0.5)보다 큰 경우, '범죄'의 모든 글자가 초성 및 중성으로만 이루어져있는지 판단할 수 있다.For example, when the learning data augmentation apparatus 100 has a probability value (eg, 0.7) of a crime promotion class for 'crime' is greater than a preset probability value (eg, 0.5), all letters of 'crime' are initial and neutral It can be determined whether it consists only of

학습 데이터 증강 장치(100)는 단어의 모든 글자가 초성 및 중성으로만 이루어진 경우(S421), 기정의된 음절 변형 방법 중 제 1 방법 내지 제 5 방법을 모두 적용할 수 있다(S430).The learning data augmentation apparatus 100 may apply all of the first to fifth methods among the predefined syllable transformation methods when all letters of a word consist only of a leading consonant and a neutral (S421) (S430).

예를 들어, 학습 데이터 증강 장치(100)는 '바보'와 같이, 단어의 글자가 초성 및 중성으로만 이루어진 경우, '바보'에 포함된 초성, 중성 또는 종성을 분리할 수 있고, '바보'에 포함된 모음을 유사 모음으로 치환할 수 있고, '바보'에 포함된 자음 및 모음을 유사 문자로 치환할 수 있고, '바보' 사이에 문자를 삽입할 수 있고, '바보'의 배열 순서를 변경하여, '바보'에 대하여 'ㅂㅏㅂㅗ', '뱌뵤', 'ㅂr보', '바@보' 및 '보바'와 같은 학습 데이터를 증강할 수 있다.For example, the learning data augmentation apparatus 100 may separate the initial, neutral, or final consonants included in 'idiot' when the letters of a word consist only of a beginning and a neutral, such as 'idiot', and 'idiot' You can replace the vowels included in 'Fool' with similar vowels, replace the consonants and vowels included in 'Fool' with similar characters, insert a letter between 'Fool', and change the arrangement order of 'Fool'. By changing, it is possible to augment learning data, such as 'bao', 'bye', 'vrbo', 'ba@bo' and 'boba' for 'idiot'.

반면에, 학습 데이터 증강 장치(100)는 단어의 모든 글자가 초성 및 중성으로만 이루어지지 않은 경우(S422), 단어의 모든 글자가 초성, 충성 및 종성으로 이루어져있는지 판단할 수 있다(S440). On the other hand, when all the letters of a word do not consist of only the initial consonant and neuter (S422), the learning data augmentation apparatus 100 may determine whether all the letters of the word are composed of the initial consonant, loyalty, and final consonant (S440).

예를 들어, 학습 데이터 증강 장치(100)는 '범죄'와 같이 모든 글자가 초성 및 중성으로만 이루어지지 않은 경우, '범죄'가 초성, 중성 및 종성으로 이루어져있는지 판단할 수 있다.For example, when all letters, such as 'crime', do not consist of only the initial and neuter, the learning data augmentation apparatus 100 may determine whether 'crime' consists of the initial, neuter, and final consonants.

학습 데이터 증강 장치(100)는 단어의 모든 글자가 초성, 중성 및 종성으로 이루어진 경우(S441), 기정의된 음절 변형 방법 중 제 1 방법, 제 4 방법 및 제 5 방법을 적용할 수 있다(S450).The learning data augmentation apparatus 100 may apply the first method, the fourth method, and the fifth method among the predefined syllable transformation methods when all the letters of a word are made up of a leading consonant, a middle consonant, and a final consonant (S441) (S450). ).

예를 들어, 학습 데이터 증강 장치(100)는 '형벌'과 같이 모든 글자가 초성, 중성 및 종성으로 이루어진 경우, '형벌'에 포함된 초성, 중성 또는 종성을 분리할 수 있고, '형벌' 사이에 문자를 삽입할 수 있고, '형벌'의 배열 순서를 변경하여, '형벌'에 대하여, 'ㅎㅕㅇ ㅂㅓㄹ', '형@벌' 및 '벌형'과 같은 학습 데이터를 증강할 수 있다.For example, the learning data augmentation apparatus 100 may separate the initial consonant, neutral or final consonant included in the 'penalty' when all letters, such as 'penal', consist of a leading consonant, a middle consonant, and a final consonant, and between the 'penalty'. It is possible to insert characters in the 'penalty' and change the arrangement order of 'penalty' to augment learning data such as 'hehㅕㅇ ㅅㅓㄹ', 'punishment@punishment' and 'penalty' for 'penalty'.

반면에, 학습 데이터 증강 장치(100)는 단어의 모든 글자가 초성, 중성 및 종성으로 이루어지지 않은 경우(S442), 단어의 글자에서 초성 및 중성으로만 이루어진 글자가 적어도 하나 존재하는지 판단할 수 있다(S460).On the other hand, the learning data augmentation apparatus 100 may determine whether at least one letter consisting only of a leading consonant and a neuter exists in the letters of the word when all the letters of the word do not consist of the initial consonant, the middle consonant and the final consonant (S442). (S460).

예를 들어, 학습 데이터 증강 장치(100)는 '무근'과 같이 모든 글자가 초성, 중성 및 종성으로 이루어지지 않은 경우, '무근'에 포함된 글자에서 초성 및 중성으로만 이루어진 글자가 적어도 하나 존재하는지 판단할 수 있다.For example, the learning data augmentation apparatus 100 may include at least one letter consisting of only a beginning and a neutral in the letters included in 'Unrooted' when all letters do not consist of a leading, neutral, and final consonant as in 'Unrooted'. can determine whether

학습 데이터 증강 장치(100)는 단어의 글자에서 초성 및 중성으로만 이루어진 글자가 적어도 하나 존재하는 경우(S461), 기정의된 음절 변형 방법 중 제 1 방법 내지 제 3 방법을 적용할 수 있다(S470). The learning data augmentation apparatus 100 may apply the first to third methods among the predefined syllable transformation methods when there is at least one letter consisting only of a leading consonant and a neutral in the letter of a word (S461) (S470). ).

예를 들어, 학습 데이터 증강 장치(100)는 '나쁜'과 같이 초성 및 중성으로만 이루어진 글자가 적어도 하나 존재하는 경우, '나쁜'에 포함된 초성, 중성 또는 종성을 분리할 수 있고, '나쁜'에 포함된 모음을 유사 모음으로 치환할 수 있고, '나쁜'에 포함된 자음 및 모음을 유사 문자로 치환하여, '나쁜'에 대하여 'ㄴㅏㅃㅡㄴ', '냐쁜' 및 'ㄴr쁜'과 같은 학습 데이터를 증강할 수 있다.For example, the learning data augmentation apparatus 100 may separate the initial, neutral, or final consonant included in 'bad' when there is at least one letter consisting only of a leading consonant and a neutral, such as 'bad', and 'bad' Vowels included in ' can be replaced with similar vowels, and consonants and vowels included in 'bad' are replaced with similar characters, so that 'bad' It is possible to augment learning data such as '.

도 5를 참조하면, 학습 데이터 증강 장치(100)는 검출된 유해어가 단어인 경우, 해당 단어에 대한 비윤리 클래스의 확률값 중 가장 높은 비윤리 클래스의 확률값(x)과 기설정된 확률값(예: 0.5 및 0.3)을 비교할 수 있다(S510).Referring to FIG. 5 , when the detected harmful word is a word, the learning data augmentation apparatus 100 provides a probability value (x) of the highest unethical class among the probability values of the unethical class for the word and a preset probability value (eg, 0.5). and 0.3) can be compared (S510).

예를 들어, 학습 데이터 증강 장치(100)는 검출된 유해 단어, '바보'에 대한 범죄 조장 클래스의 확률값(예: 0.2) 및 기타 혐오 클래스의 확률값(예: 0.4) 중 가장 높은 기타 혐오 클래스의 확률값(예: 0.4)과 기설정된 확률값(예: 0.5 및 0.3)를 비교할 수 있다.For example, the learning data augmentation apparatus 100 is the highest among the detected harmful words, the probability value of the crime-promoting class for 'idiot' (eg 0.2), and the probability value (eg, 0.4) of the other hate class. A probability value (eg, 0.4) may be compared with a preset probability value (eg, 0.5 and 0.3).

학습 데이터 증강 장치(100)는 단어에 대한 비윤리 클래스의 확률값, x가 기설정된 확률값, 일예로 0.5보다 작으나, 0.3보다 큰 경우(S511), 단어의 모든 글자가 초성 및 중성으로만 이루어져있는지 판단할 수 있다(S520).The learning data augmentation apparatus 100 determines whether all the letters of the word consist of only the initial and neutral characters when the probability value of the unethical class for the word, x, is a preset probability value, for example, less than 0.5 but greater than 0.3 (S511). It can be done (S520).

예를 들어, 학습 데이터 증강 장치(100)는 '바보'에 대한 기타 혐오 클래스의 확률값(예: 0.4)이 기설정된 확률값, 일예로 0.5보다 작고, 0.3보다 큰 경우, '바보'의 모든 글자가 초성 및 중성으로만 이루어져있는지 판단할 수 있다.For example, if the learning data augmentation apparatus 100 has a probability value (eg, 0.4) of other hate classes for 'idiot' is less than a preset probability value, for example, less than 0.5 and greater than 0.3, all the letters of 'idiot' are It can be judged whether it consists of only the initial and neutral consonants.

학습 데이터 증강 장치(100)는 단어의 모든 글자가 초성 및 중성으로만 이루어진 경우(S521), 기정의된 음절 변형 방법 중 제 1 방법 내지 제 5 방법을 모두 적용할 수 있다(S530).The learning data augmentation apparatus 100 may apply all of the first to fifth methods among the predefined syllable transformation methods when all the letters of a word consist of only a leading consonant and a neutral (S521) (S530).

반면에, 학습 데이터 증강 장치(100)는 단어의 모든 글자가 초성 및 중성으로만 이루어지지 않은 경우(S522), 해당 단어에 대한 비윤리 클래스의 확률값 중 가장 높은 비윤리 클래스의 확률값(x) 및 기설정된 개수(예: 3개)의 상위 복수의 비윤리 클래스 확률값들의 합과 기설정된 확률값(예: 0.3) 및 기설정된 임계치(예: 0.3)를 비교할 수 있다(S540)On the other hand, when all the letters of the word are not made up of only the initial consonant and the neutral (S522), the learning data augmentation apparatus 100 provides the highest probability value of the unethical class among the probability values of the unethical class for the word (x) and A preset number (eg, three) of a plurality of upper unethical class probability values may be compared with a preset probability value (eg, 0.3) and a preset threshold (eg, 0.3) (S540)

예를 들어, 학습 데이터 증강 장치(100)는 '형벌'와 같이 모든 글자가 초성 및 중성으로만 이루어지지 않은 경우, '형벌'에 대한 범죄 조장 클래스의 확률값(예: 0.2)과 기설정된 확률값(예: 0.3)을 비교할 수 있고, '형벌'에 대한 상위 3개의 범죄 조장 클래스의 확률값(예: 0.2), 종교 클래스의 확률값(예: 0.1) 및 기타 혐오 클래스의 확률값(예: 0.1)과 기설정된 임계치(예: 0.3)를 비교할 수 있다.For example, the learning data augmentation apparatus 100 provides a probability value of a crime promotion class for 'punishment' (eg 0.2) and a preset probability value ( Example: 0.3), the probability values of the top 3 crime-promoting classes (eg 0.2) for 'punishment', the probability values of the religious class (eg 0.1), and the probability values of other hate classes (eg 0.1) and the A set threshold (eg, 0.3) can be compared.

학습 데이터 증강 장치(100)는 단어에 대한 비윤리 클래스의 확률값, x가 기설정된 확률값, 일예로 0.3보다 작으나, 기설정된 개수, 일예로 상위 3개의 비윤리 클래스 확률값들의 합이 기설정된 임계치, 일예로 0.3보다 큰 경우(S540), 기정의된 음절 변형 방법 중 제 1 방법 내지 제 5 방법을 모두 적용할 수 있다(S550). The learning data augmentation apparatus 100 is a probability value of an unethical class for a word, x is a preset probability value, for example, less than 0.3, but a preset number, for example, the sum of the top three unethical class probability values is a preset threshold, an example is greater than 0.3 (S540), all of the first to fifth methods among the predefined syllable modification methods may be applied (S550).

예를 들어, 학습 데이터 증강 장치(100)는 '형벌'에 대한 범죄 조장 클래스의 확률값, 0.2가 기설정된 확률값, 일예로 0.3보다 작으나, 기설정된 개수, 일예로 상위 3개의 범죄 조장 클래스의 확률값(예: 0.2), 종교 클래스의 확률값(예: 0.1) 및 기타 혐오 클래스의 확률값(예: 0.1)의 합, 0.4가 기설정된 임계치, 일예로 0.3보다 큰 경우, '형벌'에 포함된 초성, 중성 또는 종성을 분리할 수 있고, '형벌'에 포함된 모음을 유사 모음으로 치환할 수 있고, '형벌'에 포함된 자음 및 모음을 유사 문자로 치환할 수 있고, '형벌' 사이에 문자를 삽입할 수 있고, '형벌'의 배열 순서를 변경하여, '형벌'에 대하여 'ㅎㅕㅇㅂㅓㄹ', '형별', '형%벌' 및 '벌형' 등과 같은 학습 데이터를 증강할 수 있다.For example, the learning data augmentation apparatus 100 has a probability value of the crime promotion class for 'punishment', where 0.2 is a predetermined probability value, for example, less than 0.3, but a predetermined number, for example, the probability value of the top three crime promotion classes ( Example: 0.2), the sum of the probability value of the religious class (eg 0.1) and the probability value of other hate classes (eg 0.1), when 0.4 is a preset threshold, for example, greater than 0.3 Alternatively, the last name can be separated, the vowel included in 'penalty' can be replaced with a similar vowel, the consonants and vowels included in the 'penalty' can be replaced with a similar character, and a letter is inserted between the 'penalty' By changing the arrangement order of 'penalty', learning data such as 'heh ㅇㅇㅇㅇ', 'penalty', 'penalty% punishment' and 'penalty' for 'penalty' can be augmented.

도 6 및 도 7은 본 발명의 일 실시예에 따른 유해 어구를 증강하는 과정을 설명하기 위한 예시적인 도면이다. 도 6을 참조하면, 학습 데이터 증강 장치(100)는 검출된 유해어가 어구인 경우, 해당 어구에 대한 비윤리 클래스의 확률값 중 가장 높은 비윤리 클래스의 확률값(x)과 기설정된 확률값(예: 0.5)을 비교할 수 있다(S610).6 and 7 are exemplary views for explaining a process of augmenting harmful phrases according to an embodiment of the present invention. Referring to FIG. 6 , when the detected harmful word is a phrase, the learning data augmentation apparatus 100 provides the highest probability value (x) of the unethical class among the probability values of the unethical class for the corresponding phrase and a preset probability value (eg, 0.5). ) can be compared (S610).

예를 들어, 학습 데이터 증강 장치(100)는 검출된 유해 어구, '유해 환경을 배제하자'에 대한 범죄 조장 클래스의 확률값(예: 0.7) 및 기타 혐오 클래스의 확률값(예: 0.4) 중 가장 높은 범죄 조장 클래스의 확률값(예: 0.7)과 기설정된 확률값(예: 0.5)을 비교할 수 있다.For example, the learning data augmentation apparatus 100 has the highest among the detected harmful phrases, the probability value (eg, 0.7) of the crime-promoting class for 'let's exclude the harmful environment', and the probability value (eg, 0.4) of other hate classes. A probability value (eg, 0.7) of the crime promotion class may be compared with a predetermined probability value (eg, 0.5).

학습 데이터 증강 장치(100)는 단어에 대한 비윤리 클래스의 확률값, x가 기설정된 확률값, 일예로 0.5보다 큰 경우(S611), 해당 어구가 유해어 사전에 포함되어 있는지 여부를 판단할 수 있다(S620).The learning data augmentation apparatus 100 may determine whether the corresponding phrase is included in the harmful word dictionary when the probability value of the unethical class for the word, x, is greater than a preset probability value, for example, 0.5 (S611) ( S620).

예를 들어, 학습 데이터 증강 장치(100)는 '유해 환경을 배제하자'에 대한 범죄 조장 클래스의 확률값(예: 0.7)이 기설정된 확률값(예:0.5)보다 큰 경우, '유해 환경을 배제하자'에 포함되어 있는 각 어구, '유해', '환경을' 및 '배제하자'가 유해어 사전에 포함되어 있는지 여부를 판단할 수 있다.For example, when the learning data augmentation apparatus 100 has a probability value (eg, 0.7) of the crime promotion class for 'exclude harmful environment' is greater than a preset probability value (eg, 0.5), 'exclude harmful environment' It may be determined whether each phrase included in ', 'harm', 'environment' and 'exclude' are included in the harmful dictionary.

학습 데이터 증강 장치(100)는 해당 어구가 유해어 사전에 포함되어 있지 않은 경우(S621), 어구에 기정의된 음절 변형 방법 중 제 1 방법 내지 제 5 방법 중 적어도 하나의 방법을 적용하여 어구를 증강할 수 있다(S630).When the corresponding phrase is not included in the harmful word dictionary (S621), the learning data augmentation apparatus 100 applies at least one of the first to fifth methods among the syllable transformation methods predefined for the phrase to generate the phrase. It can be augmented (S630).

예를 들어, 학습 데이터 증강 장치(100)는 각 어구, '유해', '환경을' 및 '배제하자'가 유해어 사전에 포함되어 있지 않은 경우, 각 어구, '유해', '환경을' 및 '배제하자'에 포함된 초성, 중성 또는 종성을 분리하거나, 모음을 유사 모음으로 치환하거나, 자음 및 모음을 유사 문자로 치환하거나, 각 어구 사이에 문자를 삽입하거나, 각 어구 내의 배열 순서를 변경할 수 있다.For example, if the learning data augmentation apparatus 100 does not include each phrase, 'harm', 'environment', and 'let's exclude' in the harmful word dictionary, each phrase, 'harmful', 'environment' and separating the initial, neuter, or final consonant included in ‘let’s exclude’, replacing vowels with similar vowels, replacing consonants and vowels with similar letters, inserting letters between each phrase, or changing the arrangement order within each phrase can be changed

도 7을 참조하면, 학습 데이터 증강 장치(100)는 검출된 유해어가 어구인 경우, 해당 어구에 대한 기설정된 개수(예: 3개)의 상위 복수의 비윤리 클래스 확률값들의 합과 기설정된 임계치(예: 0.3)를 비교할 수 있다(S710).Referring to FIG. 7 , when the detected harmful word is a phrase, the learning data augmentation apparatus 100 provides a preset number (eg, three) of a plurality of upper unethical class probability values for the corresponding phrase and a preset threshold ( Example: 0.3) can be compared (S710).

예를 들어, 학습 데이터 증강 장치(100)는 검출된 유해 어구, '유해 환경을 배제하자'에 대한 상위 3개의 범죄 조장 클래스의 확률값(예: 0.3), 폭력 위협 클래스의 확률값(예: 0.1) 및 기타 혐오 클래스의 확률값(예: 0.2)의 합(예: 0.6)과 기설정된 임계치(예: 0.3)를 비교할 수 있다.For example, the learning data augmentation apparatus 100 may include a detected harmful phrase, a probability value of the top three crime promotion classes for 'let's exclude a harmful environment' (eg 0.3), a probability value of a violent threat class (eg 0.1) and the sum (eg, 0.6) of the probability values (eg, 0.2) of other hate classes may be compared with a preset threshold (eg, 0.3).

학습 데이터 증강 장치(100)는 어구에 대한 기설정된 개수, 일예로 상위 3개의 비윤리 클래스 확률값들의 합이 기설정된 임계치, 일예로 0.3보다 큰 경우(S711), 어구에 기정의된 어구 변형 방법을 적용하여 어구를 증강할 수 있다(S720). The learning data augmentation apparatus 100 performs a preset number of phrases, for example, when the sum of the top three unethical class probability values is greater than a preset threshold, for example 0.3 (S711), a phrase transformation method defined in the phrase It can be applied to augment the phrase (S720).

예를 들어, 학습 데이터 증강 장치(100)는 '유해 환경을 배제하자'에 대한 상위 3개의 범죄 조장 클래스의 확률값(예: 0.3), 폭력 위협 클래스의 확률값(예: 0.1) 및 기타 혐오 클래스의 확률값(예: 0.2)의 합(예: 0.6)이 기설정된 임계치(예: 0.3)보다 큰 경우, '유해 환경을 배제하자'에 포함된 단어를 유사 단어로 치환할 수 있고, 어절의 배열 순서를 변경할 수 있고, 각 어구에 포함된 단어를 삭제하여, '유해 환경을 배제하자'에 대하여 '유해 환경을 배제하자', '환경 유해을 차단하자' 및 '유해을 차단하자'를 생성하여 학습 데이터를 증강할 수 있다.For example, the learning data augmentation apparatus 100 provides a probability value of the top three crime promotion classes for 'let's exclude harmful environments' (eg 0.3), a probability value of a violent threat class (eg 0.1), and other hate classes. When the sum of probability values (eg 0.2) (eg 0.6) is greater than a preset threshold (eg 0.3), words included in 'Let's exclude harmful environments' can be substituted with similar words, and word order can be changed, and by deleting the words included in each phrase, 'Let's exclude harmful environments' for 'Let's exclude harmful environments' can be augmented

학습 데이터 증강 장치(100)는 변형된 어구를 윤리 모델에 입력하여 변형된 어구가 기설정된 복수의 비윤리 클래스 각각에 속할 복수의 확률값을 산출할 수 있다(S730).The learning data augmentation apparatus 100 may input the modified phrase into the ethical model to calculate a plurality of probability values that the modified phrase belongs to each of a plurality of preset unethical classes ( S730 ).

예를 들어, 학습 데이터 증강 장치(100)는 변형된 어구, '유해 환경을 배제하자', '환경 유해을 차단하자' 및 '유해을 차단하자'를 각 윤리 모델에 입력하여 각 어구가 13개의 비윤리 클래스에 속할 각각의 확률값을 산출할 수 있다.For example, the learning data augmentation apparatus 100 inputs the modified phrases, 'let's exclude harmful environment', 'block environmental harm' and 'let's block harm' into each ethical model, so that each phrase contains 13 unethical sentences. Each probability value belonging to a class can be calculated.

학습 데이터 증강 장치(100)는 변형된 어구에 대한 기설정된 개수(예: 3개)의 상위 복수의 비윤리 클래스 확률값들의 합과 기설정된 임계치(예: 0.3)를 비교할 수 있다(S740).The learning data augmentation apparatus 100 may compare the sum of a predetermined number (eg, three) of a plurality of upper unethical class probability values for the modified phrase and a predetermined threshold (eg, 0.3) ( S740 ).

예를 들어, 학습 데이터 증강 장치(100)는 변형된 어구, 일예로, '유해 환경을 배제하자'에 대한 상위 3개의 범죄 조장 클래스의 확률값(예: 0.2), 폭력 위협 클래스의 확률값(예: 0.1) 및 기타 혐오 클래스의 확률값(예: 0.2)의 합(예: 0.5)과 기설정된 임계치(예: 0.3)를 비교할 수 있다.For example, the learning data augmentation apparatus 100 is a modified phrase, for example, a probability value (eg 0.2) of the top three crime promotion classes for 'let's exclude a harmful environment', a probability value of a violence threat class (eg: 0.1) and the sum (eg, 0.5) of the probability values (eg, 0.2) of other hate classes may be compared with a preset threshold (eg, 0.3).

학습 데이터 증강 장치(100)는 변형된 어구에 대한 기설정된 개수, 일예로 상위 3개의 비윤리 클래스 확률값들의 합이 기설정된 임계치, 일예로 0.3보다 큰 경우(S741), 어구마다 변형된 어구, 변형된 어구에 분류된 적어도 하나의 비윤리 클래스의 확률값이 매핑되어 저장된 유해어구 사전을 구축할 수 있다(S750).The learning data augmentation apparatus 100 provides a preset number of modified phrases, for example, when the sum of the top three unethical class probability values is greater than a preset threshold, for example, 0.3 (S741), each phrase transformed into a modified phrase, transformation Probability values of at least one unethical class classified to the old phrases are mapped to build a stored harmful phrase dictionary (S750).

예를 들어, 학습 데이터 증강 장치(100)는 변형된 어구, 일예로, '유해 환경을 배제하자'에 대한 상위 3개의 범죄 조장 클래스의 확률값(예: 0.2), 폭력 위협 클래스의 확률값(예: 0.1) 및 기타 혐오 클래스의 확률값(예: 0.2)의 합(예: 0.5)이 기설정된 임계치(예: 0.3)보다 큰 경우, '유해 환경을 차단하자'에 대한 유해 어구에 변형된 어구, '유해 환경을 배제하자' 및 변형된 어구에 대한 범죄 조장 클래스의 확률값(예: 0.2), 폭력 위협 클래스의 확률값(예: 0.1), 기타 혐오 클래스의 확률값(예: 0.2)이 매핑되어 저장된 유해어구 사전을 구축할 수 있다.For example, the learning data augmentation apparatus 100 is a modified phrase, for example, a probability value (eg 0.2) of the top three crime promotion classes for 'let's exclude a harmful environment', a probability value of a violence threat class (eg: If the sum (eg 0.5) of the probability value (eg 0.2) of 0.1) and other hate classes is greater than a preset threshold (eg 0.3), a modified phrase in the harmful phrase for 'Let's block the harmful environment', ' Probability values of crime promotion class (eg 0.2), threat of violence class (eg 0.1), and probability values of other hate classes (eg 0.2) for 'let's exclude harmful environment' and modified phrases are mapped and stored. You can build a dictionary.

도 8은 본 발명의 일 실시예에 따른 학습 데이터 증강 방법의 순서도이다. 도 8에 도시된 학습 데이터 증강 방법은 도 1 내지 도 7에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 7에 도시된 실시예에 따른 학습 데이터 증강 장치에서 유해어에 대한 학습 데이터를 증강하는 방법에도 적용된다. 8 is a flowchart of a learning data augmentation method according to an embodiment of the present invention. The learning data augmentation method illustrated in FIG. 8 includes steps processed in time series according to the embodiments illustrated in FIGS. 1 to 7 . Therefore, even if omitted below, it is also applied to the method of augmenting learning data for harmful words in the apparatus for augmenting learning data according to the embodiment shown in FIGS. 1 to 7 .

단계 S810에서 학습 데이터 증강 장치는 학습 데이터에 포함된 특정 발화를 기학습된 윤리 모델에 입력하여 특정 발화에 포함된 어휘 및 어구가 기설정된 복수의 비윤리 클래스 각각에 속할 복수의 확률값을 산출할 수 있다.In step S810, the training data augmentation apparatus inputs a specific utterance included in the training data into the pre-learned ethical model to calculate a plurality of probability values that vocabulary and phrases included in the specific utterance will belong to each of a plurality of preset unethical classes. have.

단계 S820에서 학습 데이터 증강 장치는 복수의 확률값에 기초하여 특정 발화에 포함된 유해어를 검출할 수 있다.In step S820, the learning data augmentation apparatus may detect harmful words included in a specific utterance based on a plurality of probability values.

단계 S830에서 학습 데이터 증강 장치는 유해어에 대한 복수의 확률값에 기초하여 유해어를 복수의 비윤리 클래스 중 적어도 어느 하나로 분류할 수 있다.In step S830, the learning data augmentation apparatus may classify the harmful word into at least one of a plurality of unethical classes based on a plurality of probability values for the harmful word.

단계 S840에서 학습 데이터 증강 장치는 유해어에 분류된 적어도 하나의 비윤리 클래스의 확률값에 기초하여 유해어에 기정의된 음절 변형 방법을 적용함으로써 유해어를 증강하여 신규 학습 데이터를 생성할 수 있다.In step S840, the learning data augmentation apparatus may generate new learning data by augmenting the harmful word by applying a predefined syllable transformation method to the harmful word based on the probability value of at least one unethical class classified into the harmful word.

상술한 설명에서, 단계 S810 내지 S840는 본 발명의 구현 예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다. In the above description, steps S810 to S840 may be further divided into additional steps or combined into fewer steps according to an embodiment of the present invention. In addition, some steps may be omitted as needed, and the order between the steps may be switched.

도 1 내지 도 8을 통해 설명된 학습 데이터 증강 장치에서 유해어에 대한 학습 데이터를 증강하는 방법은 컴퓨터에 의해 실행되는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어들을 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 8을 통해 설명된 학습 데이터 증강 장치에서 유해어에 대한 학습 데이터를 증강하는 방법은 컴퓨터에 의해 실행되는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. The method for augmenting learning data for harmful words in the learning data augmentation apparatus described through FIGS. 1 to 8 is a computer program stored in a computer-readable recording medium executed by a computer or a record including instructions executable by the computer It can also be implemented in the form of a medium. In addition, the method for augmenting learning data for harmful words in the learning data augmentation apparatus described with reference to FIGS. 1 to 8 may be implemented in the form of a computer program stored in a computer-readable recording medium that is executed by a computer.

컴퓨터 판독 가능 기록매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 기록매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.A computer-readable recording medium may be any available medium that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable recording medium may include a computer storage medium. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The description of the present invention described above is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

100: 학습 데이터 증강 장치
110: 유해어 검출부
120: 유해어 분류부
130: 유해어 증강부
140: 학습부100: learning data augmentation device
110: harmful fish detection unit
120: harmful fish classification unit
130: harmful fish reinforcement unit
140: study unit

Claims

In the device for augmenting learning data for harmful words,
A specific utterance included in the learning data is input to the previously learned ethical model to calculate a plurality of probability values that the vocabulary and phrase included in the specific utterance will belong to each of a plurality of preset unethical classes, and based on the plurality of probability values a harmful word detection unit detecting harmful words included in the specific utterance;
a harmful word classification unit for classifying the harmful word into at least one of the plurality of unethical classes based on the plurality of probability values for the harmful word; and
A harmful word reinforcement unit for generating new learning data by augmenting the harmful word by applying a predefined syllable transformation method to the harmful word based on the probability value of at least one unethical class classified in the harmful word
includes,
The predefined syllable transformation method is,
a first method of separating the initial, neutral, or final consonant included in the harmful fish; a second method of replacing a vowel of the harmful word with a similar vowel; a third method of replacing consonants and vowels of the harmful words with similar characters; a fourth method of inserting a character between the harmful words; and at least one of a fifth method of changing the arrangement order of the harmful words,
The harmful word reinforcement unit,
When the detected harmful word is a word, and the probability value of any one unethical class having the highest probability value of at least one unethical class classified in the word is greater than or equal to a preset probability value, the first method to the fifth method Reinforcing the word by applying at least one method of
constructing a harmful word dictionary in which the augmented word and the probability value of at least one unethical class classified in the word are mapped and stored;
When the detected harmful word is a phrase and the phrase is not included in the harmful word dictionary, the phrase is reinforced by applying at least one of the first to fifth methods to the phrase,
When the sum of the probability values of a predetermined number of upper plural unethical classes among the probability values of at least one unethical class classified in the phrase is equal to or greater than a predetermined threshold, the phrase is converted to the phrase by applying a predefined phrase transformation method to the phrase A learning data augmentation device that is augmented.

The method of claim 1,
The harmful word detection unit,
dividing the specific utterance into words, inputting each divided word into the pre-learned ethical model, calculating the plurality of probability values for at least one word, and detecting harmful words included in the specific utterance , a learning data augmentation device.

delete

The method of claim 1,
The harmful word reinforcement unit,
When all the letters of the word consist only of a leading consonant and a neutral, all of the first to fifth methods are applied, and when all the letters of the word consist of a leading consonant, a neutral and a final consonant, the first method and the first method Method 4 and the fifth method are applied, and when at least one letter consisting of only a leading consonant and a neutral exists in the letters of the word, the first method to the third method are applied, the learning data augmentation apparatus.

delete

The method of claim 1,
The predefined phrase modification method comprises:
a sixth method of substituting a word included in the phrase with a similar word; a seventh method of changing the arrangement order of words within the phrase; and at least one of an eighth method of deleting a word included in the phrase.

10. The method of claim 9,
The harmful word reinforcement unit,
A plurality of probability values are calculated that the modified phrase belongs to each of the plurality of preset unethical classes by inputting a phrase modified according to the predefined phrase modification method into the ethical model,
When the sum of the predetermined number of the upper plurality of unethical class probability values is equal to or greater than a predetermined threshold, the probabilities of the modified phrase and at least one unethical class classified in the modified phrase are mapped and stored for each of the phrases. To build a, learning data augmentation device.

11. The method of claim 10,
Learning unit to learn the above ethical model
further comprising,
The learning unit pre-learns a generation model (Generator) to transform at least one token among a plurality of tokens generated from the specific utterance into a fake token, and a discrimination model (Discriminator) selects the fake tokens included in the plurality of tokens A learning data augmentation device that pre-learns to discriminate.

12. The method of claim 11,
The ethical model includes the discriminant model and a feed forward neural network,
The learning unit,
The learning data augmentation apparatus of claiming that the ethical model is learned by updating the recurrent neural network based on the harmful word dictionary and the harmful phrase dictionary.

A method of augmenting learning data for harmful words performed by a learning data augmentation device, the method comprising:
inputting a specific utterance included in the training data into a pre-learned ethical model, and calculating a plurality of probability values that the vocabulary and phrase included in the specific utterance will belong to each of a plurality of preset unethical classes;
detecting harmful words included in the specific utterance based on the plurality of probability values;
classifying the harmful word into at least one of the plurality of unethical classes based on the plurality of probability values for the harmful word; and
generating new learning data by augmenting the harmful word by applying a predefined syllable transformation method to the harmful word based on the probability value of at least one unethical class classified in the harmful word
includes,
The predefined syllable transformation method is,
a first method of separating the initial, neutral, or final consonant included in the harmful fish; a second method of replacing a vowel of the harmful word with a similar vowel; a third method of replacing consonants and vowels of the harmful words with similar characters; a fourth method of inserting a character between the harmful words; and at least one of a fifth method of changing the arrangement order of the harmful words,
The step of generating new learning data by augmenting the harmful words,
When the detected harmful word is a word, and the probability value of any one unethical class having the highest probability value of at least one unethical class classified in the word is greater than or equal to a preset probability value, the first method to the fifth method augmenting the word by applying at least one of the methods;
constructing a harmful word dictionary in which the augmented word and the probability value of at least one unethical class classified in the word are mapped to each of the words;
when the detected harmful word is a phrase and the phrase is not included in the harmful word dictionary, augmenting the phrase by applying at least one of the first to fifth methods to the phrase; and
When the sum of the probability values of a predetermined number of upper plural unethical classes among the probability values of at least one unethical class classified in the phrase is equal to or greater than a predetermined threshold, the phrase is converted to the phrase by applying a predefined phrase transformation method to the phrase A method of augmenting learning data, comprising the step of augmenting.

14. The method of claim 13,
The step of detecting the harmful fish,
dividing the specific utterance into words;
calculating the plurality of probability values for at least one word by inputting each divided word into the pre-learned ethical model; and
detecting harmful words included in the specific utterance
That comprising a, learning data augmentation method.

delete

14. The method of claim 13,
The step of generating new learning data by augmenting the harmful words,
When all the letters of the word consist only of a leading consonant and a neutral, all of the first to fifth methods are applied, and when all the letters of the word consist of a leading consonant, a neutral and a final consonant, the first method and the first method applying the fourth method and the fifth method, and applying the first to third methods when there is at least one letter consisting only of a leading consonant and a neutral in the letters of the word
Which further comprises, the learning data augmentation method.

delete

14. The method of claim 13,
The predefined phrase modification method comprises:
a sixth method of substituting a word included in the phrase with a similar word; a seventh method of changing the arrangement order of words within the phrase; and at least one of an eighth method of deleting a word included in the phrase.

22. The method of claim 21,
The step of generating new learning data by augmenting the harmful words,
calculating a plurality of probability values that the modified phrase belongs to each of the plurality of preset unethical classes by inputting a phrase modified according to the predefined phrase modification method into the ethical model; and
When the sum of the predetermined number of the upper plurality of unethical class probability values is equal to or greater than a predetermined threshold, the probabilities of the modified phrase and at least one unethical class classified in the modified phrase are mapped and stored for each of the phrases. steps to build
Which further comprises, the learning data augmentation method.

23. The method of claim 22,
learning the ethical model
further comprising,
The step of learning the ethical model is,
pre-learning, by a generation model (Generator), to transform at least one token among a plurality of tokens generated from the specific utterance into a fake token; and
Pre-training a discriminator to discriminate the fake tokens included in the plurality of tokens
That comprising a, learning data augmentation method.

24. The method of claim 23,
The ethical model includes the discriminant model and a feed forward neural network,
The step of learning the ethical model is,
Learning the ethical model by updating the recurrent neural network based on the harmful word dictionary and the harmful phrase dictionary
Which further comprises, the learning data augmentation method.

A computer program stored in a computer-readable recording medium including a sequence of instructions for augmenting learning data for harmful words,
When the computer program is executed by a computing device,
By inputting a specific utterance included in the learning data into a pre-learned ethical model, a plurality of probability values that the vocabulary and phrase included in the specific utterance will belong to each of a plurality of preset unethical classes are calculated,
Detecting harmful words included in the specific utterance based on the plurality of probability values,
Classifying the harmful word into at least one of the plurality of unethical classes based on the plurality of probability values for the harmful word,
Based on the probability value of at least one unethical class classified in the harmful word, a predefined syllable transformation method is applied to the harmful word to augment the harmful word to generate new learning data,
The predefined syllable modification method is,
a first method of separating the initial, neutral, or final consonant included in the harmful fish; a second method of substituting a vowel of the harmful word with a similar vowel; a third method of replacing consonants and vowels of the harmful words with similar characters; a fourth method of inserting a character between the harmful words; and at least one of a fifth method of changing the arrangement order of the harmful words,
Generating new learning data by augmenting the harmful words,
When the detected harmful word is a word, and the probability value of any one unethical class having the highest probability value of at least one unethical class classified in the word is greater than or equal to a preset probability value, the first method to the fifth method Reinforcing the word by applying at least one method of
For each word, the augmented word and the probability value of at least one unethical class classified in the word are mapped to build a stored harmful dictionary,
When the detected harmful word is a phrase and the phrase is not included in the harmful word dictionary, the phrase is reinforced by applying at least one of the first to fifth methods to the phrase,
When the sum of the probability values of a predetermined number of upper plural unethical classes among the probability values of at least one unethical class classified in the phrase is equal to or greater than a predetermined threshold, the phrase is converted to the phrase by applying a predefined phrase transformation method to the phrase A computer program stored on a computer-readable medium comprising a sequence of augmenting instructions.