KR20220109938A

KR20220109938A - Apparatus and method for validating propagated wrong text

Info

Publication number: KR20220109938A
Application number: KR1020210013550A
Authority: KR
Inventors: 이세중
Original assignee: 이세중
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2022-08-05

Abstract

The present invention may provide an apparatus and method for checking text conformance. The apparatus includes: a text acquisition unit for acquiring a plurality of learning texts generated by a learning scheme using the learning texts whose ethics or nonconformity have been previously verified and labeled; a dictionary-based discrimination unit that receives the learning texts and determines inappropriateness of the learning texts by searching for words similar to profanity listed in a profanity dictionary obtained in advance from the authorized learning texts to a predetermined level or higher; a learning model discrimination unit that receives the learning texts and vectorizes the same word by word, extracts a sentence feature vector from the vectorized word according to a pre-learned pattern estimation scheme, and determines whether the learning texts are unsuitable; an original text-based discrimination unit that searches for a learning text that is most similar to the learning texts and determines inappropriateness of the learning texts according to a label of the searched learning texts; and a discrimination result comparison unit that obtains a final discrimination result for the learning texts by combining the results of discriminating the unsuitability of the learning texts determined in each of the dictionary-based discrimination unit, the learning model discrimination unit, and the original text-based discrimination unit. It is possible to determine whether a nonconforming text learned for training is valid.

Description

Apparatus and method for matching learned nonconforming texts {APPARATUS AND METHOD FOR VALIDATING PROPAGATED WRONG TEXT}

본 발명은 텍스트 검사 장치 및 방법에 관한 것으로, 학습된 부적합 텍스트의 부합성 검사 장치 및 방법에 관한 것이다.The present invention relates to a text checking apparatus and method, and to a conformity checking apparatus and method for learned inappropriate text.

현재 온라인 환경은 많은 사용자들에게 다양한 커뮤니케이션 수단을 제공하였으나, 온라인의 익명성으로 인하여 각종 비속어나 부적합어가 빈번하게 사용되고 있다. 이에 온라인 서비스 업체들은 비속어나 부적합어 등을 필터링하여 제거하기 위하여 노력하고 있으나, 비속어나 부적합어 또한 다양한 형태로 변형되어 이용됨에 따라 필터링이 용이하지 않다는 한계가 있다.Although the current online environment provides various communication means to many users, various profane or inappropriate words are frequently used due to online anonymity. Accordingly, online service providers are making efforts to filter out profanity or inappropriate words, but there is a limitation in that filtering is not easy as profanity or inappropriate words are also transformed into various forms and used.

이에 최근에는 인공 신경망으로 구성되는 부적합 텍스트 탐지 장치를 이용하여 비속어나 부적합어를 검출하고자 하는 시도가 계속되어 왔다. 그러나 인공 신경망을 이용하기 위해서는 학습이 선행되어야 하며, 학습을 위해서는 대량의 학습용 텍스트가 필요하다. 여기서 학습용 데이터는 비속어나 부적합어가 포함되어 있는지 여부가 사전에 검증되어 레이블된 텍스트이다.Accordingly, in recent years, attempts have been made to detect profanity or inappropriate words using an inappropriate text detection device composed of an artificial neural network. However, in order to use an artificial neural network, learning must be preceded, and a large amount of text for learning is required for learning. Here, the training data is text that has been previously verified and labeled whether profanity or inappropriate words are included.

기존에는 텍스트를 사람이 직접 검증을 하여 학습용 텍스트로 사용하였으므로, 인공 신경망을 학습시키기에 충분한 양의 학습용 텍스트를 획득하기가 매우 어렵다는 한계가 있었다. 이러한 학습용 텍스트 획득의 어려움을 극복하기 위해 윤리 또는 부적합가 미리 검증된 적은 양의 학습용 텍스트를 학습 증식시키도록 미리 학습된 인공 신경망을 구현되는 부적합 텍스트 학습 장치를 이용하여 대량의 학습용 텍스트를 획득하는 방안이 제안되었다. 인공 신경망을 이용하여 학습용 텍스트를 학습시키게 됨으로써, 적은 양의 학습용 텍스트로부터 대량의 학습용 텍스트를 용이하게 획득할 수 있으며, 다양한 변형 형태의 비속어나 부적합어가 포함된 학습용 텍스트를 획득할 수 있게 되었다.In the past, since the text was directly verified by a human and used as the training text, there was a limitation in that it was very difficult to obtain a sufficient amount of training text to train the artificial neural network. In order to overcome this difficulty in acquiring text for learning, a method of acquiring a large amount of text for learning using an inappropriate text learning device that implements a pre-trained artificial neural network to learn and propagate a small amount of text for learning whose ethics or non-conformity has been verified in advance is a method has been proposed By learning the learning text using an artificial neural network, it is possible to easily obtain a large amount of learning text from a small amount of learning text, and to acquire learning texts containing profanity or inappropriate words in various modified forms.

다만, 학습된 학습용 데이터 또한 정확하게 레이블 되었는지 판별될 필요가 있다. 만일 학습된 학습용 데이터가 부정확하게 레이블 되면, 부적합 텍스트 감지 장치의 학습이 부정확하게 수행되며, 이로 인해 비속어나 부적합어가 포함된 텍스트를 제대로 필터링하지 못하게 되는 문제가 발생된다.However, it is necessary to determine whether the learned learning data is also accurately labeled. If the learned training data is incorrectly labeled, the inappropriate text detection device performs inaccurately, resulting in a problem in that text containing profanity or inappropriate words cannot be properly filtered.

한국 공개 특허 제10-2019-0108958호 (2019.09.25 공개)Korean Patent Publication No. 10-2019-0108958 (published on September 25, 2019)

본 발명의 목적은 학습용으로 학습된 부적합 텍스트가 유효한지 여부를 판별할 수 있는 텍스트의 부합성 검사 장치 및 방법을 제공하는데 있다.SUMMARY OF THE INVENTION An object of the present invention is to provide an apparatus and method for checking text conformity that can determine whether or not inappropriate text learned for learning is valid.

본 발명의 다른 목적은 학습용 부적합 텍스트를 생성하는 부적합 텍스트 학습 장치의 부합성을 검증할 수 있는 학습된 텍스트의 부합성 검사 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an apparatus and method for matching learned text that can verify the correspondence of an apparatus for learning unsuitable text that generates text that is unsuitable for learning.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 학습된 텍스트의 부합성 검사 장치는 윤리 또는 부적합가 미리 검증되어 레이블된 학습용 텍스트를 이용하여 학습 방식으로 생성된 다수의 학습 텍스트를 획득하는 텍스트 획득부; 학습 텍스트를 인가받고, 인가된 학습 텍스트에서 미리 획득된 비속어 사전에 등재된 비속어와 기기정된 레벨 이상으로 유사한 단어를 탐색하여 상기 학습 텍스트의 부적합를 판별하는 사전 기반 판별부; 학습 텍스트를 인가받아 단어 단위로 벡터화하고, 벡터화된 단어로부터 미리 학습된 패턴 추정 방식에 따라 문장 특징 벡터를 추출하여 상기 학습 텍스트의 부적합를 판별하는 학습 모델 판별부; 상기 학습 텍스트와 가장 유사한 학습용 텍스트를 탐색하고, 탐색된 학습용 텍스트의 레이블에 따라 상기 학습 텍스트의 부적합를 판별하는 원문 기반 판별부; 및 상기 사전 기반 판별부, 상기 학습 모델 판별부 및 상기 원문 기반 판별부 각각에서 판별된 상기 학습 텍스트의 부적합를 판별 결과를 조합하여, 상기 학습 텍스트에 대한 최종 판별 결과를 획득하는 판별 결과 비교부를 포함한다.In order to achieve the above object, an apparatus for checking the conformity of a learned text according to an embodiment of the present invention obtains a text acquisition that obtains a plurality of learning texts generated by a learning method using a labeled learning text with pre-verified ethics or non-conformity. wealth; a dictionary-based determining unit that receives the training text and searches for words similar to the profanity registered in the profanity dictionary acquired in advance from the approved training text at a predetermined level or more to determine the inappropriateness of the training text; a learning model determining unit that receives the training text and vectorizes it in units of words, and extracts sentence feature vectors from the vectorized words according to a pattern estimation method learned in advance to determine inappropriateness of the training text; a text-based determination unit that searches for a training text most similar to the training text and determines the inappropriateness of the training text according to a label of the found training text; and a determination result comparison unit configured to obtain a final determination result for the training text by combining the results of determining the inappropriateness of the training text determined by each of the dictionary-based determining unit, the learning model determining unit, and the text-based determining unit .

상기 사전 기반 판별부는 상기 비속의 사전에 등재된 비속어와 상기 학습 텍스트의 각 단어에 대해 N-그램 유사도 분석을 수행하여, 상기 학습 텍스트에 비속어의 포함 여부를 판정하고, 비속어가 포함된 것으로 판정되면, 상기 학습 텍스트를 부적합로 판별할 수 있다.The dictionary-based determining unit performs an N-gram similarity analysis on each word of the learning text and the profanity registered in the profanity dictionary to determine whether the learning text includes the profanity, and if it is determined that the profanity is included , it is possible to determine the training text as inappropriate.

상기 학습 모델 판별부는 상기 학습 텍스트의 각 단어를 임베딩하여 벡터화함으로써 다수의 단어 벡터를 획득하는 벡터 변환부; 미리 학습된 패턴 추정 방식에 따라 상기 다수의 단어 벡터의 특징을 누적하여 추출함으로써, 상기 문장 특징 벡터를 획득하는 문장 특징 선택부; 및 미리 학습된 패턴 분류 방식에 따라 상기 문장 특징 벡터를 분류하여, 상기 학습 텍스트의 부적합를 판별하는 특징 분류부를 포함할 수 있다.The learning model determining unit includes: a vector conversion unit for obtaining a plurality of word vectors by embedding and vectorizing each word of the training text; a sentence feature selector configured to acquire the sentence feature vector by accumulating and extracting features of the plurality of word vectors according to a pre-learned pattern estimation method; and a feature classifier configured to classify the sentence feature vector according to a pre-learned pattern classification scheme to determine inappropriateness of the training text.

상기 문장 특징 선택부는 LSTM(Long Short Term Memory)으로 구현될 수 있다.The sentence feature selection unit may be implemented as a Long Short Term Memory (LSTM).

상기 판별 결과 비교부는 상기 사전 기반 판별부, 상기 학습 모델 판별부 및 상기 원문 기반 판별부 각각에서 판별된 상기 학습 텍스트의 부적합를 판별 결과에 대해 다수결 원칙을 적용하여 상기 최종 판별 결과를 획득할 수 있다.The determination result comparison unit may obtain the final determination result by applying a majority rule to the determination result of the incompatibility of the learning text determined by each of the dictionary-based determination unit, the learning model determination unit, and the text-based determination unit.

상기 판별 결과 비교부는 상기 사전 기반 판별부, 상기 학습 모델 판별부 및 상기 원문 기반 판별부 각각에서 판별된 상기 학습 텍스트의 부적합를 판별 결과에 각각에 대해 기지정된 서로 다른 가중치를 할당하고, 할당된 가중치에 따라 윤리 또는 부적합 중 더 높은 가중치가 할당된 결과를 상기 최종 판별 결과로 획득할 수 있다.The determination result comparison unit assigns different predetermined weights to the results of determining the inappropriateness of the training text determined by each of the dictionary-based determination unit, the learning model determination unit, and the text-based determination unit, and to the assigned weights Accordingly, a result assigned with a higher weight among ethics or nonconformity may be obtained as the final determination result.

상기 학습된 텍스트의 부합성 검사 장치는 상기 학습 텍스트의 생성 시에 윤리 또는 부적합로 레이블링된 레이블과 상기 최종 판별 결과를 비교하여 동일하면 상기 학습 텍스트의 레이블이 유효한 것으로 판정하고, 동일하지 않으면 유효하지 않은 것으로 판정하는 레이블 비교부를 더 포함할 수 있다.The conformity checking apparatus of the learned text compares the final determination result with the label labeled as ethical or inappropriate when the training text is generated, and determines that the label of the training text is valid if it is the same. It may further include a label comparison unit that determines that it is not.

상기 레이블 비교부는 다수의 학습 텍스트의 레이블에 대한 유효 판정 결과에 따라 학습 텍스트의 신뢰도를 계산할 수 있다.The label comparison unit may calculate the reliability of the training text according to a result of the validity determination for the labels of the plurality of training texts.

상기 학습된 텍스트의 부합성 검사 장치는 상기 텍스트 획득부에서 획득된 학습 텍스트에 대해 부가 구성 요소 제거하고, 문장 단위로 구분하여 상기 사전 기반 판별부, 상기 학습 모델 판별부 및 상기 원문 기반 판별부 각각으로 전달하는 마무리부를 더 포함할 수 있다.The apparatus for checking the conformity of the learned text removes additional components from the training text acquired by the text acquisition unit, and divides them into sentences by the dictionary-based determining unit, the learning model determining unit, and the original text-based determining unit, respectively. It may further include a finishing unit to pass to.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 학습된 텍스트의 부합성 검증 방법은 윤리 또는 부적합가 미리 검증되어 레이블된 학습용 텍스트를 이용하여 학습 방식으로 생성된 다수의 학습 텍스트를 획득하는 학습 텍스트 획득 단계; 학습 텍스트에서 미리 획득된 비속어 사전에 등재된 비속어와 기기정된 레벨 이상으로 유사한 단어를 탐색하여, 상기 학습 텍스트의 부적합를 판별하는 사전 기반 판별 단계; 학습 텍스트를 인가받아 단어 단위로 벡터화하고, 벡터화된 단어로부터 패턴 추정 방식이 미리 학습된 학습 모델을 이용하여 문장 특징 벡터를 추출하고, 추출된 문장 특징에 기반하여 상기 학습 텍스트의 부적합를 판별하는 학습 모델 기반 판별 단계; 상기 학습 텍스트와 가장 유사한 학습용 텍스트를 탐색하고, 탐색된 학습용 텍스트의 레이블에 따라 상기 학습 텍스트의 부적합를 판별하는 원문 기반 판별 단계; 및 상기 사전 기반 판별 단계, 상기 학습 모델 기반 판별 단계 및 상기 원문 기반 판별 단계 각각에서 판별된 상기 학습 텍스트의 부적합를 판별 결과를 조합하여, 상기 학습 텍스트에 대한 최종 판별 결과를 획득하는 최종 판별 단계를 포함한다.The conformity verification method of the learned text according to another embodiment of the present invention for achieving the above object is a training text for obtaining a plurality of training texts generated by a learning method using the labeled training text with ethics or non-conformity verified in advance. acquisition phase; a dictionary-based determination step of determining inappropriateness of the training text by searching for words similar to the profanity registered in the pre-obtained profanity dictionary from the training text at a predetermined level or more; A learning model that receives permission and vectorizes the training text in word units, extracts a sentence feature vector from the vectorized word using a learning model in which a pattern estimation method is previously learned, and determines inappropriateness of the training text based on the extracted sentence feature based determination step; a text-based determination step of searching for a training text most similar to the training text and determining inappropriateness of the training text according to a label of the found training text; and a final determination step of obtaining a final determination result for the training text by combining the results of determining the inappropriateness of the training text determined in each of the dictionary-based determination step, the learning model-based determination step, and the text-based determination step do.

따라서, 본 발명의 실시예에 따른 학습된 텍스트의 부합성 검사 장치 및 방법은 학습 방법으로 생성되어 레이블링된 대량의 학습용 텍스트의 레이블을 검증함으로써, 학습 방식으로 획득되는 학습용 텍스트의 부합성을 정확하게 검증할 수 있다. 그러므로 인공 신경망으로 구현되어 비속어 또는 부적합어를 탐지하는 탐지 장치를 학습시키기 위한 학습용 텍스트의 신뢰성을 크게 높일 수 있다.Accordingly, the apparatus and method for checking the conformity of the learned text according to the embodiment of the present invention accurately verify the correspondence of the training text obtained by the learning method by verifying the labels of a large amount of training texts generated and labeled by the learning method. can do. Therefore, it is possible to greatly increase the reliability of the training text for learning the detection device that detects profanity or inappropriate words by being implemented as an artificial neural network.

도 1은 본 발명의 일 실시예에 따른 학습된 텍스트 부합성 검사 장치의 개략적 구조를 나타낸다.
도 2는 도 1의 학습 모델 판별부의 상세 구성을 나타낸다.
도 3은 본 발명의 일 실시예에 따른 학습된 텍스트 부합성 검증 방법을 나타낸다.1 shows a schematic structure of a learned text conformity checking apparatus according to an embodiment of the present invention.
FIG. 2 shows a detailed configuration of the learning model determining unit of FIG. 1 .
3 illustrates a method for verifying learned text correspondence according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be embodied in several different forms, and is not limited to the described embodiments. In addition, in order to clearly explain the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it does not exclude other components unless otherwise stated, meaning that other components may be further included. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. and a combination of software.

도 1은 본 발명의 일 실시예에 따른 학습된 텍스트 부합성 검사 장치의 개략적 구조를 나타내고, 도 2는 도 1의 학습 모델 판별부의 상세 구성을 나타낸다.FIG. 1 shows a schematic structure of a learned text conformity checking apparatus according to an embodiment of the present invention, and FIG. 2 shows a detailed configuration of the learning model determining unit of FIG. 1 .

도 1을 참조하면, 본 실시예에 따른 학습된 텍스트 부합성 검사 장치는 텍스트 획득부(100), 마무리부(200), 사전 기반 판별부(300), 학습 모델 판별부(400), 원문 기반 판별부(500), 판별 결과 비교부(600) 및 레이블 비교부(700)를 포함할 수 있다.Referring to FIG. 1 , the learned text correspondence checking apparatus according to the present embodiment includes a text obtaining unit 100 , a finishing unit 200 , a dictionary-based determining unit 300 , a learning model determining unit 400 , and an original text-based It may include a determination unit 500 , a determination result comparison unit 600 , and a label comparison unit 700 .

텍스트 획득부(100)는 윤리 또는 부적합가 미리 검증되어 레이블된 학습용 텍스트를 기반으로 학습 방식으로 생성된 다수의 학습 텍스트를 획득한다. 여기서 학습 텍스트는 적은 수의 학습용 텍스트를 이용하여 대량의 학습용 데이터를 생성하기 위해 미리 학습된 학습 장치가 학습용 텍스트를 인가받아 생성한 다수의 텍스트로서, 입력된 학습용 텍스트와 마찬가지로 윤리 또는 부적합가 레이블된 텍스트이다. 이때 텍스트 내에 다수의 문장이 포함된 경우, 각 문장 단위로 윤리 또는 부적합가 레이블될 수 있으며, 학습 장치는 학습된 방식에 따라 부적합로 레이블된 학습용 텍스트로부터 윤리로 레이블된 학습 텍스트를 생성하거나 윤리로 레이블된 학습용 텍스트로부터 부적합로 레이블된 학습 텍스트를 생성할 수도 있다.The text acquisition unit 100 acquires a plurality of training texts generated by a learning method based on the labeled training texts that have been previously verified for ethics or non-conformity. Here, the training text is a plurality of texts generated by a pre-trained learning device receiving the training text permission to generate a large amount of training data using a small number of training texts. Like the input training text, ethics or nonconformity is labeled text. to be. At this time, if a plurality of sentences are included in the text, ethics or nonconformity may be labeled for each sentence unit, and the learning device generates or labels as ethics from the training text labeled as nonconformity according to the learned method. It is also possible to generate training texts labeled as nonconforming from the trained training texts.

즉 텍스트 획득부(100)는 학습 장치가 생성한 다수의 학습 텍스트를 획득하며, 학습 장치가 생성한 다수의 학습 텍스트를 저장하는 저장 장치 또는 데이터 베이스 등으로 구현될 수 있다.That is, the text acquisition unit 100 acquires a plurality of training texts generated by the learning apparatus, and may be implemented as a storage device or a database for storing the plurality of training texts generated by the learning apparatus.

또한 텍스트 획득부(100)는 학습 텍스트와 함께 학습 텍스트를 생성하기 위해 이용된 원문 학습용 텍스트를 함께 저장할 수 있다.In addition, the text acquisition unit 100 may store the original text for training used to generate the training text together with the training text.

마무리부(200)는 텍스트 획득부(100)에서 획득된 학습 텍스트를 인가받아 기지정된 전처리 작업을 수행한다. 이때 마무리부(200)는 학습 텍스트에서 레이블을 함께 인가받도록 구성될 수도 있으나, 레이블을 제외한 텍스트만을 인가받도록 구성될 수도 있다.The finishing unit 200 receives the training text acquired from the text acquisition unit 100 and performs a predetermined pre-processing operation. At this time, the finishing unit 200 may be configured to be authorized together with the label in the training text, but may be configured to receive only the text excluding the label.

마무리부(200)는 학습 텍스트 내에서 문자, 공백, 구두점 등과 같이 문장을 구성하는 문장 구성 요소 이외에 나머지 구성 요소인 특수 문자, URL, SNS 지정 특성 문자(# 해쉬태그, @ 언급)등의 부가 구성 요소 모두 제거한다. 이는 문자와 문장 기호 및 공백과 같이 문장을 구성하는 문장 구성 요소 이외의 부가 구성 요소들은 비속어나 부적합어로 이용될 가능성이 거의 없으므로 탐지 대상에서 배제하기 위해서이다.In addition to the sentence components constituting the sentence, such as characters, spaces, punctuation marks, etc. in the learning text, the finishing unit 200 includes additional components such as special characters, URLs, and SNS designated characteristic characters (# hash tag, @ mention). Remove all elements. This is to exclude additional components other than sentence components constituting sentences, such as characters, punctuation marks, and spaces, from detection targets because there is little chance that they will be used as profane or inappropriate words.

다만 학습 장치에 의해 생성된 학습 텍스트에서는 부가 구성 요소가 포함되지 않도록 생성될 수 있으며, 이 경우 부가 구성 요소 모두 제거하는 과정은 생략될 수 있다.However, the training text generated by the learning device may be generated so that additional components are not included, and in this case, the process of removing all of the additional components may be omitted.

그리고 마무리부(200)는 부가 구성 요소가 제거된 텍스트 내에 포함된 각 문장을 구분하여, 사전 기반 판별부(300), 학습 모델 판별부(400) 및 원문 기반 판별부(500) 각각으로 전달한다.And the finishing unit 200 divides each sentence included in the text from which the additional components are removed, and delivers it to each of the dictionary-based determining unit 300 , the learning model determining unit 400 , and the original text-based determining unit 500 . .

사전 기반 판별부(300), 학습 모델 판별부(400) 및 원문 기반 판별부(500)는 각각 서로 다른 지정된 방식으로 학습 텍스트의 윤리 또는 부적합를 판별한다.The dictionary-based determining unit 300 , the learning model determining unit 400 , and the text-based determining unit 500 determine the ethics or inadequacy of the learning text in different designated methods, respectively.

우선 사전 기반 판별부(300)는 비속어를 포함하는 문장은 혐오 문장 또는 부적합적 문장일 가능성이 크다는 점을 고려하여, 마무리부(200)에서 인가되는 문장에서 비속어의 포함 여부를 분석하여 윤리 또는 부적합 여부를 판별한다.First, the dictionary-based determining unit 300 analyzes whether a sentence containing profanity is an abusive sentence or an inappropriate sentence, and analyzes whether the sentence includes profanity in the sentence applied by the finishing unit 200 to be ethical or inappropriate. determine whether

사전 기반 판별부(300)는 일예로 비속어 사전을 이용하여 문장 내의 비속어 포함 여부를 분석할 수 있다. 여기서 비속어 사전은 이미 공개된 비속어 사전을 이용하거나, 미리 작성되어 획득될 수 있으며, 경우에 따라서는 원본 학습용 텍스트 또는 학습 텍스트로부터 미리 학습된 방식에 따라 비속어를 분류하여 비속어 사전을 생성하여 이용할 수 있다. 비속어 사전은 이미 작성되어 공개되어 있으며, 비속어 사전을 생성하는 방식 또한 공지된 기술이므로 여기서는 상세하게 설명하지 않는다.The dictionary-based determining unit 300 may analyze whether profanity is included in the sentence by using, for example, a profanity dictionary. Here, the profanity dictionary may be obtained by using an already published profanity dictionary or prepared in advance. . The profanity dictionary has already been prepared and published, and the method of generating the profanity dictionary is also a known technique, so it will not be described in detail here.

이때 사전 기반 판별부(300)는 다양하게 변형되는 비속어에 대응할 수 있도록 완전히 일치하는 비속어만을 탐색하는 것이 아니라 비속의 사전에 등재된 비속어와 인가된 문장의 각 단어에 대해 N-그램(N-gram) 유사도 분석을 수행하여, 각 문장에 비속어의 포함 여부를 판정할 수 있다. 일예로 사전 기반 판별부(300)는 비속어 사전에 등재된 각 단어와 인가된 문장에 포함된 단어들을 비교하여 매칭 문자의 수를 기반으로 대응 여부를 판정하고, 판정 결과에 따라 해당 문장이 부적합 문장인지 여부를 판별할 수 있다.At this time, the dictionary-based determining unit 300 does not search only profane words that completely match so as to correspond to profanity that is variously modified, but N-grams (N-grams) for each word of the profanity registered in the profanity dictionary and the approved sentence. ) by performing a similarity analysis, it is possible to determine whether profanity is included in each sentence. For example, the dictionary-based determining unit 300 compares each word listed in the profanity dictionary and the words included in the approved sentence, and determines whether the corresponding sentence corresponds to the number of matching characters based on the number of matching characters. It can be determined whether or not

학습 모델 판별부(400)는 인가되는 문장의 각 단어를 임베딩하여 벡터화하고, 벡터화된 단어를 인가받아 문장 특징을 추출하고, 추출된 문장 특징 벡터를 기반으로 해당 문장이 부적합 문장인지 여부를 판별한다.The learning model determining unit 400 embeds and vectorizes each word of the applied sentence, receives the vectorized word to extract sentence features, and determines whether the sentence is an inappropriate sentence based on the extracted sentence feature vector. .

도 2를 참조하면, 학습 모델 판별부(400)는 벡터 변환부(410)와 문장 특징 선택부(420) 및 특징 분류부(430)를 포함할 수 있다.Referring to FIG. 2 , the learning model determining unit 400 may include a vector converting unit 410 , a sentence feature selecting unit 420 , and a feature classifying unit 430 .

벡터 변환부(410)는 인가되는 문장에 포함된 단어 각각을 임베딩하여 벡터화함으로써 다수의 단어 벡터를 획득한다. 벡터 변환부(410)는 미리 학습된 임베딩 모델을 이용하여 문장 내의 단어 각각을 단어 벡터로 변환할 수 있다. 벡터 변환부(410)는 Word2Vec, fastText 등과 같이 단어를 단어 벡터로 변환하도록 공개된 임베딩 모델을 이용하여 단어를 단어 벡터로 변환할 수 있다.The vector conversion unit 410 obtains a plurality of word vectors by embedding and vectorizing each word included in the applied sentence. The vector conversion unit 410 may convert each word in a sentence into a word vector by using a pre-learned embedding model. The vector conversion unit 410 may convert a word into a word vector by using an embedding model published to convert a word into a word vector, such as Word2Vec or fastText.

문장 특징 선택부(420)는 벡터 변환부(410)로부터 단어 벡터를 인가받고 인가되는 단어 벡터의 특징을 누적하여 추출함으로써, 문장 특징 벡터를 획득한다.The sentence feature selector 420 obtains a sentence feature vector by receiving the word vector from the vector converter 410 and accumulating and extracting features of the applied word vector.

문장 특징 선택부(420)는 패턴 추정 방식이 미리 학습된 인공 신경망으로 구현될 수 있으며, 특히 LSTM(Long Short Term Memory)으로 구현될 수 있다. LSTM은 순환 신경망(Recurrent Neural Network: RNN)이 장기간(Long Term) 특징을 반영할 수 있도록 개선된 구조를 갖는 신경망으로서, 이전 추출된 단어 벡터의 특징이 이후 입력되는 단어 벡터에 누적 반영됨으로써 문장 특징을 획득하기 용이하다는 장점이 있다.The sentence feature selector 420 may be implemented as an artificial neural network in which a pattern estimation method has been previously learned, and in particular, may be implemented as a Long Short Term Memory (LSTM). The LSTM is a neural network with an improved structure so that the Recurrent Neural Network (RNN) can reflect the Long Term feature, and the feature of the previously extracted word vector is cumulatively reflected in the word vector input later, thereby providing sentence features. It has the advantage of being easy to obtain.

그리고 특징 분류부(430)는 문장 특징 선택부(420)에서 획득된 문장 특징 벡터를 인가받고, 미리 학습된 패턴 분류 방식에 따라 문장 특징 벡터를 분류하여, 윤리 또는 부적합를 판별한다. 특징 분류부(430)는 인공 신경망의 완전 연결 레이어(Fully Connected layer)로 구현되어 문장 특징 벡터를 이진 분류함으로써, 윤리 또는 부적합를 판별할 수 있다.In addition, the feature classifier 430 receives the sentence feature vector obtained from the sentence feature selector 420 and classifies the sentence feature vector according to a pre-learned pattern classification scheme to determine ethics or inappropriateness. The feature classifier 430 may be implemented as a fully connected layer of an artificial neural network and binary classify sentence feature vectors to determine ethics or nonconformity.

한편, 원문 기반 판별부(500)는 마무리부(200)로부터 문장을 인가받고, 인가된 문장을 학습 장치에서 학습 텍스트를 생성하기 위해 이용된 원문 학습 텍스트와 비교하여 가장 유사한 원문 학습 텍스트를 탐색한다. 그리고 탐색된 원문 학습 텍스트의 레이블에 따라 문장을 윤리 또는 부적합로 판별한다. 여기서 원문 기반 판별부(500) 또한 N-그램 유사도 분석을 수행하여, 가장 유사한 원문 학습 텍스트를 판별할 수 있다.On the other hand, the original text-based determination unit 500 receives a sentence from the finishing unit 200, compares the approved sentence with the original training text used to generate the training text in the learning device, and searches for the most similar original training text. . And according to the label of the searched original learning text, the sentence is judged as ethical or inappropriate. Here, the text-based determining unit 500 may also perform N-gram similarity analysis to determine the most similar original text learning text.

판별 결과 비교부(600)는 사전 기반 판별부(300), 학습 모델 판별부(400) 및 원문 기반 판별부(500) 각각이 학습 텍스트에 대해 부적합 여부를 판별한 결과에 기초하여 인가된 학습 텍스트의 윤리 또는 부적합를 최종 판별한다.The determination result comparison unit 600 determines whether each of the dictionary-based determination unit 300, the learning model determination unit 400, and the text-based determination unit 500 is unsuitable for the training text is the applied learning text. final determination of ethics or nonconformity of

여기서 판별 결과 비교부(600)는 단순히 사전 기반 판별부(300), 학습 모델 판별부(400) 및 원문 기반 판별부(500) 각각의 판별 결과를 기초로 다수결 원칙에 따라 학습 텍스트의 윤리 또는 부적합를 판별할 수 있다.Here, the determination result comparison unit 600 simply determines the ethics or inappropriateness of the learning text according to the majority rule based on the determination results of the dictionary-based determination unit 300, the learning model determination unit 400, and the text-based determination unit 500, respectively. can be discerned.

그러나 경우에 따라서 판별 결과 비교부(600)는 사전 기반 판별부(300), 학습 모델 판별부(400) 및 원문 기반 판별부(500) 각각의 판별 결과에 서로 다른 가중치를 가중하여 학습 텍스트의 윤리 또는 부적합를 판별할 수도 있다. 즉 미리 설정되는 판별 결과의 중요도에 따라 사전 기반 판별부(300), 학습 모델 판별부(400) 및 원문 기반 판별부(500) 각각의 판별 결과에 서로 다른 가중치를 가중할 수 있다.However, in some cases, the determination result comparison unit 600 weights different weights to the determination results of the dictionary-based determination unit 300, the learning model determination unit 400, and the text-based determination unit 500, respectively, to determine the ethics of the learning text. Alternatively, non-conformity may be determined. That is, different weights may be weighted to the respective determination results of the dictionary-based determination unit 300 , the learning model determination unit 400 , and the text-based determination unit 500 according to the importance of the predetermined determination result.

일예로 사전 기반 판별부(300)의 경우, 이미 검증된 비속어 사전을 기반으로 하여 학습 텍스트의 윤리 또는 부적합를 판별하므로, 학습 모델 판별부(400)나 원문 기반 판별부(500)에 비해 더 높은 가중치를 가중한 후, 가중치가 가중된 사전 기반 판별부(300), 학습 모델 판별부(400) 및 원문 기반 판별부(500)의 판별 결과에서 윤리 또는 부적합 중 더 높은 가중치가 부여된 결과를 선택할 수 있다.For example, in the case of the dictionary-based determination unit 300, since it determines the ethics or inadequacy of the learning text based on the already verified profanity dictionary, a higher weight than the learning model determination unit 400 or the text-based determination unit 500 After weighting, from the determination results of the weighted dictionary-based discriminator 300, the learning model discriminating unit 400, and the text-based discriminating unit 500, the higher weighted result among ethics or nonconformity can be selected. have.

한편, 레이블 비교부(700)는 판별 결과 비교부(600)에서 최종 판별된 결과와 텍스트 획득부(100)에 저장된 대응하는 학습 텍스트의 레이블을 비교하여 동일하면, 학습 텍스트의 레이블이 유효한 것으로 판정하고, 동일하지 않으면 유효하지 않은 것으로 판정한다.On the other hand, the label comparison unit 700 compares the final determined result in the determination result comparison unit 600 with the label of the corresponding training text stored in the text acquisition unit 100 and determines that the label of the training text is valid. and if they are not the same, it is judged as invalid.

레이블 비교부(700)는 다수의 학습 텍스트의 레이블에 대한 판정 결과를 누적하여 학습 장치에서 생성된 학습 텍스트의 신뢰도를 계산할 수 있다. 일예로 전체 학습 텍스트에서 유효한 것으로 판정된 학습 텍스트의 비율로 학습 텍스트의 신뢰도를 계산할 수 있다.The label comparison unit 700 may calculate the reliability of the training text generated by the learning apparatus by accumulating the determination results for the labels of the plurality of training texts. As an example, the reliability of the training text may be calculated from the ratio of the training text determined to be valid in the entire training text.

도 3은 본 발명의 일 실시예에 따른 학습된 텍스트 부합성 검증 방법을 나타낸다.3 illustrates a method for verifying learned text correspondence according to an embodiment of the present invention.

도 1 및 도 2를 참조하여, 도 3의 학습된 텍스트 부합성 검증 방법을 설명하면, 우선 윤리 또는 부적합가 미리 검증되어 레이블된 학습용 텍스트를 기반으로 학습 방식으로 생성된 다수의 학습 텍스트를 획득한다(S11). 여기서 레이블은 학습 텍스트 내의 문장 단위로 레이블링 될 수 있다.1 and 2, when the learned text conformity verification method of FIG. 3 is described, first, ethics or nonconformity are verified in advance and a plurality of training texts generated by a learning method are obtained based on the labeled training text ( S11). Here, the label may be labeled in units of sentences within the training text.

그리고 획득된 학습 텍스트에 대해 부가 구성 요소 제거하고, 문장 단위로 구분하는 등의 기지정된 전처리 작업을 수행한다(S12).Then, a predetermined pre-processing operation such as removing additional components and dividing the acquired learning text into sentence units is performed (S12).

학습 텍스트가 전처리되면, 전처리된 학습 텍스트에 대해 서로 다른 지정된 방식으로 학습 텍스트의 윤리 또는 부적합를 판별한다.When the training text is pre-processed, the ethics or non-conformity of the training text is determined in different specified ways for the pre-processed training text.

우선 비속어 사전을 이용하여 윤리 또는 부적합를 판별한다. 비속어 사전을 이용하는 경우, 먼저 비속어 사전에 기재된 비속어와 기지정된 레벨 이상으로 유사 단어를 학습 텍스트의 각 문장에서 탐색한다(S13). 그리고 탐색 결과에 기반하여 문장의 윤리 또는 부적합를 판별한다(S14). 즉 비속어와 유사한 것으로 판별되는 단어가 탐색되면 부적합로 판별하고, 탐색되지 않으면 윤리로 판별할 수 있다.First, use the profanity dictionary to determine ethics or non-conformity. In the case of using the profanity dictionary, first, a word similar to the profanity written in the profanity dictionary is searched for in each sentence of the learning text at a predetermined level or higher (S13). And based on the search result, it is determined ethics or inappropriateness of the sentence (S14). That is, if a word determined to be similar to profanity is searched for, it is determined as inappropriate, and if it is not searched, it can be determined as ethical.

한편, 미리 학습된 학습 모델에 기반하여 학습 텍스트의 윤리 또는 부적합를 판별한다.On the other hand, based on the pre-trained learning model, the ethics or inappropriateness of the learning text is determined.

이를 위해 우선 학습 텍스트의 각 문장에 포함된 단어를 미리 학습된 임베딩 모델을 이용하여 벡터화함으로써 다수의 단어 벡터를 획득한다(S15). 그리고 획득된 다수의 단어 벡터를 패턴 추정 방식이 미리 학습된 인공 신경망으로 입력하여 문장에 대한 특징을 나타내는 문장 특징 벡터를 획득한다(S16). 여기서 인공 신경망은 LSTM으로 구현될 수 있다.To this end, first, a plurality of word vectors are obtained by vectorizing the words included in each sentence of the training text using a pre-learned embedding model (S15). Then, a plurality of obtained word vectors are input to an artificial neural network in which a pattern estimation method has been previously learned to obtain a sentence feature vector representing a characteristic of the sentence (S16). Here, the artificial neural network may be implemented as an LSTM.

문장 특징 벡터가 획득되면, 미리 학습된 패턴 분류 방식에 따라 문장 특징 벡터를 분류하여, 윤리 또는 부적합를 판별한다(S17).When the sentence feature vector is obtained, the sentence feature vector is classified according to the pre-learned pattern classification method to determine ethics or inappropriateness (S17).

또한 학습 텍스트를 생성하기 위해 이용된 원문 학습 텍스트를 이용하여 학습 텍스트의 윤리 또는 부적합를 판별한다.In addition, by using the original learning text used to generate the learning text, the ethics or inappropriateness of the learning text is determined.

즉 원문 학습 텍스트 중 인가된 학습 텍스트와 가장 유사한 원문 학습 텍스트를 탐색한다(S18). 가장 유사한 원문 학습 텍스트가 탐색되면, 탐색된 원문 학습 텍스트의 레이블에 따라 문장을 윤리 또는 부적합로 판별한다(S19).That is, the original training text most similar to the authorized training text is searched for among the original training texts (S18). When the most similar original learning text is searched, the sentence is determined as ethical or inappropriate according to the label of the searched original learning text (S19).

이후, 비속어 사전을 이용한 판별 결과와 학습 모델에 기반한 판별 결과 및 원문 학습 텍스트를 이용한 판별 결과를 기지정된 방식으로 조합하여 학습 텍스트에 대한 윤리 또는 부적합의 최종 판별 결과를 획득한다(S20). 여기서 최종 판별 결과는 다수결의 원칙에 따라 판별하거나, 각 판별 결과에 대해 기지정된 가중치를 할당하여 윤리 또는 부적합 중 높은 가중치가 부가된 쪽을 최종 판별 결과로 획득할 수 있다.Thereafter, the final determination result of ethics or nonconformity with respect to the learning text is obtained by combining the determination result using the profanity dictionary, the determination result based on the learning model, and the determination result using the original learning text in a predetermined manner (S20). Here, the final determination result may be determined according to the principle of majority vote, or by assigning a predetermined weight to each determination result, the higher weight among ethics or nonconformity may be obtained as the final determination result.

최종 판별 결과가 획득되면, 획득된 최종 판별 결과와 학습 텍스트의 레이블을 비교하여, 학습 텍스트의 부합성을 판정한다(S21). 그리고 다수의 학습 텍스트에 대한 부합성 판정 결과를 누적하여, 학습 방식으로 생성된 학습 텍스트의 신뢰도를 계산한다(S22).When the final determination result is obtained, the obtained final determination result and the label of the training text are compared to determine the matching of the training text (S21). Then, by accumulating the match determination results for a plurality of training texts, the reliability of the training texts generated by the learning method is calculated ( S22 ).

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution by a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and read dedicated memory), RAM (Random Access Memory), CD (Compact Disk)-ROM, DVD (Digital Video Disk)-ROM, magnetic tape, floppy disk, optical data storage, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiment shown in the drawings, which is only exemplary, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Accordingly, the true technical protection scope of the present invention should be defined by the technical spirit of the appended claims.

Claims

a text acquisition unit configured to acquire a plurality of training texts generated by a learning method using the labeled training texts with ethics or nonconformity verified in advance;
a dictionary-based determining unit that receives the training text and searches for words that are similar to the profanity registered in the pre-obtained profanity dictionary from the approved training text at a predetermined level or more to determine the inappropriateness of the training text;
a learning model determining unit that receives the training text and vectorizes it in word units, extracts a sentence feature vector from the vectorized word according to a pattern estimation method learned in advance, and determines the inappropriateness of the training text;
a text-based determination unit that searches for the training text most similar to the training text and determines the inappropriateness of the training text according to the found label of the training text; and
Comprising a determination result comparison unit for obtaining a final determination result for the training text by combining the results of determining the inappropriateness of the training text determined by each of the dictionary-based determining unit, the learning model determining unit, and the text-based determining unit,
The dictionary-based determining unit performs an N-gram similarity analysis on each word of the learning text and the profanity registered in the profanity dictionary to determine whether the learning text includes the profanity, and if it is determined that the profanity is included , determine the training text as inappropriate,
The learning model determination unit,
a vector converter for obtaining a plurality of word vectors by embedding and vectorizing each word of the training text;
a sentence feature selector configured to acquire the sentence feature vector by accumulating and extracting features of the plurality of word vectors according to a pre-learned pattern estimation method; and
and a feature classifier configured to classify the sentence feature vector according to a pre-learned pattern classification scheme to determine inappropriateness of the training text.

The apparatus of claim 1 , wherein the sentence feature selector is implemented as a Long Short Term Memory (LSTM).

The method of claim 1, wherein the determination result comparison unit
Apparatus for matching the learned text to obtain the final determination result by applying a majority rule to the non-conformity determination result of the training text determined by each of the dictionary-based determination unit, the learning model determination unit, and the text-based determination unit .

The method of claim 1, wherein the determination result comparison unit
Each of the dictionary-based determining unit, the learning model determining unit, and the text-based determining unit assigns different predetermined weights to the non-conformity determination results of the training text determined by each, and either ethical or non-conforming according to the assigned weight. A conformity checking apparatus for a learned text that obtains a result assigned with a higher weight as the final determination result.

The apparatus of claim 1, wherein the conformity checking apparatus of the learned text comprises:
Comparing the final determination result with the label labeled as ethical or inappropriate when generating the training text, and determining that the label of the training text is the same if it is the same, and further comprising a label comparison unit that determines that the label of the training text is not valid if it is not the same Apparatus for matching the learned text.

The method of claim 5, wherein the label comparison unit
A conformity checking apparatus for learned texts for calculating reliability of training texts according to a result of validation of labels of a plurality of training texts.

The method of claim 1,
The learned text further comprising a finishing unit that removes additional components from the learning text obtained in the text acquisition unit, divides them into sentence units, and delivers them to each of the dictionary-based determining unit, the learning model determining unit, and the original text-based determining unit Text conformity checker.