KR102049660B1

KR102049660B1 - Method and Apparatus for Explicit Lyrics Classification Using Automated Explicit Lexicon Generation and Machine Learning

Info

Publication number: KR102049660B1
Application number: KR1020180030680A
Authority: KR
Inventors: 이문용; 김자용
Original assignee: 한국과학기술원
Priority date: 2018-03-16
Filing date: 2018-03-16
Publication date: 2019-11-28
Also published as: KR20190108958A

Abstract

유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 방법 및 장치가 제시된다. 본 발명에서 제안하는 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 방법은 유해가사와 적격가사로 분류된 가사에 등장하는 전체 단어들에 대하여 각 단어마다 유해가사에서 등장한 횟수와 적격가사에서 등장한 횟수를 구하고, 상기 횟수를 이용하여 자동 유해단어 어휘목록을 자동으로 생성하는 단계, 적격가사에 등장한 단어들에 대해서도 실수 및 오류를 감안하여 새로운 유해단어 어휘목록 생성하는 단계, 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초하여 유해단어 확인 벡터를 생성하는 단계, 단어의 앞 뒤 문맥을 고려하기 위해 순차적 데이터 처리 모델을 이용하여 맥락 확인 벡터를 생성하는 단계 및 상기 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측하는 단계를 포함한다.A method and apparatus for automatically classifying harmful words of adolescents using automatic generation of harmful word vocabulary lists and machine learning are presented. According to the present invention, a method for automatically classifying harmful words of adolescents using automatic word lists and machine learning of harmful words suggests the number and qualifications of each word appearing in each word for all words appearing in the lyrics classified into harmful and qualified words. Obtaining the number of appearances in the lyrics, and automatically generating an automatic harmful word vocabulary list using the number of times, generating a new harmful word vocabulary list in consideration of mistakes and errors even in the words appearing in the qualified lyrics, the generated Generating a harmful word identification vector based on an automatic harmful word vocabulary list and a new harmful word vocabulary list, generating a context identification vector using a sequential data processing model to consider the context before and after the word and the harmful word Hybrid harmfulness is determined using the identification vector and the context identification vector. The final classification process includes the use of a death classification model.

Description

Method and Apparatus for Explicit Lyrics Classification Using Automated Explicit Lexicon Generation and Machine Learning}

본 발명은 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for automatically classifying harmful words of adolescents using automatic word lists and machine learning.

청소년 유해음반 자동 분류에 대한 기존 연구, [Chin, Kim 2018]에 따르면 청소년 유해가사(이하, 유해가사로 표기)에는 크게 청소년 유해단어(이하, 유해단어로 표기)가 등장하는 가사와 맥락적으로 유해한 표현이 등장하는 가사 두 가지 종류가 있다. 하지만 [Chin, Kim, 2018]에서는 이 두 가지 유해가사 종류에 맞추어 기계학습 모델을 설계하지 않고, 보편적인 설계의 기계학습 모델을 사용한다. According to an existing study on automatic classification of adolescents' harmful records, [Chin, Kim 2018], adolescent harmful words (hereinafter referred to as harmful lyrics) are largely related to the lyrics and contexts in which adolescent harmful words (hereafter referred to as harmful words) appear. There are two kinds of lyrics with harmful expressions. However, in [Chin, Kim, 2018], we do not design machine learning models for these two types of harmful lyrics, but use machine learning models with universal design.

첫 번째 유해가사 종류인 '유해단어가 등장하는 가사'를 분류하기 위해서는 '문서(가사)에 유해단어가 등장하는지, 등장하지 않는지'를 확인해야 한다. 기존 연구 및 특허는 대부분 기존에 정의된 욕설, 은어 사전 등을 사용하여, 유해단어가 등장하는 지 확인한다. 그러나 이러한 방법은 잘 정의된 욕설, 은어 사전 등이 없는 언어권에서는 사용이 힘들다. 또한, 이러한 사전들은 보편적인 목적을 위하여 만들어졌으므로 사용 목적에 맞지 않을 수도 있다. 그리고 단어 등장 여부만 확인하기 때문에 단어의 맥락은 파악할 수 없다. In order to classify the first kind of harmful lyrics, 'Hyrics with harmful words', it is necessary to check whether or not harmful words appear in the document (lyrics). Most existing researches and patents use swearwords and slang dictionaries that have been previously defined to check for the appearance of harmful words. However, these methods are difficult to use in languages without well-defined swear words and slang dictionaries. In addition, these dictionaries are designed for universal use and may not be suitable for their intended use. And since the word is only checked for appearance, the context of the word cannot be understood.

두 번째 유해가사 종류인 '맥락적으로 유해한 표현이 등장하는 가사'를 분류하기 위해서는 단어의 앞뒤 단어를 확인하여 단어의 의미를 맥락적으로 파악해야 한다. 이를 위해 CNN(Convolution Neural Networks)나 RNN(Recurrent Neural Networks)와 같은 모델들이 제안된다. 그러나 이러한 모델들은 가사의 전반적인 내용을 저차원 벡터화 시키기 때문에 유해단어 등장 유무 정보가 소실되어, 유해단어 확인에 취약할 수 있다. 따라서 첫 번째로, 유해단어 확인을 위하여 유해단어 어휘목록을 자동으로 생성하고, 두 번째로, '유해단어 확인 모델'과 '맥락 확인 모델'을 결합한 하이브리드 모델의 설계를 필요로 한다.In order to classify the second kind of harmful lyrics, `` lyrics in which contextually harmful expressions appear, '' it is necessary to check the meaning of words by context before and after the words. To this end, models such as Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN) are proposed. However, since these models make the overall contents of the lyrics low-dimensionalized, the information on the appearance of harmful words is lost and may be vulnerable to the identification of harmful words. Therefore, first, it is necessary to automatically generate a list of harmful words to identify harmful words, and secondly, it is necessary to design a hybrid model that combines 'harmful word checking model' and 'context checking model'.

본 발명이 이루고자 하는 기술적 과제는 기존 심의제도는 유해가사 심의를 위해 다수의 전문가를 고용해야 하기 때문에 인건비가 필요하고, 현행 심의제도는 사람에 의해 이루어지기 때문에 시간적 제약이 존재한다는 단점을 개선하기 위한 방법 및 장치를 제공하는데 있다. 또한, 여성가족부의 유해음반 심의는 사후 심의이기 때문에, 심의 기간이 길어질수록 청소년이 유해음반에 노출될 확률이 높아지므로 실시간으로 심의를 진행할 수 있는 자동화된 유해가사 자동 분류 방법 및 장치를 제공하고자 한다.The technical problem to be achieved by the present invention is that the existing deliberation system is required to hire a number of experts for the deliberation of harmful lyrics, labor costs are required, because the current deliberation system is made by humans to improve the disadvantage that there is a time constraint To provide a method and apparatus. In addition, since the deliberation of harmful records of the female family is post-deliberation, the longer the deliberation period, the higher the probability of adolescents being exposed to harmful records, and therefore, an automatic harmful lyrics classification method and apparatus for deliberation in real time are provided. .

일 측면에 있어서, 본 발명에서 제안하는 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 방법은 유해가사와 적격가사로 분류된 가사에 등장하는 전체 단어들에 대하여 각 단어마다 유해가사에서 등장한 횟수와 적격가사에서 등장한 횟수를 구하고, 상기 횟수를 이용하여 자동 유해단어 어휘목록을 자동으로 생성하는 단계, 적격가사에 등장한 단어들에 대해서도 실수 및 오류를 감안하여 새로운 유해단어 어휘목록 생성하는 단계, 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초하여 유해단어 확인 벡터를 생성하는 단계, 단어의 앞 뒤 문맥을 고려하기 위해 순차적 데이터 처리 모델을 이용하여 맥락 확인 벡터를 생성하는 단계 및 상기 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측하는 단계를 포함한다. In one aspect, a method for automatically classifying harmful words of adolescents using automatic generation of harmful word vocabulary lists and machine learning proposed in the present invention is a harmful lyrics for each word for all words appearing in the lyrics classified as harmful words and qualified words. Obtaining the number of appearances from and the number of appearances from the qualified lyrics, and automatically generating a word list of automatic harmful words using the number of times, generating a new harmful word vocabulary list in consideration of mistakes and errors for words appearing in the qualified lyrics Generating a harmful word identification vector based on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list, and generating a context identification vector using a sequential data processing model to consider the context before and after the word. And using the word identification vector and the context identification vector. The final prediction of harmful lyrics is performed through the hybrid hazard classification model.

상기 자동 유해단어 어휘목록을 자동으로 생성하는 단계는 각 단어마다 유해가사에서 등장한 횟수 및 적격가사에서 등장한 횟수에 따라 자질 선정 기법을 통해 단어의 점수를 부여하고, 상기 점수를 기준으로 유해단어를 선정한다. The automatic generation of the harmful word vocabulary list automatically assigns a word score through a feature selection technique according to the number of appearances in the harmful lyrics and the number of appearances in the qualified lyrics for each word, and selects the harmful words based on the scores. do.

상기 선정된 유해단어를 이용하여 자동 유해단어 어휘목록을 자동으로 생성한다. Automatically generate a word list of harmful words automatically using the selected harmful words.

상기 적격가사에 등장한 단어들에 대해서도 실수 및 오류를 감안하여 새로운 유해단어 어휘목록 생성하는 단계는 하기 식을 이용하여 새로운 유해단어 어휘목록 생성하고,

, 여기서

는 적격가사에 등장하는 단어들에 대한 실수 및 오류를 감안한 팩터이다. Generating a new harmful word vocabulary list in consideration of mistakes and errors even in the words appearing in the qualified lyrics generates a new harmful word vocabulary list by using the following equation,

, here

Is a factor that takes into account mistakes and errors in words appearing in qualified lyrics.

상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초하여 유해단어 확인 벡터를 생성하는 단계는 가사에 등장하는 전체 단어들에 대하여 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 있는 단어가 한 개 이상 등장하는지 여부를 판별하여 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초한 필터링 예측 결과를 생성한다. The generating of the harmful word identification vector based on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list may be performed on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list for all words appearing in the lyrics. By determining whether one or more words exist, a filtering prediction result is generated based on an automatic harmful word vocabulary list and a new harmful word vocabulary list.

상기 단어의 앞 뒤 문맥을 고려하기 위해 순차적 데이터 처리 모델을 이용하여 맥락 확인 벡터를 생성하는 단계는 CNN(Convolution Neural Networks) 및 RNN(Recurrent Neural Networks)을 포함하는 순차적 데이터 처리 모델을 이용하여 가사를 벡터화 함으로써 맥락 확인 벡터를 생성한다. Generating a context identification vector using a sequential data processing model to consider the context before and after the word may include lyrics using a sequential data processing model including Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN). By vectorizing, you create a context identification vector.

상기 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측하는 단계는 상기 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 연결(concatenate)하고, Bagging, AdaBoost, K-Nearest Neighbors을 포함하는 분류모델을 상기 연결된 벡터와 유해가사 및 적격가사로 학습시키며, 유해가사 또는 적격가사 여부를 모르는 가사가 분류모델에 입력되었을 경우, 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측한다. The final step of predicting whether or not the harmful words through the hybrid harmful words classification model using the harmful word identification vector and the context identification vector concatenates the harmful word identification vector and the context identification vector, and includes Bagging, AdaBoost, If the classification model including K-Nearest Neighbors is trained with the linked vector and the harmful and qualified lyrics, and the lyrics are not known whether the harmful or qualified lyrics are entered into the classification model, whether the harmful lyrics are through the hybrid harmful lyrics classification model. Final prediction.

또 다른 일 측면에 있어서, 본 발명에서 제안하는 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 장치는 유해가사와 적격가사로 분류된 가사에 등장하는 전체 단어들에 대하여 각 단어마다 유해가사에서 등장한 횟수와 적격가사에서 등장한 횟수를 구하고, 상기 횟수를 이용하여 자동 유해단어 어휘목록을 자동으로 생성하는 자동 유해단어 어휘목록 생성부, 적격가사에 등장한 단어들에 대해서도 실수 및 오류를 감안하여 새로운 유해단어 어휘목록 생성하는 새로운 유해단어 어휘목록 생성부, 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초하여 유해단어 확인 벡터를 생성하는 유해단어 확인 벡터 생성부, 단어의 앞 뒤 문맥을 고려하기 위해 순차적 데이터 처리 모델을 이용하여 맥락 확인 벡터를 생성하는 맥락 확인 벡터 생성부 및 상기 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측하는 최종 유해가사 예측부를 포함한다. In another aspect, the automatic device for classifying harmful words of adolescents using automatic generation of harmful word vocabulary lists and machine learning proposed in the present invention is provided for each word for all words appearing in lyrics classified as harmful words and qualified words. Get the number of appearances from the harmful lyrics and the number of appearances from the qualified lyrics, and the automatic harmful word vocabulary list generation unit that automatically generates the automatic harmful word vocabulary list by using the number of times, considering mistakes and errors in words appearing in the qualified lyrics New harmful word vocabulary list generation unit for generating a new harmful word vocabulary list, a harmful word identification vector generation unit for generating a harmful word identification vector based on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list, the front of the word Contextual verification using a sequential data processing model to consider the back context It includes a context identification vector generation unit for generating a vector and a final harmful lyrics prediction unit for the final prediction of the harmful words through a hybrid harmful lyrics classification model using the harmful word identification vector and the context identification vector.

상기 자동 유해단어 어휘목록 생성부는 각 단어마다 유해가사에서 등장한 횟수 및 적격가사에서 등장한 횟수에 따라 자질 선정 기법을 통해 단어의 점수를 부여하고, 상기 점수를 기준으로 유해단어를 선정하고, 상기 선정된 유해단어를 이용하여 자동 유해단어 어휘목록을 자동으로 생성한다. The automatic harmful word vocabulary list generation unit assigns a word score to each word through a feature selection technique according to the number of appearances in the harmful lyrics and the number of appearances in the eligible lyrics, and selects the harmful words based on the scores. Automatically generate harmful word vocabulary list using harmful words.

상기 유해단어 확인 벡터 생성부는 가사에 등장하는 전체 단어들에 대하여 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 있는 단어가 한 개 이상 등장하는지 여부를 판별하여 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초한 필터링 예측 결과를 생성한다. The harmful word identification vector generating unit determines whether or not one or more words in the generated harmful word vocabulary list and the new harmful word vocabulary list appear for all words appearing in the lyrics. Generate filtering prediction results based on the list of harmful words.

상기 맥락 확인 벡터 생성부는 CNN(Convolution Neural Networks) 및 RNN(Recurrent Neural Networks)을 포함하는 순차적 데이터 처리 모델을 이용하여 가사를 벡터화 함으로써 맥락 확인 벡터를 생성한다. The context identification vector generator generates a context identification vector by vectorizing lyrics using a sequential data processing model including convolution neural networks (CNNs) and recurrent neural networks (RNNs).

상기 최종 유해가사 예측부는 상기 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 연결(concatenate)하고, Bagging, AdaBoost, K-Nearest Neighbors을 포함하는 분류모델을 상기 연결된 벡터와 유해가사 및 적격가사로 학습시키며, 유해가사 또는 적격가사 여부를 모르는 가사가 분류모델에 입력되었을 때 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측한다.The final harmful lyrics prediction unit concatenates the harmful word identification vector and the context identification vector, and learns a classification model including Bagging, AdaBoost, and K-Nearest Neighbors using the linked vector and the harmful lyrics and the qualified lyrics. When lyrics that do not know whether they are harmful or qualified lyrics are entered into the classification model, the final prediction is made through the hybrid harmful lyrics classification model.

기존 심의제도는 유해가사 심의를 위해 다수의 전문가를 고용해야 하기 때문에 인건비가 필요하다는 단점이 있다. 본 발명의 실시예들에 따르면 기계학습을 통한 심의 자동화를 통해 인건비를 절약할 수 있다. 또한, 현행 심의제도는 사람에 의해 이루어지기 때문에 시간적 제약이 존재한다. 여성가족부의 유해음반 심의는 사후 심의이기 때문에, 심의 기간이 길어질수록 청소년이 유해음반에 노출될 확률이 높아진다. 하지만, 본 발명에 따른 자동화된 기계는 시간적 제약이 적어 거의 실시간으로 심의를 진행할 수 있다. 이를 통해 청소년의 유해음반 노출을 최소화 시킬 수 있다. The existing deliberation system has a disadvantage in that labor costs are required because it requires hiring a large number of experts for the deliberation of harmful lyrics. According to embodiments of the present invention it is possible to save labor costs through the automation of the seam through machine learning. In addition, there are time constraints because the current deliberation system is man-made. Since the deliberation of harmful records of the female family is post-deliberation, the longer the deliberation period, the higher the probability that adolescents will be exposed to harmful records. However, the automated machine according to the present invention has less time constraints and can proceed with the deliberation in almost real time. This can minimize the exposure of adolescent harmful records.

도 1은 본 발명의 일 실시예에 따른 하이브리드 유해가사 분류 모델 개요도이다.
도 2는 본 발명의 일 실시예에 따른 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 방법을 설명하기 위한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 유해단어 어휘목록 자동 생성과 어휘목록 기반 필터링 모델의 개요도이다.
도 4는 본 발명의 일 실시예에 따른 유해단어 어휘목록 생성 과정을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 logCDα가 α에 따라 뽑은 유해단어를 등고선으로 나타낸 도면이다.
도 6은 본 발명의 일 실시예에 따른 유해단어 확인 벡터 생성 과정을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 장치의 구성을 나타내는 도면이다.1 is a schematic diagram of a hybrid harmful lyrics classification model according to an embodiment of the present invention.
2 is a flowchart illustrating a method for automatically classifying harmful words of adolescents using automatic generation of harmful word vocabulary lists and machine learning according to an embodiment of the present invention.
3 is a schematic diagram of a harmful word vocabulary list automatic generation and a vocabulary list based filtering model according to an embodiment of the present invention.
4 is a view for explaining the process of generating a word vocabulary list harmful according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating a harmful word extracted by logCDα according to α according to an embodiment of the present invention in a contour line.
6 is a view for explaining the process of generating a harmful word identification vector according to an embodiment of the present invention.
FIG. 7 is a diagram illustrating a configuration of an automatic classification apparatus for juvenile harmful lyrics using automatic generation of harmful word vocabulary lists and machine learning according to an embodiment of the present invention.

이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 하이브리드 유해가사 분류 모델 개요도이다. 1 is a schematic diagram of a hybrid harmful lyrics classification model according to an embodiment of the present invention.

우리나라에서는 청소년 보호법에 따라, 정부기관인 여성가족부의 주관 하에 청소년 유해음반 심의가 이루어진다. 방송사에서도 청소년에게 부적격한 가요를 심의한다. 이러한 유해음반 심의는 현재 사람에 의해 진행되는데, 이를 자동화하면 인적, 시간적, 물적 자본의 투입을 줄일 수 있을 것이다. In Korea, under the juvenile protection law, juvenile harmful music reviews are conducted under the supervision of the Ministry of Gender Equality and Family. Broadcasters will also consider the songs that are not suitable for youth. This review of harmful music is currently conducted by humans, and automation can reduce the input of human, time and material capital.

대부분의 유해음반은 가사에 청소년에게 유해한 표현이 있어서 유해음반으로 분류된다. 따라서 본 발명은 유해음반 자동 심의를 위해, 이러한 문제를 유해가사 분류의 문제로 접근한다. 본 발명은 유해가사 분류를 위한 기계학습 모델 설계 방법과 청소년 유해단어 어휘목록 자동 생성 방법을 다룬다. Most harmful records are classified as harmful because they have harmful expressions to adolescents. Therefore, the present invention approaches this problem as a problem of harmful lyrics classification for automatic review of harmful records. The present invention deals with a machine learning model design method for harmful lyrics classification and a method for automatically generating a list of harmful words for adolescents.

제안하는 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류를 위한 하이브리드 유해가사 분류 모델은 도 1과 같이 나타낼 수 있다. The hybrid harmful lyrics classification model for the automatic classification of the harmful words of adolescents using automatic generation of the harmful word vocabulary list and machine learning can be represented as shown in FIG. 1.

본 발명의 일 실시예에 따른 하이브리드 유해가사 분류 모델은 크게 5가지 부분으로 구성 된다: Hybrid harmful lyrics classification model according to an embodiment of the present invention consists of five parts:

(1) 유해단어 어휘목록 자동 생성 (1) Automatic generation of harmful word vocabulary list

(2) logCDα 기법을 이용한 새로운 유해단어 어휘목록 생성 (2) Generation of new harmful word vocabulary list using logCDα technique

(3) 가사에 대해 유해단어 어휘목록기반 필터링 모델을 적용하여 예측 결과 벡터, 다시 말해 유해단어 확인 벡터를 생성 (3) Generate a prediction result vector, that is, a harmful word identification vector, by applying a lexical list-based filtering model to the lyrics.

(4) 가사에 대해 맥락 확인 모델을 적용하여 예측 결과 벡터, 다시 말해 맥락 확인 벡터 생성 (4) Apply a context checking model to the lyrics to generate predictive results vectors, that is, context checking vectors

(5) 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 하이브리드 유해가사 분류 모델을 통해 유해가사인지 아닌지 최종 예측(5) Final prediction whether or not harmful words are generated through the hybrid word classification model using the word identification vector and the context identification vector.

도 1을 참조하면, 유해가사와 적격가사로 분류된 가사(110)에 등장하는 전체 단어들에 대하여 각 단어마다 유해가사에서 등장한 횟수와 적격가사에서 등장한 횟수를 구하고, 상기 횟수를 이용하여 자동 유해단어 어휘목록을 자동으로 생성한다(120). Referring to FIG. 1, for the whole words appearing in the lyrics 110 classified as harmful lyrics and qualified lyrics, the number of appearances in the harmful lyrics and the number of appearances in the qualified lyrics are obtained for each word, and the automatic harmful by using the number of times. The word vocabulary list is automatically generated (120).

이후, 적격가사에 등장한 단어들에 대해서도 실수 및 오류를 감안하여 새로운 유해단어 어휘목록 생성하고, 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초하여 유해 단어가 등장하는지 여부를 판단한다(121). Subsequently, a new harmful word vocabulary list is generated for words appearing in the qualified lyrics in consideration of mistakes and errors, and it is determined whether harmful words appear based on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list. (121).

유해단어 확인 모델(130)을 적용하여 가사에 등장하는 전체 단어들에 대해, 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 있는 단어가 한 개 이상 등장하는지 여부를 판별하여 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초한 필터링 예측 결과를 생성한다. By applying the harmful word identification model 130, the automatic harmful word vocabulary list and the automatic harmful word vocabulary list and the new harmful word vocabulary list for all the words appearing in the lyrics are determined. Generate filtering prediction results based on the new harmful word vocabulary list.

단어의 앞 뒤 문맥을 고려하기 위해 맥락 확인 모델(140)인 순차적 데이터 처리 모델(141)을 이용하여 맥락 확인 벡터를 생성한다. A context identification vector is generated using the sequential data processing model 141, which is the context identification model 140, to consider the context before and after the word.

이렇게 생성된 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 분류기(Classifier)(150), 다시 말해 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측한다. Using the generated harmful word identification vector and the context identification vector, a classifier 150, that is, a final prediction of harmful lyrics is performed through a hybrid harmful lyrics classification model.

본 발명은 여성가족부의 청소년 유해음반 심의 외에도, 가사에 유해한 표현이 있는지 확인해야 하는 TV 및 라디오 방송국, 음원 스트리밍 사이트에서도 사용될 수 있다. 특히 본 발명은 사전에 정의된 욕설사전과 같은 유해단어 어휘목록이 없어도 유해한 표현을 찾을 수 있다. 따라서 텍스트 데이터에서 유해한 표현이 있는지 확인해야 하는 유해 SNS 분류, 욕설 채팅 판단 과 같은 다른 분야에도 쉽게 적용이 가능하다.The present invention can be used in TV and radio stations and sound recording streaming sites that should check for harmful expressions in lyrics, in addition to juvenile harmful records review by the Ministry of Gender Equality and Family. In particular, the present invention can find a harmful expression even without a list of harmful words vocabulary such as a dictionary defined abusive dictionary. Therefore, it can be easily applied to other fields such as classification of harmful SNS and judgment of abusive chat, which need to check for harmful expression in text data.

도 2는 본 발명의 일 실시예에 따른 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 방법을 설명하기 위한 흐름도이다. 2 is a flowchart illustrating a method for automatically classifying harmful words of adolescents using automatic generation of harmful word vocabulary lists and machine learning according to an embodiment of the present invention.

제안하는 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 방법은 유해가사와 적격가사로 분류된 가사에 등장하는 전체 단어들에 대하여 각 단어마다 유해가사에서 등장한 횟수와 적격가사에서 등장한 횟수를 구하고, 상기 횟수를 이용하여 자동 유해단어 어휘목록을 자동으로 생성하는 단계(210), 적격가사에 등장한 단어들에 대해서도 실수 및 오류를 감안하여 새로운 유해단어 어휘목록 생성하는 단계(220), 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초하여 유해단어 확인 벡터를 생성하는 단계(230), 단어의 앞 뒤 문맥을 고려하기 위해 순차적 데이터 처리 모델을 이용하여 맥락 확인 벡터를 생성하는 단계(240) 및 상기 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측하는 단계(250)를 포함한다. The proposed method of automatic classification of harmful words of adolescents using automatic generation of harmful word vocabulary list and machine learning shows the total number of words appearing in harmful lyrics and qualified lyrics Obtaining the number of times, using the number of automatically generating a word list automatically harmful word 210, generating a new word lexical word list in consideration of mistakes and errors for the words appeared in the qualified lyrics 220, Generating a harmful word identification vector based on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list (230), and generating a context identification vector using a sequential data processing model to consider the context before and after the word; Step 240 and hybrid harmful using the harmful word identification vector and the context identification vector. A final step of predicting the harmful lyrics through the house classification model (250).

단계(210)에서, 유해가사와 적격가사로 분류된 가사에 등장하는 전체 단어들에 대하여 각 단어마다 유해가사에서 등장한 횟수와 적격가사에서 등장한 횟수를 구하고, 상기 횟수를 이용하여 자동 유해단어 어휘목록을 자동으로 생성한다. In step 210, the number of appearances in the harmful lyrics and the number of appearances in the eligible lyrics for each word for all words appearing in the lyrics classified as harmful lyrics and eligible lyrics, and using the number of automatic harmful words vocabulary list Will be generated automatically.

이때, 각 단어마다 유해가사에서 등장한 횟수 및 적격가사에서 등장한 횟수에 따라 자질 선정 기법을 통해 단어의 점수를 부여하고, 상기 점수를 기준으로 유해단어를 선정한다. 그리고, 선정된 유해단어를 이용하여 자동 유해단어 어휘목록을 자동으로 생성한다. 도 3 및 도 4를 참조하여 자동 유해단어 어휘목록을 생성하는 과정을 더욱 상세히 설명한다. In this case, each word is given a score of words through a feature selection technique according to the number of appearances in the harmful lyrics and the number of appearances in the eligible lyrics, and the harmful words are selected based on the scores. And, using the selected harmful word automatically generates a word list of harmful words automatically. Referring to Figures 3 and 4 will be described in more detail the process of generating an automatic word vocabulary list.

도 3은 본 발명의 일 실시예에 따른 유해단어 어휘목록 자동 생성과 어휘목록 기반 필터링 모델의 개요도이다. 3 is a schematic diagram of a harmful word vocabulary list automatic generation and a vocabulary list based filtering model according to an embodiment of the present invention.

먼저, 유해단어 어휘목록 자동 생성(310)을 위해 적격가사(311)와 유해가사(312)로 분류된 가사에 등장하는 전체 단어들(313)에 대하여 각 단어마다 유해가사에서 등장한 횟수와 적격가사에서 등장한 횟수를 구한다. 예를 들어, 바보, 사랑, 멍청, 힙합, 사과, 눈물, 개, 시계 등이 각 단어마다 유해가사에서 등장한 횟수 및 적격가사에서 등장한 횟수에 따라 자질 선정 기법(314)을 통해 단어의 점수를 부여한다. 이와 같이 각 단어마다 점수를 부여하고, 유해단어일 확률이 높은 단어 N개를 선택하여 유해단어를 선정할 수 있고, 선정된 유해단어를 이용하여 자동 유해단어 어휘목록을 자동으로 생성할 수 있다. First, the total number of words 313 appearing in the lyrics classified into the qualified lyrics 311 and the harmful lyrics 312 for the automatic generation of the harmful word vocabulary list 310 and the number of eligible lyrics appeared in each lyrics Find the number of occurrences in. For example, idiots, love, idiots, hip hop, apples, tears, dogs, clocks, etc., each word is scored through the feature selection technique (314) based on the number of appearances in harmful lyrics and the number of appearances in eligible lyrics. do. In this way, a score may be assigned to each word, and harmful words may be selected by selecting N words having a high probability of being harmful words, and an automatic harmful word vocabulary list may be automatically generated using the selected harmful words.

이후 어휘목록 기반 필터링(320)을 수행한다. 예를 들어, 멍청, 바보, 개 등이 유해단어(321)로 선정되었다고 가정하면, 가사(322)에 등장하는 전체 단어들에서 유해단어가 등장하는지 여부를 판별(323)한다. 유해단어가 등장할 경우 해당 단어를 유해단어(324)로 판별하고, 등장하지 않을 경우 적격가사(325)로 판별한다. Thereafter, the lexical list based filtering 320 is performed. For example, assuming that the stupid, silly, dog, etc. are selected as the harmful word 321, it is determined whether the harmful word appears in all the words appearing in the lyrics 322 (323). If a harmful word appears, the word is determined as a harmful word 324, and if not, it is determined as a qualified lyrics 325.

도 4는 본 발명의 일 실시예에 따른 유해단어 어휘목록 생성 과정을 설명하기 위한 도면이다. 4 is a view for explaining the process of generating a word vocabulary list harmful according to an embodiment of the present invention.

먼저, 유해단어 어휘목록을 생성하기 위해 유해가사와 적격가사(유해가사가 아닌 가사)로 분류된 가사에 등장한 전체 단어를 찾는다. 이후, 단어의 클래스 별 등장 정보를 각 단어마다 유해가사에서 등장한 횟수(X), 적격가사에서 등장한 횟수(Y)에 대해 정리한다. 예를 들어 유해가사에서 3번, 적격가사에서 1번 나온 가사는 X는 3, Y는 1로 나타낼 수 있다: (X, Y) = (3, 1). First, in order to generate a word list of harmful words, the entire words appearing in the lyrics classified into harmful and qualified lyrics (lyrics rather than harmful lyrics) are searched. Thereafter, the appearance information for each word of the word is summarized for the number of appearances in the harmful lyrics (X) and the number of appearances in the eligible lyrics (Y) for each word. For example, a lyrics from three of the harmful lyrics and one from the eligible lyrics could be represented by X as 3 and Y by 1: (X, Y) = (3, 1).

자질 선정 기법(Feature Selection Method) 중 필터 방식(Filter Method)을 이용하여 단어의 X, Y 값에 따라 점수를 부여할 수 있다. 필터 방식에는 정보 획득(Information Gain), 오즈 비(Odds Ratio), 관련 빈도(Relevant Frequency) 등이 있다. 예를 들어 (X, Y) = (3, 1)이고 필터 방식 중 관련 빈도를 사용한다면, 관련 빈도의 공식에 따라서 해당 단어의 점수는 log(2+X/Y) = log(2+3/1) = 2.32 이 된다. 이렇게 부여된 점수를 관련 빈도의 등고선 그림(410)으로 나타낼 수 있다. A score may be assigned according to the X and Y values of words using a filter method among feature selection methods. Filter methods include information gain, odds ratio, and relevant frequency. For example, if (X, Y) = (3, 1) and the relevant frequency of the filter methods is used, then the score of the word will be log (2 + X / Y) = log (2 + 3 / 1) = 2.32. The score thus assigned may be represented by the contour plot 410 of the relevant frequency.

이와 같이, 모든 단어의 점수를 매기고, 가장 점수가 높은 N개의 단어를 유해단어로 선정할 수 있다. 이를 통해 자동으로 유해단어 어휘목록을 생성할 수 있다. 이렇게 생성된 유해단어 어휘목록을 이용하여, 유해가사에 유해단어가 등장하는지 등장하지 않는지 확인할 수 있다. In this way, all words are scored, and N words having the highest score can be selected as harmful words. Through this, it is possible to automatically generate a word list of harmful words. Using the generated word list of harmful words, it is possible to check whether harmful words appear in the harmful lyrics.

제안하는 유해단어 어휘목록을 자동으로 생성하는 방법은 사전 정의된 어휘목록 자료가 부족한 언어권에서도 사용이 가능 하다. 이러한 어휘목록에 기반한 필터링 모델은 도 4와 같이 자동 유해가사 분류에 사용될 수 있다. 사전 정의된 어휘목록 자료는 대부분 욕설사전, 은어 사전 등으로 보편적인 목적에 맞추어진 어휘목록이지만, 이 방법은 데이터에서 자동 생성한, 데이터에 맞추어진 어휘목록이므로, 사전 정의된 어휘목록보다 높은 성능을 보인다. 또한, 자질 선정 기법(Feature selection method)에 기반한 방식이므로, 가사에서 유해단어인 N개의 단어의 등장 유무만 확인하면 되는 간단한 구조이다. 따라서 다른 기계학습 모델들 보다 구현과 적용이 용이하다. 뿐만 아니라, 유해가사 분류 외에도, 유해단어 등장 유무를 확인해야 하는 모든 분야에 적용 가능하다.The proposed method of automatically generating the lexical list of harmful words can be used in languages that lack the predefined lexical list data. The filtering model based on the lexical list may be used for automatic harmful lyrics classification as shown in FIG. 4. Most of the pre-defined vocabulary lists are vocabulary lists tailored for universal purposes, such as profanity and slang dictionaries.However, this method is more efficient than the pre-defined vocabulary lists because it is an auto-generated data list. Seems. In addition, since the method is based on a feature selection method, it is a simple structure that only needs to confirm the presence or absence of N words which are harmful words in the lyrics. Therefore, it is easier to implement and apply than other machine learning models. In addition, in addition to the classification of harmful lyrics, it is applicable to all fields that need to confirm the presence of harmful words.

다시 도 2를 참조하면, 단계(220)에서 적격가사에 등장한 단어들에 대해서도 실수 및 오류를 감안하여 새로운 유해단어 어휘목록 생성한다. Referring back to FIG. 2, a word list of harmful words is generated in consideration of mistakes and errors in words appearing in the qualified lyrics in step 220.

단계(210)에서 자질 선정 기법으로 기존에 있던 필터 모델을 사용할 수도 있지만, 본 발명에서는 기존 방법들보다 유해단어 선정에 더 적합한 logCDα를 제안하고 이를 사용한다. 하기식은 종래기술에 따른 필터 모델에 적용된 식을 나타낸다: Although the existing filter model may be used as the feature selection method in step 210, the present invention proposes and uses logCDα which is more suitable for the selection of harmful words than the existing methods. The following formula represents the equation applied to the filter model according to the prior art:

수식1

Equation 1

본 발명에서 제안하는 새로운 유해단어 어휘목록은 하기식을 이용하여 생성된다: The new harmful word vocabulary list proposed in the present invention is generated using the following equation:

수식2

Equation 2

여기서

는 적격가사에 등장하는 단어들에 대한 실수 및 오류를 감안한 팩터이다. 도 5를 참조하여 새로운 유해단어 어휘목록을 생성하는 과정을 더욱 상세히 설명한다. here

Is a factor that takes into account mistakes and errors in words appearing in qualified lyrics. Referring to Figure 5 will be described in more detail the process of generating a new word vocabulary list.

도 5는 본 발명의 일 실시예에 따른 가 에 따라 뽑은 유해단어를 등고선으로 나타낸 도면이다. 5 is a diagram showing the harmful words drawn according to the contour according to an embodiment of the present invention.

logCDα[수식 2]는 지도식 단어 가중치 기법(supervised word weight scheme)으로 종래기술의 logCD[수식 1][Fattah, 2015]를 변형한 것이다. 기존 공식의 윗변과 아랫변에 α를 더하여 X, Y축을 α만큼 이동시킨다. 이를 통해 도 5과 같이 α에 따라 X, Y값에 따른 점수 변화를 줄 수 있다.logCDα [Equation 2] is a supervised word weight scheme, a modification of the conventional logCD [Equation 1] [Fattah, 2015]. Add α to the upper and lower sides of the existing formula to move the X and Y axes by α. Through this, as shown in FIG. 5, a score change according to X and Y values may be given.

도 5a 내지 도 5c logCDα가 α에 따라 뽑은 유해단어를 나타내는 그래프이다. 도 5a는 α = 0, topN = 200, 도 5b는 α = 10, topN = 200, 도 5a는 α = 50, topN = 200의 유해단어를 나타내는 그래프이다. 5A to 5C are logCDα graphs illustrating harmful words extracted according to α. 5A is a graph showing α = 0, topN = 200, FIG. 5B is a harmful word of α = 10, topN = 200, and FIG. 5A is α = 50 and topN = 200.

α가 커질수록 X+Y값이 적은 단어들은 덜 뽑히고, X축에서 좀 떨어진 단어들도 유해단어로 뽑히게 된다. 다시 말해, α가 커질수록 희귀하게 등장하는 단어는 덜 뽑고, 빈번하게 등장하는 단어라면 어느 정도 적격가사에 등장한 단어들도 실수/오류를 감안하여 유해단어로 선정할 수 있다.As α increases, words with less X + Y values are drawn less, and words slightly off the X axis are picked as harmful words. In other words, as α increases, less rare words are drawn out, and if words appear frequently, words appearing in qualified lyrics may be selected as harmful words in consideration of mistakes / errors.

다시 도 2를 참조하면, 단계(230)에서, 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초하여 유해단어 확인 벡터를 생성한다. Referring back to FIG. 2, in step 230, a harmful word identification vector is generated based on the automatic harmful word vocabulary list and the new harmful word vocabulary list.

가사에 등장하는 전체 단어들에 대하여 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 있는 단어가 한 개 이상 등장하는지 여부를 판별하여 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초한 필터링 예측 결과를 생성한다. 도 6을 참조하여 유해단어 확인 벡터 생성 과정을 더욱 상세히 설명한다. Filtering based on the automatic harmful word vocabulary list and the new harmful word vocabulary list by determining whether one or more words in the generated harmful word vocabulary list and the new harmful word vocabulary list are displayed for all words appearing in the lyrics. Generate prediction results. Referring to Figure 6 will be described in more detail the process of generating a harmful word identification vector.

도 6은 본 발명의 일 실시예에 따른 유해단어 확인 벡터 생성 과정을 설명하기 위한 도면이다. 6 is a view for explaining the process of generating a harmful word identification vector according to an embodiment of the present invention.

앞서 설명된 바와 같이, 자질 선정 기법으로 logCDα의 α와 단어 수 N을 바꾸어 가며 가사의 전체 단어들(610)을 대상으로 여러 종류의 유해단어 n개의 어휘목록(621, 622, 623)을 생성한다. 이후, 유해가사인지 적격가사인지 분류해야 하는 가사 각각에 대하여 다음의 과정을 수행한다. 먼저, 가사가 각 유해단어 어휘목록에 있는 단어가 한 개 이상 등장하는지 확인 한다(630). 만약 유해단어 어휘목록에 있는 단어가 한 개 이상 등장한다면 유해가사(1)로, 하나도 등장하지 않는다면 적격가사(0)로 예측한다. 위 과정을 모든 가사와 유해단어 어휘목록에 대해 수행한 뒤, 개발 데이터 생으로 성능이 높았던 k개의 예측 결과만을 모은다. 이 k개의 예측 결과가 유해단어 확인 벡터가 된다(640). 본 발명의 일 실시예에 따른 유해단어 확인 벡터 생성 과정은 어휘목록 하나만 사용하는 것이 아니라, 여러 어휘목록에 기반한 필터링 예측결과로 이루어진 벡터를 생성하여 좀더 강건한(robust) 성능을 낼 수 있다. As described above, the lexical list 621, 622, and 623 of n kinds of harmful words are generated for the whole words 610 of the lyrics by changing the alpha and the word number N of logCDα by the feature selection technique. . After that, the following process is performed for each of the lyrics to be classified as a harmful or qualified lyrics. First, check whether the lyrics appear in at least one word in each harmful word vocabulary list (630). If more than one word appears in the word list, it is predicted as a harmful lyrics (1), and if none of them appear as a qualified lyrics (0). After the above process is performed for all lyrics and harmful word vocabulary lists, only the k predictions that were high in the development data are collected. The k prediction results become the harmful word identification vector (640). The harmful word identification vector generation process according to an embodiment of the present invention may generate more robust performance by generating a vector consisting of filtering prediction results based on several lexical lists, instead of using only one lexical list.

단계(240)에서, 단어의 앞 뒤 문맥을 고려하기 위해 순차적 데이터 처리 모델을 이용하여 맥락 확인 벡터를 생성한다. In step 240, a context identification vector is generated using a sequential data processing model to take into account the context before and after the word.

단어의 앞 뒤 단어를 고려하기 위해 CNN(Convolution Neural Networks) 및 RNN(Recurrent Neural Networks)을 포함하는 순차적 데이터 처리 모델을 이용하여 가사를 벡터화 함으로써 맥락 확인 벡터를 생성한다. In order to consider words before and after words, context identification vectors are generated by vectorizing lyrics using a sequential data processing model including Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN).

예를 들어, [Yang, 2016]의 하이라키컬 어텐션(Hierarchical Attention)과 같은 모델 학습시키고, 소프트맥스 레이어(Softmax layer) 직전 레이어의 결과값을 가사 벡터로 사용한다. For example, we train a model such as Hierarchical Attention of [Yang, 2016], and use the result of the layer immediately before the Softmax layer as the lyrics vector.

이와 같이 본 발명의 실시예에 따른 두 번째 유해가사 종류인 '맥락적으로 유해한 표현이 등장하는 가사'를 유해가사로 분류할 수 있다. 예를 들어 '목을 졸라'와 같이, 유해한 단어는 등장하지 않지만, 단어가 맥락적으로 유해한 의미를 가지는 경우를 파악하고, 유해가사로 분류할 수 있다.As described above, the second harmful lyrics according to the embodiment of the present invention may be classified as harmful lyrics. For example, you can identify cases where harmful words do not appear, but they have contextually harmful meanings, such as strangle, and classify them as harmful lyrics.

단계(250)에서, 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측한다. In step 250, the harmful word identification vector and the context identification vector are used to finally predict whether or not the harmful lyrics through the hybrid harmful lyrics classification model.

유해단어 확인 벡터 및 맥락 확인 벡터를 연결(concatenate)하고, Bagging, AdaBoost, K-Nearest Neighbors을 포함하는 분류모델을 상기 연결된 벡터와 유해가사 및 적격가사로 학습시킨다. 이후, 라벨을 모르는 다시 말해 유해가사 또는 적격가사 여부를 모르는 가사가 분류모델에 입력되었을 경우, 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측한다. The harmful word identification vector and the context identification vector are concatenated, and a classification model including Bagging, AdaBoost, and K-Nearest Neighbors is trained using the linked vector and the harmful and qualified lyrics. Then, when the lyrics that do not know the label, that is, whether the harmful lyrics or qualified lyrics are entered into the classification model, the final prediction whether the harmful lyrics through the hybrid harmful lyrics classification model.

유해단어 확인 벡터 및 맥락 확인 벡터에 기반한 모델을 각각 만들면, 유해단어 확인 벡터의 모델은 유해단어는 확인하지만 단어의 맥락을 파악하지 못하고, 맥락 확인 벡터의 모델은 단어의 맥락은 파악하지만 유해단어에 취약할 수 있다. 본 발명의 실시예에 따른 하이브리드 유해가사 분류 모델은 유해단어 확인 벡터 및 맥락 확인 벡터의 모델을 합쳐 서로의 단점을 보안할 수 있다.If you create a model based on the word identification vector and the context identification vector, the model of the word identification vector identifies the word but does not understand the context of the word. It can be vulnerable. Hybrid harmful lyrics classification model according to an embodiment of the present invention can secure the disadvantages of each other by combining the model of the word identification vector and the context identification vector.

여성가족부의 청소년 유해음반 심의는 청소년 보호법에 따라 지속적으로 이루어져야 한다. 심의는 크게 3 단계로 이루어진다. 먼저, 상시로 이루어지는 모니터링 요원의 기본작업, 격주로 이루어지는 음반위원회의 1차 심의, 월 1회 개최되는 청소년보호위원회의 본 심의로 이루어질 수 있다. 이를 위해 다수의 전문가를 고용해야 하기 때문에 자본이 투입되어야 한다. 이외에도 TV 및 라디오 방송국에서 비슷하게 매주 가요 심의를 실시하고 있다. 청소년 유해가사 심의를 기계학습을 통해 자동화한다면 심의에 필요한 인건비를 줄일 수 있을 것이다. Review of harmful records of adolescents by the Ministry of Gender Equality and Family shall be conducted continuously in accordance with the Juvenile Protection Act. The deliberation is composed of three steps. First, the basic work of the regular monitoring personnel, the first deliberation of the biweekly record group, and the main deliberation of the youth protection committee held once a month. This requires capital to be hired because a number of experts must be hired. In addition, TV and radio stations conduct similar weekly music reviews. Automated juvenile delinquency reviews through machine learning will reduce labor costs for deliberations.

도 7은 본 발명의 일 실시예에 따른 유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 장치의 구성을 나타내는 도면이다. FIG. 7 is a diagram illustrating a configuration of an automatic classification apparatus for juvenile harmful lyrics using automatic generation of harmful word vocabulary lists and machine learning according to an embodiment of the present invention.

유해단어 어휘목록 자동 생성과 기계학습을 이용한 청소년 유해가사 자동 분류 장치(700)는 자동 유해단어 어휘목록 생성부(710), 새로운 유해단어 어휘목록 생성부(720), 유해단어 확인 벡터 생성부(730), 맥락 확인 벡터 생성부(740) 및 최종 유해가사 예측부(750)를 포함한다. Automatic harmful word vocabulary list generation apparatus using a harmful word vocabulary list automatically machine learning unit 700, automatic harmful word vocabulary list generation unit 710, new harmful word vocabulary list generation unit 720, harmful word identification vector generation unit ( 730), a context confirmation vector generator 740 and a final harmful lyrics predictor 750.

자동 유해단어 어휘목록 생성부(710)는 유해가사와 적격가사로 분류된 가사에 등장하는 전체 단어들에 대하여 각 단어마다 유해가사에서 등장한 횟수와 적격가사에서 등장한 횟수를 구하고, 상기 횟수를 이용하여 자동 유해단어 어휘목록을 자동으로 생성한다. Automatic word vocabulary list generation unit 710 calculates the number of times appeared in the harmful lyrics and the number of appearances in the eligible lyrics for each word for the words appearing in the lyrics classified as harmful lyrics and qualified lyrics, using the number of times Automatically generate harmful word vocabulary list.

이때, 각 단어마다 유해가사에서 등장한 횟수 및 적격가사에서 등장한 횟수에 따라 자질 선정 기법을 통해 단어의 점수를 부여하고, 상기 점수를 기준으로 유해단어를 선정한다. 그리고, 선정된 유해단어를 이용하여 자동 유해단어 어휘목록을 자동으로 생성한다. In this case, each word is given a score of words through a feature selection technique according to the number of appearances in the harmful lyrics and the number of appearances in the eligible lyrics, and the harmful words are selected based on the scores. And, using the selected harmful word automatically generates a word list of harmful words automatically.

새로운 유해단어 어휘목록 생성부(720)는 적격가사에 등장한 단어들에 대해서도 실수 및 오류를 감안하여 새로운 유해단어 어휘목록 생성한다. The new harmful word vocabulary list generation unit 720 generates a new harmful word vocabulary list in consideration of mistakes and errors in words appearing in the qualified lyrics.

자동 유해단어 어휘목록 생성부(710)에서 자질 선정 기법으로 기존에 있던 필터 모델을 사용할 수도 있지만, 본 발명에서는 기존 방법들보다 유해단어 선정에 더 적합한 logCDα를 제안하고 이를 사용한다. 위에서 설명한 수식1은 종래기술에 따른 필터 모델에 적용된 식이고, 본 발명에서 제안하는 새로운 유해단어 어휘목록은 위 수식2를 이용하여 생성된다.Although the existing filter model may be used as a feature selection technique in the automatic word-word vocabulary list generating unit 710, the present invention proposes and uses logCDα which is more suitable for the word selection than the existing methods. Equation 1 described above is an equation applied to the filter model according to the prior art, and a new word list of harmful words proposed by the present invention is generated using Equation 2 above.

유해단어 확인 벡터 생성부(730)는 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초하여 유해단어 확인 벡터를 생성한다. The harmful word identification vector generating unit 730 generates a harmful word identification vector based on the automatically generated harmful word vocabulary list and the new harmful word vocabulary list.

가사에 등장하는 전체 단어들에 대하여 상기 생성된 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 있는 단어가 한 개 이상 등장하는지 여부를 판별하여 자동 유해단어 어휘목록 및 새로운 유해단어 어휘목록에 기초한 필터링 예측 결과를 생성한다. Filtering based on the automatic harmful word vocabulary list and the new harmful word vocabulary list by determining whether one or more words in the generated harmful word vocabulary list and the new harmful word vocabulary list are displayed for all words appearing in the lyrics. Generate prediction results.

맥락 확인 벡터 생성부(740)는 단어의 앞 뒤 문맥을 고려하기 위해 순차적 데이터 처리 모델을 이용하여 맥락 확인 벡터를 생성한다. The context confirmation vector generator 740 generates a context confirmation vector using a sequential data processing model to consider the context before and after the word.

이와 같이 본 발명의 실시예에 따른 두 번째 유해가사 종류인 '맥락적으로 유해한 표현이 등장하는 가사'를 유해가사로 분류할 수 있다. 예를 들어 '목을 졸라'와 같이, 유해한 단어는 등장하지 않지만, 단어가 맥락적으로 유해한 의미를 가지는 경우를 파악하고, 유해가사로 분류할 수 있다. As described above, the second harmful lyrics according to the embodiment of the present invention may be classified as harmful lyrics. For example, you can identify cases where harmful words do not appear, but they have contextually harmful meanings, such as strangle, and classify them as harmful lyrics.

최종 유해가사 예측부(750)는 유해단어 확인 벡터 및 상기 맥락 확인 벡터를 이용하여 하이브리드 유해가사 분류 모델을 통해 유해가사 여부를 최종 예측한다. The final harmful lyrics prediction unit 750 finally predicts the harmful words through the hybrid harmful lyrics classification model using the harmful word identification vector and the context identification vector.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the devices and components described in the embodiments may be, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable arrays (FPAs), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to the execution of the software. For the convenience of understanding, a processing device may be described as one being used, but a person skilled in the art will appreciate that the processing device includes a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the above, and configure the processing device to operate as desired, or process it independently or collectively. You can command the device. Software and / or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device in order to be interpreted by or to provide instructions or data to the processing device. It can be embodied in. The software may be distributed over networked computer systems so that they may be stored or executed in a distributed manner. Software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be embodied in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and the drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different form than the described method, or other components. Or even if replaced or substituted by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the claims that follow.

Claims

Obtaining the number of appearances in the harmful lyrics and the number of appearances in the eligible lyrics for each word for all words appearing in the lyrics classified into harmful lyrics and eligible lyrics, and automatically generating an automatic harmful word vocabulary list using the number of words. ;
Generating a new harmful word vocabulary list in consideration of mistakes and errors in words appearing in the qualified lyrics;
Generating a harmful word identification vector based on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list;
Generating a context identification vector using a sequential data processing model to consider the context before and after the word; And
Finally predicting whether or not harmful words are generated through a hybrid harmful lyrics classification model using the harmful word identification vector and the context identification vector.
Vocabulary automatic classification method comprising a.

The method of claim 1,
The step of automatically generating the automatic harmful word vocabulary list,
Each word is assigned a score based on the number of appearances in the harmful lyrics and the number of appearances in the qualified lyrics, and the harmful words are selected based on the scores.
Vocabulary automatic classification method.

The method of claim 2,
Automatically generate an automatic word list of harmful words using the selected harmful word
Vocabulary automatic classification method.

The method of claim 1,
Generating a new harmful word vocabulary list in consideration of mistakes and errors for words appearing in the qualified lyrics,
Create a new word vocabulary list using the following formula,

here

Is a factor that takes into account mistakes and errors in words appearing in qualified lyrics.
Vocabulary automatic classification method.

The method of claim 1,
Generating a harmful word identification vector based on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list,
Filtering based on the automatic harmful word vocabulary list and the new harmful word vocabulary list by determining whether one or more words in the generated harmful word vocabulary list and the new harmful word vocabulary list are displayed for all words appearing in the lyrics. To generate prediction results
Vocabulary automatic classification method.

The method of claim 1,
Generating a context identification vector using a sequential data processing model to consider the context before and after the word,
Create a context identification vector by vectorizing lyrics using a sequential data processing model that includes Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN).
Vocabulary automatic classification method.

The method of claim 1,
The step of finally predicting whether or not harmful words through the hybrid harmful lyrics classification model using the harmful word identification vector and the context identification vector,
Concatenate the harmful word identification vector and the context identification vector, and learn a classification model including Bagging, AdaBoost, and K-Nearest Neighbors with the linked vector and the harmful and qualified lyrics, and whether they are harmful or qualified. If the unknown lyrics are entered into the classification model, the hybrid harmful lyrics classification model predicts
Vocabulary automatic classification method.

For each word in the lyrics classified as harmful lyrics and qualified lyrics, the number of occurrences from the harmful lyrics and the number of appearances from the qualified lyrics is obtained for each word, and the automatic harmful word vocabulary list is automatically generated using the number of words. Harmful word vocabulary list generation unit;
New harmful word vocabulary list generation unit for generating a new harmful word vocabulary list in consideration of mistakes and errors for words appearing in the qualified lyrics;
A harmful word identification vector generator for generating a harmful word identification vector based on the generated automatic harmful word vocabulary list and the new harmful word vocabulary list;
A context identification vector generator for generating a context identification vector using a sequential data processing model to consider the context before and after a word; And
Final harmful lyrics prediction unit for final prediction of the harmful words through the hybrid harmful lyrics classification model using the harmful word identification vector and the context identification vector
Vocabulary automatic classification device comprising a.

The method of claim 8,
The automatic word harmful word list generation unit,
According to the number of occurrences of harmful words in each word and the number of appearances in qualified lyrics, the word score is assigned through the feature selection technique, the harmful words are selected based on the scores, and the automatic harmful words using the selected harmful words. To automatically generate lexical lists
Vocabulary automatic sorting device.

The method of claim 8,
The harmful word identification vector generation unit,
Filtering based on the automatic harmful word vocabulary list and the new harmful word vocabulary list by determining whether one or more words in the generated harmful word vocabulary list and the new harmful word vocabulary list are displayed for all words appearing in the lyrics. To generate prediction results
Vocabulary automatic sorting device.

The method of claim 8,
The context identification vector generation unit,
Create a context identification vector by vectorizing lyrics using a sequential data processing model that includes Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN).
Vocabulary automatic sorting device.

The method of claim 8,
The final harmful lyrics prediction unit,
Concatenate the harmful word identification vector and the context identification vector, and learn a classification model including Bagging, AdaBoost, and K-Nearest Neighbors with the linked vector and the harmful and qualified lyrics, and whether they are harmful or qualified. When the unknown lyrics are entered into the classification model, the final prediction of the harmful lyrics is performed through the hybrid harmful lyrics classification model.
Vocabulary automatic sorting device.