KR20230051354A

KR20230051354A - Method for evaluating malicious comments

Info

Publication number: KR20230051354A
Application number: KR1020210134012A
Authority: KR
Inventors: 문종민
Original assignee: 문종민
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2023-04-18

Abstract

The present invention relates to a method for determining malicious comments. Comment data is selected by morpheme for each part of speech, and the performance of determining malicious comments can be improved by determining malicious comments only with words corresponding to the five parts of speech, which are nouns, particles, verbs, adjectives, and adverbs, excluding suffixes and punctuation marks, which do not affect the determination of maliciousness. In particular, when extracting words from a sentence, the words are extracted in combination with adjectives rather than extracting nouns alone. When removing verbs, the performance of determining malicious comments can be further improved by building a determination model which removes other parts of speech.

Description

Malicious Comment Detection Method {METHOD FOR EVALUATING MALICIOUS COMMENTS}

본 발명은 악성댓글 판별방법에 관한 것으로서, 더욱 상세하게는 기계학습과 형태소 분석을 통해 댓글에서 품사별로 자질을 추출하고, 특정 품사의 조합에 의해 데이터를 필터링하여 댓글의 악성여부를 판별하여 악성댓글 판별의 성능을 향상시키는 악성댓글 판별방법에 관한 것이다. The present invention relates to a method for detecting malicious comments, and more particularly, extracts qualities for each part of speech from comments through machine learning and morpheme analysis, filters data by a combination of specific parts of speech, and determines whether a comment is malicious to detect malicious comments. It relates to a malicious comment discrimination method that improves the performance of discrimination.

일반적으로, 인터넷, 스마트폰, 소셜네트워크서비스(SNS) 등 정보통신 기술의 보급, 확산으로 사람들은 이전보다 많은 정보를 보다 손쉽게 취득하는 것이 가능해졌다. 하지만, 이러한 기술의 발전을 통해 얻게 되는 편리함의 부작용으로 악성댓글, 음란물, 가짜뉴스 등 인터넷 유해 정보의 범람으로 인한 각종 사회문제가 발생하게 되었다. BACKGROUND OF THE INVENTION [0002] In general, with the dissemination and spread of information and communication technologies such as the Internet, smart phones, and social networking services (SNS), people can easily acquire more information than before. However, as a side effect of the convenience obtained through the development of these technologies, various social problems have occurred due to the overflow of harmful Internet information such as malicious comments, pornography, and fake news.

그 중에서도 최근 인터넷 악성댓글이 사회적 문제로 대두되고 있으며, 유명 연예인들의 자살을 계기로 다시 화제가 되고 있다.Among them, malicious comments on the Internet have recently emerged as a social problem, and they have become a hot topic again with the suicide of famous celebrities.

인터넷을 이용하는 사용자들은 단순히 콘텐츠를 소비하는 것에 그치지 않고, 댓글을 통해 해당 콘텐츠에 대한 적극적인 의견을 표시한다. 많은 사용자들이 접속하는 웹 페이지에 게시된 콘텐츠의 경우, 콘텐츠 하나에 대해 생성된 댓글만 해도 수백 개에서 수천 개에 이르기도 한다. 댓글을 통해 여론이나 의견 형성이 이루어지기도 할 뿐만 아니라, 콘텐츠 제공자는 댓글을 통해 사용자 의견을 피드백받는 주요한 수단으로써 활용한다.Users who use the Internet do not stop at simply consuming content, but actively express their opinions on the content through comments. In the case of content posted on a web page that is accessed by many users, hundreds or even thousands of comments generated for one piece of content may reach. Public opinion or opinions are formed through comments, and content providers use comments as a major means of receiving user opinions through comments.

그러나, 댓글 이용자들이 늘어나면서 댓글을 자신의 불만을 토로하거나 악의적으로 남을 공격하는 수단으로 이용하는 이용자도 늘어나기 시작하였다. 인터넷의 익명성을 악용하여, 욕설 또는 음란어 등과 같은 비속어를 사용하여 남을 헐뜯거나 허위 사실을 퍼뜨리는 댓글에 대해 악성 리플, 즉 '악플'이라는 신조어도 만들어지게 되었다. 일부 사용자들이 혐오 표현을 포함하는 댓글을 작성하는데, 혐오 댓글은 인터넷의 특성상 청소년을 비롯한 다수의 사용자들에게 노출되어 사이버 폭력 을 야기한다.However, as the number of comment users increases, the number of users who use comments as a means of expressing their dissatisfaction or maliciously attacking others has also begun to increase. Malicious replies, that is, a new word, 'bad comments', have been coined for comments that exploit the anonymity of the Internet to slander others or spread false information by using profane language such as profanity or obscenity. Some users write comments containing hate expressions, and due to the nature of the Internet, hate comments are exposed to a large number of users, including teenagers, and cause cyberbullying.

악플을 차단하기 위한 방법들이 존재하나, 기존 방법들은 스팸성 댓글을 필터링하는데 집중되어 있다. 또한, 기존 방법들은 등록된 데이터베이스를 이용해 혐오 댓글을 필터링하기 때문에 데이터베이스에 등록되지 않는 새로운 혐오 표현을 필터링할 수 없는 한계가 있다.There are methods to block malicious comments, but existing methods are focused on filtering out spammy comments. In addition, since existing methods filter hate comments using registered databases, there is a limit to filtering new hate expressions that are not registered in the database.

또한, 종래의 댓글 분석방법에 의하면, 서포트 벡터 머신을 사용하여 사용자에 의해 입력된 댓글의 단어가 긍정적인지, 부정적인지 판단하고, 판단된 단어를 이용하여 댓글의 평판도를 분석하는 댓글에 대한 평판도 분석 방법이 제시되고 있으나, 상기 종래의 댓글 분석방법은 입력된 댓글의 모든 단어에 대하여 긍정적인지 부정적인지 여부를 판단하여야 하기 때문에, 댓글의 악성여부를 판별하는데 많은 시간이 소요되어 효율적이지 못한 문제점이 있었다. In addition, according to the conventional comment analysis method, a support vector machine is used to determine whether a word in a comment entered by a user is positive or negative, and the reputation of the comment is analyzed using the determined word to analyze the reputation of the comment. However, since the conventional comment analysis method needs to determine whether all words of the input comments are positive or negative, it takes a lot of time to determine whether the comments are malicious, and it is not efficient. .

한국공개특허 10-2009-0103171호(댓글 자동 필터링 방법)Korean Patent Publication No. 10-2009-0103171 (Comment automatic filtering method)

본 발명은 상술한 종래기술의 문제점을 해결하고자 하는 것으로서, 본 발명의 목적은 기계학습과 형태소 분석을 통해 댓글에서 품사별로 자질을 추출하고, 특정 품사의 조합에 의해 데이터를 필터링하여 악성댓글 판별의 성능을 향상시킬 수 있는 악성댓글 판별방법을 제공하는 것이다. The present invention is intended to solve the problems of the prior art described above, and an object of the present invention is to extract qualities by part of speech from comments through machine learning and morpheme analysis, and filter data by a combination of specific parts of speech to determine malicious comments. It is to provide a malicious comment detection method that can improve performance.

상술한 목적을 달성하기 위하여, 본 발명에 의한 악성댓글 판별방법은, 데이터 수집기를 사용하여 댓글 데이터를 수집하는 데이터 수집단계와, 수집된 상기 댓글 데이터로부터 형태소를 추출하고 전처리하는 단계와, 상기 전처리된 형태소로부터 형태소 분석기를 사용하여 형태소를 품사별로 선별하는 단계와, 상기 선별된 품사별 형태소를 특정 품사의 조합으로 추출하는 단계와, 추출된 상기 품사별 형태소를 벡터화하는 단계와, 상기 벡터화된 형태소를 서포트 벡터 머신(Support Vector Machine) 알고리즘에 의해 기계학습하고 상기 형태소가 감성유형 사전에 저장된 단어를 포함하는지 여부를 판단하여 모델링을 통해 댓글의 악성여부를 판별하는 단계와, 상기 댓글이 감성유형사전에 저장된 단어를 포함하는 경우 미리 설정된 가중치를 적용하여 결과값을 도출하고 댓글의 악성여부를 판별하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the malicious comment discrimination method according to the present invention includes a data collection step of collecting comment data using a data collector, extracting morphemes from the collected comment data and pre-processing, and the pre-processing. Selecting morphemes by part-of-speech from the morphemes using a morpheme analyzer; extracting the selected morphemes by part-of-speech as a combination of specific parts of speech; vectorizing the extracted morphemes by part-of-speech; and machine learning by a Support Vector Machine algorithm and determining whether the morpheme includes a word stored in a sentiment type dictionary, and determining whether the comment is malicious through modeling; and It is characterized in that it includes a step of deriving a result value by applying a preset weight and determining whether the comment is malicious if it includes a word stored in .

여기서, 상기 선별된 형태소의 품사는 명사, 조사, 동사, 형용사, 부사를 포함하고 나머지 품사는 제거하는 것을 특징으로 한다. Here, the parts of speech of the selected morphemes include nouns, particles, verbs, adjectives, and adverbs, and the rest of parts of speech are removed.

여기서, 상기 선별된 품사별 형태소의 조합은 명사와 형용사를 포함함과 동시에 동사를 제거한 조합인 것을 특징으로 한다.Here, the combination of the selected morphemes for each part of speech is characterized in that the combination includes nouns and adjectives and at the same time removes verbs.

여기서, 상기 품사의 선별은 OKT(Open Korean Text)분석기를 사용하여 선별하고, 상기 형태소의 벡터화는 TF-IDF를 사용하여 벡터화하는 것을 특징으로 한다. Here, the parts of speech are selected using an OKT (Open Korean Text) analyzer, and the vectorization of the morphemes is characterized by vectorization using TF-IDF.

여기서, 상기 감성유형사전은 댓글의 유형에 따라 복수 개의 감성사전으로 사전에 구축되는 것을 특징으로 한다. Here, the emotion type dictionary is characterized in that a plurality of emotion dictionaries are constructed in advance according to the type of comments.

여기서, 상기 전처리작업은 댓글콘텐츠로부터 한글만 남기고 콘텐츠를 클렌징하는 단계와, 이메일과 URL과 HTML 태그 및 특수기호를 제거하는 단계와, 상기 댓글의 글자를 자음과 모음으로 분류하고 맞춤법 검사를 수행하는 단계와, 맞춤법 검사를 수행한 후 맞춤범에 맞추어 상기 댓글을 수정하는 단계와, 수정된 상기 댓글을 문장으로 재조합하는 단계를 포함하는 것을 특징으로 한다. Here, the pre-processing operation includes the step of cleansing the content leaving only Hangul from the comment content, the step of removing e-mail, URL, HTML tags and special symbols, and classifying the letters of the comment into consonants and vowels and performing a spell check. It is characterized in that it includes the step of performing a spelling check, the step of correcting the replies according to the spelling, and the step of recombining the corrected replies into sentences.

상술한 구성을 가지는 본 발명에 의한 악성댓글 판별방법은 댓글 데이터를 품사별 형태소로 선별하고, 악성여부의 판별에 영향을 미치지 않는 접미사나 구둣점 등을 제외한 명사, 조사, 동사, 형용사, 부사의 5개의 품사에 해당하는 단어로만 악성댓글을 판별함으로서 악성댓글의 판별 성능을 향상시킬 수 있다. The malicious comment discrimination method according to the present invention having the above-described configuration sorts the comment data into morphemes for each part of speech, and excludes suffixes or punctuation marks that do not affect the determination of whether or not malicious comments are included. By discriminating malicious comments only with words corresponding to five parts of speech, the malicious comment discrimination performance can be improved.

또한, 본 발명에 의한 악성댓글 판별방법은, 문장에서 단어를 추출시 명사만으로 추출했을 때보다 형용사와 조합하여 추출하고, 동사를 제거하는 경우에 다른 품사를 제거하는 판별모델을 구축하여 악성댓글 판별 성능을 보다 향상시킬 수 있다. In addition, the malicious comment discrimination method according to the present invention extracts words from sentences by combining them with adjectives rather than extracting nouns alone, and constructs a discrimination model that removes other parts of speech when verbs are removed to discriminate malicious comments. performance can be further improved.

도 1은 본 발명에 의한 악성댓글 판별방법의 흐름도이다.
도 2는 명사와 형용사는 추출하고 동사는 제거한 후의 워드 벡터화의 일예를 나타내는 도면이다. 1 is a flowchart of a malicious comment discrimination method according to the present invention.
2 is a diagram showing an example of word vectorization after nouns and adjectives are extracted and verbs are removed.

이하, 첨부된 도면을 참조하여 본 발명에 의한 악성댓글 판별방법에 대하여 실시예로써 상세하게 설명한다. Hereinafter, with reference to the accompanying drawings, a malicious comment detection method according to the present invention will be described in detail as an embodiment.

본 발명에 의한 악성댓글판별방법은, 우선 데이터 수집기를 사용하여 댓글 데이터를 수집한다(S1). In the method for identifying malicious comments according to the present invention, first, comment data is collected using a data collector (S1).

상기 댓글 데이터 수집단계(S1)는 다양한 목적으로 온라인 댓글들을 수집하며, 본 실시예에서는, 파이썬(python) 웹크롤링 소스를 이용하여 포털사이트의 게시글에 포함된 댓글을 수집한다. 상기 댓글의 수집은 네이버, 구글 등의 포털사이트 API(Application Programming Interface)를 통해 게시판, 댓글 정보를 크롤링(Crawling) 함으로써 실시된다.In the comment data collection step (S1), online comments are collected for various purposes, and in this embodiment, comments included in posts on a portal site are collected using a Python web crawling source. The collection of the comments is carried out by crawling bulletin board and comment information through an application programming interface (API) of portal sites such as Naver and Google.

그런 다음, 상기 댓글데이터 수집단계(S1)에서 수집된 상기 댓글을 행태소 추출과 전처리 단계(S2)를 거친다. Then, the replies collected in the comment data collection step (S1) are subjected to behavioral element extraction and preprocessing step (S2).

상기 전처리단계는 댓글콘텐츠로부터 한글만 남기고 콘텐츠를 클렌징하는 단계(S2-1)와, 이메일과 URL과 HTML 태그 및 특수기호를 제거하는 단계(S2-2)와, 상기 댓글의 글자를 자음과 모음으로 분류하고 맞춤법 검사를 수행하는 단계(S2-3)를 거친다. The pre-processing step includes a step of cleansing the content leaving only Hangul from the comment content (S2-1), a step of removing e-mail, URL, HTML tags and special symbols (S2-2), and converting the letters of the comment to consonants and vowels. , and performs a spell check (S2-3).

상기 전처리 단계의 예는 다음과 같다. An example of the preprocessing step is as follows.

input : 제 블로그에도 방문해주세요. http://blognavercom/whdals0 <h1> 서이추 환영 </h2> ㅋㅋㅋㅋ ㄱㅅㄱㅅinput: Please visit my blog as well. http://blognavercom/whdals0 <h1> Welcome to Seo Yichu </h2> haha

output : 제 블로그에도 방문해주세요 서이추 환영 output : Please visit my blog as well. Welcome to Seo Yichu

그런 다음, 맞춤법 검사를 수행한 후 맞춤범에 맞추어 상기 댓글을 수정하고, 상기 댓글을 띄어 쓰기를 기준으로 자른 후, 자음과 모음을 분리하고, 자음, 모음 순서를 조합하여 맞춤법 사전과 비교하며, 맞춤법을 검토한다. Then, after performing a spell check, the comments are corrected according to the spelling, the comments are cut based on spacing, the consonants and vowels are separated, and the order of consonants and vowels is combined and compared with a spelling dictionary, Check your spelling.

상기 S2-4 단계의 예시는 다음과 같다. An example of the step S2-4 is as follows.

input : 제 블로그에도 방문해주세요 서이추 환영input: Please visit my blog as well Welcome Seo Yichu

output : 제 / 블로그에도 / 방문해주세요 / 서이추 / 환영output: Please visit my / blog / Seo Yichu / Welcome

input : 제 / 블로구에도 / 방문해주세요 / 서이추 / 환영input: my / blogo / please visit / seo ichu / welcome

output : {ㅈ,ㅔ}, {ㅂ,ㅡ,ㄹ,ㄹ,ㅗ,ㄱ,ㅜ,ㅇ,ㅔ,ㄷ,ㅗ},{ㅂ,ㅏ,ㅇ,ㅁ,ㅜ,ㄴ,ㅎ,ㅐ,ㅈ,ㅜ,ㅅ,ㅔ,ㅇ,ㅛ}output : {ㅅ,ㅔ}, {ㅅ,ㅡ,ㄹ,ㄹ,ㅗ,a,u,ㅇ,ㅔ,c,ㅗ},{ㅅ,아,ㅇ,ㅁ,ㅜ,ㄴ,ㅎ,ㅐ,ㄴ ,TT,ㅅ,ㅔ,ㅇ,ㅛ}

output : {ㅈ,ㅔ}, {ㅂ,ㅡ,ㄹ,ㄹ,ㅗ,ㄱ,ㅡ,ㅇ,ㅔ,ㄷ,ㅗ},{ㅂ,ㅏ,ㅇ,ㅁ,ㅜ,ㄴ,ㅎ,ㅐ,ㅈ,ㅜ,ㅅ,ㅔ,ㅇ,ㅛ}(맞춤법 사전과 비교하여 맞춤법 수정) output : {ㅅ,ㅔ}, {ㅅ,ㅡ,ㄹ,ㄹ,ㅗ,a,ㅡ,ㅇ,ㅔ,c,ㅗ},{ㅅ,아,ㅇ,ㅁ,ㅜ,ㄴ,ㅎ,ㅐ,ㄴ ,TT,ㅅ,ㅔ,ㅇ,ㅛ} (correct spelling compared to spelling dictionary)

그런 다음, 수정된 상기 댓글을 문장으로 재조합하는 단계(S2-5)를 거친다. Then, a step (S2-5) of recombining the modified comments into sentences is performed.

input : {ㅈ,ㅔ}, {ㅂ,ㅡ,ㄹ,ㄹ,ㅗ,ㄱ,ㅡ,ㅇ,ㅔ,ㄷ,ㅗ},{ㅂ,ㅏ,ㅇ,ㅁ,ㅜ,ㄴ,ㅎ,ㅐ,ㅈ,ㅜ,ㅅ,ㅔ,ㅇ,ㅛ}input : {ㅅ,ㅔ}, {ㅅ,ㅡ,ㄹ,ㄹ,ㅗ,a,ㅡ,ㅇ,ㅔ,c,ㅗ},{ㄴ,아,ㅇ,ㅁ,ㅜ,ㄴ,ㅎ,ㅐ,ㄴ ,TT,ㅅ,ㅔ,ㅇ,ㅛ}

output : 제 블로그에도 방문해주세요output: Please visit my blog as well

그런 다음, 상기 전처리된 형태소로부터 형태소 분석기를 사용하여 형태소를 품사별로 선별한다(S3). Then, morphemes are selected according to parts of speech from the preprocessed morphemes using a morpheme analyzer (S3).

본 실시예에서는 악성댓글 판별을 위해 형태소 기반 자질 추출 방법을 통해 분석한다. 본 명세서에서, '자질' 추출이란 문장 혹은 단어로 구성되어있는 데이터를 머신러닝 모델에서 사용할 수 있는 데이터 형태로 가공하는 작업을 의미한다.In this embodiment, analysis is performed through a morpheme-based feature extraction method to determine malicious comments. In this specification, 'feature' extraction refers to processing data composed of sentences or words into a data form usable in a machine learning model.

본 실시예에서, 상기 형태소 분석기는 KoNLPy에서 제공하는 OKT(Open Korean Text) 분석기를 사용한다. 상기 OKT형태소 분석기를 사용하여, 추출 가능한 28가지 형태소 중에서 악성댓글 판별에 영향을 미칠 수 있는 빈도수가 높은 8개의 형태소를 선별한다. In this embodiment, the morpheme analyzer uses OKT (Open Korean Text) analyzer provided by KoNLPy. Using the OKT morpheme analyzer, 8 morphemes with a high frequency that can affect the determination of malicious comments are selected from 28 extractable morphemes.

본 실시예에서는, 악성댓글 10,000건, 비악성댓글 10,000건, 악성댓글과 비악성댓글 혼합 10,000건에서 형태소들의 출현 빈도를 확인하고, 해당 형태소를 기반으로 단어를 추출하여 자질로 사용하였고, 악성댓글 10,000건의 형태소별 출현빈도를 [표 1]에 나타낸다. In this embodiment, the frequency of appearance of morphemes was checked in 10,000 malicious comments, 10,000 non-malicious comments, and 10,000 mixed malicious and non-malicious comments, and words were extracted based on the corresponding morphemes and used as qualities, and malicious comments [Table 1] shows the frequency of each 10,000 morphemes.

악성댓글 10,000건의 형태소별 출현빈도Frequency of occurrence by morpheme of 10,000 malicious comments 형태소morpheme 출현횟수number of appearances 빈도순위frequency ranking NounNoun 238,641238,641 1One JosaJosa 91,70991,709 22 VerbVerb 71,97171,971 33 PunctuationPunctuation 30,40130,401 44 AdjectiveAdjective 28,83128,831 55 ForeignForeign 23,83423,834 66 SuffixSuffix 15,71815,718 77 AdverbAdverb 9,4999,499 88 NumberNumber 7,1267,126 99 KoreanParticleKoreanParticle 6,7966,796 1010 DeterminerDeterminer 4,3854,385 1111 AlphaAlpha 3,9053,905 1212 ConjuctionConnection 1,0091,009 1313 ExclamationExclamation 820820 1414 URLURL 560560 1515 ScreenNameScreenName 281281 1616 eomieomi 200200 1717 HashTagHashTag 4545 1818 EmailEmail 22 1919 UnknownUnknown 00 2020 ProeomiProeomi 00 2020

실험결과, 악성댓글에서는 명사, 조사, 동사, 구두점, 형용사, 외국어, 접미사가 10,000건 이상 나타났고, 비악성댓글에서는 명사, 조사, 동사, 형용사, 구두점, 접미사가 10,000건 이상 나타났으며, 악성댓글과 비악성댓글 혼합 데이터에서는 명사, 조사, 동사, 구두점, 형용사, 접미사, 외국어, 부사가 10,000건 이상으로 나타났다.As a result of the experiment, more than 10,000 nouns, particles, verbs, punctuation marks, adjectives, foreign words, and suffixes appeared in malicious comments, and more than 10,000 nouns, particles, verbs, adjectives, punctuation marks, and suffixes appeared in non-malicious comments. In the mixed data of comments and non-malicious comments, there were more than 10,000 nouns, particles, verbs, punctuation marks, adjectives, suffixes, foreign words, and adverbs.

본 실시예에서는, 출현빈도가 높은 복수 개의 형태소 중에서, 구두점, 접미사, 외국어의 경우에는 분류에 도움이 되지 않는 데이터가 대부분을 차지하고 있으므로, 이를 제외한 명사, 조사, 동사, 형용사, 부사 등 5개의 형태소(품사)만을 포함하고 나머지 품사는 제거한다. In this embodiment, among a plurality of morphemes with high occurrence frequency, punctuation marks, suffixes, and data that are not useful for classification in the case of foreign languages occupy most of the data, so five morphemes such as nouns, particles, verbs, adjectives, and adverbs are excluded. Include only (part of speech) and remove the rest of parts of speech.

그런 다음, 상기 선별된 품사별 형태소를 특정 품사의 조합으로 추출하고, 추출된 상기 품사별 형태소를 벡터화한다.(S4)Then, the selected morphemes for each part of speech are extracted as a combination of specific parts of speech, and the extracted morphemes for each part of speech are vectorized (S4).

그런 다음, 상기 벡터화된 형태소를 서포트 벡터 머신(Support Vector Machine) 알고리즘에 의해 기계학습하고 상기 형태소가 감성유형 사전에 저장된 단어를 포함하는지 여부를 판단하여 모델링을 통해 댓글의 악성여부를 판별(S5)하고, 상기 댓글이 감성유형사전에 저장된 단어를 포함하는 경우 미리 설정된 가중치를 적용하여 결과값을 도출하고 댓글의 악성여부를 판별(S6)한다. Then, the vectorized morpheme is machine-learned by a Support Vector Machine algorithm, and whether the morpheme includes a word stored in the emotional type dictionary is determined to determine whether the comment is malicious through modeling (S5) And, if the comment includes a word stored in the emotion type dictionary, a result value is derived by applying a preset weight, and whether the comment is malicious is determined (S6).

본 실시예에서, 품사를 기반으로 추출된 단어들은 CountVectorizer와 TF-IDF를 사용하여 벡터화한 후, SVM을 사용하여 악성댓글 판별을 수행한다.In this embodiment, words extracted based on parts of speech are vectorized using CountVectorizer and TF-IDF, and then malicious comments are identified using SVM.

본 실시예에서, CountVectorizer 설정은 analyser = word, min_df = 1로 설정한다.In this embodiment, CountVectorizer settings are set to analyser = word, min_df = 1.

본 실시예에서는, 5개의 품사를 기반으로 추출한 자질들을 다양한 형태의 실험을 통해서 품사 기반 자질이 악성댓글 판별모델에 미치는 영향을 판별하였다. In this embodiment, the effects of the parts-of-speech-based features on the malicious comment discrimination model were determined through various types of experiments with the extracted features based on five parts of speech.

첫 번째 실험은 명사, 조사, 동사, 형용사, 부사 5개 품사에 해당하는 단어로만 100개씩의 자질을 추출하고 악성댓글 판별 실험을 진행하였고, 두 번째 실험은 품사별로 100개의 자질을 추출한 후, [표 1]에 나오는 출현 빈도순 으로 더해가며 실험을 진행하였고, 세 번째 실험은 최대 자질 수를 500개로 고정하고, 품사 기반 자질을 동일 비율로 추가하며 실험을 진행하였고, 네 번째 실험은 해당 품사에 해당하는 단어의 출현 빈도에 따라 단어가 추가되도록 자질을 구성하여 실험을 진행하였고, 다섯 번째 실험은 최대 자질 수는 500으로 고정한 상태에서 5개 품사 기반 자질을 모두 포함했을 때와 각각의 품사 기반 자질을 제거했을 때의 성능 비교 실험을 진행하였다.In the first experiment, 100 qualities were extracted from only words corresponding to the five parts of speech: noun, particle, verb, adjective, and adverb, and an experiment was conducted to determine malicious comments. In the second experiment, after extracting 100 qualities for each part of speech, [ In the third experiment, the maximum number of features was fixed at 500, and the part-of-speech-based features were added at the same rate, and in the fourth experiment, the part-of-speech The experiment was conducted by configuring the features so that words were added according to the frequency of occurrence of the corresponding word. A performance comparison experiment was conducted when the was removed.

실험결과를 [표 2]에 나타낸다. The experimental results are shown in [Table 2].

품사기반 자질의 악성댓글 판별실험Malicious comment discrimination experiment of parts-of-speech-based quality 실험번호Experiment number 자질수number of qualities 좋은성능을보이는품사유형(1이높은정확도Types of parts that perform well (1 is high accuracy) 1One 22 33 44 55 1-11-1 100100 명사noun 형용사adjective 동사verb 부사adverb 조사inspection 1-21-2 200200 형용사adjective 명사noun 동사verb 조사inspection 부사adverb 22 100 ~ 500100 to 500 명사noun 형용사adjective 동사verb 부사adverb 조사inspection 33 500500 형용사adjective 부사adverb 동사verb 명사noun 조사inspection 4-14-1 500500 부사adverb 형용사adjective 동사verb 조사inspection 명사noun 10001000 동사verb 조사inspection 형용사adjective 부사adverb 명사noun 15001500 형용사adjective 부사adverb 동사verb 조사inspection 명사noun 20002000 조사inspection 형용사adjective 명사noun 부사adverb 동사verb 25002500 명사noun 형용사adjective 부사adverb 조사inspection 동사verb 4-24-2 500500 형용사adjective 명사noun 동사verb 부사adverb 조사inspection 10001000 동사verb 조사inspection 부사adverb 명사noun 형용사adjective 15001500 형용사adjective 조사inspection 동사verb 부사adverb 명사noun 20002000 형용사adjective 부사adverb 명사noun 조사inspection 동사verb 25002500 부사adverb 형용사adjective 동사verb 조사inspection 명사noun 실험번호Experiment number 자질수number of qualities 품사제거시성능순위Performance ranking when parts of speech are removed 55 500500 동사verb 부사adverb 형용사adjective 명사noun 조사inspection

[표 2]에 나타낸 바와 같이, 문장에서 단어를 추출 시 명사만으로 추출했을 때 보다 형용사와 조합하여 추출했을 때 더 높은 정확도를 보여주었다. 또한, 동사를 제거하는 경우에 다른 품사를 제거한 경우보다 성능이 올라가는 것을 알 수 있었다. As shown in [Table 2], when extracting words from sentences, higher accuracy was shown when extracting words in combination with adjectives than when extracting only nouns. In addition, it was found that when the verb is removed, the performance is higher than when other parts of speech are removed.

본 발명에 의한 악성댓글 판별방법에 의하면, 상술한 바와 같이, 상기 선별된 품사별 형태소의 조합으로서 명사와 형용사를 포함함과 동시에 동사를 제거한 조합으로 구축함으로써 적은 수의 형태소 추출이라 하더라도 보다 정확도를 향상시킬 수 있다. According to the malicious comment discrimination method according to the present invention, as described above, as a combination of morphemes for each selected part of speech, nouns and adjectives are included and verbs are removed at the same time, so that more accuracy is obtained even when a small number of morphemes are extracted. can improve

도 3은 명사와 형용사는 추출하고 동사는 제거한 후의 워드 벡터화의 일예를 나타내는 도면이다. 3 is a diagram showing an example of word vectorization after nouns and adjectives are extracted and verbs are removed.

상기 악성여부 판별 단계(S5)는, 상기 전처리된 데이터를 기반으로 하여 서포트 벡터 머신(Support Vector Machine) 알고리즘에 의해 기계학습하고 모델링을 통해 댓글의 악성여부를 판별한다. In the malicious or not malicious determination step (S5), based on the preprocessed data, it is determined whether or not the comment is malicious through machine learning and modeling by a Support Vector Machine algorithm.

SVM은 기계학습 분야 중 하나로 두 카테고리 중 어느 하나에 속한 데이터의 집 합이 주어졌을 때, 주어진 데이터집합을 바탕으로 하여 새로운 데이터가 어느 카테고리에 속하는지 판단하는 분류 모델을 만든다. 만들어진 분류 모델은 데이터가 사상된 공간에서 경계로 표현되는데 SVM 알고리즘은 그 중 가장 큰 폭을 가진 경계를 찾는 알고리즘이다.SVM is one of the machine learning fields, and when given a set of data belonging to one of two categories, it creates a classification model that determines which category new data belongs to based on the given data set. The created classification model is expressed as a boundary in the space where the data is mapped, and the SVM algorithm is an algorithm that finds the boundary with the largest width among them.

상기 악성여부 판별은 상기 댓글이 감성유형 사전에 저장된 단어를 포함하는지 여부를 판단한다. The malignancy determination determines whether the comment includes a word stored in the emotion type dictionary.

상기 감성유형사전은 댓글의 유형에 따라 복수 개의 감성사전으로 사전에 구축된다. 상기 감성사전은 혐오표현 단어, 비속어 단어를 포함하여 구성된다. 비속어 사전은 예를 들면, "한국어속어"에 해당하는 단어들을 포함할 수 있다. 상기 감성사전은 사전 구축디바이스에 의해 발견된 새로운 혐오 표현 단어들이 저장되고, 지속적으로 업데이트된다. The emotion type dictionary is constructed in advance as a plurality of emotion dictionaries according to the type of comments. The emotional dictionary includes hate expression words and profanity words. The profanity dictionary may include, for example, words corresponding to "Korean slang". The emotional dictionary stores new hate expression words discovered by the dictionary construction device and is continuously updated.

본 실시예에서, 상기 감성유형사전은 언어폭력, 폭로, 아이디 도용, 사기, 스토킹, 따돌림, 성적모욕 등 7개의 유형으로 분류되고 각각의 감성사전으로 구축된다. In this embodiment, the emotional type dictionary is classified into seven types such as verbal violence, exposure, identity theft, fraud, stalking, bullying, and sexual insult, and is built into each emotional dictionary.

상기 각각의 감성사전에 포함된 단어에는 가중치를 적용할 수 있고, 각각의 단어마다에 별도의 가중치를 설정할 수도 있다. 상기 댓글이 감성유형사전에 저장된 단어를 포함하는 경우 미리 설정된 가중치를 적용한다. Weights may be applied to words included in each of the sentiment dictionaries, and a separate weight may be set for each word. When the comment includes a word stored in the emotion type dictionary, a preset weight is applied.

상기 댓글의 내용에 따라서는, 감성사전의 가중치가 다를 수 있다. 예를 들면, 언어폭력 1, , 폭로 1, 성적모욕 1 등의 가중치를 받는 댓글의 경우에는 악성댓글을 판별한 결과, 악성댓글로 판별하고, 유형별 감성유형사전에 각각 가중치를 적용한 결과, 건전도를 추출하면 '건전도 위험' 으로 위험한 악성댓글로 판별하여 댓글을 차단하도록 조절할 수 있다. Depending on the content of the comment, the weight of the sentiment dictionary may be different. For example, in the case of comments that received weights such as verbal violence 1, exposure 1, sexual insult 1, etc., malicious comments were identified as malicious comments, and each weight was applied to the emotional type dictionary for each type. is extracted, it can be adjusted to block comments by determining them as dangerous malicious comments with 'health risk'.

본 실시예는 본 발명에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 본 발명의 명세서에 포함된 기술적 사상의 범위내에서 당업자가 용이하게 유추할 수 있는 변형예와 구체적인 실시예는 모두 본 발명의 기술적 사상에 포함되는 것은 자명하다.This embodiment only clearly shows a part of the technical idea included in the present invention, and variations and specific examples that can be easily inferred by those skilled in the art within the scope of the technical idea included in the specification of the present invention It is obvious that all are included in the technical spirit of the present invention.

S 1 : 댓글데이터 수집단계
S 2 : 전처리 단계
S 3 : 형태소 분석단계
S 4 : 형태소 벡터화 단계
S 5 : 악성여부 판별단계
S 6 : 가중치 적용 및 악성여부 판별단계S 1: Comment data collection step
S 2: pre-processing step
S 3: Morphological Analysis Step
S 4: morpheme vectorization step
S5: malignancy determination step
S 6: Weight application and malignancy determination step

Claims

A data collection step of collecting comment data using a data collector;
Extracting and pre-processing morphemes from the collected comment data;
selecting morphemes according to parts of speech using a morpheme analyzer from the preprocessed morphemes;
extracting the selected morphemes for each part of speech as a combination of specific parts of speech and vectorizing the extracted morphemes for each part of speech;
Machine learning the vectorized morpheme by a Support Vector Machine algorithm, determining whether the morpheme includes a word stored in a sentiment type dictionary, and determining whether the comment is malicious through modeling;
and deriving a result value by applying a preset weight when the comment includes a word stored in a sentiment type dictionary, and determining whether the comment is malicious.

According to claim 1,
The malicious comment discrimination method characterized in that the parts of speech of the selected morphemes include nouns, particles, verbs, adjectives, and adverbs, and the remaining parts of speech are removed.

According to claim 2,
The malicious comment discrimination method, characterized in that the combination of the selected parts of speech by morpheme includes a noun and an adjective and at the same time removes a verb.

According to claim 1,
The parts of speech are selected using an OKT (Open Korean Text) analyzer,
The malicious comment discrimination method, characterized in that the vectorization of the morpheme is vectorized using TF-IDF.

According to claim 1,
The method for determining malicious comments, characterized in that the emotional type dictionary is constructed in advance as a plurality of emotional dictionaries according to the type of comments.

According to claim 1,
In the preprocessing step,
A step of cleansing the content leaving only Hangul from the comment content;
removing e-mail, URL, HTML tags and special symbols;
Classifying the letters of the comment into consonants and vowels and performing a spell check;
After performing a spell check, correcting the comment according to the spelling;
Malicious comment discrimination method comprising the step of recombining the modified comments into sentences