KR20150054355A

KR20150054355A - Apparatus and method for building up sentiment dictionary

Info

Publication number: KR20150054355A
Application number: KR1020130136710A
Authority: KR
Inventors: 조희련; 이종석
Original assignee: 연세대학교 산학협력단
Priority date: 2013-11-12
Filing date: 2013-11-12
Publication date: 2015-05-20
Also published as: KR101555039B1

Abstract

The present invention relates to a sentiment dictionary building apparatus and a method thereof. The sentiment dictionary building apparatus includes: a word frequency calculation unit; a comparison determination unit; and a score conversion unit. The word frequency calculation unit calculates the occurrence frequency values regarding particular words respectively in a positive text group including positive texts classified as positive opinions and in a negative text group including negative texts classified as negative opinions. The comparison determination unit determines whether the sentiment scores set for the corresponding words in a sentiment dictionary match the ratio of the frequency values respectively for the positive and negative text groups. The score conversion unit which selectively adjusts the sentiment scores set for the corresponding words in the sentiment dictionary according to the comparison determination unit.

Description

[0001] APPARATUS AND METHOD FOR BUILDING UP SENTIMENT DICTIONARY [0002]

본 발명은 감정 사전 구축 장치 및 감정 사전 구축 방법에 관한 것이다.The present invention relates to an emotion dictionary construction apparatus and an emotion dictionary construction method.

최근 들어, 감정 분석(sentiment analysis)과 오피니언 마이닝(opinion mining) 기술에 관한 연구가 증가하고 있다. 감정 분석과 오피니언 마이닝 기술은 소셜미디어 상에서의 주식이나 선거 혹은 영화 등에 대한 대화를 분석할 수 있다. 또한, 감정 분석과 오피니언 마이닝 기술은 특정한 사건이 발생한 후 사회적 정서를 모델링하는데 활용될 수 있으며, 실업통계만으로는 얻기 어려운 깊은 통찰력(deep insight)을 제공할 수 있다. 감정 분석과 오피니언 마이닝 기술은 크게 기계 학습(machine learning)에 의한 방식과, 감정 사전(sentiment dictionary)을 이용하는 방식으로 분류할 수 있다. 이들 중 기계학습을 이용한 의견 분류 방식은 사회적 이슈를 다루는 경우에 있어, 초기 데이터로 학습한 분류기를 이용하여 나중에 생성되는 데이터를 분류할 때 분류 성능이 떨어질 수 있다.In recent years, studies on sentiment analysis and opinion mining techniques are increasing. Emotional Analysis and Opinion Mining technology can analyze conversations about stocks, elections or movies on social media. Emotional analysis and Opinion mining techniques can be used to model social emotions after a specific event and can provide deep insight that is unavailable by unemployment statistics alone. Emotion analysis and Opinion mining techniques can be broadly divided into machine learning and sentiment dictionary. Among them, when classifying opinions by using machine learning is dealing with social issues, classification performance may be deteriorated when classifying data generated later using classifiers learned with initial data.

감정 스코어(sentiment score)나 감정 범주(sentiment category)를 정의하는 다양한 어휘 자료들(lexical resources)이 구축되고 활용 가능해졌다. 각각의 어휘 자료들은 인터넷상의 댓글, 상품평, 블로그나 트위터 등의 텍스트로 표현된 의견을 긍정 의견이나 부정 의견으로 분류하는데 활용될 수 있다. 일 예로, 기업체의 경우, 상품 마케팅에 활용하기 위해 인터넷상의 상품평들을 분석하기를 원할 수 있다. 예를 들면, 감정 사전(sentiment dictionary)을 활용하여 상품평에 포함되어 있는 단어들에 대한 감정 스코어들의 평균값을 산출하는 것에 의해 상품평을 긍정 혹은 부정 의견으로 분류할 수 있다.Various lexical resources have been constructed and available that define the sentiment score or sentiment category. Each vocabulary can be used to classify opinions expressed in texts such as comments on the Internet, product reviews, blogs, and tweets as affirmative or negative opinions. For example, businesses may want to analyze product reviews on the Internet for product marketing. For example, a sentiment dictionary can be used to classify a product review as positive or negative by calculating the average value of the sentiment score for the words included in the product review.

다양한 감정 사전들은 서로 다른 포맷(format)과 단어 규모(size)를 갖는다. 즉, 서로 다른 감정 사전들은 동일한 단어에 대하여 서로 다른 감정 스코어 혹은 서로 다른 감정 범주를 나타내며, 각각의 감정 사전에 수록되어 있는 단어들의 리스트 역시 다를 수 있다. 이 때문에, 어떤 감정 사전을 활용하는지에 따라 동일한 텍스트 의견에 대한 분류 결과가 다르게 나타날 수 있다. 또한, 종래의 감정 사전에는 실제로 의견 분류에 별다른 영향을 미치지 않거나 정확한 의견 분류에 악영향을 끼치는 단어들이 다수 포함될 수 있으며, 이로 인해 의견 분류 성능이 저하될 수 있다.Various emotion dictionaries have different formats and word sizes. That is, the different emotion dictionaries represent different emotion scores or different emotion categories for the same word, and the list of words contained in each emotion dictionary may also be different. For this reason, classification results for the same text comment may appear differently depending on which emotion dictionary is utilized. In addition, the conventional emotion dictionary may include many words that do not substantially affect the opinion classification or adversely affect the accurate opinion classification, which may degrade the opinion classification performance.

뿐만 아니라, 종래의 감정 사전은 각 도메인(domain)별 고유 특성을 정확하게 반영하지 못하기 때문에, 특정 도메인에 대하여 의견 분류 성능이 저하될 수 있다. 예를 들어, 장갑에 대한 상품평에서 'warm' 이라는 단어는 긍정적 감정 스코어로 분류되는 것이 바람직한 반면, 맥주에 대한 상품평에서 'warm' 이라는 단어는 부정적 감정 스코어로 분류되는 것이 바람직하다. 하지만, 종래의 감정 사전은 단어들에 대해 도메인과 무관하게 일률적으로 감정 스코어를 부여하고 있어, 특정 도메인의 경우에서는 잘못된 의견 분류 결과가 초래될 수 있다.In addition, since the conventional emotion dictionary does not accurately reflect the intrinsic characteristics of each domain, the opinion classification performance may deteriorate for a specific domain. For example, in a review of gloves, it is desirable that the word 'warm' be categorized as a positive emotional score whereas in a review of beer, the word 'warm' is categorized as a negative emotional score. However, the conventional emotion dictionaries uniformly assign emotion scores to words irrespective of the domain, and in the case of a specific domain, incorrect opinion classification results may result.

본 발명은 감정 사전에서 정확한 의견 분류에 악영향을 끼치는 단어들의 감정 스코어 값을 변경하여 의견 분류 성능을 높일 수 있는 감정 사전 구축 장치 및 감정 사전 구축 방법을 제공하는 것을 목적으로 한다.An object of the present invention is to provide an emotion dictionary building apparatus and an emotion dictionary building method capable of improving the opinion classification performance by changing the emotion score value of words which adversely affect the accurate opinion classification in the emotion dictionary.

본 발명이 해결하고자 하는 다른 과제는 의견 분류에 그다지 영향을 미치지 않는 단어들을 감정 사전에서 제거하여 의견 분류 정확성 및 효율성을 높일 수 있는 감정 사전 구축 장치 및 감정 사전 구축 방법을 제공하는 것에 있다.Another problem to be solved by the present invention is to provide an emotion dictionary building apparatus and an emotion dictionary building method capable of improving accuracy and efficiency of opinion classification by removing words that do not have much influence on opinion classification from the emotion dictionary.

본 발명이 해결하고자 하는 또 다른 과제는 서로 다른 감정 사전들을 병합한 통합 감정 사전을 구축함으로써 보다 정확한 의견 분석 결과를 얻도록 하는 감정 사전 구축 장치 및 감정 사전 구축 방법을 제공하는 것에 있다.Another object of the present invention is to provide an emotion dictionary building apparatus and an emotion dictionary building method that can obtain a more accurate opinion analysis result by constructing an integrated emotion dictionary combining different emotion dictionaries.

본 발명이 해결하고자 하는 또 다른 과제는 도메인(domain)별로 극대화된 의견 분류 성능을 얻도록 하는 감정 사전 구축 장치 및 감정 사전 구축 방법을 제공하는 것에 있다.Another object of the present invention is to provide an emotion dictionary building apparatus and an emotion dictionary building method for obtaining maximized opinion classification performance for each domain.

본 발명이 해결하고자 하는 과제는 이상에서 언급된 과제로 제한되지 않는다. 언급되지 않은 다른 기술적 과제들은 이하의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the above-mentioned problems. Other technical subjects not mentioned will be apparent to those skilled in the art from the description below.

상기 과제를 해결하기 위한 본 발명의 일 측면에 따른 감정 사전 구축 장치는 긍정적 의견으로 분류된 긍정 텍스트들(positive texts)의 집합체인 긍정 텍스트군 및 부정적 의견으로 분류된 부정 텍스트들(negative texts)의 집합체인 부정 텍스트군 각각에서 소정의 단어가 출현한 빈도값을 산출하는 단어 빈도 산출부; 감정 사전(sentiment dictionary)에 상기 단어에 대응하여 설정된 감정 스코어(sentiment score)가 상기 긍정 텍스트군 및 상기 부정 텍스트군 각각에 대한 빈도값 간의 비율과 부합하는지를 판단하는 비교 판단부; 및 상기 감정 사전에 상기 단어에 대응하여 설정된 상기 감정 스코어를 상기 비교 판단부의 판단 결과에 따라 선택적으로 변경하는 스코어 변환부를 포함한다.According to an aspect of the present invention, there is provided an apparatus for constructing an emotion dictionary, comprising: a positive text group, which is an aggregate of positive texts classified as a positive opinion; and a negative text group, A word frequency calculation unit for calculating a frequency value in which a predetermined word appears in each of an indefinite text group which is an aggregate; A comparison judging unit for judging whether a sentiment score set in correspondence with the word in a sentiment dictionary matches a ratio between frequency values for the positive text group and the negative text group; And a score conversion unit for selectively changing the emotion score set corresponding to the word in the emotion dictionary according to a determination result of the comparison determination unit.

본 발명의 일 실시 예에서, 상기 스코어 변환부는, 상기 감정 스코어가 상기 빈도값 간의 비율과 부합하지 않는 것으로 판단되면, 상기 감정 스코어를 변경할 수 있다.In one embodiment of the present invention, the score conversion unit may change the emotion score if it is determined that the emotion score does not match the ratio between the frequency values.

본 발명의 일 실시 예에서, 상기 비교 판단부는 상기 긍정 텍스트들과 상기 부정 텍스트들 간의 텍스트 개수 비율을 산출하고, 상기 빈도값 간의 비율과 상기 텍스트 개수 비율 간의 차이값을 산출하며, 상기 감정 스코어의 극성(polarity)과 상기 차이값의 극성을 비교하여 상기 감정 스코어의 변경 여부를 판단할 수 있다.In one embodiment of the present invention, the comparison and determination unit may calculate the ratio of the number of texts between the positive texts and the negative texts, calculate the difference between the ratio between the frequency values and the number of texts, It is possible to determine whether the emotion score is changed by comparing the polarity of the difference and the polarity of the difference value.

본 발명의 일 실시 예에서, 상기 스코어 변환부는 상기 감정 스코어의 극성과 상기 차이값의 극성이 서로 일치하지 않는 경우, 상기 감정 스코어의 극성을 전환할 수 있다.In one embodiment of the present invention, the score conversion unit may change the polarity of the emotion score if the polarity of the emotion score and the polarity of the difference value do not match with each other.

본 발명의 일 실시 예에서, 상기 감정 사전 구축 장치는 상기 빈도값 간의 비율과 상기 텍스트 개수 비율 간의 차이값이 0을 포함하는 소정의 임계 범위 내에 속하는 경우, 상기 감정 사전에서 상기 단어를 제거하는 단어 제거부를 더 포함할 수 있다.In an embodiment of the present invention, the emotion dictionary building apparatus may further include a word extracting unit for extracting words from the emotion dictionary when the difference value between the ratio between the frequency values and the text number ratio falls within a predetermined threshold range including 0, And may further include a removal section.

본 발명의 일 실시 예에서, 상기 감정 사전 구축 장치는 서로 다른 감정 사전 자료들을 병합하여 상기 감정 사전을 구축하는 감정 사전 병합부를 더 포함할 수 있다.In an embodiment of the present invention, the emotion dictionary building apparatus may further include an emotion dictionary merging unit for merging different emotion dictionary data to construct the emotion dictionary.

본 발명의 일 실시 예에서, 상기 감정 사전 병합부는 상기 감정 사전 자료들 각각의 단어들에 대한 서로 다른 감정 스코어들과 서로 다른 감정 범주들(sentiment categories)을 표준화하고, 각 단어에 대하여 표준화된 서로 다른 감정 사전 자료들의 감정 스코어들을 평균한 값을 상기 각 단어에 대한 감정 스코어로 산출할 수 있다.In one embodiment of the present invention, the emotion dictionary merger may standardize different sentiment categories and different sentiment categories for each word of the emotional dictionary data, A value obtained by averaging the emotion scores of the other emotion dictionary data can be calculated by the emotion score for each word.

본 발명의 일 실시 예에서, 상기 감정 사전에서 상기 단어에 대한 감정 스코어는 도메인별로 서로 다른 값으로 설정되고, 상기 스코어 변환부는 상기 감정 사전의 상기 단어에 대한 감정 스코어들 중 상기 긍정 텍스트군과 상기 부정 텍스트군의 도메인에 대응하는 감정 스코어만을 선택적으로 변경할 수 있다.In one embodiment of the present invention, the emotion score for the word in the emotion dictionary is set to a different value for each domain, and the score converter converts the positive text group and the positive text group among the emotion scores for the word in the emotion dictionary, Only the emotion score corresponding to the domain of the negative text group can be selectively changed.

상기 과제를 해결하기 위한 본 발명의 다른 일 측면에 따르면, 긍정적 의견으로 분류된 긍정 텍스트들(positive texts)의 집합체인 긍정 텍스트군 및 부정적 의견으로 분류된 부정 텍스트들(negative texts)의 집합체인 부정 텍스트군 각각에서 소정의 단어가 출현한 빈도값을 산출하는 단계; 감정 사전(sentiment dictionary)에 상기 단어에 대응하여 설정된 감정 스코어(sentiment score)가 상기 긍정 텍스트군 및 상기 부정 텍스트군 각각에 대한 빈도값 간의 비율과 부합하는지를 판단하는 단계; 및 상기 감정 스코어가 상기 빈도값 간의 비율과 부합하지 않으면, 상기 감정 스코어를 변경하는 단계를 포함하는 감정 사전 구축 방법이 제공된다.According to another aspect of the present invention, there is provided an information processing apparatus including: a group of positive texts, which is an aggregate of positive texts classified as a positive opinion; and a group of negative texts, Calculating a frequency value of a predetermined word in each of the text groups; Determining whether a sentiment score set in the sentiment dictionary corresponding to the word matches a ratio between the frequency values for the positive text group and the negative text group; And modifying the emotional score if the emotional score does not match the ratio between the frequency values.

본 발명의 일 실시 예에서, 상기 판단하는 단계는, 상기 긍정 텍스트들과 상기 부정 텍스트들 간의 텍스트 개수 비율을 산출하는 단계; 상기 빈도값 간의 비율과 상기 텍스트 개수 비율 간의 차이값을 산출하는 단계; 및 상기 감정 스코어의 극성(polarity)과 상기 차이값의 극성을 비교하여 상기 감정 스코어의 극성을 전환할지 여부를 판단하는 단계를 포함할 수 있다.In one embodiment of the present invention, the determining step may include: calculating a ratio of the number of texts between the positive texts and the negative texts; Calculating a difference value between the ratio between the frequency values and the text number ratio; And comparing the polarity of the emotion score with the polarity of the difference value to determine whether to change the polarity of the emotion score.

본 발명의 일 실시 예에서, 상기 감정 사전 구축 방법은 상기 빈도값 간의 비율과 상기 텍스트 개수 비율 간의 차이값이 0을 포함하는 소정의 임계 범위 내에 속하는 경우, 상기 감정 사전에서 상기 단어를 제거하는 단계를 더 포함할 수 있다.In one embodiment of the present invention, the emotion dictionary construction method further includes the step of removing the word from the emotion dictionary when the difference between the ratio between the frequency values and the text number ratio falls within a predetermined threshold range including 0 As shown in FIG.

본 발명의 일 실시 예에서, 상기 감정 사전 구축 방법은 서로 다른 감정 사전 자료들을 병합하여 상기 감정 사전을 구축하는 단계를 더 포함하며, 상기 감정 사전을 구축하는 단계는, 상기 감정 사전 자료들 각각의 단어들에 대한 서로 다른 감정 스코어들과 서로 다른 감정 범주들(sentiment categories)을 표준화하는 단계; 및 각 단어에 대하여 표준화된 서로 다른 감정 사전 자료들의 감정 스코어들을 평균한 값을 상기 각 단어에 대한 감정 스코어로 산출하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the emotion dictionary building method may further include constructing the emotion dictionary by merging different emotion dictionary data, wherein the building the emotion dictionary comprises: Standardizing different sentiment scores and different sentiment categories for words; And calculating an average of emotion scores of different emotional dictionary data standardized for each word with an emotion score for each word.

본 발명의 일 실시 예에서, 상기 감정 사전에서 상기 단어에 대한 감정 스코어는 도메인별로 서로 다른 값으로 설정되고, 상기 스코어를 변경하는 단계는 상기 감정 사전의 상기 단어에 대한 감정 스코어들 중 상기 긍정 텍스트군과 상기 부정 텍스트군의 도메인에 대응하는 감정 스코어만을 선택적으로 변경할 수 있다.In one embodiment of the present invention, in the emotion dictionary, the emotion score for the word is set to a different value for each domain, and the step of changing the score includes the step of changing the score of the affirmative text Only the emotion score corresponding to the domain of the group and the negative text group can be selectively changed.

본 발명의 실시 예에 의하면, 감정 사전에서 정확한 의견 분류에 악영향을 끼치는 단어들의 감정 스코어를 변경하여 의견 분류 성능을 높일 수 있다.According to the embodiment of the present invention, it is possible to improve the opinion classification performance by changing the emotion score of the words that adversely affect the accurate opinion classification in the emotion dictionary.

또한, 본 발명의 실시 예에 의하면, 의견 분류에 그다지 영향을 미치지 않는 단어들을 감정 사전에서 제거하여 의견 분류 정확성 및 효율성을 높일 수 있다.In addition, according to the embodiment of the present invention, it is possible to remove the words which do not have much influence on the opinion classification from the emotion dictionary, thereby improving the accuracy and efficiency of the opinion classification.

또한, 본 발명의 실시 예에 의하면, 서로 다른 감정 사전들을 병합한 통합 감정 사전을 구축함으로써 보다 정확한 의견 분석 결과를 얻을 수 있다.Further, according to the embodiment of the present invention, a more accurate opinion analysis result can be obtained by constructing an integrated emotion dictionary combining different emotion dictionaries.

또한, 본 발명의 실시 예에 의하면, 도메인별로 극대화된 의견 분류 성능을 얻을 수 있다.In addition, according to the embodiment of the present invention, maximized opinion classification performance can be obtained for each domain.

본 발명의 효과는 상술한 효과들로 제한되지 않는다. 언급되지 않은 효과들은 본 명세서 및 첨부된 도면으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확히 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects described above. Unless stated, the effects will be apparent to those skilled in the art from the description and the accompanying drawings.

도 1은 본 발명의 일 실시 예에 따른 감정 사전 구축 장치를 포함하는 감정 사전 구축 시스템을 보여주는 구성도이다.
도 2는 본 발명의 일 실시 예에 따른 감정 사전 구축 장치를 보여주는 구성도이다.
도 3은 도 2에 도시된 감정 사전 구축부를 보여주는 구성도이다.
도 4는 도 3에 도시된 감정 사전 갱신부를 보여주는 구성도이다.
도 5 내지 도 6은 본 발명의 일 실시 예에 따른 감정 사전 구축 장치를 설명하기 위한 도면이다.
도 7 내지 도 12는 본 발명의 일 실시 예에 따른 감정 사전 구축 장치의 효과를 설명하기 위한 도면이다.1 is a configuration diagram showing an emotion dictionary building system including an emotion dictionary building apparatus according to an embodiment of the present invention.
FIG. 2 is a configuration diagram showing an emotion dictionary building apparatus according to an embodiment of the present invention.
FIG. 3 is a configuration diagram showing the emotion dictionary building unit shown in FIG. 2. FIG.
FIG. 4 is a configuration diagram showing the emotional dictionary updating unit shown in FIG. 3. FIG.
5 to 6 are diagrams for explaining an emotion dictionary building apparatus according to an embodiment of the present invention.
7 to 12 are diagrams for explaining effects of the emotion dictionary building apparatus according to an embodiment of the present invention.

본 발명의 다른 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술하는 실시 예를 참조하면 명확해질 것이다. 본 발명은 이하에서 제시되는 실시 예로 제한되지 않고, 제시되지 않은 다양한 형태로 구현될 수 있으며, 본 발명의 범주는 특허청구범위에 의해 정의될 뿐이다. 만일 정의되지 않더라도, 여기서 사용되는 모든 용어들(기술 혹은 과학 용어들을 포함)은 이 발명이 속한 종래 기술에서 보편적 기술에 의해 일반적으로 수용되는 것과 동일한 의미를 갖는다. 공지된 구성에 대한 일반적인 설명은 본 발명의 요지를 흐리지 않기 위해 생략될 수 있다. 본 발명의 도면에서 동일하거나 상응하는 구성에 대하여는 가급적 동일한 도면부호가 사용된다.Other advantages and features of the present invention and methods of achieving them will be apparent by referring to the embodiments described hereinafter in detail with reference to the accompanying drawings. The present invention is not limited to the embodiments shown below, but can be implemented in various forms not shown, and the scope of the present invention is only defined by the claims. Although not defined, all terms (including technical or scientific terms) used herein have the same meaning as commonly accepted by the generic art in the prior art to which this invention belongs. A general description of known configurations may be omitted so as not to obscure the gist of the present invention. In the drawings of the present invention, the same reference numerals are used as many as possible for the same or corresponding configurations.

본 발명의 일 실시 예에 따른 감정 사전 구축 장치는 긍정 텍스트들(positive texts)의 집합체인 긍정 텍스트군 및 부정 텍스트들(negative texts)의 집합체인 부정 텍스트군 각각에서 소정의 단어가 출현한 빈도값을 산출하는 단어 빈도 산출부; 긍정 텍스트군 및 부정 텍스트군 각각에 대한 빈도값 간의 비율을 산출하고, 긍정 텍스트들과 부정 텍스트들 간의 텍스트 개수 비율을 산출하고, 빈도값 간의 비율과 텍스트 개수 비율 간의 차이값을 산출하며, 감정 사전(sentiment dictionary)에 설정된 단어의 감정 스코어(sentiment score)의 극성(polarity)과, 상기 차이값의 극성을 비교하여 감정 스코어의 변경 여부를 판단하는 비교 판단부; 및 상기 감정 스코어의 극성과 상기 차이값의 극성이 서로 일치하지 않는 경우, 감정 스코어의 극성을 전환(switching)하는 스코어 변환부를 포함한다. 본 발명의 실시 예에 의하면, 정확한 의견 분류에 악영향을 끼치는 단어들의 감정 스코어를 전환하여 의견 분류 성능을 높일 수 있다.The emotional dictionary construction apparatus according to an embodiment of the present invention includes a positive text group which is an aggregate of positive texts and a frequency value in which a predetermined word appears in each of the negative text groups which is an aggregate of negative texts A word frequency calculation unit for calculating a word frequency; The ratio between the frequency values of the positive text group and the negative text group is calculated, the ratio of the number of texts between the positive texts and the negative texts is calculated, the difference value between the frequency values and the text number ratio is calculated, a comparison judging unit for judging whether the emotion score is changed by comparing the polarity of the sentiment score of the word set in the sentiment dictionary with the polarity of the difference value; And a score conversion unit for switching the polarity of the emotion score when the polarity of the emotion score and the polarity of the difference value do not match with each other. According to the embodiment of the present invention, it is possible to improve the opinion classification performance by switching the emotion scores of the words that adversely affect the accurate opinion classification.

본 발명의 일 실시 예에 따른 감정 사전 구축 장치는 상기 빈도값 간의 비율과 상기 텍스트 개수 비율 간의 차이값이 0을 포함하는 소정의 임계 범위 내에 속하는 경우, 감정 사전에서 해당 단어를 제거(removing)하는 단어 제거부를 더 포함할 수 있다. 본 발명의 실시 예에 의하면, 의견 분류에 그다지 영향을 미치지 않는 단어들을 감정 사전에서 제거하여 의견 분류 정확성 및 효율성을 높일 수 있다.The emotion dictionary construction apparatus according to an embodiment of the present invention removes the word from the emotion dictionary when the difference between the ratio between the frequency values and the text number ratio falls within a predetermined threshold range including 0 And a word removing unit. According to the embodiment of the present invention, the words which do not have much influence on the opinion classification can be removed from the emotion dictionary, thereby improving the accuracy and efficiency of the opinion classification.

본 발명의 일 실시 예에 따른 감정 사전 구축 장치는 서로 다른 감정 사전 자료들을 병합(merging)하여 감정 사전을 구축하는 감정 사전 병합부를 더 포함할 수 있다. 감정 사전 병합부는 감정 사전 자료들 각각의 단어들에 대한 서로 다른 감정 스코어들과 서로 다른 감정 범주들(sentiment categories)을 표준화하고, 각 단어에 대하여 표준화된 서로 다른 감정 사전 자료들의 감정 스코어들을 평균한 값을 각 단어에 대한 감정 스코어로 산출할 수 있다. 본 발명의 실시 예에 의하면, 서로 다른 감정 사전들을 병합한 통합 감정 사전을 구축함으로써 보다 정확한 의견 분석 결과를 얻을 수 있다.The emotion dictionary building apparatus according to an embodiment of the present invention may further include an emotion dictionary merging unit for merging different emotion dictionary data to construct an emotion dictionary. The emotion dictionary merger section standardizes the different emotion scores and sentiment categories for each of the words in the emotional dictionary data, and the emotion scores of the different standardized emotional dictionary data for each word are averaged The value can be calculated as an emotion score for each word. According to the embodiment of the present invention, a more accurate opinion analysis result can be obtained by constructing an integrated emotion dictionary combining different emotion dictionaries.

본 발명의 일 실시 예에 있어서, 감정 사전에서 단어에 대한 감정 스코어는 도메인(domain)별로 서로 다른 값(혹은 서로 다른 극성)을 갖도록 설정되고, 스코어 변환부는 감정 사전의 단어에 대한 감정 스코어들 중 긍정 텍스트군과 부정 텍스트군의 도메인에 대응하는 감정 스코어만을 선택적으로 변경할 수 있다. 본 발명의 실시 예에 의하면, 도메인별로 극대화된 의견 분류 성능을 얻을 수 있다.In one embodiment of the present invention, the emotion score for a word in the emotion dictionary is set to have a different value (or a different polarity) for each domain, and the score converter converts the emotion score Only the emotion scores corresponding to the domains of the positive text group and the negative text group can be selectively changed. According to the embodiment of the present invention, maximized opinion classification performance can be obtained for each domain.

도 1은 본 발명의 일 실시 예에 따른 감정 사전 구축 장치를 포함하는 감정 사전 구축 시스템을 보여주는 구성도이다. 도 1을 참조하면, 감정 사전 구축 장치(100)는 예시적으로 네트워크(network)를 통해 다수의 감정 사전 제공 서버들(10, 20)로부터 서로 다른 감정 사전 자료들(sentiment dictionary resources)을 제공받고, 서로 다른 감정 사전 자료들을 병합(merging)하여 감정 사전(sentiment dictionary), 즉 통합(merged or integrated) 감정 사전을 구축할 수 있다. 서로 다른 감정 사전들을 병합함으로써 개별 감정 사전 자료보다 확장된 단어 스펙트럼을 갖는 통합 감정 사전을 구축할 수 있으며, 통합 감정 사전을 이용함으로써 개별 감정 사전 자료를 이용하는 것보다 정확한 의견 분류 결과를 얻을 수 있다.1 is a configuration diagram showing an emotion dictionary building system including an emotion dictionary building apparatus according to an embodiment of the present invention. Referring to FIG. 1, the emotion dictionary construction apparatus 100 is provided with different sentiment dictionary resources from a plurality of emotion pre-provision servers 10 and 20 via a network as an example , A sentiment dictionary (merged or integrated emotion dictionary) can be constructed by merging different emotional dictionary data. By merging different emotion dictionaries, it is possible to construct an integrated emotion dictionary with an extended word spectrum than individual emotion dictionary data. By using the integrated emotion dictionary, more accurate classification results can be obtained than using individual emotion dictionary data.

감정 사전 구축 장치(100)는 예시적으로 네트워크를 통해 훈련 데이터 제공 서버(30)로부터 훈련 데이터(training data)를 제공받고, 훈련 데이터를 이용하여 통합 감정 사전을 갱신할 수 있다. 훈련 데이터 제공 서버(30)는 예시적으로 아마존(Amazon)과 같은 서버일 수 있다. 네트워크는 예시적으로 인터넷(internet)일 수 있다. 훈련 데이터는 긍정적 의견으로 분류되는 긍정 텍스트들(positive texts)의 집합체인 긍정 텍스트군과, 부정적 의견으로 분류되는 부정 텍스트들(negative texts)의 집합체인 부정 텍스트군을 포함할 수 있다. 훈련 데이터는 예시적으로 아마존 서버로부터 제공받은 상품평들일 수 있다. 예시적으로, 훈련 데이터는 별점 표시(star rating) 등의 방식에 의하여 긍정 텍스트들과 부정 텍스트들로 분류될 수 있다. 감정 사전 구축 장치(100)는 긍정 텍스트들과 부정 텍스트들에 나타난 단어의 출현 빈도를 분석하여, 감정 사전에서 정확한 의견 분류에 악영향을 끼치는 단어들의 감정 스코어를 전환(switching)하거나, 의견 분류에 그다지 영향을 미치지 않는 단어들을 감정 사전에서 제거(removing)할 수 있다. 이에 따라 의견 분류 정확성 및 효율성을 높일 수 있다.The emotion dictionary building apparatus 100 may receive training data from the training data providing server 30 via the network as an example and update the integrated emotion dictionary using the training data. The training data providing server 30 may be, for example, a server such as Amazon. The network may be illustratively the internet. The training data may include a positive text group, which is an aggregate of positive texts classified as a positive opinion, and a negative text group, which is an aggregate of negative texts classified as a negative opinion. The training data may be illustrative merchandise reviews provided by the Amazon server. Illustratively, the training data may be classified into positive text and negative text by a method such as a star rating. The emotion dictionary construction apparatus 100 analyzes the appearance frequency of words appearing in the positive texts and the negative texts to switch the emotion score of the words that adversely affect the accurate opinion classification in the emotion dictionary, You can remove unaffected words from the emotion dictionary. This can improve the accuracy and efficiency of the opinion classification.

도 2는 본 발명의 일 실시 예에 따른 감정 사전 구축 장치를 보여주는 구성도이다. 도 1 내지 도 2를 참조하면, 감정 사전 구축 장치(100)는 사용자 인터페이스부(110), 입력부(120), 감정 사전 구축부(130), 메모리(140), 디스플레이부(150) 및 제어부(160)를 포함한다. 사용자 인터페이스부(110)는 키보드(keyboard), 마우스(mouse), 혹은 터치 패드(touch pad) 등과 같은 입력 인터페이스일 수 있다. 예시적으로, 사용자는 사용자 인터페이스부(110)를 조작하는 것에 의하여 감정 사전 제공 서버들(10, 20)로부터 감정 사전 자료들을 제공받거나, 훈련 데이터 제공 서버(30)로부터 훈련 데이터를 제공받거나, 감정 사전 구축부(130)에서 제공하는 기능들 혹은 동작들을 실행하거나, 감정 사전 자료들 혹은 훈련 데이터를 메모리(140)에 저장할 수 있다. 입력부(120)는 통신 인터페이스로 동작하며, 네트워크를 통해 감정 사전 자료들이나 훈련 데이터를 전송받는 기능을 수행한다.FIG. 2 is a configuration diagram showing an emotion dictionary building apparatus according to an embodiment of the present invention. 1 and 2, the emotion dictionary building apparatus 100 includes a user interface unit 110, an input unit 120, an emotion dictionary building unit 130, a memory 140, a display unit 150, and a control unit 160). The user interface unit 110 may be an input interface such as a keyboard, a mouse, or a touch pad. Illustratively, the user may be provided with emotion dictionary data from the emotion pre-provision servers 10, 20 by operating the user interface unit 110, receiving training data from the training data providing server 30, Execute functions or operations provided by the dictionary construction unit 130, store emotion dictionary data or training data in the memory 140, The input unit 120 functions as a communication interface and receives emotion dictionary data or training data through a network.

감정 사전 구축부(130)는 사용자의 입력에 따라 감정 사전 자료들을 병합하여 통합 감정 사전을 구축하거나, 통합 감정 사전을 훈련 데이터를 이용하여 갱신하는 등의 기능을 수행한다. 감정 사전 구축부(130)의 기능과 동작은 하나 이상의 프로세서(processor)에 의하여 수행될 수 있다. 감정 사전 구축부(130)는 제어부(160) 내에 통합적으로 제공될 수도 있다. 감정 사전 구축부(130)의 구체적인 기능과 동작은 이후 도 3을 참조하여 보다 상세하게 설명될 것이다. 메모리(140)는 사용자의 입력에 따라 감정 사전 자료들, 훈련 데이터, 통합 감정 사전 혹은 감정 사전 구축부(130)의 각종 기능과 동작을 실행하기 위해 필요한 알고리즘 등을 저장할 수 있다. 디스플레이부(150)는 감정 사전 자료들, 훈련 데이터 혹은 통합 감정 사전 등의 각종 정보들을 화면으로 표시할 수 있다. 디스플레이부(150)는 LCD(Liquid Crystal Display) 등의 화면으로 제공될 수 있다. 감정 사전 구축 장치(100)를 구성하는 각 구성들의 기능과 동작은 하나 이상의 프로세서를 포함하는 제어부(160)에 의해 제어될 수 있다.The emotion dictionary building unit 130 performs a function of building an integrated emotion dictionary by merging emotion dictionary data according to a user's input, or updating an integrated emotion dictionary using training data. The function and operation of the emotion dictionary building unit 130 may be performed by one or more processors. The emotion dictionary building unit 130 may be integrally provided in the control unit 160. [ The specific functions and operations of the emotion dictionary building unit 130 will be described later in detail with reference to FIG. The memory 140 may store emotion dictionary data, training data, an integrated emotion dictionary, or various functions of the emotion dictionary building unit 130 and algorithms necessary for performing an operation according to a user's input. The display unit 150 may display various kinds of information such as emotion dictionary data, training data, or an integrated emotion dictionary on a screen. The display unit 150 may be provided as an LCD (Liquid Crystal Display) or the like. The functions and operations of the respective constituent elements constituting the emotion dictionary building apparatus 100 can be controlled by the control unit 160 including one or more processors.

도 3은 도 2에 도시된 감정 사전 구축부를 보여주는 구성도이다. 도 1 내지 도 3을 참조하면, 감정 사전 구축부(130)는 감정 사전 병합부(131)와 감정 사전 갱신부(132)를 포함한다. 감정 사전 병합부(131)는 예시적으로 감정 사전 제공 서버들(10, 20)로부터 제공받은 서로 다른 감정 사전 자료들을 병합(merging)하여 통합 감정 사전을 구축한다. 감정 사전 병합부(131)는 감정 사전 자료들 각각의 단어들에 대한 서로 다른 감정 스코어들과 서로 다른 감정 범주들(sentiment categories)을 표준화(standardize)하고, 각 단어에 대하여 표준화된 서로 다른 감정 사전 자료들의 감정 스코어 값들을 평균한 값을 각 단어에 대한 감정 스코어로 산출할 수 있다. 감정 사전 갱신부(132)는 훈련 데이터 제공 서버(30)로부터 제공받은 훈련 데이터를 이용하여 통합 감정 사전을 갱신하거나, 혹은 훈련 데이터를 이용하여 개별 감정 사전 자료를 갱신할 수 있다.FIG. 3 is a configuration diagram showing the emotion dictionary building unit shown in FIG. 2. FIG. 1 to 3, the emotion dictionary construction unit 130 includes an emotion dictionary merging unit 131 and an emotion dictionary updating unit 132. [ The emotion dictionary merger 131 illustratively merges different emotion dictionary data provided from the emotion dictionary providing servers 10 and 20 to construct an integrated emotion dictionary. The emotion dictionary merger 131 standardizes the different emotion scores and sentiment categories for each of the words of the emotion dictionary data and stores the standardized emotion dictionary A value obtained by averaging the emotion score values of the data can be calculated by an emotion score for each word. The emotion dictionary updating unit 132 can update the integrated emotion dictionary using the training data provided from the training data providing server 30 or update the individual emotion dictionary data using the training data.

도 4는 도 3에 도시된 감정 사전 갱신부를 보여주는 구성도이다. 도 3 내지 도 4를 참조하면, 감정 사전 갱신부(132)는 단어 빈도 산출부(1321), 비교 판단부(1322), 단어 제거부(1323) 및 스코어 변환부(1324)를 포함한다. 단어 빈도 산출부(1321)는 훈련 데이터 중 긍정 텍스트들(positive texts)의 집합체인 긍정 텍스트군과, 부정 텍스트들(negative texts)의 집합체인 부정 텍스트군 각각에서 소정의 단어가 출현한 빈도값을 산출한다.FIG. 4 is a configuration diagram showing the emotional dictionary updating unit shown in FIG. 3. FIG. 3 to 4, the emotion dictionary updating unit 132 includes a word frequency calculating unit 1321, a comparison determining unit 1322, a word removing unit 1323, and a score converting unit 1324. The word frequency calculation unit 1321 calculates the frequency value of a predetermined word in each of the positive text group which is an aggregate of positive texts and the negative text group which is an aggregate of negative texts .

비교 판단부(1322)는 감정 사전(sentiment dictionary)에 단어에 대응하여 설정된 감정 스코어(sentiment score)가 긍정 텍스트군 및 부정 텍스트군 각각에 대한 빈도값 간의 비율과 부합하는지를 판단한다. 단어 제거부(1323)는 빈도값 간의 비율과 텍스트 개수 비율 간의 차이값이 0을 포함하는 소정의 임계 범위 내에 속하는 경우, 감정 사전에서 해당 단어를 제거한다. 스코어 변환부(1324)는 감정 사전에 단어에 대응하여 설정된 감정 스코어를 비교 판단부(1322)의 판단 결과에 따라 선택적으로 변경한다. 즉, 스코어 변환부(1324)는 감정 스코어가 빈도값 간의 비율과 부합하지 않는 것으로 판단되면, 감정 스코어를 변경할 수 있다.The comparative judgment unit 1322 judges whether the sentiment score set in correspondence with the word in the sentiment dictionary matches the ratio between the frequency values of the positive text group and the negative text group. The word removing unit 1323 removes the word from the emotion dictionary when the difference between the ratio between the frequency values and the number of texts falls within a predetermined threshold range including zero. The score conversion unit 1324 selectively changes the emotion score set corresponding to the word in the emotion dictionary according to the determination result of the comparison determination unit 1322. [ That is, the score conversion unit 1324 can change the emotion score if it is determined that the emotion score does not match the ratio between the frequency values.

본 발명의 일 실시 예에서, 비교 판단부(1322)는 긍정 텍스트들과 부정 텍스트들 간의 텍스트 개수 비율을 산출하고, 상기 빈도값 간의 비율과 상기 텍스트 개수 비율 간의 차이값을 산출하며, 감정 스코어의 극성(polarity)과 상기 차이값의 극성을 비교하여 감정 스코어의 변경 여부를 판단할 수 있다. 스코어 변환부(1324)는 감정 스코어의 극성과 상기 차이값의 극성이 서로 일치하지 않는 경우, 해당 단어에 대한 감정 스코어의 극성을 (-)에서 (+)로, 혹은 (+)에서 (-)로 전환할 수 있다.In an embodiment of the present invention, the comparison determination unit 1322 calculates a text number ratio between the positive text and the negative text, calculates a difference value between the ratio between the frequency values and the text number ratio, It is possible to judge whether or not the emotion score is changed by comparing the polarity and the polarity of the difference value. If the polarity of the emotion score and the polarity of the difference value do not match with each other, the score converting unit 1324 converts the polarity of the emotion score for the word from (-) to (+) or from (-) to (- . &Lt; / RTI >

본 발명의 일 실시 예에서, 감정 사전에서 단어에 대한 감정 스코어는 도메인(domain)별로 서로 다른 값 혹은 서로 다른 극성으로 설정될 수 있다. 도메인은 예시적으로, 주식, 선거, 정치 성향, 영화 평가, 서평, 다양한 상품에 대한 상품평 등을 예로 들 수 있으며, 서로 다른 상품별로 개별적인 도메인이 구축될 수 있다. 단어 제거부(1323)는 감정 사전의 단어에 대한 감정 스코어들 중 긍정 텍스트군과 부정 텍스트군의 도메인에 대응하는 감정 스코어만을 선택적으로 제거할 수 있다. 이러한 경우, 다른 도메인에서는 해당 단어의 감정 스코어 값이 그대로 유지될 수 있다. 스코어 변환부(1324)는 감정 사전의 단어에 대한 감정 스코어들 중 긍정 텍스트군과 부정 텍스트군의 도메인에 대응하는 감정 스코어만을 선택적으로 변경할 수 있다. 이때, 다른 도메인에 대응하는 해당 단어의 감정 스코어 값은 변동되지 않을 수 있다. 본 발명의 실시 예에 의하면, 도메인별로 최적화되는 감정 스코어를 이용하여 도메인별로 의견 분류 성능을 극대화할 수 있다.In one embodiment of the present invention, the emotion score for a word in the emotion dictionary may be set to a different value or a different polarity for each domain. Domains can be exemplified by stocks, elections, political tendencies, movie evaluations, book reviews, product reviews for various products, etc., and individual domains can be constructed for different products. The word removing unit 1323 can selectively remove only the emotion scores corresponding to the domains of the positive text group and the negative text group among the emotion scores for the words in the emotion dictionary. In this case, the emotion score value of the corresponding word can be maintained in the other domain. The score conversion unit 1324 can selectively change only the emotion scores corresponding to the domains of the positive text group and the negative text group among the emotion scores for the words in the emotion dictionary. At this time, the emotion score value of the corresponding word corresponding to another domain may not vary. According to the embodiment of the present invention, it is possible to maximize the opinion classification performance for each domain by using the emotion score optimized for each domain.

도 5는 본 발명의 일 실시 예에 따른 감정 사전 구축 장치를 구성하는 감정 사전 갱신부의 단어 제거(removing) 기능을 설명하기 위한 도면이다. 도 3 내지 도 5를 참조하면, 감정 사전 갱신부(132)는 긍정 텍스트군과 부정 텍스트군에서의 단어 발생 빈도 간의 비율 값이 긍정 텍스트들과 부정 텍스트들의 개수 비율과 유사한 값을 갖는 사전 항목(단어)을 제거한다. 도 5에 예시한 바와 같이, 긍정 텍스트군(Positive Reviews)은 네 개의 긍정 텍스트들을 포함하고, 부정 텍스트군(Negative Reviews)은 한 개의 부정 텍스트를 포함한다. 모든 텍스트들은 단어 'interested'를 포함한다. 단어 'interested'는 긍정 텍스트군에서 4회 출현하고, 부정 텍스트군에서 1회 출현한다. 긍정 텍스트군과 부정 텍스트군 각각에서 단어 'interested'가 출현한 빈도값 간의 비율은 4/1이고, 긍정 텍스트들의 개수와 부정 텍스트들 간의 텍스트 개수 비율 역시 4/1이다. 즉, 긍정 텍스트군과 부정 텍스트군에서 단어 'interested'가 출현한 비율은 긍정 텍스트들과 부정 텍스트들 간의 텍스트 개수 비율과 동일하며, 그 차이값은 0이다.5 is a diagram for explaining a word removing function of the emotion dictionary updating unit included in the emotion dictionary building apparatus according to the embodiment of the present invention. 3 to 5, the emotion dictionary updating unit 132 updates the dictionary item having the ratio value between the occurrence frequency of the word in the positive text group and the word occurrence frequency in the negative text group to a value similar to the number ratio of the positive text and the negative text ( Words). As illustrated in FIG. 5, the Positive Reviews include four positive texts, and the Negative Reviews contain one negative text. All texts contain the word 'interested'. The word 'interested' appears four times in the positive text group and once in the negative text group. The ratio between the frequencies of occurrence of the word 'interested' in each of the positive text group and the negative text group is 4/1, and the ratio between the number of positive texts and the number of texts between negative texts is also 4/1. That is, the ratio of occurrence of the word 'interested' in the positive text group and the negative text group is equal to the ratio of the number of texts between the positive text and the negative text, and the difference value is zero.

도 5의 예에서, 단어 'interested'는 모든 텍스트들에 포함되어 있으므로, 실제 의견 분류에 크게 기여하지 않는다. 따라서, 정확한 의견 분류를 결정하는데 있어서 보다 결정적인 다른 단어들을 강조할 수 있도록, 의견 분류에 그다지 기여하지 않는 단어 'interested'는 단어 제거부(1323)에 의해 감정 사전에서 제거된다. 도 5에 도시된 실시 예에서, 긍정 텍스트군과 부정 텍스트군 각각에서 단어가 출현한 빈도값 간의 비율과, 긍정 텍스트들의 개수와 부정 텍스트들 간의 개수 비율 사이의 차이값이 0인 경우, 해당 단어를 제거하는 것을 예로 들었지만, 단어 제거부(1323)는 단어의 긍정/부정 텍스트군 빈도값 간의 비율과 긍정/부정 텍스트 개수 비율 간의 차이값이 반드시 0이 아니라, 0을 포함하는 미리 설정된 임계 범위 내에 속하는 차이값을 가질 경우에도, 감정 사전에서 해당 단어를 제거할 수 있다.In the example of FIG. 5, the word " interested " is included in all the texts, and thus does not contribute much to the actual opinion classification. Thus, the word " interested " that does not contribute significantly to the opinion classification is removed from the emotional dictionary by word removal 1323 so as to emphasize other more crucial words in determining the correct classification. In the embodiment shown in FIG. 5, when the difference between the ratio between the frequency values in which the words appear in each of the positive text group and the negative text group, and the ratio between the number of positive texts and the number of negative texts is 0, The word removing unit 1323 deletes the difference between the positive / negative text group frequency value of the word and the positive / negative text number ratio not necessarily 0 but within a predetermined threshold range including 0 Even if you have a difference value to belong to, you can remove the word from the emotion dictionary.

도 6은 본 발명의 일 실시 예에 따른 감정 사전 구축 장치를 구성하는 감정 사전 갱신부의 감정 스코어 전환(switching) 기능을 설명하기 위한 도면이다. 도 3 내지 도 4, 도 6을 참조하면, 감정 사전 갱신부(132)는 긍정 텍스트군과 부정 텍스트군 간의 단어 출현 비율과, 긍정 텍스트들과 부정 텍스트들 간의 텍스트 개수 비율 간의 차이값이 감정 스코어의 극성에 반대되는 극성을 갖는 단어에 대해 해당 단어의 감정 스코어의 극성을 반대로 전환한다. 도 6에 예시한 바와 같이, 긍정 텍스트군(Positive Reviews)은 네 개의 긍정 텍스트들을 포함하고, 부정 텍스트군(Negative Reviews)은 두 개의 부정 텍스트를 포함한다. 단어 'betrayal'은 긍정 텍스트군에서 4회 출현하고, 부정 텍스트군에서 1회 출현한다. 긍정 텍스트군과 부정 텍스트군 각각에서 단어 'betrayal'이 출현한 빈도값 간의 비율은 4/1이고, 긍정 텍스트들의 개수와 부정 텍스트들 간의 개수 비율은 4/2이다. 긍정 텍스트군과 부정 텍스트군에서 단어 'betrayal'이 출현한 비율은 긍정 텍스트들과 부정 텍스트들 간의 개수 비율보다 많으며, 그 차이값은 양(+)의 값(4 - 2 = 2)으로 산출된다. 이에 반해, 감정 사전에서 'betrayal'은 음(-)의 값으로 주어져 있다. 즉, 긍정 텍스트군과 부정 텍스트군 각각에서 단어가 출현한 비율과, 긍정 텍스트들과 부정 텍스트들 간의 개수 비율 간의 차이값의 극성(+)과, 해당 단어에 대해 감정 사전에 설정된 감정 스코어의 극성(-)이 서로 반대인 경우, 단어에 대한 감정 스코어의 극성을 전환한다. 도 6의 예에서, 단어 'betrayal'의 극성은 (-) 값에서 (+) 값으로 전환될 것이다.FIG. 6 is a diagram for explaining an emotion score switching function of the emotion dictionary updating unit constituting the emotion dictionary building apparatus according to the embodiment of the present invention. FIG. Referring to FIGS. 3 to 4 and 6, the emotion dictionary updating unit 132 updates the word appearance ratio between the positive text group and the negative text group and the difference value between the ratio of the number of texts between the positive text and the negative text, The polarity of the emotion score of the word is reversed for the word having the polarity opposite to the polarity of the word. As illustrated in FIG. 6, the Positive Reviews include four positive texts, and the Negative Reviews contain two negative texts. The word 'betrayal' appears four times in the positive text group and once in the negative text group. The ratio between the frequency values of the word 'betrayal' in the positive text group and the negative text group is 4/1, and the ratio between the number of positive texts and the number of negative texts is 4/2. The ratio of the occurrence of the word 'betrayal' in the positive text group and the negative text group is higher than the ratio between the positive text and the negative text, and the difference value is calculated as a positive value (4 - 2 = 2) . In contrast, 'betrayal' in the emotion dictionary is given as negative (-) value. That is, the polarity (+) of the difference value between the ratio of occurrence of words in the positive text group and the negative text group and the ratio between the number of positive texts and the number of negative texts, and the polarity of the emotion score (-) are opposite to each other, the polarity of the emotion score for the word is switched. In the example of FIG. 6, the polarity of the word 'betrayal' will be switched from a (-) value to a (+) value.

본 발명의 실시 예에 따른 감정 사전 구축 장치에 의한 의견 분류 성능을 평가하기 위한 실험을 수행하였다. 인터넷에서 공중 이용 가능한 10개의 잘 알려진 감정 사전 자료들을 수집하였다. 10개의 감정 사전 자료들, 즉 'AFINN'(혹은 'AFN')', 'ANEW'(혹은 'ANW'), 'General Inquirer'(혹은 'GI'), 'Micro-WNOp'(혹은 'MWN' 또는 'MWO'), 'Opinion Lexicon'(혹은 'OPL' 또는 'OL'), 'SenticNet'(혹은 'STN'), 'SentiSense'(혹은 'STS' 또는 'SS'), 'SentiWordNet'(혹은 'SWN'), 'Subjectivity Lexicon'(혹은 'SBL' 또는 'SL'), 'WordNet-Affect'(혹은 'WNA')은 각각 URL(Uniform Resource Locator) 주소 'http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010', 'http://www.uvm.edu/~pdodds/files/papers/others/1999/bradley1999a.pdf', 'http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm', 'http://www-3.unipv.it/wnop/', 'http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon', 'http://sentic.net/senticnet-2.0.zip', 'http://nlp.uned.es/~jcalbornoz/resources.html', 'http://sentiwordnet.isti.cnr.it/', 'http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/', 'http://wndomains.fbk.eu/wnaffect.html'로부터 수집되었다.Experiments were conducted to evaluate the opinion classification performance by the emotion dictionary building apparatus according to the embodiment of the present invention. We collected ten well-known emotion dictionary data available on the Internet. (Or 'ANW'), 'General Inquirer' (or 'GI'), 'Micro-WNOp' (or 'MWN' , 'SentryNet' (or 'STN'), 'SentiSense' (or 'STS' or 'SS'), 'SentiWordNet' (or 'MWO'), 'Opinion Lexicon' 'SubjectNet Lexicon' (or 'SBL' or 'SL'), and 'WordNet-Affect' (or 'WNA') are URL (Uniform Resource Locator) addresses' http: //www2.imm.dtu .dk / pubdb / views / publication_details.php? id = 6010 ',' http://www.uvm.edu/~pdodds/files/papers/others/1999/bradley1999a.pdf ',' http: // www. wjh.harvard.edu/~inquirer/spreadsheet_guide.htm ',' http://www-3.unipv.it/wnop/ ',' http://www.cs.uic.edu/~liub/FBS/sentiment -analysis.html # lexicon ',' http://sentic.net/senticnet-2.0.zip ',' http://nlp.uned.es/~jcalbornoz/resources.html ',' http: // sentiwordnet. isti.cnr.it/ ',' http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/ 'and' http://wndomains.fbk.eu/wnaffect.html '.

10개의 감정 사전들 외에, 6개의 웹 주소, 'http://www.psychpage.com/learning/library/assess/feelings.html', 'http://www.destinies.com/en/?sv=&category=Goal+Success&title=Understanding+Emotional+Testing', 'http://www.enchantedlearning.com/wordlist/negativewords.shtml', 'http://eqi.org/fw neg.htm(http://eqi.org/cnfs.htm)', 'http://www.thehappinessblueprint.com/2013/01/emotional-essence-words.html', 'http://www.slideshare.net/stevenkellyactor/positive-emotions-list' 에서 긍정/부정 단어 리스트를 스크래치(scratch)하여 감정 사전 자료 'WWW-Scratch'를 구축하였다.In addition to the ten emotion dictionaries, there are six web addresses, 'http://www.psychpage.com/learning/library/assess/feelings.html', 'http://www.destinies.com/en/?sv= & category = Goal + Success & title = Understanding + Emotional + Testing ',' http://www.enchantedlearning.com/wordlist/negativewords.shtml ',' http://eqi.org/fw neg.htm (http: // eqi .org / cnfs.htm) ',' http://www.thehappinessblueprint.com/2013/01/emotional-essence-words.html ',' http://www.slideshare.net/stevenkellyactor/positive-emotions- WWW-Scratch 'was constructed by scratching the list of affirmative / negative words in' list '.

상기 열거된 다수의 감정 사전 자료들 중 일부는 수치화된 감정 스코어를 정의하는 대신, 'positive', 'negative', 'emotions' 등과 같은 감정 범주(sentiment category)를 정의하고 있다. 감정 스코어를 정의하는 감정 사전 자료라 하더라도, 서로 다른 감정 사전 자료는 동일한 단어에 대하여 다양한 감정 스코어를 부여하고 있다. 따라서, 다수의 감정 사전 자료들을 통합 감정 사전으로 병합하여 통합 감정 사전을 구축하기 위해, 다수의 감정 사전 자료들을 표준화하였다. 이를 위해, 감정 사전 자료들 각각에 대해 단일 단어들, 즉 하이픈(hyphen), 밑줄 표시(underscore), 빈 칸(blank space)이 없는 단어들의 감정 스코어가 -1.0 ~ 1.0 사이의 값을 갖도록 서로 다른 감정 사전 자료들의 포맷을 표준화하였다. 예시적으로, 아래의 수식 1과 같은 변환 관계식을 이용해 감정 스코어 값을 구간 [A, B]에서 구간 [C, D]로 정규화할 수 있다. 본 명세서에서, '[a, b]'는 a 이상 b(b≥a) 이하의 구간을 의미한다.Some of the emotional dictionary data listed above define the sentiment category such as 'positive', 'negative', 'emotions', etc. instead of defining the digitized emotion score. Even though the emotional dictionary data defines the emotional score, different emotional dictionary data assign different emotional scores to the same word. Therefore, we have standardized a number of emotion dictionary data to integrate multiple emotion dictionary data into an integrated emotion dictionary and construct an integrated emotion dictionary. To this end, the emotional scores of words that do not have single words, hyphen, underscore, or blank space for each of the emotional dictionary data are different The format of emotional dictionary data was standardized. By way of example, the emotion score value can be normalized from the interval [A, B] to the interval [C, D] using the conversion relation as in Equation 1 below. In the present specification, '[a, b]' means an interval of a or more and b (b≥a) or less.

[수식 1][Equation 1]

수식 1에서, X는 개별 감정 사전 자료에 정의된 단어의 감정 스코어를 나타내고, X'는 통합 감정 사전으로 정규화된 단어의 감정 스코어를 나타낸다. 감정 사전 자료 'AFN'의 경우, 단어의 감정 스코어는 [-5, 5] 사이의 값을 갖는다. 예시적으로, 감정 사전 자료 'AFN'에 정의된 단어의 감정 스코어는 "X' = 0.2X"의 변환 관계식을 이용해 [-1, 1] 사이의 값을 갖는 감정 스코어로 변환될 수 있다. 감정 사전 자료 'ANW'는 [1, 9] 사이의 감정 스코어 값을 갖는다. 예시적으로, 감정 사전 자료 'ANW'는 "X' = 0.25X - 1.25"의 변환 관계식을 이용해 [-1, 1] 범위의 감정 스코어 값을 갖도록 변환될 수 있다.In Equation 1, X represents the emotion score of the word defined in the individual emotion dictionary data, and X 'represents the emotion score of the word normalized by the integrated emotion dictionary. For the emotional dictionary 'AFN', the emotional score of the word has a value between [-5, 5]. Illustratively, the emotional score of a word defined in the emotional dictionary data 'AFN' can be converted to an emotional score having a value between [-1, 1] using a conversion relation of "X '= 0.2X". The emotional dictionary data 'ANW' has an emotion score value between [1, 9]. Illustratively, the emotional dictionary data 'ANW' can be transformed to have an emotion score value in the range [-1, 1] using the transformation relation of "X '= 0.25X - 1.25".

감정 사전 자료 'GI'의 경우, 감정 범주들 'Positive', 'Pstv', 'PosAff', 'Pleasur', 'Virtue', Complet', 'Yes'은 1.0 값을 갖는 감정 스코어로 변환되고, 감정 범주들 'Negativ', 'Ngtv', NegAff', 'Pain', 'Vice', 'Fail', 'No', 'Negate'는 -1.0 값을 갖는 감정 스코어로 변환된다. 결과적으로, 긍정적 감정 범주인 1,738개의 단어들과 부정적 감정 범주인 2,113개의 단어들이 각각 1.0, -1.0의 감정 스코어로 할당되며, 긍정적 감정 범주와 부정적 감정 범주를 동시에 포함하고 있는 71개의 단어들은 0.0의 감정 스코어로 할당된다.In the emotional dictionary data 'GI', the emotion categories' Positive ',' Pstv ',' PosAff ',' Pleasur ',' Virtue ', Complet', and 'Yes' are converted into emotion scores having a value of 1.0, The categories 'Negativ', 'Ngtv', 'NegAff', 'Pain', 'Vice', 'Fail', 'No', and 'Negate' are converted to emotion scores with a value of -1.0. As a result, 1,738 words of positive emotional category and 2,113 of negative emotional category are assigned to emotion scores of 1.0 and -1.0, respectively, and 71 words including positive emotional category and negative emotional category are 0.0 Emotion scores are assigned.

감정 사전 자료 'OPL'은 2,006개의 긍정 단어들과 4,783개의 부정 단어들로 구성되며, 긍정 단어들은 1.0의 감정 스코어가 부여되고, 부정 단어들은 -1.0의 감정 스코어가 부여된다. 긍정, 부정 리스트 모두에서 배제되는 모호한 단어들은 0.0의 감정 스코어가 부여된다. 감정 사전 자료 'SBL'의 경우, 'positive', 'negative', 'neutral'의 극성을 갖는 단어들이 각각 1.0, -1.0, 0.0의 감정 스코어를 갖도록 변환된다. 그 밖의 다른 감정 사전 자료들에 대해서도 단어들의 감정 스코어가 [-1, 1] 사이의 값을 갖도록 정규화하였다.The emotional dictionary 'OPL' consists of 2,006 affirmative words and 4,783 negative words. Positive words are given an emotional score of 1.0, and negative words are given an emotional score of -1.0. Ambiguous words excluded from both positive and negative lists are assigned an emotional score of 0.0. In the emotional dictionary 'SBL', words with polarities of 'positive', 'negative', and 'neutral' are converted to have emotion scores of 1.0, -1.0, and 0.0, respectively. For other emotional dictionary data, the emotion scores of the words were normalized to have values between [-1, 1].

감정 사전 자료 'SWN'을 제외한 10개의 표준화된 감정 사전 자료들을 병합하여 하나의 통합 감정 사전을 구축하였다. 서로 다른 감정 사전 자료들의 단어 목록을 병합할 때, 서로 다른 감정 사전 자료들에 중복되는 단어의 감정 스코어는 감정 사전 자료들에서 표준화된 감정 스코어 값들을 평균함으로써 구하였다. 결과적으로, 15,599개의 단어 목록을 갖는 통합 감정 사전이 구축된다. 이들 중 서적 관련 단어들의 개수는 12,776개, 영화 관련 단어들의 개수는 11,633개, 스마트폰 관련 단어들의 개수는 6,521개였다. 통합 감정 사전은 개별 감정 사전 자료보다 많은 단어 목록을 가지므로, 텍스트 의견들과 매칭되는 단어의 개수가 증가되며, 이에 따라 의견 분류 정확도가 높아질 수 있다.The emotion dictionary data "SWN" except for the 10 standardized emotional dictionary data was merged to build one integrated emotional dictionary. When merging word lists of different emotional dictionary data, emotion scores of words overlapping in different emotional dictionary data were obtained by averaging the standardized emotion score values in emotional dictionary data. As a result, an integrated emotion dictionary having 15,599 word lists is constructed. Among them, 12,776 books, 11,633 movies, and 6,521 smartphone related words were found. Since the integrated emotion dictionary has more word lists than the individual emotion dictionary data, the number of words matched with the text comments increases, and the accuracy of opinion classification can be increased accordingly.

11개의 개별 감정 사전들과, 통합 감정 사전의 성능을 평가하기 위해 3개의 상품평에 대하여 실험을 수행하였다. 이를 위한 훈련 데이터는 서적, 영화, 스마트폰에 대한 아마존 상품평으로부터 수집하였으며, 각 도메인에 대하여 90,000개, 35,000개, 15,000개의 텍스트 의견으로 이루어지는 데이터 세트가 구축되었다. 서적, 영화, 스마트폰에 대한 텍스트 의견들 중에서 임의적으로 10,000개, 5,000개, 5,000개의 의견들을 문턱값과 슬라이더 값의 설정을 위한 설정용 데이터로 선택하였다. 각각의 상품평은 토큰화(tokenized)되고, 분류 정리(lemmatized)되었다.Experiments were conducted on three product reviews in order to evaluate the performance of the 11 individual emotion dictionaries and the integrated emotion dictionaries. The training data was collected from Amazon reviews for books, movies, and smartphones, and a data set of 90,000, 35,000, and 15,000 text comments was built for each domain. We arbitrarily selected 10,000, 5,000, and 5,000 opinions among the text comments on books, movies, and smartphones as setting data for setting threshold and slider values. Each product review was tokenized and lemmatized.

개별 감정 사전 자료들과, 이들을 병합한 감정 사전 각각에 대해 분류에 영향을 미치지 않는 단어와 분류에 악영향을 미치는 단어를 찾기 위해, 훈련 데이터를 긍정 의견과 부정 의견으로 분류하여, 긍정/부정 텍스트 개수 비율(Ratio_Review)을 산출하였다. 각각의 아마존 상품평은 평가자에 의해 주어지는 별점표시(star rating)를 갖는다. 아마존 상품평에서 가장 추천되는 상품은 5개의 별이 주어지고, 가장 덜 추천되는 상품은 1개의 별이 주어진다. 5개의 별점과 4개의 별점을 갖는 텍스트 의견은 긍정 의견으로 분류하였으며, 1개의 별점과 2개의 별점을 갖는 텍스트 의견은 부정 의견으로 분류하였다. 3개의 별점을 갖는 의견은 긍정 또는 부정 의견으로 분류하기 어려우므로, 훈련 데이터에서 제외하였다.For each of the individual emotion dictionary data and the emotion dictionary combining them, the training data is classified into positive and negative opinions in order to find words that do not affect the classification and words that adversely affect the classification, and the number of positive / negative text Ratio _Review was calculated. Each Amazon rating has a star rating given by the evaluator. The most recommended products in the Amazon appraisal are given five stars, and the least recommended ones are given one star. Text comments with five stars and four stars were classified as affirmative, and text comments with one and two stars were classified as negative. Opinions with three stars were excluded from the training data because they were difficult to classify as positive or negative.

긍정/부정 텍스트군에서의 단어 발생 빈도 비율(Ratio_Word)을 얻기 위해 긍정 텍스트군과 부정 텍스트군에 포함된 사전 단어의 출현 빈도를 카운트한다. 이어서, 단어 발생 빈도 비율(Ratio_Word)에서 긍정/부정 텍스트 개수 비율(Ratio_Review)을 뺀 차이값(Difference_Ratio)을 산출 다음, 긍정/부정 텍스트 개수 비율(Ratio_Review)에 대한 차이값(Difference_Ratio)의 비율로부터 차이값 비율(Proportion_RatioDiff)을 산출한다. 그리고나서, 어떤 단어 목록을 제거하고 어떤 감정 극성을 전환할지를 결정하기 위한 슬라이더 값(slider value)을 설정한다. 슬라이더 값은 0.0 ~ 0.5 사이에서 0.1 간격으로 결정할 수 있다. 만약, 차이값 비율(Proportion_RatioDiff)이 0 이상이면서 슬라이더 값과 같거나 슬라이더 값보다 작은 경우, 해당 단어를 감정 사전에서 삭제(removing)할 수 있다. 만약, 차이값 비율(Proportion_RatioDiff)이 슬라이더 값보다 큰 것과 동시에 감정 사전에 정의된 감정 스코어와 차이값(Difference_Ratio)이 서로 다른 극성을 갖는 경우, 해당 단어의 감정 스코어의 극성을 반대로 전환(switching)함으로써, 수정 감정 사전을 구축할 수 있다.The frequency of occurrence of the dictionary words included in the positive text group and the negative text group is counted to obtain the ratio of word occurrence frequency (Ratio _Word ) in the positive / negative text group. Then, the word occurrence frequency ratio (Ratio _Word) positive / negative text number ratio in (Ratio _Review) for subtracting the difference value (Difference _Ratio) the calculated next, the difference value for the positive / negative text number ratio _(Ratio _Review) (Difference Ratio ) To calculate the difference ratio (Proportion _RatioDiff ). Then, set a slider value to determine which word list to remove and which emotional polarity to switch. The slider value can be set between 0.0 and 0.5 with an interval of 0.1. If the Proportion _RatioDiff is equal to or greater than 0 and less than or equal to the slider value, the word may be removed from the emotion dictionary. If the difference ratio (Proportion _RatioDiff ) is larger than the slider value and the emotion score and the difference _ratio (Difference _Ratio ) defined in the emotion dictionary have different polarities, the polarity of the emotion score of the corresponding word is reversed ), A corrected emotion dictionary can be constructed.

11개의 개별 감정 사전 자료들과, 통합 감정 사전 각각에 대하여 분류 정확도를 측정하였다. 먼저, 각각의 감정 사전(D_j, j=1,...,12)에서 각 단어(w_i, i=1,...,n)에 대한 감정 스코어 'D_j(w_i)'를 검색하며, 예시적으로, 아래의 수식 2에 나타낸 바와 같이, 감정 사전에 부합되는 텍스트 의견의 모든 단어들의 감정 스코어 값들을 평균하여 텍스트 의견 감정 스코어(Review Sentiment Score; RSS)를 산출한다.The classification accuracy was measured for each of the 11 individual emotion dictionary data and the integrated emotion dictionary. First, the emotion score 'D _j (w _i )' for each word w _i , i = 1, ..., n in each emotion dictionary (D _j , And calculates a text sentiment score (RSS) by, for example, averaging emotion score values of all words of the text opinion conforming to the emotion dictionary, as shown in Equation 2 below.

[수식 2][Equation 2]

아래의 수식 3에 나타낸 바와 같이, 만약, RSS 값이 문턱값(threshold)과 같거나 문턱값보다 크면, 텍스트 의견은 긍정적인 것으로 판단되고, 그렇지 않으면, 부정적인 것으로 판단된다.As shown in Equation 3 below, if the RSS value is equal to or greater than the threshold value, the text comment is determined to be positive, otherwise it is determined to be negative.

[수식 3][Equation 3]

예시적으로, 서적에 대한 상품평 데이터 세트는 별점 표시에 의하여 분류되는 긍정 의견의 개수(9,007개)가 부정 의견의 개수(993개)보다 많은 불균형적인 데이터 세트(대략 9:1 비율)이므로, 전체 성능을 정확하게 평가하기 위해 아래의 수식 4에 나타낸 바와 같은 균형화된 정확도(AccBAL)를 사용하였다.Illustratively, the review data set for a book is an unbalanced data set (approximately 9: 1 ratio) with a number of affirmative comments (9,007) classified by star rating greater than the number of negative comments (993) To accurately evaluate the performance, a balanced accuracy (AccBAL) as shown in Equation 4 below was used.

[수식 4][Equation 4]

수식 4에서, Recall_POS와 Recall_NEG는 각각 긍정 의견과 부정 의견의 정확도를 나타내고, 'true positives'는 감정 사전을 이용하여 긍정 의견으로 분류된 텍스트 의견들 중 별점 표시 4개 이상인 텍스트 의견들의 개수를 나타내고, 'true negatives'는 감정 사전을 이용하여 부정 의견으로 분류된 텍스트 의견들 중 별점 표시 2개 이하인 텍스트 의견들의 개수를 나타내고, 'false negatives'는 감정 사전을 이용하여 부정 의견으로 분류된 텍스트 의견들 중 별점 표시 4개 이상인 텍스트 의견들의 개수를 나타내고, 'false positives'는 감정 사전을 이용하여 긍정 의견으로 분류된 텍스트 의견들 중 별점 표시 2개 이하인 텍스트 의견들의 개수를 나타낸다.In the formula 4, Recall _POS and Recall _NEG indicate the accuracy of the positive and negative opinions, respectively, and 'true positives' indicates the number of text opinions having four or more stars among the text opinions classified as positive opinions using the emotion dictionary 'True negatives' represents the number of text opinions of two or less stars among the text comments classified as negative opinions using the emotion dictionary, 'false negatives' represents the number of text opinions classified as negative comments by using the emotion dictionary, 'False positives' represents the number of text opinions of two or less stars among the text opinions classified as affirmative opinions using the emotion dictionary.

한편, 11개의 개별 감정 사전들과 이들을 병합한 통합 감정 사전 각각에 대하여, 의견 분류에 영향을 미치지 않는 단어를 감정 사전에서 제거하고, 의견 분류에 악영향을 미치는 단어의 감정 스코어를 전환하는 것에 의하여 수정 감정 사전을 구축하였다. 12개의 감정 사전들 각각에 대한 실험에서, 긍정 의견과 부정 의견을 분류하기 위한 기준값인 문턱값(threshold)과, 각 감정 사전에 대하여 수정 사전의 구축시 감정 스코어의 전환 여부를 결정하기 위한 기준값인 슬라이더 값(slider value)은 훈련 데이터 중의 일부 혹은 전부를 사용하여 설정되었다. 문턱값과 슬라이더 값은 각 감정 사전마다 개별적으로 설정되었으며, 다양한 문턱값과 슬라이더 값의 조합에 대하여 반복 실험한 후 최대의 의견 분류 정확도를 갖는 문턱값과 슬라이더 값을 선택함으로써 문턱값과 슬라이더 값을 결정하였다.On the other hand, for each of the 11 individual emotion dictionaries and the integrated emotion dictionaries merged with the 11 individual emotion dictionaries, the words which do not affect the opinion classification are removed from the emotion dictionary, and the emotion scores of the words The emotion dictionary was constructed. In an experiment on each of the 12 emotion dictionaries, a threshold value, which is a reference value for classifying positive and negative opinions, and a threshold value for determining whether to switch the emotion score when building a modified dictionary for each emotion dictionary The slider value was set using some or all of the training data. The threshold value and the slider value are individually set for each emotion dictionary. After repeating the experiment for various combinations of the threshold value and the slider value, the threshold value and the slider value having the maximum opinion classification accuracy are selected, .

아래의 표 1은 아마존으로부터 수집된 서평들(book reviews)에 대하여 각 감정 사전들을 이용하여 긍정/부정 의견을 분류한 정확도를 측정한 결과를 보여준다.Table 1 below shows the results of measuring accuracy of classification of affirmative / negative opinions using each emotion dictionary for book reviews collected from Amazon.

표 1에서, 'R & S'는 각 감정 사전에 대해 의견 분류에 영향을 미치지 않는 단어를 제거(Remove)하고, 의견 분류에 악영향을 미치는 단어의 감정 스코어를 전환(Switch)한 수정 감정 사전의 정확도(Balanced Accuracy)를 나타내고, 'Original'은 수정 전의 감정 사전에 대한 정확도를 나타낸다. 'Integrated'는 감정 사전들을 병합한 감정 사전을 나타낸다. 'Training Data:10000/Test Data:80000'은 수정 감정 사전의 구축을 위해 10000개의 서평(설정용 데이터)를 이용하고, 80000개의 서평(테스트용 데이터)을 이용하여 수정 전과 후의 감정 사전 각각에 대해 정확도를 측정한 경우를 나타내고, 'Training Data=Test Data:90000'은 90000개의 서평을 이용하여 수정 감정 사전을 구축하고, 수정 전과 후의 감정 사전에 대해 정확도를 측정한 경우를 나타낸다. 'R & S Dictionary's Threshold 10000/90000'은 10000개의 설정용 데이터를 이용하여 설정된 문턱값과, 90000개의 설정용 데이터를 이용하여 설정한 문턱값을 나타내며, 'R & S Slider Value 10000/90000'은 10000개의 설정용 데이터를 이용하여 설정된 슬라이더 값과, 90000개의 설정용 데이터를 이용하여 설정한 슬라이더 값을 나타내며, 'Matched Word Size'는 각 감정 사전에서 훈련 데이터와 매칭되는 단어의 개수를 나타낸다.In Table 1, 'R & S' is a modified emotion dictionary in which words that do not affect the opinion classification are removed for each emotion dictionary, and emotional scores of the words that adversely affect the opinion classification are switched 'Balance Accuracy' and 'Original' represent the accuracy of the emotion dictionary before correction. 'Integrated' refers to an emotion dictionary incorporating emotion dictionaries. 'Training Data: 10000 / Test Data: 80000' uses 10000 book reviews (setting data) for building a modified emotion dictionary and uses 8000 book reviews (test data) 'Training Data = Test Data: 90000' represents a case in which a corrected emotion dictionary is constructed using 90,000 book reviews, and the accuracy of the emotion dictionary before and after the correction is measured. The 'R & S Dictionary Threshold 10000/90000' represents the threshold value set by using the data for 10000 setting and the threshold value set by using the data for 90000, and 'R & S Slider Value 10000/90000' 'Matched Word Size' indicates the number of words matched with the training data in each emotion dictionary. The 'Matched Word Size' indicates the number of words matched with the training data in the emotion dictionary.

표 1에 나타낸 바로부터, 모든 감정 사전의 경우에 대하여, 수정 감정 사전의 의견 분류 정확도는 수정 전의 감정 사전보다 훨씬 높은 것을 알 수 있다. 감정 사전 자료들을 병합한 통합 감정 사전(Integrated)은 개별 감정 사전 자료보다 높은 정확도를 갖는다. 또한, 개별 감정 사전 자료들을 병합한 후 의견 분류에 영향을 미치는 단어를 제거하고 의견 분류에 악영향을 미치는 단어의 감정 스코어를 전환하여 구축한 수정 감정 사전의 경우에도 단지 개별 감정 사전 자료들을 병합한 감정 사전보다 훨씬 높은 의견 분류 정확도(83.8%)를 보여준다.From Table 1, it can be seen that the accuracy of opinion classification in the modified emotion dictionary is much higher than that in the emotion dictionary before the modification, for all the emotion dictionaries. Integrated emotional dictionaries that incorporate emotional dictionaries have higher accuracy than individual emotion dictionaries. In the case of the modified emotion dictionary constructed by merging the individual emotion dictionary data and then removing the words that affect the opinion classification and converting the emotion score of the word that adversely affects the opinion classification, (83.8%) which is much higher than the dictionary.

아래의 표 2는 아마존으로부터 수집된 영화 후기들(movie reviews)에 대하여 각 감정 사전들을 이용하여 긍정/부정 의견을 분류한 정확도를 측정한 결과를 보여준다.Table 2 below shows the results of measuring the accuracy of classifying affirmative / negative opinions using each emotion dictionaries for movie reviews collected from Amazon.

표 2에서, 'Training Data:5000/Test Data:30000'은 수정 감정 사전의 구축을 위해 5000개의 영화 후기들(설정용 데이터)을 이용하고, 30000개의 영화 후기들(테스트용 데이터)을 이용하여 수정 전과 후의 감정 사전 각각에 대해 정확도를 측정한 경우를 나타내고, 'Training Data=Test Data:35000'은 35000개의 영화 후기들을 이용하여 수정 감정 사전을 구축하고, 수정 전과 후의 감정 사전에 대해 정확도를 측정한 경우를 나타낸다. 'R & S Dictionary's Threshold 5000/35000'은 5000개의 설정용 데이터를 이용하여 설정된 문턱값과, 35000개의 설정용 데이터를 이용하여 설정한 문턱값을 나타내며, 'R & S Slider Value 5000/35000'은 5000개의 설정용 데이터를 이용하여 설정된 슬라이더 값과, 35000개의 설정용 데이터를 이용하여 설정한 슬라이더 값을 나타낸다.In Table 2, 'Training Data: 5000 / Test Data: 30000' uses 5000 movie reviews (setting data) for building a modified emotion dictionary and uses 30,000 movie reviews (test data) 'Training Data = Test Data: 35000' is used to construct a modified emotion dictionary using 35,000 movie reviews, and to measure the accuracy of the emotion dictionary before and after the modification. . 'R & S Dictionary' s Threshold 5000/35000 'shows threshold value set using 5000 setup data and threshold value set using 35000 setup data.' R & S Slider Value 5000/35000 ' A slider value set using 5000 setting data, and a slider value set using 35000 setting data.

도 7은 5000개의 설정용 데이터와 30000개의 테스트용 데이터를 이용하여 측정한 영화에 대한 의견 분류의 정확도를 문턱값과 슬라이더 값에 따라 나타낸 도면이고, 도 8은 35000개의 데이터를 이용하여 측정한 영화에 대한 의견 분류의 정확도를 문턱값과 슬라이더 값에 따라 나타낸 도면이다. 도 7 내지 도 8에서, 예를 들어 'dict_0.10'은 슬라이더 값이 0.10으로 설정된 경우의 정확도를 나타내는 그래프이다. 도 7 내지 도 8에 도시된 바와 같이, 문턱값과 슬라이더 값은 의견 분류 정확도가 최대로 되는 값으로 설정될 수 있다. 도 7의 경우 문턱값은 0.13으로 설정되고 슬라이더 값은 0.1로 설정되며, 도 8의 경우 문턱값은 0.19로 설정되고 슬라이더 값은 0.2로 설정된다.FIG. 7 is a graph showing the accuracy of opinion classification for a movie measured using 5000 pieces of setting data and 30000 pieces of test data according to a threshold value and a slider value. FIG. In accordance with the threshold value and the slider value. 7 to 8, for example, 'dict_0.10' is a graph showing the accuracy when the slider value is set to 0.10. As shown in FIGS. 7 to 8, the threshold value and the slider value may be set to values maximizing the opinion classification accuracy. 7, the threshold value is set to 0.13, the slider value is set to 0.1, the threshold value is set to 0.19, and the slider value is set to 0.2 in the case of FIG.

표 2에서 확인할 수 있는 바와 같이, 모든 감정 사전에 대하여, 수정 감정 사전은 수정 전의 감정 사전보다 의견 분류 정확도가 향상된다. 감정 사전 자료들을 병합한 감정 사전(Integrated)은 개별 감정 사전 자료보다 높은 정확도를 보여준다. 또한, 개별 감정 사전 자료들을 병합한 후 의견 분류에 영향을 미치는 단어를 제거하고 의견 분류에 악영향을 미치는 단어의 감정 스코어를 전환하여 구축한 수정 감정 사전의 경우에도 단지 개별 감정 사전 자료들을 병합한 감정 사전보다 높은 의견 분류 정확도(83.9%)를 보여준다.As can be seen in Table 2, for all emotion dictionaries, the corrected emotion dictionary improves the opinion classification accuracy over the emotion dictionary before the modification. The emotion dictionary integrated with the emotion dictionary data shows higher accuracy than the individual emotion dictionary data. In the case of the modified emotion dictionary constructed by merging the individual emotion dictionary data and then removing the words that affect the opinion classification and converting the emotion score of the word that adversely affects the opinion classification, (83.9%).

아래의 표 3은 아마존으로부터 수집된 스마트폰 상품평들(smartphone reviews)에 대하여 각 감정 사전들을 이용하여 긍정/부정 의견을 분류한 정확도를 측정한 결과를 보여준다.Table 3 below shows the results of measuring the accuracy of classifying affirmative / negative opinions using each emotion dictionary for smartphone reviews collected from Amazon.

표 3에서, 'Training Data:5000/Test Data:10000'은 수정 감정 사전의 구축을 위해 5000개의 스마트폰 상품평들(설정용 데이터)을 이용하고, 10000개의 스마트폰 상품평들(테스트용 데이터)을 이용하여 수정 전과 후의 감정 사전 각각에 대해 정확도를 측정한 경우를 나타내고, 'Training Data=Test Data:15000'은 15000개의 스마트폰 상품평들을 이용하여 수정 감정 사전을 구축하고, 수정 전과 후의 감정 사전에 대해 분류 정확도를 측정한 경우를 나타낸다. 'R & S Dictionary's Threshold 5000/15000'은 5000개의 설정용 데이터를 이용하여 설정된 문턱값과, 15000개의 설정용 데이터를 이용하여 설정한 문턱값을 나타내며, 'R & S Slider Value 5000/15000'은 5000개의 설정용 데이터를 이용하여 설정된 슬라이더 값과, 15000개의 설정용 데이터를 이용하여 설정한 슬라이더 값을 나타낸다.In Table 3, 'Training Data: 5000 / Test Data: 10000' uses 5,000 smartphone appraisers (configuration data) and 10000 smartphone appraisers (test data) 'Training Data = Test Data: 15000' is used to construct a modified emotion dictionary using 15,000 smartphone product reviews, and to compare the emotion dictionary before and after the modification. And the classification accuracy is measured. 'R & S Dictionary' s Threshold 5000/15000 'shows threshold value set using 5000 setup data and threshold value set using 15000 setup data.' R & S Slider Value 5000/15000 ' A slider value set using 5000 setting data, and a slider value set using 15000 setting data.

도 9는 5000개의 설정용 데이터와 15000개의 테스트용 데이터를 이용하여 측정한 스마트폰 상품평의 분류 정확도를 문턱값과 슬라이더 값에 따라 나타낸 도면이고, 도 10은 15000개의 데이터를 이용하여 측정한 스마트폰 상품평의 분류 정확도를 문턱값과 슬라이더 값에 따라 나타낸 도면이다. 도 9의 경우, 표 3에 나타난 바와 같이 문턱값은 0.165로 설정되었고 슬라이더 값은 0.2로 설정되었다. 도 10의 경우, 표 3에 나타난 바와 같이 문턱값은 0.17로 설정되었고 슬라이더 값은 0.3으로 설정되었다.FIG. 9 is a view showing the classification accuracy of a smartphone product review measured using 5000 setting data and 15000 test data according to a threshold value and a slider value. FIG. And shows the classification accuracy of the product review according to the threshold value and the slider value. 9, the threshold value was set to 0.165 and the slider value was set to 0.2, as shown in Table 3. 10, the threshold value is set to 0.17 and the slider value is set to 0.3, as shown in Table 3. [

표 3에서 확인할 수 있는 바와 같이, 모든 감정 사전에 대하여, 수정 감정 사전에 의한 의견 분류 정확도는 수정 전의 감정 사전의 의견 분류 정확도보다 훨씬 높게 나타난다. 감정 사전 자료들을 병합한 감정 사전(Integrated)은 개별 감정 사전 자료보다 높은 정확도를 보여준다. 또한, 개별 감정 사전 자료들을 병합한 후 의견 분류에 영향을 미치는 단어를 제거하고 의견 분류에 악영향을 미치는 단어의 감정 스코어를 전환하여 구축한 수정 감정 사전을 이용하여 의견 분류한 경우에도 단지 개별 감정 사전 자료들을 병합한 감정 사전을 이용하여 의견 분류한 경우보다 훨씬 높은 의견 분류 정확도(83.6%)를 나타낸다.As can be seen in Table 3, for all emotion dictionaries, the accuracy of opinion classification by the modified emotion dictionary is much higher than the accuracy of opinion classification before the emotion dictionary before modification. The emotion dictionary integrated with the emotion dictionary data shows higher accuracy than the individual emotion dictionary data. In addition, even when the opinions are classified by using the modified emotion dictionary constructed by merging the individual emotion dictionary data and removing the words that affect the opinion classification and converting the emotion scores of the words that adversely affect the opinion classification, (83.6%), which is much higher than the opinion classification using the emotion dictionary combined with the data.

도 11은 개별 감정 사전 자료들과 이들을 병합한 감정 사전의 분류 정확도를 비교하여 보여주는 그래프이다. 도 11에 도시된 바와 같이, 감정 사전 자료들을 병합한 감정 사전(MRG)을 이용하여 텍스트 의견을 분류할 경우, 개별 감정 사전 자료를 이용하는 경우보다 높은 최대 분류 정확도를 얻을 수 있다. 도 12는 개별 감정 사전 자료들과 이들을 병합한 감정 사전 각각에 대해 분류에 영향을 미치지 않는 단어를 제거하고 분류에 악영향을 미치는 단어의 감정 스코어를 전환한 수정 감정 사전의 분류 정확도를 비교하여 보여주는 그래프이다. 도 11 내지 도 12를 비교하면, 수정된 감정 사전을 이용하는 경우, 수정하기 전의 감정 사전을 이용하는 경우보다 최대 분류 정확도가 향상된다. 도 12를 참조하면, 감정 사전 자료들을 병합한 후 수정한 감정 사전(MRG)을 이용하여 텍스트 의견을 분류할 경우, 개별 감정 사전 자료를 수정한 감정 사전보다 높은 최대 분류 정확도를 얻을 수 있다.FIG. 11 is a graph showing the classification accuracy of the individual emotion dictionary data and the emotion dictionary combined therewith. As shown in FIG. 11, when the text comment is classified using the emotion dictionary (MRG) in which the emotional dictionary data are merged, a maximum classification accuracy higher than that in the case of using the individual emotional dictionary data can be obtained. FIG. 12 is a graph showing the comparison of the classification accuracy of the modified emotion dictionary in which the emotion dictionary of each of the individual emotion dictionaries and the emotion dictionaries merging them is removed and the words that do not affect the classification are removed, to be. Comparing FIG. 11 to FIG. 12, when the modified emotion dictionary is used, the maximum classification accuracy is improved compared with the case of using the emotion dictionary before the modification. Referring to FIG. 12, when the text comment is classified using the modified emotion dictionary (MRG) after merging the emotion dictionary data, the maximum classification accuracy higher than the emotion dictionary in which the individual emotion dictionary data is modified can be obtained.

본 발명의 실시 예에 따른 감정 사전 구축 방법은 예를 들어 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 SRAM(Static RAM), DRAM(Dynamic RAM), SDRAM(Synchronous DRAM) 등과 같은 휘발성 메모리, ROM(Read Only Memory), PROM(Programmable ROM), EPROM(Electrically Programmable ROM), EEPROM(Electrically Erasable and Programmable ROM), 플래시 메모리 장치, PRAM(Phase-change RAM), MRAM(Magnetic RAM), RRAM(Resistive RAM), FRAM(Ferroelectric RAM) 등과 같은 불휘발성 메모리, 플로피 디스크, 하드 디스크 또는 광학적 판독 매체 예를 들어 시디롬, 디브이디 등과 같은 형태의 저장매체일 수 있으나, 이에 제한되지는 않는다.The emotion dictionary building method according to an exemplary embodiment of the present invention can be implemented in a general-purpose digital computer that can be created as a program that can be executed in a computer and operates the program using a computer-readable recording medium. The computer readable recording medium may be a volatile memory such as SRAM (Static RAM), DRAM (Dynamic RAM), SDRAM (Synchronous DRAM), ROM (Read Only Memory), PROM (Programmable ROM), EPROM A non-volatile memory such as an electrically erasable programmable read-only memory (EEPROM), a flash memory device, a phase-change RAM (PRAM), a magnetic RAM (MRAM), a resistive RAM (RRAM) Or optical storage media, such as, but not limited to, CD-ROMs, DVDs, and the like.

이상의 실시 예들은 본 발명의 이해를 돕기 위하여 제시된 것으로, 본 발명의 범위를 제한하지 않으며, 이로부터 다양한 변형 가능한 실시 예들도 본 발명의 범위에 속하는 것임을 이해하여야 한다. 본 발명의 기술적 보호범위는 특허청구범위의 기술적 사상에 의해 정해져야 할 것이며, 본 발명의 기술적 보호범위는 특허청구범위의 문언적 기재 그 자체로 한정되는 것이 아니라 실질적으로는 기술적 가치가 균등한 범주의 발명에 대하여까지 미치는 것임을 이해하여야 한다.It is to be understood that the above-described embodiments are provided to facilitate understanding of the present invention, and do not limit the scope of the present invention, and it is to be understood that various modifications are possible within the scope of the present invention. It is to be understood that the technical scope of the present invention should be determined by the technical idea of the claims and the technical scope of protection of the present invention is not limited to the literary description of the claims, The invention of the present invention.

10, 20: 감정 사전 제공 서버 30: 훈련 데이터 제공 서버
100: 감정 사전 구축 장치 110: 사용자 인터페이스부
120: 입력부 130: 감정 사전 구축부
131: 감정 사전 병합부 132: 감정 사전 갱신부
1321: 단어 빈도 산출부 1322: 비교 판단부
1323: 단어 제거부 1324: 스코어 변환부
140: 메모리 150: 디스플레이부
160: 제어부10, 20: emotion dictionary server 30: training data providing server
100: emotion dictionary construction apparatus 110: user interface unit
120: input unit 130: emotion dictionary building unit
131: Emotion dictionary merging unit 132: Emotion dictionary update unit
1321: word frequency calculation unit 1322: comparison judgment unit
1323: word removal 1324: score conversion unit
140: memory 150:
160:

Claims

A frequency value in which a predetermined word appears in each of a group of positive texts, which is an aggregate of positive texts classified as a positive opinion, and a group of negative texts, which is an aggregate of negative texts classified as a negative opinion, A word frequency calculation unit;
A comparison judging unit for judging whether a sentiment score set in correspondence with the word in a sentiment dictionary matches a ratio between frequency values for the positive text group and the negative text group; And
And a score conversion unit for selectively changing the emotion score set corresponding to the word in the emotion dictionary according to a determination result of the comparison determination unit.

The method according to claim 1,
Wherein the score conversion unit changes the emotion score when it is determined that the emotion score does not match the ratio between the frequency values.

The method according to claim 1,
Wherein the comparison determining unit calculates a ratio of the number of texts between the positive texts and the negative texts, calculates a difference value between the ratio between the frequency values and the number of texts, calculates a difference between the polarity of the emotion score, And judges whether the emotion score is changed or not.

The method of claim 3,
Wherein the score conversion unit switches the polarity of the emotion score when the polarity of the emotion score and the polarity of the difference value do not match with each other.

The method of claim 3,
And a word removing unit for removing the word from the emotion dictionary when the difference between the ratio of the frequency values and the ratio of the number of texts falls within a predetermined threshold range including zero.

6. The method according to any one of claims 1 to 5,
And an emotion dictionary merger for merging different emotion dictionary data to construct the emotion dictionary.

The method according to claim 6,
The emotion dictionary merger may standardize different emotion scores and sentiment categories for each of the words of the emotion dictionary data and classify emotion scores of different standardized emotion dictionary data for each word And calculates an average value by an emotion score for each word.

6. The method according to any one of claims 1 to 5,
The emotion score for the word in the emotion dictionary is set to a different value for each domain,
Wherein the score conversion unit selectively changes only the emotion score corresponding to the domain of the positive text group and the negative text group among the emotion scores for the word in the emotion dictionary.

A frequency value in which a predetermined word appears in each of a group of positive texts, which is an aggregate of positive texts classified as a positive opinion, and a group of negative texts, which is an aggregate of negative texts classified as a negative opinion, step;
Determining whether a sentiment score set in the sentiment dictionary corresponding to the word matches a ratio between the frequency values for the positive text group and the negative text group; And
And changing the emotion score if the emotion score does not match the ratio between the frequency values.

10. The method of claim 9,
Wherein the determining step comprises:
Calculating a ratio of the number of texts between the positive texts and the negative texts;
Calculating a difference value between the ratio between the frequency values and the text number ratio; And
And comparing the polarity of the emotion score with the polarity of the difference value to determine whether to change the polarity of the emotion score.

11. The method of claim 10,
Further comprising removing the word from the emotion dictionary if the difference between the ratio of the frequency values and the ratio of the number of texts falls within a predetermined threshold range including zero.

12. The method according to any one of claims 9 to 11,
Further comprising merging different emotion dictionary data to construct the emotion dictionary,
The step of constructing the emotion dictionary comprises:
Standardizing different sentiment scores and different sentiment categories for each word of the emotional dictionary data; And
Calculating an average of emotion scores of different standardized emotion dictionary data for each word with an emotion score for each word.

12. The method according to any one of claims 9 to 11,
The emotion score for the word in the emotion dictionary is set to a different value for each domain,
Wherein the step of changing the score selectively changes only the emotion score corresponding to the domain of the positive text group and the negative text group among emotion scores of the word in the emotion dictionary.