KR20230118030A

KR20230118030A - Text mining method, text mining program, and text mining apparatus

Info

Publication number: KR20230118030A
Application number: KR1020230012981A
Authority: KR
Inventors: 징롱 저우; 야스노리 나카무라
Original assignee: 가부시키가이샤 스크린 홀딩스
Priority date: 2022-02-03
Filing date: 2023-01-31
Publication date: 2023-08-10
Also published as: TW202341003A; JP2023113268A; CN116541518A

Abstract

(과제) 적은 계산량으로 문서에 있어서의 감정어의 적절한 평가에 기초하여 복수의 문서 사이에서 감정적 경향을 비교할 수 있도록 한다.
(해결 수단) 본원에서 개시되는 텍스트 마이닝 방법에서는, 감정 극성의 경향을 비교해야 할 복수의 문서를 대상 문서로서 지정하는 지시와 함께, 대상 문서로부터 추출해야 할 특징어의 범위를 지정하는 지시 및 감정 극성의 강도를 나타내는 감정 지수의 범위를 지정하는 지시를 접수하고, 이들 지시에 기초하여, 지정 범위 내에서 당해 복수의 문서로부터 특징어를 추출하고, 추출한 특징어 중, 지정 범위 내의 감정 지수가 부여된 감정어로서 소정의 감정어 사전에 등록되어 있는 특징어에 대해 당해 감정 지수를 부여한다. 그 후, 추출된 특징어와 부여된 감정 지수를 당해 복수의 문서 사이에서 비교 가능하게 표시한다. 이 표시에서는, 예를 들어, 감정 지수가 부여된 특징어에는 그 감정 지수에 따른 배경색이 부여된다.(Task) It is possible to compare emotional tendencies among a plurality of documents based on appropriate evaluation of emotional words in documents with a small amount of calculation.
(Means of solution) In the text mining method disclosed herein, along with instructions for designating a plurality of documents to be compared for sentiment polarity as target documents, instructions for designating the range of feature words to be extracted from target documents and evaluation An instruction for specifying a range of emotional indices representing the intensity of polarity is received, and based on these instructions, feature words are extracted from the plurality of documents within the designated range, and emotional indices within the designated range are assigned from among the extracted characteristic words. The corresponding emotional index is given to a characteristic word registered in a predetermined emotional word dictionary as the emotional word. After that, the extracted characteristic word and the assigned emotional index are displayed so that the plurality of documents can be compared. In this display, for example, a background color corresponding to the emotional index is assigned to a characteristic word to which an emotional index is assigned.

Description

Text mining method, text mining program, and text mining device {TEXT MINING METHOD, TEXT MINING PROGRAM, AND TEXT MINING APPARATUS}

본 발명은, 텍스트 마이닝에 관한 것으로, 특히, 복수의 문서의 감정 극성의 경향을 비교하기 위한 텍스트 마이닝 방법, 텍스트 마이닝 프로그램, 및 텍스트 마이닝 장치에 관한 것이다.The present invention relates to text mining, and more particularly, to a text mining method, a text mining program, and a text mining apparatus for comparing tendencies of emotional polarity of a plurality of documents.

최근, 자유 기술된 텍스트 데이터를 해석하고, 해석 결과로부터 유용한 정보를 구하는 텍스트 마이닝이 주목받고 있다. 이 텍스트 마이닝의 분야에 있어서, 문서의 텍스트 데이터로부터, 당해 문서에 관련되는 물건이나, 사람, 컨텐츠 등에 대해 긍정적인지 부정적인지라는 감정 극성의 경향 (이하 「감정적 경향」 이라고 한다) 을 판정하는 기술이 알려져 있다.In recent years, text mining, which analyzes freely described text data and obtains useful information from the analysis result, has attracted attention. In the field of text mining, there is a technique for determining, from the text data of a document, a tendency of emotional polarity (hereinafter referred to as "emotional tendency") of positive or negative toward a thing, person, content, etc. related to the document. It is known.

예를 들어, 단어와 그 단어가 나타내는 감정 (긍정적인지 부정적인지를 나타내는 감정 극성 등) 과의 대응 관계가 미리 등록된 감정어 사전을 사용하여, 문서에 포함되어 있는 단어 중 감정 극성이 긍정적인 단어의 수와 부정적인 단어의 수를 비교하고, 그 비교 결과에 따라 당해 문서의 감정 극성 (당해 문서가 긍정적인지 부정적인지 중립적인지) 을 판정하는 방법이 알려져 있다 (일본 공개특허공보 2011-204226호의 단락 [0009] 참조).For example, by using an emotion word dictionary in which a corresponding relationship between a word and an emotion represented by the word (emotion polarity indicating positive or negative, etc.) is pre-registered, among words included in a document, words having a positive emotion polarity are selected. There is known a method of comparing the number of words with the number of negative words and judging the emotional polarity of the document (whether the document is positive, negative, or neutral) according to the comparison result (Japanese Unexamined Patent Publication No. 2011-204226, paragraph [0009 ] reference).

상기 종래의 방법에서는, 문서에 포함되는 단어가 감정어 사전에 등록되어 있는지 여부를 조사하여, 등록되어 있는 경우에 그 단어를 감정어 사전에 따라 긍정적인지 부정적인지라는 2 종류로 분류하고 있다. 이 때문에, 감정 강도 (감정 극성) 가 약한 단어 즉 중성에 가까운 단어에 대해서는 그 적절한 취급 방법이 확립되어 있지 않았다. 또, 이와 같은 단어의 감정 극성에 대해서는, 대상으로 하는 문서의 내용에 따라 조정할 필요가 있었지만, 그것을 위한 간편한 수법이 알려지지 않았다.In the above conventional method, whether a word included in a document is registered in a sentiment word dictionary is checked, and if registered, the word is classified into two types, positive and negative, according to the sentiment word dictionary. For this reason, an appropriate handling method has not been established for words with weak emotional intensity (emotional polarity), that is, words close to neutral. In addition, it is necessary to adjust the emotional polarity of such words according to the content of a target document, but a simple method for this has not been known.

또 상기 종래의 방법에서는, 복수의 문서 사이에서 감정적 경향을 비교하는 경우, 각각의 문서에 포함되는 단어 중 감정어 사전에 등록되어 있는 단어인 감정어를 모두 집계하고 그 집계 결과를 비교하고 있었다. 이 때문에, 그 비교를 위한 계산량이 많아지고, 또, 어느 문서에 있어서 다른 문서에 비해 많이 출현하고 있는 감정어가 과소 평가되는 경우도 있다.Further, in the conventional method described above, when comparing emotional tendencies among a plurality of documents, all of the words registered in the sentiment word dictionary among words included in each document are counted and the result of the aggregation is compared. For this reason, the amount of calculation for the comparison increases, and there are cases where an appraisal word appearing more frequently in a certain document than in other documents is underestimated.

그래서, 적은 계산량으로 문서에 있어서의 감정어의 적절한 평가에 기초하여 복수의 문서 사이에서 감정적 경향을 비교할 수 있는 데이터 마이닝 방법이나 텍스트 마이닝 장치 등을 제공하는 것이 요구되고 있다.Therefore, it is desired to provide a data mining method or a text mining apparatus capable of comparing emotional tendencies among a plurality of documents based on appropriate evaluation of emotional words in documents with a small amount of calculation.

본 발명의 제 1 국면은, 복수의 문서 사이에서 감정 극성의 경향을 비교하기 위한 텍스트 마이닝 방법으로서,A first aspect of the present invention is a text mining method for comparing tendencies of emotional polarity among a plurality of documents,

감정 극성의 경향을 비교해야 할 복수의 문서를 대상 문서로서 지정하는 지시를 접수하는 지시 입력 스텝과,an instruction input step for receiving an instruction to designate as target documents a plurality of documents to be compared in the tendency of emotional polarity;

상기 대상 문서로서 지정된 상기 복수의 문서의 텍스트 데이터에 기초하여, 상기 복수의 문서의 각각으로부터 특징어를 추출하는 특징어 추출 스텝과,a feature word extraction step of extracting a feature word from each of the plurality of documents based on text data of the plurality of documents designated as the target documents;

상기 특징어 추출 스텝에 의해 추출된 특징어 중 소정의 감정어 사전에 등록되어 있는 특징어에 대해, 당해 감정어 사전에 있어서 감정 극성의 강도를 나타내는 수치로서 당해 특징어에 부여되어 있는 감정 지수를 부여하는 감정 지수 취득 스텝과,For a feature word registered in a predetermined emotion word dictionary among the feature words extracted by the feature word extraction step, the emotional index assigned to the feature word as a numerical value representing the intensity of emotional polarity in the emotion word dictionary An emotional index acquisition step to impart;

상기 대상 문서로서 지정된 상기 복수의 문서에 대해, 상기 특징어 추출 스텝에 의해 추출된 특징어를 상기 감정 지수 취득 스텝에 의해 부여된 감정 지수와 함께 표시하는 표시 스텝을 구비한다.and a display step of displaying, with respect to the plurality of documents designated as the target documents, the feature word extracted by the feature word extraction step together with the emotional index given by the emotional index acquisition step.

본 발명의 제 2 국면은, 본 발명의 제 1 국면에 있어서,In the second aspect of the present invention, in the first aspect of the present invention,

상기 지시 입력 스텝은, 상기 대상 문서로부터 추출해야 할 특징어의 범위를 지정하는 지시를 접수하는 스텝을 추가로 포함하고,The instruction input step further includes a step of receiving an instruction specifying a range of feature words to be extracted from the target document;

상기 특징어 추출 스텝에서는, 상기 지시 입력 스텝에서 지정되는 범위 내의 특징어가 추출된다.In the feature word extraction step, feature words within the range specified in the instruction input step are extracted.

본 발명의 제 3 국면은, 본 발명의 제 1 또는 제 2 국면에 있어서,In the third aspect of the present invention, in the first or second aspect of the present invention,

상기 지시 입력 스텝은, 감정 극성의 강도를 나타내는 지수인 감정 지수의 범위를 지정하는 지시를 접수하는 스텝을 추가로 포함하고,The instruction inputting step further includes a step of receiving an instruction for designating a range of an emotional index, which is an index representing the intensity of emotional polarity;

상기 감정 지수 취득 스텝에서는, 상기 특징어 추출 스텝에 의해 추출된 특징어 중 상기 지시 입력 스텝에서 지정된 범위 내의 감정 지수가 부여된 단어로서 상기 감정어 사전에 등록되어 있는 특징어에 대해 당해 감정 지수가 부여된다.In the emotional index acquisition step, for the characteristic words registered in the emotional word dictionary as words to which an emotional index within the range specified in the instruction input step is assigned, among the characteristic words extracted by the characteristic word extraction step, the corresponding emotional index is determined. is granted

본 발명의 제 4 국면은, 본 발명의 제 3 국면에 있어서,In the fourth aspect of the present invention, in the third aspect of the present invention,

상기 지시 입력 스텝은, 상기 추출된 특징어가 상기 부여된 감정 지수와 함께 상기 표시 스텝에 의해 표시되어 있을 때, 상기 감정 지수의 범위의 변경을 지정하는 지시를 접수하는 스텝을 추가로 포함한다.The instruction inputting step further includes a step of receiving an instruction for designating a change in the range of the emotional index, when the extracted characteristic word is displayed by the display step together with the assigned emotional index.

본 발명의 제 5 국면은, 본 발명의 제 1 내지 제 4 국면 중 어느 것에 있어서,The fifth aspect of the present invention is any one of the first to fourth aspects of the present invention,

상기 대상 문서로서 지정된 상기 복수의 문서의 각각에 대해, 상기 특징어 추출 스텝에 의해 당해 문서로부터 추출된 특징어 중 상기 감정 지수 취득 스텝에 의해 감정 지수가 부여된 특징어에 기초하여 당해 문서의 감정 지수를 문서 감정 지수로서 산출하는 문서 감정 지수 산출 스텝을 추가로 구비하고,For each of the plurality of documents designated as the target documents, the document is judged based on the characteristic word to which the emotional index is assigned by the emotion index acquisition step among the characteristic words extracted from the document by the characteristic word extraction step. Further comprising a document emotion index calculation step for calculating the index as a document emotion index;

상기 표시 스텝에서는, 상기 문서 감정 지수 산출 스텝에 의해 산출된 상기 문서 감정 지수를 나타내는 표시가 실시된다.In the display step, a display indicating the document emotional index calculated in the document emotional index calculation step is performed.

본 발명의 제 6 국면은, 복수의 문서 사이에서 감정 극성의 경향을 비교하기 위한 텍스트 마이닝 프로그램으로서,A sixth aspect of the present invention is a text mining program for comparing tendencies of emotional polarity among a plurality of documents,

상기 대상 문서로서 지정된 상기 복수의 문서에 대해, 상기 특징어 추출 스텝에 의해 추출된 특징어를 상기 감정 지수 취득 스텝에 의해 부여된 감정 지수와 함께 표시하는 표시 스텝을 컴퓨터에 CPU 가 메모리를 이용하여 실행시킨다.With respect to the plurality of documents designated as the target documents, a display step of displaying the feature word extracted by the feature word extraction step together with the emotional index given by the emotional index acquisition step, a CPU using a memory in a computer run it

본 발명의 제 7 국면은, 복수의 문서 사이에서 감정 극성의 경향을 비교하기 위한 텍스트 마이닝 장치로서,A seventh aspect of the present invention is a text mining apparatus for comparing tendencies of emotional polarity among a plurality of documents,

감정 극성의 경향을 비교해야 할 복수의 문서를 대상 문서로서 지정하는 지시를 접수하는 지시 입력부와,an instruction input unit for receiving an instruction for specifying, as target documents, a plurality of documents whose emotional polarity tendencies are to be compared;

상기 대상 문서로서 지정된 상기 복수의 문서의 텍스트 데이터에 기초하여, 상기 복수의 문서의 각각으로부터 특징어를 추출하는 특징어 추출부와,a feature word extraction unit extracting a feature word from each of the plurality of documents based on text data of the plurality of documents designated as the target documents;

상기 특징어 추출부에 의해 추출된 특징어 중 소정의 감정어 사전에 등록되어 있는 특징어에 대해, 당해 감정어 사전에 있어서 감정 극성의 강도를 나타내는 수치로서 당해 특징어에 부여되어 있는 감정 지수를 부여하는 감정 지수 취득부와,Among the feature words extracted by the feature word extraction unit, for a feature word registered in a predetermined emotion word dictionary, an emotion index assigned to the feature word as a numerical value representing the strength of emotional polarity in the emotion word dictionary an emotional index acquisition unit to impart;

상기 대상 문서로서 지정된 상기 복수의 문서에 대해, 상기 특징어 추출부에 의해 추출된 특징어를 상기 감정 지수 취득부에 의해 부여된 감정 지수와 함께 표시하는 표시부를 구비한다.and a display unit for displaying the characteristic word extracted by the characteristic word extraction unit together with the emotional index given by the emotional index acquisition unit for the plurality of documents designated as the target documents.

본 발명의 다른 국면은, 본 발명의 상기 국면 그리고 후술하는 실시형태 및 그 변형예에 관한 설명으로부터 분명하므로, 그 설명을 생략한다.Since other aspects of the present invention are clear from the above aspects of the present invention and from the description of the embodiments and modifications thereof to be described later, the descriptions thereof are omitted.

상기 제 1, 제 6 또는 제 7 국면에 의하면, 대상 문서로서 지정된 복수의 문서의 각각에 대해 특징어가 추출되고, 추출된 특징어인 대상 특징어 중, 감정어 사전에 감정어로서 등록되어 있는 특징어에 대해, 그 감정어 사전에서 그 특징어에 부여되어 있는 감정 지수가 부여된다. 이와 같이 하여 당해 복수의 문서에 대해, 대상 특징어와 그것들에 포함되는 감정어에 부여된 감정 지수가, 당해 복수의 문서에 대한 감정적 경향 분석의 결과로서 표시된다. 이와 같은 표시에 의해, 감정적 경향을 비교해야 할 복수의 문서에 있어서 감정 극성이 약한 특징어가 포함되어 있는 경우에도, 추출된 특징어와 함께 그것들에 부여되어 있는 감정 지수를 보는 것에 의해, 당해 복수의 문서 사이에서 그들의 감정적 경향을 적확하게 파악할 수 있다.According to the first, sixth or seventh aspect, a feature word is extracted for each of a plurality of documents designated as target documents, and among the target feature words that are the extracted feature words, a feature word registered as an emotion word in an emotion word dictionary For , an emotion index assigned to the characteristic word in the emotion word dictionary is assigned. In this way, for the plurality of documents, the emotional indices assigned to the target characteristic words and the emotion words included therein are displayed as the result of the emotional tendency analysis for the plurality of documents. With such a display, even when a plurality of documents with emotional tendencies to be compared contain feature words with weak emotional polarity, by viewing the extracted feature words and the emotional indices assigned to them, the plurality of documents Among them, their emotional tendencies can be accurately identified.

상기 제 2 국면에 의하면, 대상 문서로서의 복수의 문서의 각각으로부터 추출해야 할 특징어의 범위를 지정할 수 있으므로, 보다 특징적인 단어만을 대상 특징어로서 추출함으로써, 종래에 비해, 적은 계산량으로 상기 복수의 문서의 각각의 특징을 반영한 감정적 경향을 상기 복수의 문서 사이에서 비교할 수 있다. 또, 상기 복수의 문서 중 어느 문서에 있어서 다른 문서에 비해 보다 많이 출현하고 있는 특징적인 감정어가 과소 평가된다는 문제도 회피할 수 있다.According to the second aspect, since the range of feature words to be extracted from each of a plurality of documents as target documents can be specified, by extracting only more characteristic words as target feature words, the plurality of Emotional tendencies reflecting the characteristics of each document may be compared among the plurality of documents. In addition, it is possible to avoid a problem that a characteristic emotion word appearing more frequently in one of the plurality of documents than in other documents is underestimated.

상기 제 3 국면에 의하면, 대상 문서로서의 복수의 문서의 각각으로부터 추출되는 특징어인 대상 특징어에 부여되는 감정 지수의 범위를 지정함으로써, 감정 극성이 약한 특징어를 포함하는 복수의 문서 사이에서 그들의 감정적 경향을 적확하게 비교할 수 있다.According to the third aspect, by specifying the range of the emotional index given to the target feature word, which is a feature word extracted from each of the plurality of documents as the target document, their emotional emotions among a plurality of documents including the feature word having a weak emotional polarity. trends can be accurately compared.

상기 제 4 국면에 의하면, 상기와 같이 추출된 특징어가 상기와 같이 부여된 감정 지수와 함께 대상 문서로서의 복수의 문서에 대한 감정적 경향 분석의 결과로서 표시되어 있을 때, 감정 지수의 범위의 변경을 지정하는 지시를 접수하면, 변경 후의 감정 지수의 범위에 기초하여, 상기 복수의 문서의 각각에 대해 특징어가 대상 특징어로서 추출되고, 당해 대상 특징어 중 감정어 사전에 감정어로서 등록되어 있는 특징어에 대해 감정 지수가 부여되고, 그 후, 상기 복수의 문서에 대해, 대상 특징어와 그것들에 포함되는 감정어에 부여된 감정 지수가 상기 복수의 문서에 대한 감정적 경향 분석의 결과로서 표시된다. 이에 의해 이용자는, 상기 복수의 문서에 대한 감정적 경향 분석의 결과를 일단 표시시킨 후에, 그 표시를 보면서 감정 지수의 지정 범위를 조정함으로써, 상기 복수의 문서의 감정적 경향을 더욱 적확하게 비교할 수 있다.According to the fourth aspect, when the feature word extracted as described above is displayed as a result of emotional tendency analysis for a plurality of documents as target documents together with the emotional index assigned as above, a change in the range of emotional index is specified. When an instruction to do so is received, a feature word is extracted as a target feature word for each of the plurality of documents based on the range of the emotion index after change, and a feature word registered as an emotion word in the appraisal word dictionary among the target feature words. Emotional indices are assigned to , and then, for the plurality of documents, emotional indices assigned to target feature words and emotional words included therein are displayed as results of emotional tendency analysis for the plurality of documents. Accordingly, the user can more accurately compare the emotional tendencies of the plurality of documents by displaying the result of emotional tendency analysis for the plurality of documents once and adjusting the designated range of the emotional index while viewing the display.

상기 제 5 국면에 의하면, 대상 문서로서의 복수의 문서의 각각에 대해, 감정 지수가 부여된 대상 특징어에 기초하여 문서 감정 지수가 산출되므로, 상기 복수의 문서 사이에서 그들의 특징어에 부여된 감정 지수를 비교하는 것에 더하여, 상기 복수의 문서 사이에서 문서 감정 지수를 비교할 수 있다. 이로써, 상기 복수의 문서의 감정적 경향을 보다 적확하고 또한 용이하게 비교할 수 있다.According to the fifth aspect, for each of a plurality of documents as target documents, a document emotion index is calculated based on a target feature word to which an emotion index is assigned, and thus an emotion index assigned to the feature word among the plurality of documents. In addition to comparing, emotional indices of documents may be compared among the plurality of documents. In this way, it is possible to more accurately and easily compare the emotional tendencies of the plurality of documents.

본 발명의 다른 국면의 효과에 대해서는, 본 발명의 상기 국면의 효과 그리고 하기 실시형태 및 그 변형예의 효과에 대한 설명으로부터 분명하므로, 설명을 생략한다.As for the effects of other aspects of the present invention, descriptions thereof are omitted since they are clear from the description of the effects of the above aspects of the present invention and the effects of the following embodiments and modifications thereof.

도 1 은, 본 발명의 일 실시형태에 관련된 텍스트 마이닝 장치의 구성을 나타내는 블록도이다.
도 2 는, 상기 일 실시형태에 관련된 텍스트 마이닝 장치로서 동작하는 컴퓨터의 구성을 나타내는 블록도이다.
도 3 은, 감정 지수가 부여된 감정어 사전을 설명하기 위한 도면이다.
도 4 는, 상기 일 실시형태에 관련된 텍스트 마이닝 장치로서 컴퓨터가 동작하기 위해서 실행되는 감정적 경향 분석 처리의 순서를 나타내는 플로 차트이다.
도 5 는, 상기 일 실시형태에 관련된 텍스트 마이닝 장치에 있어서의 조작 화면을 나타내는 도면이다.
도 6 은, 상기 일 실시형태에 있어서의 특징어의 추출을 설명하기 위한 도면이다.
도 7 은, 상기 일 실시형태에 있어서 대상 문서의 각각에 대해 추출된 특징어의 감정적 경향을 나타내는 표시예의 도면이다.
도 8 은, 상기 일 실시형태에 관련된 텍스트 마이닝 장치에 있어서의 감정적 경향 분석의 결과를 나타내는 표시예를 나타내는 도면이다.1 is a block diagram showing the configuration of a text mining apparatus according to an embodiment of the present invention.
Fig. 2 is a block diagram showing the configuration of a computer operating as a text mining device according to the above-mentioned one embodiment.
3 is a diagram for explaining a dictionary of emotional words to which emotional indices are assigned.
Fig. 4 is a flowchart showing the sequence of emotional tendency analysis processing executed in order for a computer to operate as a text mining device according to the above-mentioned one embodiment.
Fig. 5 is a diagram showing an operation screen in the text mining device according to the above-mentioned one embodiment.
Fig. 6 is a diagram for explaining feature word extraction in the above-described embodiment.
Fig. 7 is a diagram of a display example showing emotional tendencies of feature words extracted for each of the target documents in the above-mentioned one embodiment.
Fig. 8 is a diagram showing a display example showing the result of emotional tendency analysis in the text mining apparatus according to the above-mentioned one embodiment.

이하, 도면을 참조하여, 본 발명의 일 실시형태에 관련된 텍스트 마이닝 장치에 대해 설명한다. 이 텍스트 마이닝 장치는, 복수의 문서 사이에서 감정적 경향 (감정 극성의 경향) 을 비교하기 위한 텍스트 마이닝 방법을 실시하기 위한 장치이고, 후술하는 텍스트 마이닝 프로그램을 컴퓨터가 실행함으로써 실현된다. 또한 이하에 있어서, 「감정 극성」 이란, 문서에 있어서 긍정적인 의견이 기술되어 있는지 부정적인 의견이 기술되어 있는지를 나타내는 정보를 말한다.Hereinafter, a text mining apparatus according to an embodiment of the present invention will be described with reference to the drawings. This text mining device is a device for implementing a text mining method for comparing emotional tendencies (tendencies of emotional polarity) among a plurality of documents, and is realized by a computer executing a text mining program described later. In the following, "emotion polarity" refers to information indicating whether a positive opinion or a negative opinion is described in a document.

<1. 텍스트 마이닝 장치의 기능적 구성><1. Functional configuration of text mining device>

도 1 은, 본 실시형태에 관련된 텍스트 마이닝 장치 (10) 의 기능적 구성을 나타내는 블록도이다. 이 텍스트 마이닝 장치 (10) 는, 지시 입력부 및 표시부로서 기능하는 GUI 부 (11) 와, 텍스트 데이터 기억부 (12) 와, 특징어 추출부 (13) 와, 감정 지수가 부여된 감정어 사전을 기억하고 있는 사전 기억부 (14) 와, 특징어 감정 지수 취득부 (15) 와, 문서 감정 지수 산출부 (16) 와, 표시 데이터 처리부 (17) 를 구비하고 있다. 또한, 이 텍스트 마이닝 장치 (10) 는, 텍스트 데이터 기억부 (12) 및 사전 기억부 (14) 의 일방 또는 쌍방을 구비하지 않고, 외부의 기억부에 기억된 텍스트 데이터 및 감정 지수가 부여된 감정어 사전의 일방 또는 쌍방을 네트워크를 통하여 이용하도록 구성되어 있어도 된다.Fig. 1 is a block diagram showing the functional configuration of a text mining apparatus 10 according to the present embodiment. This text mining apparatus 10 includes a GUI unit 11 that functions as an instruction input unit and a display unit, a text data storage unit 12, a feature word extraction unit 13, and a dictionary of emotional words to which emotional indices are assigned. It is provided with a pre-storage unit 14 that is memorized, a characteristic word emotional index acquisition unit 15, a document emotional index calculation unit 16, and a display data processing unit 17. In addition, this text mining device 10 does not have either or both of the text data storage unit 12 and the dictionary storage unit 14, and the text data stored in the external storage unit and the emotion to which the emotional index is assigned It may be configured so that one or both of the dictionaries may be used via a network.

본 실시형태에서는, 감정적 경향을 비교해야 할 복수의 문서로 이루어지는 대상 문서를 포함하는 다수의 문서의 텍스트 데이터가 텍스트 데이터 기억부 (12) 에 미리 기억되어 있는 것으로 한다. 대상 문서의 지정 등, 감정적 경향 분석 처리를 실시할 때에는, 그것을 위한 이용자의 지시 (대상 문서의 지정 등) 가 GUI 부 (11) 에 의해 접수된다. 이 지시에 기초하여, 특징어 추출부 (13) 는, 먼저, 대상 문서로서 지정된 복수의 문서의 텍스트 데이터를 텍스트 데이터 기억부 (12) 로부터 판독 출력하고, 당해 복수의 문서의 각각에 포함되는 특징어를 추출한다. 특징어 감정 지수 취득부 (15) 는, 추출된 특징어인 대상 특징어 중, 사전 기억부 (14) 에 있어서의 감정 지수가 부여된 감정어 사전에 감정어로서 등록되어 있는 특징어에 대해, 당해 감정어 사전에 있어서 당해 특징어에 부여되어 있는 감정 지수를 부여한다. 또한, 이 감정 지수는, 감정 극성의 강도를 나타내는 수치이고, 후술하는 문서 감정 지수와 구별해야 할 경우에는 「단어 감정 지수」 라고 부르는 것으로 한다. 문서 감정 지수 산출부 (16) 는, 이와 같이 하여 감정 지수가 부여된 특징어를 사용하여 후술하는 식에 의해 당해 복수의 문서의 각각에 대해 감정 지수 (문서 감정 지수) 를 산출한다. 이와 같이 하여, 대상 문서로서 지정된 복수의 문서의 각각에 대해, 대상 특징어와 그것들에 포함되는 감정어에 부여된 감정 지수 (단어 감정 지수) 와 문서 감정 지수가 얻어진다. 표시 데이터 처리부 (17) 는, 이들 대상 특징어, 단어 감정 지수, 및 문서 감정 지수를 당해 복수의 문서 사이에서 비교 가능하게 표시하기 위한 표시 데이터를 생성한다. GUI 부 (11) 는, 표시부로서, 이 표시 데이터에 기초하여 당해 복수의 문서의 감정적 경향을 비교하기 위한 표시를 실시한다. 이것은, 대상 문서에 대한 감정적 경향 분석의 결과를 나타내는 것이다. 이용자는, 이 표시를 보고, 대상 문서로서의 복수의 문서 사이에서의 감정적 경향의 상이를 파악할 수 있고, 또, 필요에 따라, 지시 입력부로서의 GUI 부 (11) 에 대해 특징어에 부여해야 할 감정 지수의 범위를 좁혀, 다시, 상기와 같은 감정적 경향 분석을 실시할 수 있다.In the present embodiment, it is assumed that text data of a plurality of documents including a target document composed of a plurality of documents to be compared for emotional tendencies are stored in the text data storage unit 12 in advance. When performing the emotional tendency analysis process, such as designation of a target document, the user's instruction (designation of a target document, etc.) for this purpose is received by the GUI unit 11. Based on this instruction, the feature word extracting unit 13 first reads out the text data of a plurality of documents designated as target documents from the text data storage unit 12, and then reads out the text data of the plurality of documents, respectively, to find the characteristics included in each of the plurality of documents. extract the fish The characteristic word emotion index acquisition unit 15, among the target characteristic words that are the extracted characteristic words, for the characteristic word registered as an emotion word in the emotion word dictionary to which the emotion index in the dictionary storage unit 14 is assigned, An emotion index assigned to the feature word in the emotion word dictionary is assigned. In addition, this emotional index is a numerical value representing the intensity of emotional polarity, and when it is to be distinguished from a document emotional index described later, it is referred to as a "word emotional index". The document emotional index calculation unit 16 calculates an emotional index (document emotional index) for each of the plurality of documents by an expression described later using the characteristic word to which the emotional index is assigned in this way. In this way, for each of a plurality of documents designated as target documents, emotional indices (word emotional indices) and document emotional indices assigned to target feature words and emotional words included therein are obtained. The display data processing unit 17 generates display data for displaying these target characteristic words, word emotional indices, and document emotional indices in a comparable manner among a plurality of documents. The GUI unit 11, as a display unit, performs a display for comparing the emotional tendencies of the plurality of documents concerned based on this display data. This represents the result of emotional tendency analysis on the target document. By looking at this display, the user can grasp the difference in emotional tendencies among a plurality of documents as target documents, and, if necessary, the emotional index to be assigned to the characteristic word for the GUI unit 11 as the instruction input unit. By narrowing the range of, again, the same emotional tendency analysis as above can be performed.

<2. 텍스트 마이닝 장치의 하드웨어 구성><2. Hardware Configuration of Text Mining Device>

도 2 는, 본 실시형태에 있어서 후술하는 텍스트 마이닝 프로그램에 의해 텍스트 마이닝 장치 (10) 로서 동작하는 컴퓨터 (20) 의 구성, 즉, 본 실시형태에 관련된 텍스트 마이닝 장치 (10) 의 하드웨어 구성을 나타내는 블록도이다. 도 2 에 나타내는 컴퓨터 (20) 는, CPU (21), 메인 메모리 (22), 보조 기억 장치 (23), 입력 조작부 (24), 표시 장치 (25), 통신 인터페이스 장치 (26), 및 기록 매체 판독 장치 (27) 를 구비하고 있다. 메인 메모리 (22) 에는, 예를 들어 DRAM 이 사용된다. 보조 기억 장치 (23) 에는, 예를 들어 하드 디스크나 솔리드 스테이트 드라이브가 사용된다. 입력 조작부 (24) 에는, 예를 들어 키보드 (28) 나 마우스 (29) 가 포함된다. 표시 장치 (25) 에는, 예를 들어 액정 디스플레이가 사용된다. 통신 인터페이스 장치 (26) 는, 유선 통신 또는 무선 통신의 인터페이스 회로이다. 기록 매체 판독 장치 (27) 는, 프로그램 등을 기억한 기록 매체 (30) 의 인터페이스 회로이다. 기록 매체 (30) 에는, 예를 들어, CD-ROM, DVD-ROM, USB 메모리 등의 비일과성의 기록 매체가 사용된다.Fig. 2 shows the configuration of the computer 20 operating as the text mining device 10 by a text mining program described later in the present embodiment, that is, the hardware configuration of the text mining device 10 according to the present embodiment. It is a block diagram. The computer 20 shown in FIG. 2 includes a CPU 21, a main memory 22, an auxiliary storage device 23, an input operation unit 24, a display device 25, a communication interface device 26, and a recording medium. A reading device 27 is provided. For the main memory 22, DRAM is used, for example. For the auxiliary storage device 23, a hard disk or a solid state drive is used, for example. The input operation unit 24 includes, for example, a keyboard 28 and a mouse 29 . For the display device 25, a liquid crystal display is used, for example. The communication interface device 26 is an interface circuit for wired communication or wireless communication. The recording medium reading device 27 is an interface circuit of the recording medium 30 storing a program or the like. For the recording medium 30, a non-transitory recording medium such as a CD-ROM, DVD-ROM, or USB memory is used, for example.

상기와 같이 구성된 컴퓨터 (20) 에 있어서, 보조 기억 장치 (23) 는, 본 실시형태에 관련된 텍스트 마이닝 프로그램 (31) 에 더하여, 대상 문서의 텍스트 데이터 (32) 와 감정 지수가 부여된 감정어 사전인 감정어 사전 (34) 을 기억하고, 이로써 텍스트 데이터 기억부 (12) 및 사전 기억부 (14) 가 실현된다. 텍스트 마이닝 프로그램 (31) 과 텍스트 데이터 (32) 는, 예를 들어, 서버나 다른 컴퓨터로부터 통신 인터페이스 장치 (26) 를 사용하여 수신한 것이어도 되고, 기록 매체 (30) 로부터 기록 매체 판독 장치 (27) 를 사용하여 판독 출력한 것이어도 된다. 또, 감정어 사전 (34) 은, 서버나 다른 컴퓨터에 격납되어 있어도 되고, 이 경우, 텍스트 마이닝 장치 (10) 로서 동작하는 컴퓨터 (20) 는, 네트워크 및 통신 인터페이스 장치 (26) 를 통하여 감정어 사전 (34) 을 이용하게 된다.In the computer 20 configured as described above, the auxiliary storage device 23, in addition to the text mining program 31 according to the present embodiment, contains the text data 32 of the target document and the sentiment word dictionary to which the emotional index is assigned. An emotion word dictionary 34 is stored, whereby the text data storage unit 12 and the dictionary storage unit 14 are realized. The text mining program 31 and the text data 32 may be received from, for example, a server or other computer using the communication interface device 26, and may be received from the recording medium 30 by the recording medium reading device 27 ) may be used to read out. In addition, the emotion word dictionary 34 may be stored in a server or another computer. In this case, the computer 20 operating as the text mining device 10 transmits the emotion word through the network and communication interface device 26. The dictionary (34) will be used.

컴퓨터 (20) 에 있어서 텍스트 마이닝 프로그램 (31) 을 실행할 때에는, 텍스트 마이닝 프로그램 (31) 과 텍스트 데이터 (32) 가 메인 메모리 (22) 에 로드된다. CPU (21) 는, 메인 메모리 (22) 를 작업용 메모리로서 이용하고, 메인 메모리 (22) 에 기억된 텍스트 마이닝 프로그램 (31) 을 실행함으로써, 대상 문서에 대해 감정적 경향 분석 처리를 실시한다. 이 감정적 경향 분석 처리에서는, 대상 문서로서 지정된 복수의 문서의 각각에 대해, 특징어의 추출이나, 특징어의 감정 지수의 취득, 문서 감정 지수의 산출 등이 실시된다 (상세한 것은 후술). CPU (21) 가 감정적 경향 분석 처리를 실시할 때, 컴퓨터 (20) 는 텍스트 마이닝 장치 (10) 로서 기능한다. 또한, 이상에서 서술한 컴퓨터 (20) 의 구성은 일례에 지나지 않고, 여러 가지 컴퓨터를 사용하여 텍스트 마이닝 장치 (10) 를 실현할 수 있다.When the text mining program 31 is executed in the computer 20, the text mining program 31 and text data 32 are loaded into the main memory 22. The CPU 21 uses the main memory 22 as a working memory and executes the text mining program 31 stored in the main memory 22 to perform emotional tendency analysis processing on the target document. In this emotional tendency analysis process, for each of a plurality of documents designated as target documents, extraction of characteristic words, acquisition of emotional indices of characteristic words, and calculation of document emotional indices are performed (details will be described later). When CPU 21 performs emotional tendency analysis processing, computer 20 functions as text mining device 10. In addition, the configuration of the computer 20 described above is only an example, and the text mining apparatus 10 can be realized using various computers.

<3. 감정 지수가 부여된 감정어 사전><3. Sentiment word dictionary with sentiment index>

상기의 감정적 경향 분석 처리에서는, 보조 기억 장치 (23) 에 격납되어 있는 감정 지수가 부여된 감정어 사전인 감정어 사전 (34) 이 사용된다. 도 3 은, 본 실시형태에 있어서 사용되는 감정 지수가 부여된 감정어 사전을 설명하기 위한 도면이다. 이 감정어 사전에서는, 긍정적인지 부정적인지라는 감정 극성을 나타내는 단어가 감정어로서 수집되어 등록되어 있고, 또한, 등록되어 있는 각 감정어에 대해, 그 감정 극성의 강도를 나타내는 수치가 감정 지수로서 나타나 있다. 이 감정 지수는, -1.00 내지 +1.00 의 범위 내의 수치이고, 긍정적인 감정어에는 정 (正) 의 수치가, 부정적인 감정어에는 부 (負) 의 수치가 각각 부여된다. 예를 들어 도 3 에 나타내는 바와 같이, 강한 긍정적인 의미를 갖는 「우량」 이라는 단어 (감정어) 에는 감정 지수로서 +1.00 이 부여되고, 강한 부정적인 의미를 갖는 「흉악」 이라는 단어 (감정어) 에는 감정 지수로서 -1.00 이 부여된다. 감정 지수가 부여된 감정어 사전의 작성 방법으로는, 단어를 벡터화 (수치화) 한 후에 이미 알려진 감정어와의 유사도를 계산하는 등, 몇 가지 방법이 알려져 있다. 본 실시형태에서는, 이미 알려진 어느 방법에 의해 작성된 정 지수가 부여된 감정어 사전의 데이터가 감정어 사전 (34) 으로서 보조 기억 장치 (23) 에 미리 기억되어 있다.In the emotional tendency analysis process described above, the emotional word dictionary 34 stored in the auxiliary storage device 23 and to which the emotional index is assigned is used. Fig. 3 is a diagram for explaining the emotional word dictionary to which the emotional index used in the present embodiment is assigned. In this emotional word dictionary, words representing emotional polarity, such as positive or negative, are collected and registered as emotional words, and for each registered emotional word, a numerical value representing the intensity of the emotional polarity is displayed as an emotional index. there is. This emotional index is a numerical value within the range of -1.00 to +1.00, and a positive numerical value is assigned to a positive sentiment word, and a negative numerical value is assigned to a negative sentiment word. For example, as shown in Fig. 3, the word "excellent" (emotional word) having a strong positive meaning is given +1.00 as an emotional index, and the word "bad" (emotional word) having a strong negative meaning is assigned an emotional index of +1.00. -1.00 is given as the emotional index. Several methods are known as methods for creating a dictionary of emotional words to which emotional indices are assigned, such as vectorizing (digitizing) words and then calculating similarities with known emotional words. In the present embodiment, the data of the emotion word dictionary to which the positive index has been created by a known method is previously stored in the auxiliary storage device 23 as the emotion word dictionary 34 .

<4. 감정적 경향 분석 처리><4. Emotional Tendency Analysis Processing>

상기와 같이, 컴퓨터 (20) 에 있어서 CPU (21) 가 텍스트 마이닝 프로그램 (31) 을 실행함으로써 대상 문서에 대해 감정적 경향 분석 처리가 실시된다. 도 4 는, 이 감정적 경향 분석 처리의 순서를 나타내는 플로 차트이다. 본 실시형태에서는, CPU (21) 가 텍스트 마이닝 프로그램 (31) 을 실행함으로써, 컴퓨터 (20) 는 도 4 에 나타내는 바와 같이 동작한다.As described above, when the CPU 21 of the computer 20 executes the text mining program 31, the emotional tendency analysis process is performed on the target document. Fig. 4 is a flowchart showing the procedure of this emotional tendency analysis process. In this embodiment, CPU 21 executes text mining program 31, and computer 20 operates as shown in FIG.

먼저, 대상 문서, 특징어의 범위, 및 감정 지수 (단어 감정 지수) 의 범위를 지정하기 위한 지시를 접수한다 (스텝 S10). 구체적으로는, 표시 장치 (25) 에 예를 들어 도 5 에 나타내는 바와 같은 조작 화면을 표시하고, 이용자가 이 조작 화면에 대해, 입력 조작부 (24) 에 있어서의 키보드 (28) 나 마우스 (29) 를 사용하여, 대상 문서, 특징어의 범위, 및 감정 지수의 범위를 지정하는 조작을 실시하고, 조작 화면에 있어서의 "OK" 의 버튼 (260) 을 클릭한다. 이로써, 텍스트 마이닝 장치 (10) 로서의 컴퓨터 (20) 는, 지정된 대상 문서, 특징어의 범위, 및 감정 지수의 범위를 나타내는 입력 정보를 수취한다. 도 5 에 나타내는 예에서는, 제 1 손잡이 (251) 와 제 2 손잡이 (252) 를 갖는 슬라이더 (250) 를 조작함으로써, 감정 지수의 범위를 지정할 수 있다. 즉, 슬라이더 (250) 에 있어서의 제 1 손잡이 (251) 및 제 2 손잡이 (252) 의 위치를 설정함으로써, "-1.00" 내지 제 1 손잡이 (251) 의 위치가 나타내는 부치 (負値) 까지의 부정적 감정 지수의 범위 (부의 감정 지수의 범위) 와 제 2 손잡이 (252) 의 위치가 나타내는 정치 (正値) 내지 "+1.00" 까지의 긍정적 감정 지수의 범위 (정의 감정 지수의 범위) 로 이루어지는 2 개의 범위를 감정 지수의 지정 범위로 할 수 있다. 또한, 대상 문서로는, 감정적 경향을 비교해야 할 복수의 문서가 지정된다. 여기서는, 어느 제품의 기종 A, 기종 B, 및 기종 C 의 리뷰 문서 (당해 제품의 각 기종에 대해 이용자의 감상이나 비평, 의견 등을 기재한 문서) 가 대상 문서로서 지정된 것으로 하여 이하의 설명을 실시한다.First, an instruction for specifying a target document, a range of characteristic words, and a range of emotional indices (word emotional indices) is received (step S10). Specifically, an operation screen as shown in FIG. 5 is displayed on the display device 25, and the user uses the keyboard 28 or the mouse 29 in the input operation unit 24 for this operation screen. is used to designate the target document, the range of characteristic words, and the range of the emotional index, and the "OK" button 260 on the operation screen is clicked. In this way, the computer 20 as the text mining apparatus 10 receives input information indicating the designated target document, the range of characteristic words, and the range of emotional indices. In the example shown in FIG. 5 , the range of the emotional index can be specified by operating the slider 250 having the first knob 251 and the second knob 252 . That is, by setting the positions of the first knob 251 and the second knob 252 in the slider 250, the range from "-1.00" to the value indicated by the position of the first knob 251 is 2 consisting of a range of negative emotional indices (range of negative emotional indices) and a range of positive emotional indices from positive to "+1.00" indicated by the position of the second handle 252 (range of positive emotional indices) The range of the dog can be used as the specified range of the emotion index. Also, as the target documents, a plurality of documents to be compared for emotional tendencies are designated. Here, the following explanation is given as the review documents (documents describing users' impressions, criticisms, opinions, etc. on each model of the product) of model A, model B, and model C of a certain product are designated as target documents. do.

상기 스텝 S10 에 있어서, 특징어의 범위의 지정은, 대상 문서로서 지정된 각 문서로부터 특징어를 추출할 때에 각 단어의 특징도를 나타내는 수치가 사용되는 것을 전제로 하고 있다 (상세한 것은 후술). 특징어의 범위의 지정은, 대상 문서로서 지정된 각 문서에 있어서 특징도가 큰 단어로부터 순서대로 몇 개의 단어를 특징어로서 추출할지를 지정함으로써 실시된다.In the above step S10, designation of the range of the characteristic word is based on the premise that a numerical value indicating the characteristic degree of each word is used when extracting the characteristic word from each document designated as the target document (details will be described later). Designation of the range of feature words is performed by specifying how many words to be extracted as feature words in order from words having a large feature degree in each document designated as a target document.

이와 같이 하여, 대상 문서, 특징어의 범위, 및 감정 지수의 범위를 지정하는 지시를 접수한 후, 먼저, 대상 문서로서 지정된 복수의 문서의 텍스트 데이터 (32) 를 보조 기억 장치 (23) 로부터 메인 메모리 (22) 에 판독 입력한다 (스텝 S12). 다음으로, 이 텍스트 데이터 (32) 를 사용하여, 지정된 범위 내의 특징어를 대상 문서로서의 복수의 문서의 각각으로부터 대상 특징어로서 추출한다 (스텝 S14).In this way, after receiving the instruction for designating the target document, the range of characteristic words, and the range of the emotional index, first, the text data 32 of a plurality of documents designated as target documents is transferred from the auxiliary storage device 23 to the main It reads and writes to the memory 22 (step S12). Next, using this text data 32, a characteristic word within a specified range is extracted as a target characteristic word from each of a plurality of documents as target documents (step S14).

도 6 은, 어느 제품의 기종 A, 기종 B, 및 기종 C 의 리뷰 문서가 대상 문서로서 지정되고, 특징도가 큰 순서로부터 상위 10 개가 특징어의 범위로서 지정되었을 경우에 있어서의 특징어의 추출예를 나타내고 있다. 도 6 에서는, 기종 A, 기종 B, 및 기종 C 의 리뷰 문서의 각각에 대해, 특징도가 큰 순서로부터 10 개의 특징어가 그들의 특징도를 나타내는 수치와 함께 나타나 있다.Fig. 6 shows the extraction of feature words in the case where review documents of model A, model B, and model C of a certain product are designated as target documents, and the top 10 from the order of the characteristic degree are designated as the range of feature words. shows an example. In Fig. 6, for each of the review documents of type A, type B, and type C, 10 characteristic words are shown in ascending order of characteristic degree together with numerical values representing their characteristic degree.

도 6 에 나타내는 예에서는, 단어의 특징도를 나타내는 수치로서 자카드 계수 (Jaccard 계수) 가 사용되고 있다. 대상 문서로서의 기종 A, 기종 B, 및 기종 C 의 리뷰 문서를 각각 부호 Da, Db, Dc 로 나타내는 것으로 하면, 문서 Dx 에 있어서의 단어 w 의 자카드 계수 Jxw 는, 하기 (p1) ∼ (p4) 의 순서에 의해 산출된다 (x = a, b, c).In the example shown in Fig. 6, a Jaccard coefficient is used as a numerical value representing a characteristic degree of a word. Assuming that review documents of model A, model B, and model C as target documents are represented by symbols Da, Db, and Dc, respectively, the Jacquard coefficient Jxw of word w in document Dx is the following (p1) to (p4) It is calculated by order (x = a, b, c).

(p1) 문서 Da, Db, Dc 에 포함되는 모든 문장 (sentence) 중 단어 w 를 포함하는 문장의 수 Nw 를 구한다.(p1) Find the number Nw of sentences including word w among all sentences included in documents Da, Db, and Dc.

(p2) 문서 Dx 에 포함되는 문장의 수 Nx 를 구한다.(p2) The number of sentences Nx included in the document Dx is obtained.

(p3) 문서 Dx 에 포함되는 문장 중 단어 w 를 포함하는 문장의 수 Nxw 를 구한다.(p3) Among the sentences included in the document Dx, the number Nxw of sentences including the word w is obtained.

(p4) 문서 Dx 에 있어서의 단어 w 의 자카드 계수 Jxw 를 하기 식에 의해 구한다.(p4) The Jacquard coefficient Jxw of the word w in the document Dx is obtained by the following formula.

Jxw = Nxw/(Nw + Nx - Nxw) …(1) Jxw = Nxw/(Nw + Nx - Nxw) … (One)

일반적으로는, 대상 문서로서 복수의 문서 D1, D2, …, Dn 이 지정되었을 경우, 이들 중 문서 Dk (1 ≤ k ≤ n) 에 있어서의 단어 w 의 자카드 계수 Jkw 는 하기 식에 의해 나타낸다.Generally, a plurality of documents D1, D2, . . . as target documents. , When Dn is specified, the Jacquard coefficient Jkw of the word w in the document Dk (1 ≤ k ≤ n) is expressed by the following formula.

Jkw = ｜Sw∩Sk｜/｜Sw∪Sk｜ …(2) Jkw = ｜Sw∩Sk｜/｜Sw∪Sk｜ … (2)

여기서, Sw 는, 문서 D1, D2, …, Dn 에 포함되는 모든 문장 중 단어 w 를 포함하는 문장을 요소로 하는 집합을 나타내고, Sk 는, 문서 Dk 에 포함되는 모든 문장을 요소로 하는 집합을 나타낸다.Here, Sw is the document D1, D2, ... , denotes a set including sentences including the word w among all sentences included in Dn as elements, and Sk denotes a set including all sentences included in document Dk as elements.

대상 문서로서의 복수의 문서 D1, D2, …, Dn 중 문서 Dk (1 ≤ k ≤ n) 에 있어서의 단어의 특징도를 나타내는 수치로서, 이와 같은 자카드 계수 대신에, 문서 Dk 에 있어서 당해 단어를 포함하는 문장의 수 (이하 「문서 내 출현수」 라고 한다) 를 사용하는 것이 생각된다. 이 문서 내 출현수를 사용하면 하기와 같은 문제가 생긴다. 즉, 문서 D1, D2, …, Dn 중 어느 것에 있어서도 많이 출현하는 단어 wp 는, 어느 문서 Dk (1 ≤ k ≤ n) 에 있어서도 특징도가 높다고는 할 수 없지만, 문서 내 출현수는 크다. 또, 어느 문서 Dk (1 ≤ k ≤ n) 에 포함되지만 다른 문서 Dj (j ≠ k 또한 1 ≤ j ≤ n) 에서는 거의 전무한 단어 wq 는, 문서 Dk 에 있어서 당해 단어 wq 를 포함하는 문장의 수 (문서 내 출현수) 가 큰 것이 아니어도 특징도가 높다고 여겨지는 경우가 있지만, 이 문서 내 출현수가 어느 정도 이상 작으면 문서 Dk 에 있어서의 특징어라고는 할 수 없다. 이에 대해, 자카드 계수를 사용하는 경우에는, 문서 Dk 에 있어서 이들 2 개의 단어 wp, wq 에 대해 상기 식 (2) 에 의해 산출되는 자카드 계수는 충분히 작아져, 이들 2 개의 단어 wp, wq 는 모두 특징어로서 추출되지 않는다.A plurality of documents D1, D2, . . . as target documents. , Dn, as a numerical value representing the characteristic degree of a word in the document Dk (1 ≤ k ≤ n), instead of such a Jacquard coefficient, the number of sentences containing the word in the document Dk (hereinafter referred to as “the number of appearances in the document”) 」) is considered. The use of occurrence counts in this document presents the following problems. That is, documents D1, D2, ... , Dn, the word wp appears frequently in any document Dk (1 ≤ k ≤ n), although it cannot be said that the feature degree is high, the number of occurrences in the document is large. Further, the number of sentences containing the word wq in the document Dk ( There are cases where the feature degree is considered high even if the number of occurrences in the document) is not large, but if the number of occurrences in the document is smaller than a certain extent, it cannot be said to be a feature word in the document Dk. In contrast, in the case of using the Jacquard coefficient, the Jacquard coefficient calculated by the above equation (2) for these two words wp and wq in the document Dk is sufficiently small, and these two words wp and wq are both characteristic It is not extracted as a fish.

상기와 같이 하여 대상 문서로서의 복수의 문서의 각각으로부터 대상 특징어가 추출되면, 다음으로, 대상 문서 중 미착안 중 어느 하나의 문서에 착안한다 (스텝 S15). 또한, 감정적 경향 분석 처리의 개시 후, 최초로 스텝 S15 가 실행될 때에는, 대상 문서로서 지정된 복수의 문서는 모두 미착안 상태이다. 상기와 같이, 어느 제품의 기종 A, 기종 B, 및 기종 C 의 리뷰 문서가 대상 문서로서 지정된 경우, 기종 A, 기종 B, 및 기종 C 의 리뷰 문서 중 어느 하나가 착안 문서가 된다.After the target characteristic word is extracted from each of a plurality of documents as target documents in the above manner, next, attention is focused on any one of the unsolved documents among the target documents (step S15). Further, when step S15 is executed for the first time after the start of the emotional tendency analysis process, all of the plurality of documents designated as target documents are in an unattended state. As described above, when review documents of model A, model B, and model C of a certain product are designated as target documents, any one of the review documents of model A, model B, and model C becomes the target document.

다음으로, 착안 문서에 있어서 추출된 특징어인 대상 특징어의 각각을 감정 지수가 부여된 감정어 사전 (34) 에서 검색하고, 당해 대상 특징어 중, 감정어 사전 (34) 에 있어서 지정 범위 내의 감정 지수를 갖는 단어 (감정어) 로서 등록되어 있는 특징어에 대해, 그 감정 지수를 부여한다 (스텝 S16). 도 7 은, 대상 문서로서 지정된 기종 A, 기종 B, 및 기종 C 의 리뷰 문서의 각각으로부터 추출된 특징어의 감정적 경향의 표시예를 나타내고 있다. 이것은 설명의 편의를 위한 표시예이고, 후술하는 도 8 에 나타내는 실제의 표시예의 주요부를 구성한다. 이 표시예에서는, 감정 지수가 부여된 특징어에는, 당해 특징어가 긍정적인지 부정적인지 (부여된 감정 지수가 정인지 부인지) 에 의해 색이 상이하여 당해 특징어에 부여된 감정 지수에 따른 농도를 갖는 배경색이 부여되어 있다. 예를 들어, 긍정적인 특징어에는, 그 감정 지수에 따른 농도의 청색의 배경색이 부여되고, 부정적인 특징어에는, 그 감정 지수에 따른 농도의 적색의 배경색이 부여된다. 상기와 같이 기종 A 의 리뷰 문서가 착안 문서일 때에는, 도 7 에 있어서 "기종 A" 에 대해 나타내는 특징어의 배경색이, 감정어 사전 (34) 에 기초하여 부여된 감정 지수를 나타내고 있다. 대상 특징어 중 감정어 사전 (34) 에 등록되어 있지 않은 특징어에는 배경색은 부여되지 않는다.Next, each target feature word, which is a feature word extracted from the target document, is searched in the emotion word dictionary 34 to which an emotion index is assigned, and among the target feature words, the emotion within the specified range is evaluated in the emotion word dictionary 34. The emotional index is assigned to a feature word registered as a word (emotional word) having an index (step S16). Fig. 7 shows a display example of the emotional tendencies of feature words extracted from each of review documents of model A, model B, and model C designated as target documents. This is a display example for convenience of explanation, and constitutes a main part of an actual display example shown in FIG. 8 to be described later. In this display example, a feature word to which an emotional index is assigned has a different color depending on whether the feature word is positive or negative (either positive or negative), and has a density corresponding to the emotional index assigned to the feature word. A background color is given. For example, a blue background color of a density corresponding to the emotional index is assigned to a positive feature word, and a red background color of a density corresponding to the emotional index is assigned to a negative feature word. As described above, when the review document of model A is the target document, the background color of the characteristic word shown for "model A" in Fig. 7 indicates the emotional index assigned based on the emotion word dictionary 34. A background color is not assigned to a characteristic word not registered in the emotion word dictionary 34 among target characteristic words.

다음으로, 주목 문서로부터 추출된 대상 특징어 중 감정 지수가 부여된 특징어에 기초하여 착안 문서의 감정 지수를 구한다 (스텝 S18). 즉, 착안 문서의 감정 지수인 문서 감정 지수 Ctx 를 하기 식에 의해 산출한다.Next, the emotional index of the target document is obtained based on the characteristic word to which the emotional index is assigned among target characteristic words extracted from the target document (step S18). That is, the document emotion index Ctx, which is the emotion index of the target document, is calculated by the following formula.

Ctx = (Naf - Nng)/(Naf + Nng) …(3) Ctx = (Naf - Nng) / (Naf + Nng) … (3)

여기서, Naf 는, 착안 문서로부터 추출된 특징어 중 정의 감정 지수가 부여된 특징어의 개수, 즉 착안 문서에 있어서의 긍정적인 특징어의 출현수이다. Nng 는, 착안 문서로부터 추출된 특징어 중 부의 감정 지수가 부여된 특징어의 개수, 즉 착안 문서에 있어서의 부정적인 특징어의 출현수이다. 상기 식 (3) 으로부터 알 수 있는 바와 같이, 문서 감정 지수 Ctx 는 -1 에서 +1 까지의 범위의 값을 취한다.Here, Naf is the number of feature words to which a positive emotional index is assigned among the feature words extracted from the target document, that is, the number of occurrences of positive feature words in the target document. Nng is the number of feature words to which a negative emotion index is assigned among the feature words extracted from the target document, that is, the number of occurrences of negative feature words in the target document. As can be seen from the above equation (3), the document sentiment index Ctx takes a value ranging from -1 to +1.

그 후, 대상 문서에 미착안의 문서가 포함되어 있는지의 여부를 판정하고 (스텝 S20), 미착안의 문서가 포함되어 있는 경우에는 스텝 S15 로 되돌아간다. 이후, 대상 문서에 미착안의 문서가 포함되지 않게 될 때까지 스텝 S15 ∼ S20 을 반복해서 실행하고, 스텝 S20 에 있어서 대상 문서에 미착안의 문서가 포함되어 있지 않다고 판정되면, 스텝 S22 로 진행된다. 이미 서술한 바와 같이, 어느 제품의 기종 A, 기종 B, 및 기종 C 의 리뷰 문서가 대상 문서로서 지정되었다고 하면, 기종 A, 기종 B, 및 기종 C 의 리뷰 문서의 각각에 대해 스텝 S15 ∼ S20 이 실행되고, 그 후, 스텝 S22 로 진행된다.After that, it is determined whether or not the target document contains an unconsidered document (step S20), and if an unconsidered document is included, the process returns to step S15. Thereafter, steps S15 to S20 are repeatedly executed until the target document does not include the unconsidered document, and if it is determined in step S20 that the target document does not contain the unconsidered document, the process proceeds to step S22. . As described above, if review documents of model A, model B, and model C of a certain product are designated as target documents, steps S15 to S20 are performed for each of review documents of model A, model B, and model C. It is executed, and then proceeds to step S22.

스텝 S22 로 진행된 시점에서는, 대상 문서로서 지정된 기종 A, 기종 B, 및 기종 C 의 리뷰 문서의 각각에 대해, 지정 범위 내의 특징어가 대상 특징어로서 추출되고 (도 6 참조), 대상 특징어 중 감정어 사전 (34) 에 있어서 지정 범위 내의 감정 지수를 갖는 감정어로서 등록되어 있는 특징어에 대해 그 감정 지수가 부여되고 (도 7 참조), 감정 지수가 부여된 대상 특징어에 기초하여 문서 감정 지수 Ctx 가 산출되어 있다. 스텝 S22 에서는, 이와 같이 하여 대상 문서로서의 복수의 문서에 대해 얻어진 대상 특징어와 그것들에 포함되는 감정어에 부여된 감정 지수와 문서 감정 지수 Ctx 를 당해 복수의 문서 사이에서 비교 가능하게 표시하기 위한 표시용 데이터를 생성한다 (스텝 S22). 즉, 대상 문서로서의 기종 A, 기종 B, 및 기종 C 의 리뷰 문서의 각각에 대해 얻어진 대상 특징어와 그것들에 포함되는 감정어에 부여된 감정 지수와 문서 감정 지수 Ctx 에 기초하여, 그들의 리뷰 문서의 감정적 경향을 비교하기 위한 표시용 데이터를 생성한다.At step S22, for each of the review documents of model A, model B, and model C designated as target documents, feature words within the specified range are extracted as target feature words (see Fig. 6), and the target feature words are identified. The emotional index is assigned to a feature word registered as a sentiment word having an emotional index within a specified range in the word dictionary 34 (see Fig. 7), and the document emotional index is based on the target feature word to which the emotional index is assigned. Ctx is calculated. In step S22, the target characteristic words thus obtained for the plurality of documents serving as the target documents, and the emotion index and document emotion index Ctx assigned to the emotion words included therein, are displayed for display in a comparable manner among the plurality of documents. Data is generated (step S22). That is, based on the target characteristic words obtained for each of the review documents of type A, type B, and type C as target documents, and the emotion index and document emotion index Ctx assigned to the emotion words included in them, the emotional expression of the review document is determined. Generate data for display to compare trends.

다음으로, 이와 같이 하여 생성된 표시용 데이터를 사용하여, 대상 문서로서 지정된 복수의 문서의 감정적 경향을 비교 가능하게 표시 장치 (25) 에 표시한다 (스텝 S24). 이것은, 대상 문서에 대한 감정적 경향 분석의 결과를 나타내는 것이다. 예를 들어, 기종 A, 기종 B, 및 기종 C 의 리뷰 문서로 구성되는 대상 문서에 대한 감정적 경향 분석의 결과로서 도 8 에 나타내는 바와 같은 표시를 실시한다. 도 8 의 표시예에서는, 감정 지수 (단어 감정 지수) 가 부여된 대상 특징어에는, 그 감정 지수에 따른 색 및 농도의 배경색이 부여되어 있고, 이것에 더하여, 대상 문서로서 지정된 문서의 명칭을 나타내는 「기종 A」, 「기종 B」, 및 「기종 C」 의 각각에도, 그 문서 감정 지수 Ctx 에 대응하는 색 및 농도의 배경색이 부여되어 있다. 또한 당해 표시예에서는, 스텝 S10 에 있어서 감정 지수의 범위를 지정하기 위해서 표시되는 슬라이더 (250) (도 5 참조) 와 동일한 슬라이더 (250) 가 대상 문서에 대한 감정적 경향 분석의 결과와 함께 표시되어 있다.Next, using the display data generated in this way, the emotional tendencies of a plurality of documents designated as target documents are displayed on the display device 25 so that comparison is possible (step S24). This represents the result of emotional tendency analysis on the target document. For example, display as shown in FIG. 8 is performed as a result of emotional tendency analysis on target documents composed of review documents of type A, type B, and type C. In the display example of FIG. 8 , a target characteristic word to which an emotional index (word emotional index) is assigned is assigned a color and a background color of density according to the emotional index, and in addition to this, a name of a document designated as a target document is assigned. For each of "model A", "model B", and "model C", a background color with a color and density corresponding to the document emotion index Ctx is assigned. In this display example, the same slider 250 as the slider 250 (see Fig. 5) displayed for designating the emotional index range in step S10 is displayed together with the result of the emotional tendency analysis for the target document. .

도 8 에 나타내는 슬라이더 (250) 도 이용자에 의해 마우스 (29) 를 사용하여 조작 가능하게 구성되어 있다. 이용자는, 도 8 에 나타내는 바와 같은 대상 문서에 대한 감정적 경향 분석의 결과의 표시를 보고, 이 슬라이더 (250) 를 조작함으로써, 대상 문서로부터 추출해야 할 특징어의 감정 지수의 범위를 변경할 수 있다. 즉, 컴퓨터 (20) 는, 대상 문서에 대한 감정적 경향 분석의 결과의 표시 중에 있어서, 이 슬라이더 (250) 가 조작될 때까지 대기하고 있고 (스텝 S26), 이 슬라이더 (250) 가 이용자에 의해 조작되면, 대상 문서로서 지정된 모든 문서 (기종 A, 기종 B, 및 기종 C 의 리뷰 문서 모두) 를 미착안 상태로 되돌리고 (스텝 S28), 스텝 S12 로 되돌아간다.The slider 250 shown in FIG. 8 is also configured to be operable by the user using the mouse 29 . The user can change the range of the emotional index of the feature word to be extracted from the target document by operating the slider 250 while viewing the display of the emotional tendency analysis result for the target document as shown in FIG. 8 . That is, the computer 20 waits until the slider 250 is operated while displaying the result of the emotional tendency analysis for the target document (step S26), and the slider 250 is operated by the user. Then, all documents designated as target documents (all review documents of model A, model B, and model C) are returned to the uninterested state (step S28), and the process returns to step S12.

이후, 감정 지수의 지정 범위가 변경된 상태에서, 상기와 동일하게 하여, 대상 문서로서 지정된 복수의 문서 (여기서는 기종 A, 기종 B, 및 기종 C 의 리뷰 문서) 의 각각에 대해 스텝 S15 ∼ S20 이 실행되고, 그 후, 스텝 S22 로 진행된다. 이 후, 스텝 S22 및 S24 가 실행된 시점에서는, 감정 지수의 변경 후의 지정 범위에 대한 대상 문서에 대한 감정적 경향 분석의 결과가 도 8 에 나타내는 형태와 동일한 형태로 표시된다. 컴퓨터 (20) 는, 상기와 동일하게, 이 표시 상태에 있어서 슬라이더 (250) 가 조작될 때까지 대기한다. 이 대기 중에 있어서, 슬라이더 (250) 의 조작에 의해 감정 지수의 범위가 더욱 변경되면, 스텝 S12 로 되돌아가, 상기와 동일한 처리가 실시된다. 또한, 이 대기 중에 있어서, 끼어들기 처리에 의해 종료 지시가 수취되면, 도 4 에 나타내는 감정적 경향 분석 처리를 종료한다.Thereafter, with the emotional index designation range changed, in the same manner as above, steps S15 to S20 are executed for each of a plurality of documents designated as target documents (here, review documents of model A, model B, and model C) After that, the process proceeds to step S22. Thereafter, at the time of execution of steps S22 and S24, the result of the emotional tendency analysis for the target document for the designated range after the change of the emotional index is displayed in the same format as shown in FIG. 8 . The computer 20 waits until the slider 250 is operated in this display state, similarly to the above. During this standby, if the range of the emotional index is further changed by operating the slider 250, the process returns to step S12 and the same processing as above is performed. Also, if an end instruction is received by the cut-in process during this waiting period, the emotional tendency analysis process shown in Fig. 4 is ended.

이상의 설명으로부터 알 수 있는 바와 같이, 본 실시형태에서는, 입력 조작부 (24) 및 표시 장치 (25) 에 관한 처리를 실시하는 스텝 S10, S24, S26 에 의해 지시 입력부 및 표시부로서의 GUI 부 (11) 가 실현되고, 스텝 S14 에 의해 특징어 추출부 (13) 가 실현되고, 스텝 S16 에 의해 특징어 감정 지수 취득부 (15) 가 실현되고, 스텝 S18 에 의해 문서 감정 지수 산출부 (16) 가 실현된다.As can be seen from the above description, in the present embodiment, the GUI unit 11 as the instruction input unit and the display unit is configured by steps S10, S24, and S26 for performing processing related to the input operation unit 24 and the display device 25. In step S14, the feature word extraction unit 13 is realized, in step S16, the characteristic word emotion index acquisition unit 15 is realized, and in step S18, the document emotion index calculation unit 16 is realized. .

<5. 효과><5. effect>

상기와 같은 본 실시형태에 의하면, 감정적 경향을 비교해야 할 복수의 문서 (대상 문서) 의 각각에 대해 특징어가 추출되고, 추출된 특징어인 대상 특징어 중, 감정어 사전 (34) 에 감정어로서 등록되어 있는 특징어에 대해, 그 감정어 사전 (34) 에서 그 특징어에 부여되어 있는 감정 지수가 부여된다. 이와 같이 하여 당해 복수의 문서의 각각에 대해, 대상 특징어와 그것들에 포함되는 감정어에 부여된 감정 지수가, 대상 문서로서 지정된 당해 복수의 문서에 대한 감정적 경향 분석의 결과로서 표시된다. 이와 같은 표시에 의해 (도 4 의 스텝 S22, S24, 도 7, 도 8 참조), 감정적 경향을 비교해야 할 복수의 문서에 있어서 감정 극성이 약한 특징어가 포함되어 있는 경우에도, 추출된 특징어와 함께 그것들에 부여되어 있는 감정 지수인 -1 에서 +1 까지의 범위의 수치를 보는 것에 의해, 당해 복수의 문서 사이에서 그들의 감정적 경향을 적확하게 파악할 수 있다.According to the present embodiment as described above, a feature word is extracted for each of a plurality of documents (target documents) to be compared for emotional tendency, and among the target feature words that are the extracted feature words, the emotion word dictionary 34 is used as the emotion word. For the registered characteristic word, the emotion index assigned to the characteristic word in the emotion word dictionary 34 is given. In this way, for each of the plurality of documents, the emotional index assigned to the target feature word and the emotion word included therein is displayed as a result of emotional tendency analysis for the plurality of documents designated as the target document. By such display (see Steps S22 and S24 in FIG. 4, and FIGS. 7 and 8), even when a feature word with a weak emotional polarity is included in a plurality of documents to be compared for emotional tendency, together with the extracted feature word. By looking at the numerical values in the range of -1 to +1, which are emotional indices assigned to them, it is possible to accurately grasp the emotional tendencies among the plurality of documents.

또, 본 실시형태에 의하면, 상기 복수의 문서의 각각으로부터 추출되는 특징어인 대상 특징어에 부여되는 감정 지수의 범위를 지정할 수 있다 (도 4 의 스텝 S10, 도 5 참조). 즉, 도 5 에 나타내는 슬라이더 (250) 를 조작함으로써, 이미 서술한 바와 같이 부정적 감정 지수의 범위와 긍정적 감정 지수의 범위로 이루어지는 2 개의 범위를 감정 지수의 지정 범위로 하고, 대상 특징어 중 감정 극성이 약한 중립적인 특징어에는 감정 지수를 부여하지 않게 할 수 있다. 이로써, 감정 극성이 약한 특징어를 포함하는 복수의 문서 사이에서 그들의 감정적 경향을 적확하게 비교할 수 있다. 본 실시형태에서는, 상기 복수의 문서의 각각에 대해, 지정된 범위 내의 감정 지수가 부여된 대상 특징어에 기초하여 이미 서술한 식 (3) 에 의해 문서 감정 지수 Ctx 가 산출되므로 (도 4 의 스텝 S16 참조), 상기 복수의 문서 사이에서 그들의 특징어에 부여된 감정 지수를 비교하는 것에 더하여, 상기 복수의 문서 사이에서 문서 감정 지수 Ctx 를 비교할 수 있다 (도 8 참조). 이로써, 복수의 문서의 감정적 경향을 보다 적확하고 또한 용이하게 비교할 수 있다.Further, according to the present embodiment, it is possible to designate the range of emotional indices to be given to the target feature word, which is a feature word extracted from each of the plurality of documents (step S10 of Fig. 4, see Fig. 5). That is, by manipulating the slider 250 shown in Fig. 5, as described above, the two ranges consisting of the range of the negative emotional index and the range of the positive emotional index are set as the designated range of the emotional index, and the emotional polarity among the target feature words. It is possible not to assign an emotional index to this weakly neutral feature word. Accordingly, it is possible to accurately compare the emotional tendencies among a plurality of documents including characteristic words with weak emotional polarity. In the present embodiment, the document emotion index Ctx is calculated by the above-mentioned equation (3) based on the target feature word to which the emotional index within the specified range is assigned to each of the plurality of documents (step S16 in FIG. 4 ). Reference), in addition to comparing emotional indices assigned to their feature words among the plurality of documents, it is also possible to compare document emotional indices Ctx among the plurality of documents (see FIG. 8). In this way, it is possible to more accurately and easily compare the emotional tendencies of a plurality of documents.

또한, 본 실시형태에 의하면, 대상 문서로서의 복수의 문서의 각각으로부터 추출해야 할 특징어의 범위를 지정할 수 있으므로 (도 4 의 스텝 S10, S14, 도 5 참조), 보다 특징적인 단어만 (예를 들어 특징도가 큰 순서로부터 상위 10 개의 단어) 을 대상 특징어로서 추출함으로써, 종래에 비해, 적은 계산량으로 상기 복수의 문서의 각각의 특징을 반영한 감정적 경향을 상기 복수의 문서 사이에서 비교할 수 있다. 또, 상기 복수의 문서 중 어느 문서에 있어서 다른 문서에 비해 보다 많이 출현하고 있는 특징적인 감정어가 과소 평가된다는 문제도 회피할 수 있다.Further, according to the present embodiment, since it is possible to designate the range of feature words to be extracted from each of a plurality of documents as target documents (see steps S10 and S14 in Fig. 4 and Fig. 5), only more characteristic words (for example, For example, by extracting the top 10 words in order of the largest feature degree as target feature words, it is possible to compare emotional tendencies reflecting respective characteristics of the plurality of documents among the plurality of documents with a smaller amount of calculation than before. In addition, it is possible to avoid a problem that a characteristic emotion word appearing more frequently in one of the plurality of documents than in other documents is underestimated.

추가로 또한, 본 실시형태에 의하면, 상기 복수의 문서에 대한 감정적 경향 분석의 결과를 나타내는 표시 장치 (25) 에는 감정 지수의 범위를 지정하기 위한 슬라이더 (250) 도 표시되므로 (도 8), 이용자는 당해 감정적 경향의 분석 결과를 보고 감정 지수의 범위를 변경하고, 변경 후의 감정 지수의 범위에 기초하여 상기 복수의 문서에 대한 감정적 경향 분석의 결과를 표시시킬수 있다 (도 4 의 스텝 S26 → S28 → S15 → … → S24). 이로써, 이용자는, 상기 복수의 문서에 대한 감정적 경향 분석의 결과를 일단 표시시킨 후에, 그 표시를 보면서 감정 지수의 지정 범위를 조정함으로써, 상기 복수의 문서의 감정적 경향을 더욱 적확하게 비교할 수 있다.Additionally, according to the present embodiment, a slider 250 for specifying the range of emotional indices is also displayed on the display device 25 showing the result of the emotional tendency analysis for the plurality of documents (FIG. 8), so that the user can change the emotional index range by looking at the emotional tendency analysis result, and display the emotional tendency analysis results for the plurality of documents based on the emotional index range after the change (step S26 → S28 → in FIG. 4 ). S15 → … → S24). In this way, the user can more accurately compare the emotional tendencies of the plurality of documents by once displaying the result of emotional tendency analysis for the plurality of documents and then adjusting the designated range of the emotional index while viewing the display.

<6. 변형예><6. Variation>

본 발명은 상기 실시형태에 한정되는 것은 아니며, 본 발명의 범위를 일탈하지 않는 한 여러 가지 변형을 실시할 수 있다.The present invention is not limited to the above embodiment, and various modifications can be made without departing from the scope of the present invention.

예를 들어 상기 실시형태에서는, 감정적 경향을 비교해야 할 대상 문서로서 지정된 복수의 문서의 각각으로부터 특징어를 추출할 때, 대상 문서로부터 추출해야 할 특징어의 범위가 특징도를 나타내는 수치로서의 자카드 계수의 범위 (최소치 및 최대치) 에 의해 지정되지만, 자카드 계수 (Jaccard 계수) 에 한정되지 않고, 특징어의 특징도를 나타내는 다른 수치를 사용하여 특징어의 범위를 지정해도 된다. 예를 들어, 자카드 계수 대신에 다이스 계수 (Dice 계수) 또는 심슨 계수 (Simpson 계수) 를 사용해도 되고, 또, TF-IDF (Term Frequency - Inverse Document Frequency) 수법에 기초하는 특징도를 나타내는 수치에 의해, 추출해야 할 특징어의 범위를 지정해도 된다.For example, in the above embodiment, when extracting a feature word from each of a plurality of documents designated as target documents to be compared for emotional tendency, the range of feature words to be extracted from the target document is a Jacquard coefficient as a numerical value representing the feature degree. It is designated by the range (minimum and maximum values) of , but it is not limited to the Jaccard coefficient, and the range of the characteristic word may be designated using other numerical values representing the characteristic degree of the characteristic word. For example, instead of Jacquard coefficients, Dice coefficients (Dice coefficients) or Simpson coefficients (Simpson coefficients) may be used, and by numerical values representing a characteristic diagram based on the TF-IDF (Term Frequency - Inverse Document Frequency) method , the range of feature words to be extracted may be specified.

또 상기 실시형태에서는, 대상 문서로서 지정된 복수의 문서에 대한 감정적 경향 분석의 결과의 표시에서는, 도 8 에 나타내는 바와 같이, 당해 복수의 문서의 각각에 대해 추출된 특징어와 그들의 특징도를 나타내는 수치를 나타냄과 함께, 특징어의 감정 지수 (단어 감정 지수) 및 문서의 감정 지수 (문서 감정 지수) Ctx 가 그 감정 지수에 대응하는 색 및 농도의 배경색에 의해 나타난다. 그러나, 당해 감정적 경향 분석의 결과의 표시 형태는, 이것에 한정되는 것은 아니며, 특징어의 감정 지수나 문서의 감정 지수를 수치 또는 막대 그래프 등의 다른 형태로 표시해도 된다.Further, in the above embodiment, in displaying the results of emotional tendency analysis for a plurality of documents designated as target documents, as shown in Fig. 8, the extracted characteristic words for each of the plurality of documents and numerical values representing their characteristic degrees are In addition, the emotional index of the feature word (word emotional index) and the emotional index of the document (document emotional index) Ctx are indicated by the background color of the color and density corresponding to the emotional index. However, the display form of the result of the emotional tendency analysis is not limited to this, and the emotional index of the characteristic word or the emotional index of the document may be displayed in other forms such as numerical values or bar graphs.

10 : 텍스트 마이닝 장치
11 : GUI 부
12 : 텍스트 데이터 기억부
13 : 특징어 추출부
14 : 사전 기억부
15 : 특징어 감정 지수 취득부
16 : 문서 감정 지수 산출부
17 : 표시 데이터 처리부
20 : 컴퓨터
21 : CPU
22 : 메인 메모리
23 : 보조 기억 장치
24 : 입력 조작부
25 : 표시 장치
30 : 기록 매체
31 : 텍스트 마이닝 프로그램
32 : 텍스트 데이터
34 : 감정어 사전
250 : 슬라이더10: text mining device
11: GUI part
12: text data storage unit
13: feature word extraction unit
14: dictionary storage
15: feature word appraisal index acquisition unit
16: document emotion index calculation unit
17: display data processing unit
20: computer
21 : CPU
22: main memory
23: auxiliary storage
24: input control unit
25: display device
30: recording medium
31: text mining program
32: text data
34 : Emotion dictionary
250: Slider

Claims

As a text mining method for comparing trends in emotional polarity among a plurality of documents,
an instruction input step for receiving an instruction to designate as target documents a plurality of documents to be compared in the tendency of emotional polarity;
a feature word extraction step of extracting a feature word from each of the plurality of documents based on text data of the plurality of documents designated as the target documents;
For a feature word registered in a predetermined emotion word dictionary among the feature words extracted by the feature word extraction step, the emotional index assigned to the feature word as a numerical value representing the intensity of emotional polarity in the emotion word dictionary An emotional index acquisition step to impart;
and a display step of displaying, with respect to the plurality of documents designated as the target documents, the feature word extracted by the feature word extraction step together with the emotional index given by the emotional index acquisition step.

According to claim 1,
The instruction input step further includes a step of receiving an instruction specifying a range of feature words to be extracted from the target document;
In the feature word extraction step, a feature word within a range specified in the instruction input step is extracted.

According to claim 1,
The instruction inputting step further includes a step of receiving an instruction for designating a range of an emotional index, which is an index representing the intensity of emotional polarity;
In the emotional index acquisition step, for the characteristic words registered in the emotional word dictionary as words to which an emotional index within the range specified in the instruction input step is assigned, among the characteristic words extracted by the characteristic word extraction step, the corresponding emotional index is determined. text mining method.

According to claim 3,
The instruction input step further comprises a step of receiving an instruction for specifying a change in the range of the emotional index, when the extracted characteristic word is displayed by the display step together with the assigned emotional index. mining method.

According to claim 1,
For each of the plurality of documents designated as the target documents, the document is judged based on the characteristic word to which the emotional index is assigned by the emotion index acquisition step among the characteristic words extracted from the document by the characteristic word extraction step. Further comprising a document emotion index calculation step for calculating the index as a document emotion index;
and in the display step, a display indicating the document emotional index calculated in the document emotional index calculation step is performed.

According to claim 5,
In the document emotional index calculation step, the text mining method in which the document emotional index Ctx for each of the plurality of documents designated as the target documents is calculated by the following formula:
Ctx = (Naf - Nng) / (Naf + Nng)
Here, Naf is the number of appearances of positive feature words in the document, and Nng is the number of appearances of negative feature words in the document.

According to claim 5,
In the display step, each name of the plurality of documents designated as the target documents is a background color having a density corresponding to the emotional index of the document as a background color having a different color depending on whether the emotional index of the document is positive or negative. Text mining method, indicated by granting.

According to claim 1,
In the display step, the characteristic word to which the emotional index is assigned in the emotional index acquisition step is a background color having a density corresponding to the emotional index of the characteristic word as a background color having a different color depending on whether the emotional index of the characteristic word is positive or negative. A method of text mining, which is indicated by being granted.

A text mining program stored in a medium for comparing tendencies of emotional polarity among a plurality of documents, comprising:
an instruction input step for receiving an instruction to designate as target documents a plurality of documents to be compared in the tendency of emotional polarity;
a feature word extraction step of extracting a feature word from each of the plurality of documents based on text data of the plurality of documents designated as the target documents;
For a feature word registered in a predetermined emotion word dictionary among the feature words extracted by the feature word extraction step, the emotional index assigned to the feature word as a numerical value representing the intensity of emotional polarity in the emotion word dictionary An emotional index acquisition step to impart;
With respect to the plurality of documents designated as the target documents, a display step of displaying the feature word extracted by the feature word extraction step together with the emotional index given by the emotional index acquisition step, a CPU using a memory in a computer A text mining program stored on a medium to be executed.

According to claim 9,
The instruction input step further includes a step of receiving an instruction specifying a range of feature words to be extracted from the target document;
In the feature word extraction step, a text mining program stored in a medium is extracted with a feature word within a range specified in the instruction input step.

According to claim 9,
The instruction inputting step further includes a step of receiving an instruction for designating a range of an emotional index, which is an index representing the intensity of emotional polarity;
In the emotional index acquisition step, for the characteristic words registered in the emotional word dictionary as words to which an emotional index within the range specified in the instruction input step is assigned, among the characteristic words extracted by the characteristic word extraction step, the corresponding emotional index is determined. A text mining program stored on a medium, which is given.

According to claim 11,
The instruction inputting step further includes a step of receiving an instruction for specifying a change in a range of the emotional index, when the extracted characteristic word is displayed by the display step together with the assigned emotional index. Text mining program stored in .

According to claim 9,
For each of the plurality of documents designated as the target documents, the document is judged based on the characteristic word to which the emotional index is assigned by the emotion index acquisition step among the characteristic words extracted from the document by the characteristic word extraction step. Further comprising a document emotion index calculation step for calculating the index as a document emotion index;
The text mining program stored in the medium, wherein in the display step, a display indicating the document emotion index calculated by the document emotion index calculation step is performed.

As a text mining device for comparing trends in emotional polarity among a plurality of documents,
an instruction input unit for receiving an instruction for specifying, as target documents, a plurality of documents whose emotional polarity tendencies are to be compared;
a feature word extraction unit extracting a feature word from each of the plurality of documents based on text data of the plurality of documents designated as the target documents;
Among the feature words extracted by the feature word extraction unit, for a feature word registered in a predetermined emotion word dictionary, an emotion index assigned to the feature word as a numerical value representing the strength of emotional polarity in the emotion word dictionary an emotional index acquisition unit to impart;
and a display unit for displaying, with respect to the plurality of documents designated as the target documents, the characteristic word extracted by the characteristic word extraction unit together with the emotional index given by the emotional index acquisition unit.

15. The method of claim 14,
The instruction input unit additionally receives an instruction for specifying a range of feature words to be extracted from the target document;
The feature word extraction unit extracts a feature word within a range specified by the instruction input unit.

15. The method of claim 14,
The instruction input unit additionally receives an instruction for specifying a range of an emotional index, which is an index representing the intensity of emotional polarity;
The emotional index obtaining unit assigns the emotional index to a characteristic word registered in the emotional word dictionary as a word to which an emotional index within a range specified by the instruction input unit is assigned among the characteristic words extracted by the characteristic word extracting unit. , a text mining device.

17. The method of claim 16,
The text mining apparatus of claim 1 , wherein the instruction input unit further receives an instruction for specifying a change in a range of the emotional index when the extracted feature word is displayed by the display unit together with the assigned emotional index.

15. The method of claim 14,
For each of the plurality of documents designated as the target documents, the document is judged based on the characteristic word to which the emotion index is assigned by the emotion index acquisition unit among the characteristic words extracted from the document by the characteristic word extraction unit. Further comprising a document emotion index calculation unit for calculating the index as a document emotion index;
wherein the display unit displays the document emotion index calculated by the document emotion index calculation unit.