KR20200017568A

KR20200017568A - Sentence sentiment classification system and method based on sentiment dictionary construction by the price fluctuation and convolutional neural network

Info

Publication number: KR20200017568A
Application number: KR1020180085570A
Authority: KR
Inventors: 양형정; 김미선
Original assignee: 전남대학교산학협력단
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2020-02-19
Also published as: KR102086642B1

Abstract

The present invention relates to a system for building an emotion dictionary in accordance with price fluctuations and classifying sentence emotions based on a convolutional neural network and a method thereof. According to the present invention, the system includes a data preprocessing part, an emotion dictionary building part and an emotion classification part. The data preprocessing part analyzes a morpheme by extracting a sentence from a document group and processes a stop-word by removing a discrimination-less word from the document group. The emotion dictionary building part extracts and expresses a word as a vector based on internal/external border values of the word to identify the frequency and meaning of the word in the document group, collects price fluctuation information in accordance with a specific field, and creates an emotion dictionary by classifying the affirmation and negation of the word based on the price fluctuation information. The emotion classification part classifies the emotion of a sentence by using a convolutional neural network (CNN) based on the emotion dictionary.

Description

Sentence sentiment classification system and method based on sentiment dictionary construction by the price fluctuation and convolutional neural network

본 발명은 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템 및 방법에 관한 것으로서, 더욱 상세하게는 다양한 분야의 문서 집합에서 단어를 추출하여 감성사전을 구축하고 가격등락에 따라 긍정/부정을 분류하며, 분류된 결과를 이용하여 문장의 감성을 분류하는 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템 및 방법에 관한 것이다.The present invention relates to an emotional dictionary construction system based on price fluctuations and a sentence emotional classification system and method based on a composite-product neural network, and more particularly, to construct an emotional dictionary by extracting words from document sets of various fields and affirmatively according to price fluctuations. This paper relates to the emotional dictionary construction based on price fluctuations and the sentence emotional classification system based on the composite-product neural network.

문서에서 단어를 추출하여 감성을 분류하는 문제는 오랫동안 연구되어 왔다. 그러나 정규화된 한글 감성사전이 없고, 분야에 따라 개별적으로 구축하여 사용하고 있다.The problem of classifying emotions by extracting words from documents has been studied for a long time. However, there is no normalized Hangeul Emotion Dictionary, and it is constructed and used separately according to the field.

통상적으로 글이란 단어가 모여서 문장을 이루고, 문장이 모여서 한 덩이의 완성된 글이 되는데, 글을 이루는 문장의 기초 단위인 단어는 하나의 단어가 다른 단어와 어떤 관계를 맺는가의 문제와 글을 쓰기 위해서 내용상으로 어떠한 단어를 선정하는가의 문제를 내포하고 있다. 이 두 가지의 문제를 해결하는 과정이 통상적으로 글을 집필하는 과정에 일어나는 현상이라고 볼 수 있으며, 이 두 가지의 과정 중에서 두 번째의 문제, 즉 어떠한 단어를 선정하느냐의 문제를 집중적으로 분석하면, 이로부터 글쓴이의 감정을 도출할 수 있다.In general, words are gathered together to form sentences, and sentences are gathered together to form a lump of completed writing. The basic unit of a sentence is a word that describes how one word is related to another word and how to write it. For this purpose, it contains the question of which word to select. The process of solving these two problems is a phenomenon that usually occurs in the process of writing. When the intensive analysis of the second problem of the two processes, namely, which word is selected, From this, the emotion of the writer can be derived.

즉, 각각의 단어는 문장에서 변용되고 상황에 따라 다른 뜻을 함축하기도 하지만, 기본적으로 내재하고 있는 단어 자체의 성향이 있다는 것을 알 수 있다. 따라서 글을 쓴 이가 어떤 성향이 내재된 단어를 집중적으로 사용할 수 있으며, 문학적, 인지언어학적, 정신분석학적으로 접근했을 때, 인간 보편의 성향에 맞춘 단어의 분석을 통해 단어를 분류할 수 있는 기준을 설정할 수 있고, 그 기준에 따라 단어를 분류하여 글쓴이의 감정을 분류할 수 있게 된다.In other words, each word is modified in the sentence and implies a different meaning depending on the situation, but it can be seen that there is a tendency of the word itself inherently inherent. Therefore, the writer can use words inherently inclined in a certain way, and when he approaches literary, cognitive, linguistic, and psychoanalytical, the criteria for classifying words through analysis of words that fit the general tendency of human beings It is possible to set, and to sort the words according to the criteria can be classified the author's feelings.

최근에는, 매일 인터넷을 통해 정형 혹은 비정형의 많은 텍스트 정보들이 발생한다. 2012년 기준 1인 평균 3개의 SNS 계정을 사용하며, 한 해 약 1조 8천억 기가바이트가 생성된다. 이처럼 온라인상에 범람하는 데이터를 처리하기 위해서는 필요한 데이터를 수집하고 감성을 분류하는 기술이 중요하다.In recent years, a large amount of textual information, atypical or atypical, occurs over the Internet every day. As of 2012, an average of three SNS accounts are used, with about 1.8 trillion gigabytes generated per year. In order to process the data flooding online, it is important to collect necessary data and to classify emotions.

대한민국 등록특허 제10-1855168호(2018년 05월 10일 공고)Republic of Korea Patent Registration No. 10-1855168 (announced May 10, 2018)

따라서, 본 발명은 종래의 단점을 해결하기 위한 것으로서, 데이터를 분석하여 문장의 긍정부정을 분류함으로써 마케팅 또는 여론조사 등의 분야에 적용하고자 하는데 그 목적이 있다. 또한, 본 발명은 다양한 텍스트 정보들의 문장 감성 분류에 대한 성능을 증대하고자 하는데 그 목적이 있다.Therefore, the present invention is to solve the disadvantages of the prior art, and to analyze the data to classify the affinity of the sentence by applying to the field of marketing or polling, and the like. In addition, an object of the present invention is to increase the performance of sentence sensitivity classification of various text information.

이러한 기술적 과제를 이루기 위한 본 발명의 일 측면에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템은 데이터 전처리부, 감성사전 구축부 및 감성 분류부를 포함할 수 있다. 상기 데이터 전처리부는 문서 집합에서 문장을 추출하여 형태소를 분석하고, 상기 문서 집합에서 변별력이 없는 단어를 제거하여 불용어를 처리한다.According to an aspect of the present invention for achieving the technical problem, the emotional dictionary construction and the sentence emotional classification system based on the composite product neural network according to the price fluctuation may include a data preprocessor, an emotional dictionary construction unit and the emotional classification unit. The data preprocessor extracts a sentence from a document set, analyzes morphemes, and removes words without discrimination from the document set to process stopwords.

바람직하게는, 상기 감성사전 구축부는 상기 문서집합에서 단어의 의미와 빈도수를 파악하기 위해 단어의 내/외부 경계값을 활용하여 단어를 추출하고 벡터화하며, 특정 분야에 따른 가격등락 정보를 수집하고 상기 가격등락 정보를 토대로 단어의 긍정과 부정을 분류하여 감성사전을 생성할 수 있다. 또한, 상기 감성 분류부는 상기 감성사전을 기반으로 합성곱 신경망(Convolutional Newral Network, CNN)을 이용하여 문장의 감성을 분류할 수 있다.Preferably, the emotional dictionary construction unit extracts and vectorizes words using internal / external boundary values of words to grasp the meaning and frequency of words from the document set, and collects price fluctuation information according to a specific field. An emotional dictionary can be generated by classifying the positive and negative words based on price fluctuation information. In addition, the emotion classification unit may classify the emotion of the sentence by using a convolutional neural network (CNN) based on the emotional dictionary.

또한, 본 발명의 다른 측면에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법은 웹 크롤링(Web Crawling)을 이용하여 대상체의 관련 뉴스 기사를 수집하고, 특정 분야에 따른 가격 등락 정보를 수집하는 데이터 수집 단계(S10)와 수집된 문서 집합에서 문장을 추출하여 문서를 어절 단위로 형태소 분석하고, 상기 문서 집합에서 변별력이 없는 단어를 제거하여 불용어를 처리하는 데이터 전처리 단계(S20)를 포함한다.In addition, according to another aspect of the present invention, the emotional dictionary construction based on the price fluctuation and the sentence emotional classification method based on the composite-product neural network collect relevant news articles of the object using Web crawling, and price according to a specific field. Data pre-processing step (S10) of collecting and receiving information and extracting a sentence from the collected document set to morphologically analyze the document in word units, and remove the words without discrimination from the document set to process the stopwords (S20) ).

또한, 단어의 의미와 빈도수를 파악하기 위하여 내/외부 경계값을 활용하여 단어를 벡터로 표현하고 상기 가격 등락 정보를 이용하여 단어의 긍정 또는 부정을 분류하여 감성사전을 생성하는 감성사전 구축 단계(S30) 및 상기 감성사전을 합성곱 신경망(Convolutional Newral Network, CNN)의 훈련데이터로 이용하여 문장의 감성을 분류하는 감성 분류 단계(S40)를 포함할 수 있다.In addition, in order to grasp the meaning and frequency of the word to express the word by using the internal / external boundary value as a vector, using the price fluctuation information classification of words positive or negative using the emotional dictionary building step ( S30) and using the emotional dictionary as training data of a convolutional neural network (CNN) may include an emotion classification step (S40) of classifying the emotion of the sentence.

이상에서 설명한 바와 같이, 본 발명에 따른 가격등락에 따른 감성사전 구축과 합성곱신경망 기반의 문장 감성 분류 시스템 및 방법은 많은 양의 정보나 데이터를 한눈에 보기 쉽게 분류하여 마케팅 또는 여론조사의 분야에 적용할 수 있는 효과가 있다. 또한, 가격 등락 정보를 이용하여 문서의 긍정/부정을 정의하고, 딥러닝 기술을 이용함으로써 다양한 텍스트 정보들의 문장 감성 분류에 대한 높은 성능을 구현할 수 있는 효과가 있다.As described above, the emotional dictionary construction and synthetic emotional network based sentence emotional classification system and method according to the price fluctuation according to the present invention are classified into a large amount of information or data at a glance for marketing or polling. There is an effect that can be applied. In addition, it is possible to implement a high performance for sentence sentiment classification of various text information by defining affirmation / negativeness of document using price fluctuation information and using deep learning technology.

도 1은 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템을 개략적으로 나타내는 개념도이다.
도 2는 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템을 나타내는 구성도이다.
도 3은 본 발명의 실시 예에 따라 합성곱 신경망을 이용하여 문장을 분류하는 모델을 나타내는 도면이다.
도 4는 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법을 개략적으로 나타내는 개념도이다.
도 5는 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법을 나타내는 순서도이다.
도 6은 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법에서 감성사전 구축 단계를 나타내는 도면이다.FIG. 1 is a conceptual diagram schematically illustrating a sentence emotion classification system based on emotional dictionary construction and a composite-product neural network based on price fluctuations according to an embodiment of the present invention.
2 is a block diagram illustrating a sentence emotion classification system based on emotional dictionary construction and a composite product neural network according to price fluctuations according to an embodiment of the present invention.
3 is a diagram illustrating a model for classifying sentences using a composite product neural network according to an exemplary embodiment of the present invention.
FIG. 4 is a conceptual diagram schematically showing a method for constructing an emotional dictionary based on price fluctuation and a sentence emotional classification method based on a composite-product neural network according to an embodiment of the present invention.
5 is a flowchart illustrating a method of classifying sentence emotion based on emotional dictionary construction and a composite-product neural network based on price fluctuations according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating an emotional dictionary construction step of constructing an emotional dictionary based on price fluctuation and a sentence emotional classification method based on a composite-product neural network, according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면부호를 붙였다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted for simplicity of explanation, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "…모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 또는 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part is said to "include" a certain component, it means that it can further include other components, except to exclude other components unless specifically stated otherwise. In addition, the terms “… unit”, “… unit”, “… module” described in the specification mean a unit that processes at least one function or operation, which is implemented by hardware or software or a combination of hardware and software. Can be.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예를 설명함으로써, 본 발명을 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Like reference numerals in the drawings denote like elements.

도 1은 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템을 개략적으로 나타내는 개념도이고, 도 2는 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템을 나타내는 구성도이다.1 is a conceptual diagram schematically showing an emotional dictionary construction based on price fluctuation according to an embodiment of the present invention and a sentence emotional classification system based on a composite-product neural network, and FIG. 2 is an emotional dictionary according to a price fluctuation according to an embodiment of the present invention. Schematic classification system based on constructive and convolutional neural networks.

본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템은 소비자물가를 적용한 대상체의 가격 등락을 기준으로 단어의 상승/하락지수를 계산하여 감성사전을 구축하고, 합성곱 신경망(Convolutional Neural Networks, CNN)을 이용하여 문장을 긍/부정으로 분류할 수 있다. 즉, 문장 감성 분류 시스템(10)은 내/외부 경계값을 활용하여 단어를 추출하고, 가격에 기반을 두어 감성사전을 구축하며 합성곱 신경망을 활용하여 문장의 감성을 분류한다. 또한, 특정 대상체(Object)에 대한 가격의 등락을 기준으로 감성사전을 구축하여 문서 집합의 긍정과 부정을 분류할 수 있다.Emotional dictionary construction based on price fluctuations according to an embodiment of the present invention and the sentence emotional classification system based on the composite product neural network to build an emotional dictionary by calculating the rising / falling index of the word based on the price fluctuation of the object to which the consumer price is applied Convolutional Neural Networks (CNN) can be used to classify sentences as positive or negative. That is, the sentence emotion classification system 10 extracts words by using internal / external boundary values, constructs an emotional dictionary based on price, and classifies the emotions of sentences by using a composite product neural network. In addition, an emotional dictionary may be constructed based on the fluctuation of the price of a specific object to classify the positive and negative of the document set.

문장 감성 분류 시스템(10)은 데이터 수집부(100), 데이터 전처리부(200), 감성사전 구축부(300), 감성 분류부(400) 및 저장부(500)를 포함할 수 있다. 데이터 수집부(100)는 온라인 뉴스 기사나 신문기사와 같은 다양한 텍스트 문서 집합을 수집할 수 있다.The sentence emotion classification system 10 may include a data collection unit 100, a data preprocessor 200, an emotion dictionary construction unit 300, an emotion classification unit 400, and a storage unit 500. The data collection unit 100 may collect various text document sets such as online news articles or newspaper articles.

데이터 전처리부(200)는 상기 문서 집합에서 문장을 추출하여 형태소를 분석하고, 상기 문서 집합에서 변별력이 없는 단어를 제거하여 불용어를 처리한다. 즉, 비정형 텍스트문서를 문장 단위로 분할한 뒤 분석 내용과 연관이 없거나 가격 등락에 상관없이 빈번하게 언급된 단어들을 불용어로 처리한다. 또한, 추출한 후보 단어들 중 상대적으로 긍정/부정 모두에서 빈도수가 많은 단어들도 변별력이 없는 단어로 판단하여 불용어 처리할 수 있다.The data preprocessor 200 extracts a sentence from the document set, analyzes the morpheme, and removes a word without discrimination from the document set to process the stopword. That is, after dividing the unstructured text document into sentence units, words that are frequently referred to are irrelevant or irrelevant to the analysis or price fluctuation. In addition, among the extracted candidate words, words having a high frequency in both positive and negative can be determined as words without discrimination and processed as stopwords.

데이터 전처리부(200)는 문서 집합에서 문장을 추출하여 형태소를 분석하는 형태소 분석 모듈(210)과, 문서 집합에서 변별력이 없는 단어를 제거하는 불용어 처리 모듈(220)을 포함할 수 있다.The data preprocessor 200 may include a morpheme analysis module 210 for analyzing a morpheme by extracting a sentence from a document set, and a stopword processing module 220 for removing a word without discrimination from the document set.

즉, 데이터 전처리부(200)는 감성사전 구축부(300) 또는 컴퓨터가 문서를 이해할 수 있도록 문서를 각 단어별로 분리한다. 한 문서 안에는 분류를 하는데 중요한 단어가 포함되기도 하지만 동시에 변별력이 없는 단어가 포함되어 있기도 한다. 이러한 단어들은 불용어라 하며 분류의 성능을 높이기 위해 삭제하는 것이 바람직하다. 따라서, 문서에서 상대적으로 자주 등장하는 단어들이나 문장부호 등을 제거한다. 또한, 상기 문서 집합을 문장 단위로 분할한 뒤 주제와 관련 없는 데이터를 삭제할 수 있다.That is, the data preprocessor 200 divides the document into words so that the emotional dictionary construction unit 300 or the computer can understand the document. A document may contain words that are important for classification, but at the same time, words that do not discriminate. These words are called stop words and should be deleted to increase the performance of the classification. Thus, words or punctuation marks that appear relatively frequently in a document are removed. In addition, after dividing the document set into sentence units, data unrelated to the subject may be deleted.

감성사전 구축부(300)는 문서집합에서 단어의 의미와 빈도수를 파악하기 위해 단어의 내/외부 경계값을 활용하여 단어를 추출하고 벡터화하여 표현하며, 특정 분야에 따른 가격등락 정보를 수집하고 상기 가격등락 정보를 토대로 단어의 긍정과 부정을 분류한다. 즉, 상기 문서 집합에서 긍정과 부정을 분류하는 기준을 가격의 등락으로 결정할 수 있다.The emotional dictionary construction unit 300 extracts and vectorizes words using internal / external boundary values of words to grasp the meaning and frequency of words from a document set, collects price fluctuation information according to a specific field, and Classify positive and negative words based on price fluctuation information. That is, a criterion for classifying positives and negatives in the document set may be determined as price fluctuations.

감성사전 구축부(300)는 감성사전의 극성 분류를 위해 특정 분야의 가격 등락 정보를 수집하는 가격등락 정보 모듈(310)과 감성사전의 의미 있는 단어로 이루어진 후보 키워드를 선정하는 후보키워드 선정 모듈(320)을 포함할 수 있다.The emotional dictionary construction unit 300 is a price and price information module 310 for collecting price fluctuation information of a specific field for classification of the polarity of the emotional dictionary and a candidate keyword selection module for selecting candidate keywords consisting of meaningful words in the emotional dictionary ( 320).

감성사전의 후보키워드를 선정하기 위해 단어의 내/외부 경계값을 사용할 수 있다. 이 알고리즘은 단어를 구성하는 글자 간의 정보로부터 통계적 정보를 추출하는 내부 경계값과 단어 주변의 다른 글자로들로부터 통계적 정보를 추출하는 외부 경계값을 모두 이용하여 단어를 인식하는 비지도 학습 방법이 될 수 있다. 상기 비지도 학습 방법은 어절의 위치에 따라 랭킹을 계산하여 키워드를 추출할 수 있다.In order to select candidate keywords for emotional dictionaries, the inner and outer boundary values of words may be used. This algorithm will be an unsupervised learning method that recognizes a word using both an inner boundary value that extracts statistical information from information between letters forming words and an outer boundary value that extracts statistical information from other letters around the word. Can be. In the unsupervised learning method, a keyword may be extracted by calculating a ranking according to a word position.

한편, 외부 경계값이란 주어진 단어의 좌우 주변에 다른 단어가 나타날 가능성을 의미하며, 내부 경계값이란 주어진 단어를 이루는 연속적인 글자의 응집성을 의미한다. 띄어쓰기를 이용하여 문장을 토큰(token)으로 구분한 후, 각 부분 글자의 위치 정보를 사용하여 단어를 추출하고, 추출한 단어들은 명사 및 어근과 같이 의미를 지니는 단어 집합과, 어미 및 조사와 같은 문법적 기능을 하는 단어 집합으로 분류할 수 있다.On the other hand, the outer boundary value means the possibility that another word appears around the left and right of the given word, the inner boundary value means the coherence of the continuous letters forming the given word. After the sentences are separated by tokens, the words are extracted using the location information of each sub-letter, and the extracted words are a set of words that have meanings such as nouns and roots, and grammatical functions such as endings and investigations. Can be classified into a set of words.

감성사전의 극성 분류를 위해 기설정된 해당 분야에 따른 가격 정보를 추가할 수 있다. 본 발명에 따른 실시 예를 들어 설명하면 다음과 같다. 아래의 [수학식 1]에서 빈도수(freq)는 기설정된 해당 단어가 나온 기사의 수를 합산하여 계산할 수 있다.In order to classify the emotional dictionary, it is possible to add price information according to a predetermined field. Referring to the embodiment according to the present invention will be described. In [Equation 1] below, the frequency (freq) can be calculated by summing up the number of articles that appeared in the preset word.

[수학식 1][Equation 1]

또한, 아래의 [수학식 2]를 이용하여 기설정된 해당 단어가 들어간 기사가 월별 대상체 가격(Object month price, OMP)이 상승한 달에 속한 경우의 수를 합산하여 상승 값(pos)을 계산할 수 있다. 또한, 상기 대상체 가격에 특화된 어휘 사전을 구축하고, 가격이 오르는 긍정적인 어휘가 가지는 값을 상승 지수, 가격이 떨어지는 부정적인 어휘가 가지는 값을 하락 지수로 나타낼 수 있다.In addition, by using Equation 2 below, an increase value pos may be calculated by summing the number of cases in which the article including the preset word belongs to the month in which the monthly object month price (OMP) increases. . In addition, a vocabulary dictionary specialized for the object price may be constructed, and a value of a positive vocabulary in which the price rises may be represented by a rising index and a value of a negative vocabulary in which the price falls as a falling index.

[수학식 2][Equation 2]

다음으로, 추출한 어휘들의 상승 지수 및 하락 지수를 계산하여 감성사전을 완성한다. 상승 지수는 상승 값을 빈도수로 나누어 나타내며, 아래의 [수학식 3]으로 나타낼 수 있다.Next, the emotional dictionary is completed by calculating the rising index and the decreasing index of the extracted words. The rising index is expressed by dividing the rising value by the frequency, and may be expressed by Equation 3 below.

[수학식 3][Equation 3]

상기 감성 사전의 단어들(Word)은 각각 상승 지수 또는 하락 지수를 가지며, 구축된 감성사전은 분류 모델의 훈련 데이터로 사용될 수 있다. 분류 모델에 기사 데이터가 문장으로 입력되면 단어들은 다차원의 행백터로 임베딩된다. 또한, 기사 단어들의 상승 및 하락 지수가 계산되어 전체적인 내용의 긍정/부정 여부를 판별할 수 있다.Words of the emotional dictionary have a rising index or a falling index, respectively, and the constructed emotional dictionary may be used as training data of the classification model. When article data is entered as sentences in the classification model, words are embedded in multidimensional hangbacks. In addition, the rising and falling indexes of the article words can be calculated to determine whether the overall content is positive or negative.

감성 분류부(400)는 합성곱 신경망(Convolutional Newral Network, CNN)을 이용하여 문장의 극성을 분류한다. 즉, 감성 분류부(400)는 상기 구축된 감성 사전을 토대로 합성곱 신경망을 이용하여 문장의 감성을 분류할 수 있다. 감성 분류부(400)는 문장의 긍정 및 부정 분류를 위해 합성곱 신경망(CNN)을 수행하는 합성곱신경망 모듈(410)을 포함할 수 있다.The emotion classification unit 400 classifies the polarity of sentences using a convolutional neural network (CNN). That is, the emotion classification unit 400 may classify the emotion of the sentence using the compound-product neural network based on the constructed emotional dictionary. The emotion classifier 400 may include a compound multiplicative neural network module 410 that performs a compound multiplicative neural network (CNN) for affirmative and negative classification of sentences.

자연어를 처리하고 감성 분류를 하기 위해 베이시안 분류, 최근접 이웃 기법, 서프트 벡터 머신(Support Vector Machine, SVM) 등의 통계적 추론이 사용될 수도 있다.Statistical inferences such as Bayesian classification, nearest neighbor technique, and Support Vector Machine (SVM) may be used to process natural language and classify emotion.

도 3은 본 발명의 실시 예에 따라 합성곱 신경망을 이용하여 문장을 분류하는 모델을 나타내는 도면이다. 일반적으로 합성곱신경망은 이미지 처리에 주로 사용되지만 텍스트 CNN의 필터가 텍스트의 지역적인 정보, 즉 단어 등장순서와 문맥 정보를 보존할 수 있다. 이미지 처리를 위해 사용되는 CNN의 필터를 텍스트의 단어등장 순서 및 문맥정보를 가져오는데 사용할 수 있다.3 is a diagram illustrating a model for classifying sentences using a composite-product neural network according to an exemplary embodiment of the present invention. Synthetic multiplication neural networks are commonly used for image processing, but text CNN's filters can preserve text local information, that is word order and contextual information. The CNN filter used for image processing can be used to retrieve the word appearance order and contextual information of text.

한 문장 당 단어 수가 총 n개일 때 단어들은 각각 k차원의 벡터이다. 즉, n개의 단어로 이루어진 기사를 각 단어별로 k차원의 행벡터로 임베딩할 수 있다. 여기에서, 필터 윈도우의 사이즈는 h이다. 본 발명의 실시 예에 따라 파라미터를 설정할 때 단어벡터 값의 초기값을 랜덤으로 설정하고, 학습 과정에서 업데이트를 수행할 수 있다.When the total number of words per sentence is n, the words are each k-dimensional vectors. That is, an article composed of n words may be embedded as a k-dimensional row vector for each word. Here, the size of the filter window is h. According to an embodiment of the present invention, when setting a parameter, an initial value of a word vector value may be randomly set and updated in a learning process.

저장부(500)는 문서 집합 저장 모듈(510), 가격등락 정보 저장 모듈(520) 및 감성사전 저장 모듈(530)을 포함할 수 있다. 문서 집합 저장 모듈(510)은 데이터 수집부(100)에서 수집된 다양한 텍스트 문서 집합을 저장할 수 있다. 여기에서, 상기 텍스트 문서 집합에는 뉴스기사나 신문기사가 포함될 수 있다. 가격등락 정보 저장 모듈(520)은 감성사전 구축부(300)에서 수집된 특정 분야에 따른 가격등락 정보를 저장할 수 있다. 감성사전 저장 모듈(530)은 감성사전 구축부(300)에서 생성된 감성사전 데이터를 저장할 수 있다.The storage unit 500 may include a document set storage module 510, a price change information storage module 520, and an emotional dictionary storage module 530. The document set storage module 510 may store various text document sets collected by the data collector 100. Here, the text document set may include a news article or a newspaper article. The price fluctuation information storage module 520 may store price fluctuation information according to a specific field collected by the emotional dictionary construction unit 300. The emotional dictionary storage module 530 may store emotional dictionary data generated by the emotional dictionary builder 300.

본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템(10)은 단어의 빈도수와 문맥을 이해하기 위해 내/외부 경계값을 활용하여 단어를 추출하고 벡터화시킨다. 또한, 가격의 등락을 기반으로 합성곱신경망을 활용하여 긍정/부정 감성 사전을 생성하고 문장을 분류할 수 있다.Emotional dictionary construction based on price fluctuation and sentence emotional classification system 10 based on a composite-product neural network according to an embodiment of the present invention extracts and vectorizes words using internal / external boundary values to understand the frequency and context of words. Let's do it. In addition, a positive / negative emotional dictionary may be generated and a sentence may be classified using a compound-neural network based on price fluctuations.

이로 인하여 다양한 텍스트 문서 집합에서 단어를 추출하여 감성사전을 구축하고, 문장의 긍정/부정을 추출하여 감성 분류를 용이하게 할 수 있다. 또한, 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 시스템(10)은 시계열 분석이나 기상데이터와 함께 언론기사 같은 비정형 데이터를 이용하여 농산물 가격 예측에 활용할 수 있는 효과도 있다.Accordingly, it is possible to facilitate the classification of emotions by extracting words from various text document sets, constructing an emotional dictionary, and extracting positive / negative sentences. In addition, the sentence emotion classification system based on the price fluctuation and sentence-based neural network based on the price fluctuation according to an embodiment of the present invention is utilized for predicting the price of agricultural products using atypical data such as media articles with time series analysis or weather data. There is also an effect.

도 4는 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법을 개략적으로 나타내는 개념도이고, 도 5는 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법을 나타내는 순서도이다.FIG. 4 is a conceptual diagram schematically showing a method for constructing an emotional dictionary based on price fluctuation according to an embodiment of the present invention and a sentence emotional classification method based on a composite-product neural network, and FIG. 5 is an emotional dictionary based on a price fluctuation according to an embodiment of the present invention. A flowchart showing sentence sensitivity classification based on construction and composite product neural networks.

본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법은 최근 범람하는 데이터 속에서 필요로 하는 정보를 수집하고 감성을 분류하는 분류 방법으로서, 문장을 감성 분류하기 위해서 단어를 벡터화하고 감성사전을 구축한다.Emotional dictionary construction based on price fluctuation according to an embodiment of the present invention and sentence sentiment classification method based on the composite product neural network is a classification method for collecting the necessary information from the recently flooded data and classify the emotion, the sentence classification To do this, we vectorize words and build an emotional dictionary.

본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법은 웹 크롤링(Web Crawling)을 이용하여 대상체의 관련 뉴스를 수집하고, 가격 등락 정보를 수집하는 데이터 수집 단계(S10)와 수집된 문서 집합에서 문장을 추출하여 형태소를 분석하고 상기 문서 집합에서 변별력이 없는 단어를 제거하여 불용어를 처리하는 데이터 전처리 단계(S20)를 포함할 수 있다. 데이터 전처리 단계(S20)는 문서를 어절 단위로 형태소 분석하고, 분별력을 높이기 위해 불용어를 처리한다.According to an embodiment of the present invention, the emotional dictionary construction based on the price fluctuation and the sentence emotional classification method based on the composite-curve neural network collect data related to an object using web crawling and collect price fluctuation information. Step S10 may include extracting a sentence from the collected document set, analyzing the morpheme, and removing the word having no discriminating power from the document set to process the stopword. In the data preprocessing step S20, the document is morphologically analyzed in units of words, and the stopwords are processed to increase discrimination.

예를 들어 설명하면, 웹 크롤링을 이용하여 양파 관련 뉴스를 수집하고, 농산물유통 정보 사이트에서 양파가격을 수집할 수 있다. 또한, 감성사전을 구축하기 위해 데이터 전처리 단계(S20)에서는 양파와 관련이 없는 내용을 필터링하고 형태소 분석을 수행할 수 있다.For example, you can collect onion-related news using web crawling, and collect onion prices from agricultural distribution information sites. In addition, in order to build an emotional dictionary, in the data preprocessing step (S20), contents that are not related to onions may be filtered and morphological analysis may be performed.

상기 감성사전 구축을 위해 수집한 기사데이터의 전처리(Preprocessing)는 다음과 같이 진행될 수 있다. 먼저 널(NULL) 값이나 형식에 맞지 않는 데이터를 삭제한다. 또한, 제목을 기준으로 중복되는 기사를 제거하고 형태소 분석을 진행한 후 불용어를 제거한다.Preprocessing of the article data collected for building the emotional dictionary may proceed as follows. First, delete null values or data that does not match the format. Also, duplicate articles are removed based on the title, and stemming analysis is used to remove stopwords.

또한, 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법은 단어의 의미와 빈도수를 파악하기 위하여 내/외부 경계값을 활용하여 단어를 벡터로 표현하고 특정 분야에 따른 가격 등락 정보를 이용하여 긍정 또는 부정을 분류하는 감성사전 구축 단계(S30)를 포함할 수 있다. 감성사전 구축 단계(S30)는 단어의 빈도수와 문맥을 이해하기 위하여 내/외부 경계값을 이용하여 키워드를 추출하고 감성사전을 구축한다.In addition, according to an embodiment of the present invention, the emotional dictionary construction based on the price fluctuation and the sentence emotional classification method based on the composite-product neural network are used to express words by using internal / external boundary values to determine the meaning and frequency of words. It may include an emotional dictionary building step (S30) for classifying the positive or negative using price fluctuation information according to a specific field. In the emotional dictionary building step (S30), in order to understand the frequency and context of the word, the keyword is extracted using the internal / external boundary values and the emotional dictionary is constructed.

도 6은 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법에서 감성사전 구축 단계를 나타내는 도면이다.FIG. 6 is a diagram illustrating an emotional dictionary construction step of constructing an emotional dictionary based on price fluctuation and a sentence emotional classification method based on a composite-product neural network, according to an embodiment of the present invention.

예를 들어 설명하면, 분해된 어휘에 대해 양파(대상체) 가격의 등락을 기준으로 감성사전을 구축할 수 있다. 이로 인하여 농산물의 가격에 기반하여 감성사전을 구축함으로써 농산물 기사의 긍정/부정 분석에 적용할 수 있는 효과가 있다.For example, an emotional dictionary can be constructed based on the fluctuation of onion (object) price for the decomposed vocabulary. As a result, an emotional dictionary is established based on the price of agricultural products, and thus it can be applied to the affirmative / negative analysis of agricultural articles.

또한, 주어진 문장으로부터 후보키워드를 선정하는 방법은 학습데이터를 토대로 단어를 추정하는 지도학습 기반 방법과 사전 지식 없이 통계적인 정보를 기반으로 추정하는 비지도 학습 기반으로 나눌 수 있다.In addition, a method of selecting candidate keywords from a given sentence can be divided into supervised learning based method of estimating words based on learning data and unsupervised learning based estimating based on statistical information without prior knowledge.

본 발명의 실시 예에 따라 키워드 선별을 위해 사전 지식 없이 긍정/부정 집합을 생성할 수 있는 비지도 학습 방법인 KR-WordRank 알고리즘이 이용될 수 있다. 상기 KR-WordRank 알고리즘은 단어를 구성하는 글자간의 정보로부터 통계적 정보를 추출하는 내부 경계값과 단어 주변의 다른 글자들로부터 통계적 정보를 추출하는 외부 경계값을 모두 이용하여 단어를 인식할 수 있다.According to an embodiment of the present invention, the KR-WordRank algorithm, which is an unsupervised learning method that can generate a positive / negative set without prior knowledge, may be used for keyword selection. The KR-WordRank algorithm can recognize a word using both an inner boundary value for extracting statistical information from information between letters constituting a word and an outer boundary value for extracting statistical information from other letters around the word.

즉, 외부 경계값이란 주어진 단어의 좌우 주변에 다른 단어가 나타날 가능성을 의미하며, 내부 경계값이란 주어진 단어를 이루는 연속적인 글자의 응집성을 의미한다. 띄어쓰기를 이용하여 문장을 토큰(token)으로 구분한 후, 각 부분 글자의 위치 정보를 사용해 단어를 추출하고 추출된 단어들은 명사, 어근과 같이 의미를 지니는 단어 집합과 어미 및 조사와 같은 문법적 기능을 하는 단어 집합으로 분류될 수 있다.In other words, the outer boundary value means the possibility of other words appearing around the left and right sides of a given word, and the inner boundary value means the coherence of consecutive letters forming a given word. After the sentences are separated by tokens, the words are extracted using the location information of each sub-letter, and the extracted words have a grammatical function such as a set of words that have meanings such as nouns and roots, and endings and investigations. Can be classified into a set of words.

또한, 본 발명의 실시 예에 따른 가격등락에 따른 감성사전 구축과 합성곱 신경망 기반의 문장 감성 분류 방법은 상기 감성사전을 합성곱 신경망의 훈련데이터로 이용하여 문장의 극성을 분류하는 감성 분류 단계(S40)를 포함할 수 있다. 감성 분류 단계(S40)는 합성곱신경망을 활용하여 문장을 긍정 또는 부정으로 분류한다.In addition, according to an embodiment of the present invention, the emotional dictionary construction based on the price fluctuation and the sentence emotional classification method based on the composite product neural network, the emotional classification step of classifying the polarity of the sentence using the emotional dictionary as the training data of the composite product neural network ( S40) may be included. In the emotion classification step (S40), the sentence is classified as positive or negative by using a synthetic product neural network.

이상으로 본 발명에 관한 바람직한 실시 예를 설명하였으나, 본 발명은 상기 실시 예에 한정되지 아니하며, 본 발명의 실시 예로부터 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의한 용이하게 변경되어 균등하다고 인정되는 범위의 모든 변경을 포함한다.Although the preferred embodiments of the present invention have been described above, the present invention is not limited to the above embodiments, and easily changed and equalized by those skilled in the art from the embodiments of the present invention. It includes all changes to the extent deemed acceptable.

10 : 문장 감성 분류 시스템 100 : 데이터 수집부
200 : 데이터 전처리부 210 : 형태소 분석 모듈
220 : 불용어 처리 모듈 300 : 감성사전 구축부
310 : 가격등락 정보 모듈 320 : 후보키워드 선정 모듈
400 : 감성 분류부 410 : 합성곱신경망 모듈
500 : 저장부 510 : 문서 집합 저장 모듈
520 : 가격등락정보 저장 모듈 530 : 감성사전 저장 모듈10: sentence sensitivity classification system 100: data collection unit
200: data preprocessor 210: morphological analysis module
220: stopword processing module 300: emotional dictionary construction unit
310: price fluctuation information module 320: candidate keyword selection module
400: emotional classification unit 410: synthetic product neural network module
500: storage unit 510: document set storage module
520: price fluctuation information storage module 530: emotional dictionary storage module

Claims

A data preprocessor for extracting sentences from a document set, analyzing morphemes, and removing stop words from the document set to process stopwords;
In order to grasp the meaning and frequency of words in the document set, words are extracted and vectorized using internal / external boundary values of words, collecting price fluctuation information according to a specific field, and affirming words and Emotion Dictionary Dictionary for classifying negative to create an emotional dictionary; And
Emotional dictionary construction based on price fluctuations and sentence emotional classification based on the composite product neural network, characterized in that it comprises an emotional classification unit for classifying the emotions of sentences using a convolutional neural network (CNN) based on the emotional dictionary. system.

The method of claim 1,
The emotional dictionary building unit
The candidate keyword for the emotional dictionary word is selected using an internal boundary value for extracting statistical information from information between letters constituting a word and an external boundary value for extracting statistical information from other letters around the word. Emotional dictionary construction based on price fluctuations and sentence emotional classification system based on composite product neural network.

The method of claim 2,
The emotional dictionary building unit
In order to use the internal boundary value and the external boundary value, a sentence is divided into tokens using a space, and then a word is extracted using the location information of each partial letter, and the extracted word is used as a noun and a root. A sentence emotional classification system based on a price fluctuation construction and a composite product neural network, characterized by classifying into a set of words having a word set, and a set of words having a grammatical function such as a mother and a survey.

The method of claim 1,
The frequency is a sentence emotional classification system based on the price dictionary and the composite product neural network according to the price fluctuation, characterized in that calculated by summing the number of articles that appeared a particular word in the document set.

The method of claim 4, wherein
The emotional dictionary building unit
Compute the rising value (pos) by adding up the number of cases in which the article containing the specific word in the document belonging to the month in which the monthly object month price (OMP) rises,
Calculating a rising index of the specific words to generate an emotional dictionary based on price fluctuation information,
The sentence index classification system based on emotional dictionary building and composite product neural network according to the price fluctuation, characterized in that the rising index is extracted by dividing the rising value (pos) by the frequency.

A data collection step (S10) of collecting related news articles of the object by using web crawling and collecting price fluctuation information according to a specific field;
A data preprocessing step of extracting a sentence from the collected document set, morphologically analyzing the document in word units, and processing a stopword by removing a word having no discriminating power from the document set (S20);
In order to grasp the meaning and frequency of the word to express the word by using the internal / external boundary value in the vector and the emotional dictionary building step of classifying the positive or negative of the word using the price fluctuation information to create an emotional dictionary (S30) ; And
Emotional dictionary construction and compound multiplication neural network according to the price fluctuation, characterized in that it comprises an emotional classification step (S40) using the emotional dictionary as training data of a convolutional neural network (CNN) Based sentence sentiment classification method.

The method of claim 6,
The data preprocessing step (S20)
Delete data from the document set that does not match a null value or format,
Emotional dictionary construction according to the price fluctuation and the sentence emotional classification method based on the composite product neural network, characterized by processing the stopword by removing the duplicate article based on the title.