KR101473239B1

KR101473239B1 - Category and Sentiment Analysis System using Word pattern.

Info

Publication number: KR101473239B1
Application number: KR1020130089058A
Authority: KR
Inventors: 이호준; 배태환; 김병태
Original assignee: 주식회사 알에스엔
Priority date: 2013-07-26
Filing date: 2013-07-26
Publication date: 2014-12-16

Abstract

The present invention relates to a system for analyzing a category and sensibility by using a word pattern and, more specifically, to a system for analyzing a category and sensibility by using a word pattern capable of offering data in which category and sensibility analysis are completed to increase the accuracy of a result, which is desired by a user, in a search environment for massive data (electronic document) through a computer system and the Internet.

Description

Category and Sentiment Analysis System using Word Pattern.

본 발명은 단어 패턴을 이용한 카테고리 및 감성 분석시스템에 관한 것으로, 더욱 상세하게는 컴퓨터 시스템과 인터넷을 통한 대용량의 데이터(전자문서)에 대한 검색 환경에서 신속, 정확, 사용자가 원하는 결과에서 정확도를 높이기 위한 카테고리와 감성 분석이 완료된 데이터를 제공하는 단어 패턴을 이용한 카테고리 및 감성 분석시스템에 관한 것이다.
The present invention relates to a category and emotional analysis system using a word pattern, and more particularly, to a system and method for analyzing a category and an emotional analysis system using a word pattern in a search environment for a large amount of data (electronic document) And a category and emotional analysis system using a word pattern for providing data in which emotional analysis is completed.

오늘날 컴퓨터 시스템의 발달과 인터넷망의 초고속화 또는 Mobile 기기의 무선 인터넷 사용 증가로 인하여 기업의 업무활동 또는 개인의 취미 활동 등 일상생활 대부분의 정보가 데이터화되어 쌓이게 된다. Today, due to the development of computer systems and the increasing speed of Internet network or the use of wireless internet in mobile devices, most of the daily life information such as business activities or personal hobbies is accumulated and accumulated.

그러한 데이터를 효율적으로 관리하기 위하여 검색 기술은 필수라 할 수 있다. In order to efficiently manage such data, search technology is essential.

이렇듯 중요도가 높아지고 있는 검색 기술이지만 기존의 키워드 기반 정보 검색은 정보의 기하급수적인 증가로 인하여 제 역할을 못하고 있는 상황이다.This is an increasingly important search technology, but existing keyword-based information retrieval is not playing its role due to the exponential growth of information.

많은 기업들이 시맨틱(Semantic) 검색, 감성 분석(Sentiment Analysis) 검색, 사용자 경험(User Experience) 검색 등을 개발하여 서비스 중이거나 개발중에 있다. Many companies have developed and are in the process of developing services such as semantic search, sentiment analysis search, and user experience search.

이런 검색 기술들에 텍스트 마이닝은 필수로 사용되는 기술이며 많은 분류 추출 방법이 사용되고 있다.Text mining is an essential technique for such retrieval techniques and many classification methods are used.

종래의 카테고리 추출 방법은 첫째로 사람이 직접 문서를 보고 특정 키워드를 선택하여 분류로 지정하는 방법이 있는데, 이는 작업자의 주관에 따라 달라질 수 있으며 현재와 같이 데이터량이 많아져 분류 또한 수시로 변하며 추가되고 있는 상황에 대응하기에는 문제가 있는 방법이다. In the conventional category extraction method, firstly, there is a method in which a person directly views a document and selects a specific keyword to designate it as a classification. This can be changed according to the subject of the operator, and the amount of data is increased as in the present, It is a problematic way to deal with the situation.

둘째로 문서에 있는 키워드를 단순 추출하여 분류로 사용하는 방법인데 이는 정확도가 많이 떨어지기 때문에 관련 없는 분류를 추출할 가능성이 높으므로 문제가 있었다.Second, the method extracts the keywords in the document and uses them as classification. This is because there is a high possibility of extracting irrelevant classification because the accuracy is very low.

따라서, 컴퓨터 시스템과 인터넷을 통한 대용량의 데이터(전자문서)에 대한 검색 환경에서 신속, 정확, 사용자가 원하는 결과에서 정확도를 높이기 위한 카테고리와 감성 분석이 완료된 데이터를 제공할 필요가 대두되고 있다.
Accordingly, there is a need to provide data in which categories and emotional analysis are completed in a search environment for a large amount of data (electronic documents) through a computer system and the Internet to quickly, accurately, and accurately increase accuracy in results desired by a user.

대한민국특허공개공보 10-2009-0125559(2009.12.07)Korean Patent Publication No. 10-2009-0125559 (2009.12.07)

따라서 본 발명은 상기와 같은 종래 기술의 문제점을 감안하여 제안된 것으로서, 본 발명의 목적은 대용량의 데이터(전자문서)를 자동으로 분류하여 검색과 분류의 정확도를 상승시키며 사람의 특별한 작업이 없어도 데이터(전자문서)를 정확하게 분류하는 기술을 제공하는데 있다.SUMMARY OF THE INVENTION Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior arts, and it is an object of the present invention to provide an information processing apparatus and a method for automatically classifying a large amount of data (electronic document) (Electronic document) to be classified correctly.

즉, 데이터(전자문서)를 분류하기 위해 단어사전디비를 이용하여 문단을 선정하며, 선정된 문단을 형태소 분석을 통해 서술어와 주제어를 추출하며, 추출된 문단의 단어를 통해 단어의 정규화를 진행, 단어의 패턴을 이용하여 정규화된 단어를 분류한 뒤 각 추출된 문단의 카테고리와 감성을 파악하고, 카테고리와 감성의 분석이 완료된 문서 내에 존재하는 단어를 조합하여 새로운 단어패턴을 제시하여 분류의 정확도를 향상시키는데 있다.In other words, to classify data (electronic document), a paragraph is selected by using a word dictionary database, a predicate is extracted from a selected paragraph by morpheme analysis, the word is normalized through the word of the extracted paragraph, We classify the normalized words by using the patterns of words, identify the categories and emotions of the extracted paragraphs, and present a new word pattern by combining the words existing in the document with the analysis of the category and emotion, .

본 발명의 다른 목적은 한 개의 전자문서에서 여러 개의 카테고리와 감성이 나올 경우 추출된 각 문단 중에서 가장 많이 노출된 카테고리 및 감성을 1순위로 지정하는데 있다.
Another object of the present invention is to designate the categories and emotions that are most exposed among the extracted paragraphs as a first rank when a plurality of categories and emotions are found in one electronic document.

본 발명이 해결하고자 하는 과제를 달성하기 위하여,In order to achieve the object of the present invention,

본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템은,The category and emotion analysis system using a word pattern according to an embodiment of the present invention,

단어, 제외단어를 저장하고 있는 단어사전디비(200)와;A word dictionary database 200 storing words and exclusion words;

주제어를 저장하고 있는 주제어사전디비(300)와;A main language dictionary database 300 storing main language words;

서술어를 저장하고 있는 서술어사전디비(400)와;A predicate dictionary database 400 storing a predicate;

주제별 단어 패턴 정보와 성향 단어 패턴 정보를 저장하고 있는 단어패턴디비(500)와;A word pattern database 500 storing topic word pattern information and incongruent word pattern information;

전자 문서에서 상기 단어사전디비를 참조하여 문단을 선정하며, 선정된 문단을 형태소 분석하기 위한 형태소분석부(110)와,A morphological analysis unit 110 for selecting a paragraph by referring to the word dictionary database in the electronic document and morphologically analyzing the selected paragraph,

상기 형태소 분석한 단어에서 주제어를 추출하며, 주제어사전디비를 참조하여 추출된 주제어가 일치할 경우에 서술어가 존재하는지를 서술어사전디비를 참조하여 분석하여 서술어가 존재할 경우에 하나의 문단으로 추출하기 위한 주제어/서술어추출부(120)와,The main word is extracted from the morpheme analyzed word, and if the extracted main word matches with the main word dictionary dictionary, whether the predicate exists or not is analyzed by referring to the dictionary dictionary, / Predicate extraction unit 120,

단어사전디비를 참조하여 등록된 단어인지를 분석한 후, 등록된 단어일 경우에 제외단어를 제거하여 단어의 정규화를 진행하기 위한 정규화부(130)와,A normalization unit 130 for analyzing whether a word is a registered word by referring to a word dictionary database and then proceeding to normalize the word by removing an excluded word in the case of a registered word,

정규화된 단어를 획득하여 단어패턴디비를 참조하여 주제의 단어 패턴인지를 판단하여 주제(카테고리)의 단어 패턴일 경우에 성향(감성) 단어 패턴인지를 판단하기 위한 카테고리/감성분석부(140)와,A category / emotion analysis unit 140 for determining whether a normalized word is obtained and determining whether the word pattern is a word pattern of a subject by referring to the word pattern database to determine whether the word pattern is a tendency (emotion) word pattern in the case of a word pattern of a subject (category) ,

카테고리와 감성의 분석이 완료된 전자 문서 내에 존재하는 단어 패턴을 추천하기 위한 단어패턴추천부(150)를 포함하여 구성되는 카테고리/감성분석수단(100);을 포함하여 구성되어 본 발명의 과제를 해결하게 된다.
And a category / emotion analyzing means (100) configured to include a word pattern recommending unit (150) for recommending a word pattern existing in the electronic document in which analysis of the category and emotion is completed. .

이상의 구성 및 작용을 지니는 본 발명에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템은, The category and emotion analysis system using the word pattern according to the present invention having the above-

대용량의 데이터(전자문서)를 사용자가 빠르고 정확하게 검색할 수 있으며, 불필요한 내용을 제거한 데이터를 쉽게 접근할 수 있다.Users can quickly and accurately retrieve large amounts of data (electronic documents) and easily access data that has no unnecessary content.

또한, 현재 사용중인 머신 런닝 기반의 분석에 비해 적은 인원으로 구축 가능하며, 유지보수가 수월한 효과를 제공한다.In addition, it can be constructed with fewer personnel than the analysis of the current running machine running basis, and provides easy maintenance.

또한, 인터넷에 새롭게 올라오는 신조어의 적용이 빠르며, 새로운 이슈가 생겼을 때 카테고리를 추가하여 시스템에 적용시키기 편리한 효과를 제공한다.In addition, the application of the new coined words on the Internet is fast, and when a new issue arises, a category is added to provide a convenient effect to be applied to the system.

또한, 카테고리와 감성을 동시에 분석하기 때문에 분석시간이 타 시스템에 비해 빠르며, 정확도가 상승하며 다양한 분류를 진행할 수 있다.
In addition, since analysis of categories and emotions is performed at the same time, the analysis time is faster than other systems, accuracy can be improved, and various classification can be performed.

도 1은 본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템의 전체 구성도이다.
도 2는 본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템의 카테고리/감성분석수단 블록도이다.
도 3은 본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템의 주제별 문단 추출 방법을 나타낸 흐름도이다.
도 4는 본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템의 문단을 형태소 분석하여 단어의 정규화를 수행하는 방법을 나타낸 흐름도이다.
도 5는 본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템의 단어 패턴을 이용한 문서의 카테고리와 감성 분류 방법을 나타낸 흐름도이다.FIG. 1 is an overall configuration diagram of a category and emotion analysis system using a word pattern according to an embodiment of the present invention.
FIG. 2 is a block diagram of a category / emotion analyzing means of a category and emotion analyzing system using a word pattern according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating a method of extracting a subject-based paragraph of a category and emotion analysis system using a word pattern according to an embodiment of the present invention.
FIG. 4 is a flowchart illustrating a method of word normalization by morphing a paragraph of a category and emotion analysis system using a word pattern according to an embodiment of the present invention.
FIG. 5 is a flowchart illustrating a category and a sensitivity classification method of a document using a word pattern according to an exemplary embodiment of the present invention, using a word pattern of a category and emotion analysis system.

상기 과제를 달성하기 위한 본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템은,According to an embodiment of the present invention, there is provided a category and emotion analysis system using a word pattern,

카테고리와 감성의 분석이 완료된 전자 문서 내에 존재하는 단어 패턴을 추천하기 위한 단어패턴추천부(150)를 포함하여 구성되는 카테고리/감성분석수단(100);을 포함하여 구성되는 것을 특징으로 한다.And a word / pattern analysis unit (100) configured to include a word pattern recommendation unit (150) for recommending a word pattern existing in the electronic document in which analysis of the category and emotion is completed.

이때, 상기 정규화부(130)는,At this time, the normalization unit (130)

미등록 단어일 경우에 화면에 출력시키는 것을 특징으로 한다.And when the word is an unregistered word, the word is output to the screen.

이때, 상기 카테고리/감성분석부(140)는,At this time, the category / emotion analyzing unit 140,

주제별로 나뉘고 형태소 분석으로 정규화된 단어들을 가지고 각 카테고리(주제)와 성향(감성)에 맞는 단어 패턴과 매칭시키는 것을 특징으로 한다.And matching with a word pattern matching each category (theme) and inclinability (emotion) with normalized words divided by theme and morphological analysis.

정규화된 단어들을 각각 그램(gram)으로 분리하여 각 단어 패턴에 매칭시키는 것을 특징으로 한다.Normalized words are separated into grams and matched to each word pattern.

주제의 단어 패턴일 경우에 주제 노출 점수를 상승시키며, 성향의 단어 패턴일 경우에 성향 노출 점수를 상승시키는 것을 특징으로 한다.The subject impression score is increased in the case of a word pattern of a subject, and the tendency impression score is increased in case of a word pattern of a disposition.

두 단어와 세 단어의 조합 단어 패턴의 경우에 형태소 분석된 결과에서 단어가 순차적으로 존재하는지를 판단하여 존재할 경우에 패턴 일치로 분석하며, 노출점수를 상승시키는 것을 특징으로 한다.In the case of a combination of two words and three words, it is determined whether words are sequentially present in the result of morpheme analysis, and if there is a word, it is analyzed as pattern matching and the exposure score is increased.

이때, 상기 단어패턴추천부(150)는,At this time, the word pattern recommending unit 150,

각 카테고리와 감성에서 노출순으로 정렬하여 화면에 출력시키며, 사용자에 의해 선택된 단어의 패턴을 단어패턴디비에 저장시키는 것을 특징으로 한다.Arranging them in order of exposure from each category and emotion, and outputting them on the screen, and storing the pattern of the word selected by the user in the word pattern database.

이하, 본 발명에 의한 단어 패턴을 이용한 카테고리 및 감성 분석시스템의 실시예를 통해 상세히 설명하도록 한다.Hereinafter, a category and emotion analysis system using a word pattern according to the present invention will be described in detail with reference to embodiments.

도 1은 본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템의 전체 구성도이다.FIG. 1 is an overall configuration diagram of a category and emotion analysis system using a word pattern according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 본 발명인 단어 패턴을 이용한 카테고리 및 감성 분석시스템은,As shown in FIG. 1, the category and emotion analysis system using the word pattern of the present invention,

카테고리와 감성의 분석을 수행하여 단어 패턴을 추천하는 카테고리/감성분석수단(100);을 포함하여 구성하게 된다.And category / emotion analyzing means (100) for analyzing categories and emotions to recommend word patterns.

본 발명의 시스템은 전자 문서에서 주제어와 서술어 추출, 문서 내에 존재하는 여러 가지의 주제를 주제어와 서술어로 문단 분류, 주제별로 분류된 문단을 형태소 분석, 형태소 분석된 단어를 정규화, 단어의 정규화를 위해 제외단어사전 관리, 형태소 분석 후 미 분석된 단어는 신조어로 추천, 정규화된 단어를 이용하여 단어패턴을 이용한 문서의 카테고리와 감성 분류, 카테고리와 감성을 동시에 분석, 카테고리와 감성분석이 완료된 문서의 단어를 이용하여 새로운 단어 패턴의 추천, 주제어와 서술어 사전을 별도로 관리, 형태소 분석된 결과에서 주제어, 서술어 추가 및 추천, 인터넷에서 새롭게 올라오는 신조어들을 발굴하여 사전에 적용하는 등의 기능을 수행할 수 있게 된다.The system of the present invention extracts main words and predicates from an electronic document, classifies a plurality of topics existing in the document into main words and descriptors, morpheme analysis classified by themes, normalization of morpheme-analyzed words, normalization of words After analyzing the excluded word dictionary and analyzing the morphemes, it is recommended to coincide with the new words, and it is also possible to analyze the category and emotion classification of the document using the word pattern using the normalized word, the category and the emotion at the same time, , It is possible to perform functions such as recommendation of a new word pattern, management of a main word and a thesaurus separately, adding a keyword, a predicate, and a new word from the morphological analysis result, do.

본 발명의 효과로는 대용량의 데이터(전자문서)를 자동으로 분류하여 검색과 분류의 정확도를 상승시키며, 사람의 특별한 작업이 없어도 데이터(전자문서)를 정확하게 분류하게 된다.According to the effect of the present invention, a large amount of data (electronic document) is automatically classified to increase the accuracy of retrieval and classification, and data (electronic document) can be accurately classified without any special work by a person.

또한, 데이터(전자문서)를 분류하기 위해 단어사전디비를 참조하여 문단을 선정하며, 형태소 분석을 통해 서술어와 주제어를 추출해 낸다.Also, in order to classify data (electronic document), we refer to the word dictionary database And extracts predicates and subject words through morphological analysis.

이때, 추출된 문단의 단어를 통해 단어의 정규화를 진행, 단어의 패턴을 이용하여 정규화된 단어를 분류한 뒤 각 추출된 문단의 카테고리와 감성을 파악한다.At this time, the normalization of the word is performed through the word of the extracted paragraph, the normalized word is classified using the pattern of the word, and the category and emotion of each extracted paragraph are grasped.

또한, 한 개의 전자문서에서 여러 개의 카테고리와 감성이 나올 경우 추출된 각 문단 중에서 가장 많이 노출된 카테고리 및 감성을 1순위로 지정하게 된다.In addition, when multiple categories and emotions are found in one electronic document, the category and emotion that are most exposed among the extracted paragraphs are designated as the first rank.

또한, 카테고리와 감성의 분석이 완료된 문서 내에 존재하는 단어를 조합하여 새로운 단어패턴을 제시하여 분류의 정확도를 향상시키게 된다.In addition, the accuracy of the classification is improved by presenting a new word pattern by combining words existing in the document in which analysis of the category and emotion is completed.

본 발명의 주제어는 카테고리로 분류하기 이전 각 카테고리에서 노출 순위가 높은 단어를 주제어로 지정하게 된다.In the main word of the present invention, a word having a high exposure ranking in each category before classification into categories is designated as a main word.

본 발명의 단어패턴은 형태소 분석이 완료된 주제어와 서술어의 조합(주제어+주제어, 주제어+서술어 등), 조합별로 카테고리와 감성을 가진다.The word pattern of the present invention has a category and emotion for each combination of a main word and a descriptive word (morpheme + main word, main word + predicate, etc.) that have been subjected to morphological analysis.

상기 단어사전디비(200)에는 단어, 제외단어를 저장하고 있으며, 주제어사전디비(300)에는 주제어를 저장하고 있으며, 서술어사전디비(400)에는 서술어를 저장하고 있으며, 상기 단어패턴디비(500)에는 주제별 단어 패턴 정보와 성향 단어 패턴 정보를 저장하고 있게 된다.The word dictionary database 200 stores words and excluded words. The main dictionary database 300 stores main terms. The predicate dictionary 400 stores predicates. The word dictionary database 500 stores words, The word pattern information of the subject and the word pattern information of the word are stored.

도 2는 본 발명의 일실시예에 따른 단어 패턴을 이용한 카테고리 및 감성 분석시스템의 카테고리/감성분석수단 블록도이다.FIG. 2 is a block diagram of a category / emotion analyzing means of a category and emotion analyzing system using a word pattern according to an embodiment of the present invention.

도 2에 도시한 바와 같이, 카테고리/감성분석수단(100)은,As shown in Fig. 2, the category / emotion analyzing means 100,

카테고리와 감성의 분석이 완료된 전자 문서 내에 존재하는 단어 패턴을 추천하기 위한 단어패턴추천부(150)를 포함하여 구성되게 된다.And a word pattern recommendation unit 150 for recommending a word pattern existing in the electronic document in which analysis of the category and emotion is completed.

상기 형태소분석부(110)는 전자 문서에서 단어사전디비를 참조하여 주제어와 서술어를 이용하여 문단을 추출하며, 문단을 형태소 분석하게 된다.The morpheme analysis unit 110 refers to the dictionary dictionary in the electronic document, extracts the paragraph using the main word and the descriptor, and analyzes the paragraph.

이때, 상기 주제어/서술어추출부(120)는 형태소 분석한 단어에서 주제어를 추출하며, 주제어사전디비를 참조하여 추출된 주제어가 일치할 경우에 서술어가 존재하는지를 서술어사전디비를 참조하여 분석하여 서술어가 존재할 경우에 하나의 문단으로 추출하게 된다.The main word / predicate extraction unit 120 extracts the main word from the morpheme analysis word, and refers to the main dictionary dictionary to analyze whether a predicate exists if the extracted main word matches, referring to the dictionary dictionary, If it exists, it is extracted as a single paragraph.

도 3을 참조하여 구체적으로 설명하자면, 수집된 문서에서 문단을 선정하여 최초로 형태소 분석(S100)을 진행한다.More specifically, referring to FIG. 3, a paragraph is selected from the collected document, and the morpheme analysis (S100) is first performed.

형태소 분석한 단어에서 주제어를 추출하며 주제어를 추출하기 위하여 주제어사전디비(300)를 참조하게 된다. The morpheme analysis extracts a subject word from a word and refers to the subject dictionary database 300 to extract a subject word.

주제어사전디비(300)에 저장된 단어와 형태소 분석된 단어가 일치(S110)한 경우에 주제어로 인정하며, 주제어 뒤로 서술어가 존재하는지를 찾는다.When the words stored in the main dictionary database 300 are matched with the morpheme analyzed words (S110), the main word is recognized as a main word, and whether a predicate is present after the main word is found.

다음 문장 이내에 서술어가 존재(S120)하면 하나의 문단으로 인정하며 추출된 문단(S130)은 배열로 관리한다.If there is a predicate in the next sentence (S120), it is recognized as one paragraph, and the extracted paragraph (S130) is managed as an array.

즉, 별도의 디비를 구축하여 문단을 저장하거나, 상기 단어사전디비에 별도의 데이터 필드를 구성하여 저장 관리할 수도 있다.That is, a separate dictionary may be constructed to store a paragraph, or a separate data field may be formed in the word dictionary database for storage management.

이후, 추출된 문단은 문서 내에서 삭제 후 다시 다른 주제를 찾는다.Then, the extracted paragraphs are deleted in the document and then another topic is searched again.

만약, 서술어가 존재하지 않으면 여러 개의 주제가 존재하더라도 한가지의 주제로 판단하기가 어렵기 때문에 주제어만 존재하고 서술어가 존재하지 않는 경우에는 하나의 주제로 인정할 수 있다.If a predicate does not exist, it is difficult to judge it as a single subject even if there are several subjects. Therefore, if there is only a subject word and there is no predicate, it can be regarded as one topic.

주제어사전디비(300)와 서술어사전디비(400)는 형태소 분석된 결과에서 사용자가 등록할 수 있으며, 각 사전디비에 저장된 단어의 양이 많을수록 정확한 주제를 파악할 수 있다.The main dictionary 300 and the predicate dictionary 400 can be registered by the user in the result of the morphological analysis, and the more the amount of words stored in each dictionary, the more accurate the subject can be grasped.

사용자가 등록을 하기 위하여 별도의 등록 페이지 혹은 등록 버튼을 구성할 수 있을 것이다.A user may configure a separate registration page or registration button to register.

상기 문단을 나누는 이유는 전자문서에 존재하는 여러 개의 주제 중 가장 높은 주제를 파악하여 각 카테고리 단어의 패턴과 감성분석의 정확도를 향상시키기 위함이다.The reason for dividing the paragraphs is to improve the accuracy of patterns and emotional analysis of each category word by grasping the highest topic among the plurality of topics existing in the electronic document.

다음은 도 4를 참조하여 정규화부(130)를 설명하도록 한다.Next, the normalization unit 130 will be described with reference to FIG.

즉, 상기 정규화부(130)는 단어사전디비를 참조하여 등록된 단어인지를 분석한 후, 등록된 단어일 경우에 제외단어를 제거하여 단어의 정규화를 진행하게 된다.That is, the normalization unit 130 refers to the word dictionary database to analyze whether the word is a registered word, and if the word is a registered word, the normalization unit removes the non-word and normalizes the word.

구체적으로 설명하자면, 주제별로 분류된 문단에서 형태소 분석된 단어들을 정규화하기 위해 몇 가지 절차를 거친다.Specifically, there are several procedures to normalize the words that are morphed in the thematic paragraphs.

기본적으로 사전디비를 기반으로 진행하며, 사용자가 등록해 놓은 단어일 경우에만 분석에 사용된다.Basically, it is based on dictionary database. It is used for analysis only when it is a registered word.

사용자가 등록한 단어이나 단어사전디비(200)에 저장된 제외단어는 정규화 목록에 들어갈 수 없으며, 사전에 등록되지 않은 단어는 신조어로 인정한다. The words registered by the user or the excluded words stored in the word dictionary database 200 can not be included in the normalization list, and words not registered in advance are recognized as coined words.

신조어 중에서도 사용자가 신조어로 원치 않는 단어는 제거하여 미등록 단어로 관리한다. Among the coined words, the user removes unwanted words as new words and manages them as unregistered words.

이때, 미등록 단어는 일괄적으로 모아서 사용자에게 추천하게 된다.At this time, unregistered words are collectively collected and recommended to the user.

미등록 단어를 사용자가 확인하여 단어사전디비에 등록할 수 있으며, 사전의 양이 많아 질수록 미등록 단어의 양이 줄어들며 분석 시 정확도가 향상된다.Unregistered words can be confirmed by the user and registered in the dictionary dictionary. As the amount of the dictionary increases, the amount of unregistered words decreases and the accuracy of the analysis improves.

동작 과정을 설명하면, 문단에서 등록 단어인지(S200)를 판단하여 등록된 단어일 경우에 제외단어를 제거(S210)한 후 정규화 단어로 저장하게 된다.In operation, if it is determined that the registered word is a registered word (S200) in the paragraph, the excluded word is deleted (S210) and stored as a normalized word.

또한, 등록된 단어가 아닐 경우에 미등록 단어, 제외단어를 제거(S220)한 후 미등록 단어를 저장하게 된다.If the word is not a registered word, the unregistered word and the excluded word are removed (S220), and the unregistered word is stored.

상기 미등록 단어는 별도의 디비를 구성하여 별도 관리할 수도 있을 것이다.The unregistered word may be separately managed by forming a separate database.

다음은 도 5를 참조하여 카테고리/감성분석부를 설명하도록 한다.The category / sensitivity analysis unit will be described with reference to FIG.

본 발명의 카테고리/감성분석부(140)는 정규화된 단어를 획득하여 단어패턴디비를 참조하여 주제의 단어 패턴인지를 판단하여 주제(카테고리)의 단어 패턴일 경우에 성향(감성) 단어 패턴인지를 판단하게 된다.The category / emotion analysis unit 140 of the present invention obtains a normalized word and refers to the word pattern database to determine whether it is a word pattern of a subject. If the word pattern is a theme (category) word pattern, .

본 발명에서 설명하고 있는 주제는 카테고리로 정의할 수 있으며, 성향은 감성으로 정의할 수 있다.The subject described in the present invention can be defined as a category, and the propensity can be defined as sensitivity.

동작 과정을 설명하자면, 주제별 형태소 분석된 단어에서 주제의 단어 패턴인지(S300)를 분석하게 된다.To explain the operation process, the word pattern recognition (S300) of the subject is analyzed from the subject morpheme analyzed word.

이때, 주제의 단어 패턴일 경우에 주제 노출 점수를 상승(S310)시키며, 성향의 단어 패턴인지(S320)를 판단하여 성향의 단어 패턴일 경우에 성향 노출 점수를 상승(S330)시키게 된다.At this time, in the case of the word pattern of the subject, the subject exposure score is increased (S310), and if the word pattern is the word pattern of the inclination (S320), the inclination exposure score is increased (S330).

그리고, 상기 카테고리/감성분석부(140)는 주제별로 나뉘고 형태소 분석으로 정규화된 단어들을 가지고 각 카테고리(주제)와 감성에 맞는 단어패턴과 매칭시킨다. Then, the category / emotion analyzer 140 classifies the words into phrases that are appropriate for each category (theme) and emotion with the words normalized by morphological analysis.

매칭하기 위해 정규화된 단어들을 각각 Gram으로 분리하여 각 단어 패턴에 매칭시키며, Gram 에는 One, Two, Three 세 가지가 존재하며 단어패턴에도 3가지의 Gram이 존재한다. To match, each normalized word is separated into Grams and matched to each word pattern. There are three types of Grams: One, Two, and Three, and there are three Grams in word patterns.

상기 One Gram 이란 한 개의 단어를 말하며, Two Gram, Three Gram 은 두 단어, 세 단어를 말한다. The One Gram refers to one word, and the Two Gram and Three Gram refers to two words and three words.

두 단어와 세 단어의 조합 단어 패턴은 형태소 분석된 결과에서 단어가 순차적으로 존재하여야 패턴 일치로 인정하며, 각 노출점수를 상승시킨다.Combination of two words and three words A word pattern is recognized as a pattern match by sequentially presenting words in the morphological analysis result, and each exposure score is increased.

위와 같은 방법으로 각 문단의 주제와 감성을 감지하며 여러 개의 주제 중 하나의 주제를 선택하는 기준은 노출 점수가 가장 높은 카테고리(주제), 감성(성향)으로 인정하여 시스템에 반영된다.In this way, the theme and sensibility of each paragraph are detected, and the criterion for selecting one of several topics is recognized as the category (theme) and emotion (tendency) with the highest exposure score and reflected in the system.

본 발명의 시스템에서 장점은 각 단어패턴(카테고리, 감성)을 동시에 분류 할 수 있는데 있다.The advantage of the system of the present invention is that it can classify each word pattern (category, emotion) at the same time.

정규화된 단어 목록으로 카테고리를 판단하는 방법과 주제를 판단하는 방법이 동일하기 때문에 별도의 추가 작업이 없이도 카테고리와 감성을 분석 해낼 수 있다.Because the normalized word list has the same method of judging categories and judging subjects, categories and emotions can be analyzed without additional work.

동시분석으로 인해 분석 시간이 많이 감소 될 수 있으며 대용량의 데이터도 손쉽게 분류 해낼 수 있는 것이다. Simultaneous analysis can greatly reduce analysis time and can easily classify large amounts of data.

상기 단어패턴추천부(150)는 카테고리와 감성의 분석이 완료된 전자 문서 내에 존재하는 단어 패턴을 사용자에게 추천하게 된다.The word pattern recommending unit 150 recommends a word pattern existing in the electronic document in which analysis of the category and emotion is completed to the user.

즉, 카테고리와 감성분석이 완료된 문서에 대해서 단어의 패턴을 추천하는데, 각 카테고리와 감성에서 노출 점수가 높은 순으로 정렬하여 사용자에게 추천할 수 있으며, 사용자가 원하는 카테고리와 감성에 등록할 수 있다. That is, a pattern of a word is recommended for a category and a document in which emotional analysis is completed. In this case, the user can recommend the pattern of the word by sorting the category and emotion in descending order of exposure score.

또한, 정규화된 단어의 목록을 가지고 One, Two, Three Gram 단어를 등록할 수 있으며, 단어의 패턴이 추가될수록 분석 정확도가 향상된다.In addition, one, two, three Gram words can be registered with the list of normalized words, and the more the pattern of the word is added, the better the accuracy of analysis.

상기와 같은 구성 및 동작을 통해 대용량의 데이터(전자문서)를 사용자가 빠르고 정확하게 검색할 수 있으며, 불필요한 내용을 제거한 데이터를 쉽게 접근할 수 있게 된다.Through the above-described structure and operation, the user can quickly and accurately search a large amount of data (electronic document), and data without unnecessary contents can be easily accessed.

이상에서와 같은 내용의 본 발명이 속하는 기술분야의 당업자는 본 발명의 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시된 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. It will be appreciated by those skilled in the art that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. It is to be understood, therefore, that the embodiments described above are to be considered in all respects as illustrative and not restrictive.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구 범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

100 : 카테고리/감성분석수단
200 : 단어사전디비
300 : 주제어사전디비
400 : 서술어사전디비
500 : 단어패턴디비100: category / emotional analysis means
200: Word dictionary dictionary
300: Keyword dictionary dictionary
400: Proverb dictionary
500: Word pattern database

Claims

In a category and emotion analysis system using word patterns,
A word dictionary database 200 storing words and exclusion words;
A main language dictionary database 300 storing main language words;
A predicate dictionary database 400 storing a predicate;
A word pattern database 500 storing topic word pattern information and incongruent word pattern information;
A morphological analysis unit 110 for selecting a paragraph by referring to the word dictionary database in the electronic document and morphologically analyzing the selected paragraph,
The main word is extracted from the morpheme analyzed word. If the extracted main word exists in the dictionary dictionary, it is analyzed whether the predicate exists or not by referring to the dictionary database, and a keyword / An extracting unit 120,
A normalization unit 130 for analyzing whether a morpheme-analyzed word is a registered word by referring to a word dictionary dictionary, and if the word is a registered word, removing the excluded word and proceeding to normalize the word,
A category / emotion analysis unit 140 for determining whether a normalized word is obtained and determining whether the word pattern is a word pattern of a subject by referring to the word pattern database to determine whether the word pattern is a tendency (emotion) word pattern in the case of a word pattern of a subject (category) ,
And a word pattern recommendation unit (150) for recommending a word pattern existing in the electronic document in which analysis of the category and emotion is completed. The word / phrase analysis unit (100) And emotion analysis system.

The method according to claim 1,
The normalization unit 130,
And if the word is an unregistered word, the result is outputted to the screen.

The method according to claim 1,
The category / emotion analyzing unit 140,
And a word pattern matching each category (subject) and a propensity (sensitivity) with normalized words divided by a subject and morphological analysis.

The method according to claim 1,
The category / emotion analyzing unit 140,
Wherein the normalized words are respectively separated into grams and matched to the respective word patterns.

The method according to claim 1,
The category / emotion analyzing unit 140,
A category pattern and an emotional analysis system using a word pattern, wherein the theme impression score is raised when the word pattern is a subject word, and the tendency exposure score is increased when the word pattern is a tendency word pattern.

The method according to claim 1,
The category / emotion analyzing unit 140,
In the case of the combination of the two words and the three words, it is judged whether the words are sequentially present in the morphologically analyzed result, and when there is a word, the pattern is analyzed and the exposure score is increased. Analysis system.

The method according to claim 1,
The word pattern recommendation unit 150,
A category and an emotion analyzing system using a word pattern, wherein the category and emotion are sorted in order of exposure from emotion to emotion and outputted to a screen, and a pattern of a word selected by a user is stored in a word pattern database.