KR102008845B1

KR102008845B1 - Automatic classification method of unstructured data

Info

Publication number: KR102008845B1
Application number: KR1020170163188A
Authority: KR
Inventors: 박유경; 맹국재
Original assignee: 굿모니터링 주식회사; 맹국재
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2019-10-21
Also published as: KR20190063978A

Abstract

본 발명은 비정형 데이터의 카테고리 자동분류 방법에 관한 것으로,
인터넷 게시판에 게시된 게시물부터 비정형 데이터의 문자열을 수집하는 단계; 수집된 비정형 데이터의 문자열에 대해 정보 추출 및 문자열 처리와, 문자 토큰화 처리를 포함하는 전처리를 수행하는 단계 - 전처리를 수행하는 것에 의해 상기 문자열은 벡터 표현으로 변환됨 - ; 전처리된 백터 표현을 CNN-LSTM 분류기에 입력하는 단계; CNN-LSTM 분류기에 의해 상기 게시물의 카테고리를 자동으로 분류하는 단계를 포함한다. The present invention relates to a method for automatically classifying categories of unstructured data.
Collecting a string of unstructured data from a post posted on an Internet bulletin board; Performing preprocessing on the collected strings of unstructured data, including information extraction and string processing, and character tokenization processing, wherein the string is converted into a vector representation by performing preprocessing; Inputting the preprocessed vector representation into a CNN-LSTM classifier; And automatically categorizing the post's category by a CNN-LSTM classifier.

Description

Automatic classification of categories of unstructured data {AUTOMATIC CLASSIFICATION METHOD OF UNSTRUCTURED DATA}

본 발명은 비정형 데이터의 카테고리 자동분류 방법에 관한 것으로, 보다 구체적으로는 인터넷 상의 게시판과 같은 특정 영역에 입력된 게시글을 자동으로 분석하여 게시글이 어떤 카테고리에 해당하는지 분류하는 방법에 관한 것이다.
The present invention relates to a method for automatically classifying categories of unstructured data. More particularly, the present invention relates to a method for classifying which categories a post corresponds to by automatically analyzing a post entered in a specific area such as a bulletin board on the Internet.

비정형 데이터(unstructured data)란 텍스트나 이미지, 동영상과 같이 사전에 정의된 정형(structure)을 따라지 않는 데이터를 의미한다. 비정형 데이터는 뉴스, 댓글, SNS 데이터, 이메일, 보고서 등 다양하며 채널 또한 다양하다.
Unstructured data refers to data that does not follow a predefined structure, such as text, images, or video. Unstructured data varies with news, comments, social media data, emails, reports, and more.

기업, 기관, 개인은 비정형 데이터를 매일 매시간 생산하고 있다. 하지만 대부분의 비정형데이터는 분류되지 않고 사장되고 있다. 이런 비정형 데이터가 의미있고 가치있는 정보가 되기 위해서는 분석이 필수적이다.
Companies, institutions, and individuals produce unstructured data every hour. Most unstructured data, however, is not classified and is dead. Analysis is essential for such unstructured data to be meaningful and valuable information.

비정형 데이터의 첫 번째 분석 방법은 분류분석(classification analysis) 또는 군집분석(clustering analysis)을 이용하는 것이고, 두 번째 분석 방법으로는 특정 범주로의 카테고라이징(categorizing)을 수행하는 것이 있다.
The first method of analyzing unstructured data is to use classification analysis or clustering analysis. The second method is to categorize into a specific category.

그 동안 2가지 분석 방법은 수작업적인 처리 방법과 자동화된 처리 방법을 활용하였지만 분야별 적용은 아직 어려움이 있다.
While the two analytical methods used manual and automated treatment methods, application by field is still difficult.

일반적으로 텍스트 문서에 대한 자동분류 시스템은 그 성능이 학습 알고리즘 자체보다는 특징선택(feature selection) 알고리즘에 의존하는 경향이 크다. 특징 선택이란 학습 문서에 존재하는 특징(또는 단어)들 속에서 카테고리간 차별화에 기여하는 특징만을 골라내는 기법을 의미한다.
In general, automatic classification systems for text documents tend to rely on feature selection algorithms rather than learning algorithms themselves. Feature selection refers to a technique of selecting only features that contribute to differentiation among categories among features (or words) existing in a learning document.

점차 더 처리해야할 정보와 문서의 양이 방대해지고 복잡해지면서 이는 빠르게 전달해야 하는 뉴스의 속도를 저하시킬 뿐만 아니라 인력자원의 투입으로 인해 더 많은 비용이 소비되고 있다. 따라서 문서 분류의 자동화에 대한 필요성은 더욱 증대되고 있다.
As the amount of information and documents to be processed increases and becomes more and more complex, this not only slows down the news that needs to be delivered quickly, but also costs more due to the input of human resources. Therefore, the need for automation of document classification is increasing.

또한 기존에 문서 분류의 자동화를 위하여 단순히 문서에 나타나는 단어의 빈도수를 이용하여 적합한 범주를 지정하는 통계적인 분류방법이 이용되거나, 분류에 필요한 주요 단어들을 추출하고 추출된 단어들을 기반으로 K-NN, 의사결정트리, 베이지언 네트워크, 인공 신경망 등의 데이터 마이닝 알고리즘을 이용한 연구가 진행되었다. 최근에는 자연어 처리에 딥러닝 알고리즘인 컨볼루션 신경망(convolutional neural network, 이하 CNN)이 효과적이라는 것이 알려지면서 단어를 벡터(vector)로 표현하는 word2vec과 CNN을 이용한 문장 분류 방법이 제안되었고 실제로 우수한 결과를 보여주었다.
In addition, in order to automate document classification, a statistical classification method that simply designates a suitable category by simply using the frequency of words appearing in a document is used, or key words necessary for classification are extracted and based on the extracted words, K-NN, Research has been conducted using data mining algorithms such as decision trees, Bayesian networks, and artificial neural networks. Recently, the convolutional neural network (CNN), which is a deep learning algorithm, is known to be effective for natural language processing, and a sentence classification method using word2vec and CNN to express words as vectors has been proposed. Showed.

word2vec과 CNN을 이용한 문장 분류 방법은 구조가 단순하기 때문에 훈련 및 예측 시간이 빠르다는 장점을 가지며 자연어 처리 및 텍스트 마이닝의 여러 분야에서 우수한 성능을 내는 기계 학습 도구로 평가 받고 있는 SVM 방식(Support Vector Machine)과 LR(Logistic Regression)을 이용한 경우보다 분류 성능에 있어서 향상을 이루었다.
The sentence classification method using word2vec and CNN has the advantage of fast training and prediction time because of its simple structure, and it is evaluated as an SVM method that is evaluated as a high performance machine learning tool in various fields of natural language processing and text mining (Support Vector Machine). And Logistic Regression (LR) have improved the classification performance.

그러나 word2vec과 CNN을 이용한 문장 분류 방법은 영어 문장을 대상으로 한 성능 형가 실험 결과만을 제시하여 한국어 문서 분류에 적용시 모델의 유효 여부는 확인할 수 없었다.
However, the sentence classification method using word2vec and CNN presented only the results of the performance type test for the English sentences, so the validity of the model could not be verified when applied to Korean document classification.

본 발명은 전술한 문제점에 기반하여 안출된 발명으로서, word2vec과 CNN을 이용한 문장 분류 방법을 한국어에 대해 적용하여 한국어 문서 분류에 있어 유효한지를 검증하고 한국어 문서 분류에 적용함에 있어서 보다 정확하게 고객 게시글 등을 자동으로 분류할 수 있는 방법을 제공하는 것을 목적으로 한다.
The present invention has been made based on the above-mentioned problem, and the sentence classification method using word2vec and CNN is applied to Korean to verify whether it is effective in classifying Korean documents and applying the article postings more accurately in Korean document classification. It is an object of the present invention to provide a method for automatically classifying.

전수한 문제점을 해결하기 위해 본 발명의 양태에 따르면, 비정형 데이터의 카테고리를 자동으로 분류 방법이 제공된다. 구체적으로 이 분류 방법은, 인터넷 게시판에 게시된 게시물부터 비정형 데이터의 문자열을 수집하는 단계; 수집된 비정형 데이터의 문자열에 대해 정보 추출 및 문자열 처리와, 문자 토큰화 처리를 포함하는 전처리를 수행하는 단계 - 전처리를 수행하는 것에 의해 상기 문자열은 벡터 표현으로 변환됨 - ; 상기 전처리된 백터 표현을 CNN-LSTM 분류기에 입력하는 단계; 상기 CNN-LSTM 분류기에 의해 상기 게시물의 카테고리를 자동으로 분류하는 단계를 포함하는 것을 특징적인 구성으로 포함한다.
According to an aspect of the present invention for solving the problems, a method for automatically classifying categories of unstructured data is provided. Specifically, the classification method may include collecting a string of unstructured data from a post posted on an Internet bulletin board; Performing preprocessing on the collected strings of unstructured data, including information extraction and string processing, and character tokenization processing, wherein the string is converted into a vector representation by performing preprocessing; Inputting the preprocessed vector representation into a CNN-LSTM classifier; And a step of automatically classifying the category of the post by the CNN-LSTM classifier.

전술한 양태에서, 정보 추출 및 문자열 처리는, In the above-described aspect, the information extraction and string processing are

엑셀파일 파싱하여 본문, 범주 정보를 추출하는 단계; 문자열 내의 줄바꿈 문자 및 특수문자를 처리하는 단계; 문자열을 자동 띄어쓰기 처리하는 단계; 및 자동 띄어쓰기된 문자열에 대해 WPM을 적용하는 단계를 포함하고, Parsing an Excel file to extract body and category information; Processing newline characters and special characters in the string; Automatic spacing of the character string; And applying WPM to the auto spaced string,

상기 문자 토큰화 처리는,The character tokenization process is

WPM 적용된 문자열을 Word2Vec 라이브러리를 활용하여 문자와 단어를 벡터 표현으로 변환하는 단계;를 포함하여 구성된다.
And converting the WPM-applied string into a vector representation by using the Word2Vec library.

또한 전술한 양태에서, CNN-LSTM 분류기에 입력되는 벡터 표현은 제로-패딩(zero-padding)을 통해 CNN용의 미리결정된 입력 길이를 가지며, 벡터로 이루어진 매트릭스의 높이(CNN에 입력되는 토큰 수에 대응)는 고정되어 있다.
Also in the above-described aspect, the vector representation input to the CNN-LSTM classifier has a predetermined input length for the CNN through zero-padding, and the height of the matrix of vectors (the number of tokens input to the CNN). Correspondence) is fixed.

또한 전술한 양태에서, CNN-LSTM 분류기에 의해 의해 상기 게시물의 카테고리를 자동으로 분류하는 단계는,Also in the above aspect, the step of automatically classifying the category of the post by the CNN-LSTM classifier,

복수의 피처 맵(feature map)을 생성하기 위해 미리정해진 필터를 통해 컨볼루션(convolution)을 수행하는 단계; 생성된 복수의 피처 맵 각각에 대해 max 풀링 연산을 수행하여 각 피처 맵에서 하나의 자질을 획득하는 단계; 모든 출력값들을 연결(concatenation)하여 고정된 길이를 갖는 탑-레벨 피처 벡터(top level freaure)를 생성하는 단계; 상기 생성된 탑-레벨 피처 벡터(top level freaure)를 BasicLSTMCell 3개 레이어로 이루어진 MultiRNNCell로 구성하는 단계; 풀커넥션+소프트맥스(Full Connection + Softmax) 계층을 통해 출력하는 단계; 를 포함하여 구성된다.
Performing convolution through a predetermined filter to generate a plurality of feature maps; Performing one max pulling operation on each of the generated feature maps to obtain one feature from each feature map; Concatenating all output values to produce a top level feature vector having a fixed length; Configuring the generated top-level feature vector with a MultiRNNCell consisting of three layers of BasicLSTM Cell; Outputting through a Full Connection + Softmax layer; It is configured to include.

또한 전술한 양태에서 풀커넥션+소프트맥스(Full Connection + Softmax) 계층을 통한 출력은 라벨(label)에 대한 확률 분포이고, 가장 높은 출력값을 가지는 라벨이 주어진 문장의 예측 라벨이 된다.
In addition, in the above-described aspect, the output through the Full Connection + Softmax layer is the probability distribution for the label, and the label with the highest output value becomes the predicted label of the given sentence.

본 발명에 따르면 기존의 CNN 기반의 분류 방식에 비해 성능 및 효과가 개선된 비격식 한국어 텍스트에 대한 분류 방법을 제공할 수 있다.
According to the present invention, it is possible to provide a classification method for informal Korean text, which has improved performance and effectiveness, compared to the existing CNN-based classification method.

도 1은 WPM 기반 어휘 사전 생성 알고리즘을 나타내는 도면;
도 2는 Word2Vec의 CBOW/Skip-gram 모델을 나타내는 도면;
도 3은 리쿤(Lecun)이 발표한 CNN의 구조를 나타내는 CNN의 구조도;
도 4는 LSTM 블록 구조를 나타낸 도면;
도 5는 LSTM 네트워크 구조를 나타내는 도면;
도 6은 LSTM 다이어그램을 나타낸 도면;
도 7은 본 발명에 따른 실시예에서 데이터 전처리 수행과정을 나타낸 도면;
도 8은 word2vec를 활용한 CNN 모델의 예를 나타낸 도면;
도 9는 word2vec을 활용한 CNN 모델(기반모델)을 간략하게 나타낸 도면;
도 10은 본 발명에 따른 word2vec을 활용한 CNN-LSTM 모델을 나타내는 도면;
도 11은 분류기 성능 분석을 위한 모델 분류기의 구성도를 나타내는 도면;
도 12는 소프트맥스 회귀를 사용한 학습 훈련 구성도를 나타내는 도면;
도 13은 CNN 기반 분류기에서의 성능을 나타내는 도면;
도 14는 LSTM 기반 분류기에서의 성능을 나타내는 도면;
도 15는 CNN-LSTM 기반 분류기에서 셀종류별 성능을 나타낸 도면;
도 16은 최종 파라미터를 사용한 CNN-LSTM 성능을 나타낸 도면; 및
도 17은 트레이닝 횟수에 따른 모데별 최종 파라미터를 사용한 성능을 나타낸 도면이다.1 illustrates a WPM based lexical dictionary generation algorithm;
2 shows a CBOW / Skip-gram model of Word2Vec;
3 is a structural diagram of a CNN showing the structure of a CNN published by Lecun;
4 shows an LSTM block structure;
5 shows an LSTM network structure;
6 shows an LSTM diagram;
7 is a view showing a data preprocessing process in an embodiment according to the present invention;
8 shows an example of a CNN model utilizing word2vec;
9 is a view briefly showing a CNN model (based model) utilizing word2vec;
10 illustrates a CNN-LSTM model utilizing word2vec according to the present invention;
11 is a diagram showing the configuration of a model classifier for classifier performance analysis;
12 is a diagram showing a learning training diagram using Softmax regression;
13 illustrates performance in a CNN based classifier;
14 illustrates performance in an LSTM based classifier;
FIG. 15 is a diagram illustrating performance of each cell type in a CNN-LSTM based classifier. FIG.
16 shows CNN-LSTM performance using final parameters. And
17 is a diagram showing the performance using the final parameters for each model according to the number of training.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되는 실시예를 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이다.
Advantages and features of the present invention, and methods for achieving them will be apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but may be implemented in various different forms.

본 명세서에서 본 실시예는 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다. 그리고 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 따라서, 몇몇 실시예들에서, 잘 알려진 구성 요소, 잘 알려진 동작 및 잘 알려진 기술들은 본 발명이 모호하게 해석되는 것을 피하기 위하여 구체적으로 설명되지 않는다.
In this specification, the embodiments are provided so that the disclosure of the present invention may be completed and the scope of the present invention may be completely provided to those skilled in the art. And the present invention is only defined by the scope of the claims. Thus, in some embodiments, well known components, well known operations and well known techniques are not described in detail in order to avoid obscuring the present invention.

명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다. 그리고, 본 명세서에서 사용된(언급된) 용어들은 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 또한, '포함(또는, 구비)한다'로 언급된 구성 요소 및 동작은 하나 이상의 다른 구성요소 및 동작의 존재 또는 추가를 배제하지 않는다.
Like reference numerals refer to like elements throughout. In addition, the terms used (discussed) herein are for the purpose of describing particular embodiments only and are not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. In addition, components and operations referred to as 'includes (or includes)' do not exclude the presence or addition of one or more other components and operations.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 정의되어 있지 않은 한 이상적으로 또는 과도하게 해석되지 않는다.
Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used in a sense that can be commonly understood by those skilled in the art. In addition, the terms defined in the commonly used dictionary are not ideally or excessively interpreted unless they are defined.

본 발명에서 필요한 관련기술들은 Word2Vec, CNN, LSTM, WPM 등이 있으며 그 내용은 다음과 같다.Related technologies required in the present invention include Word2Vec, CNN, LSTM, WPM and the like are as follows.

[표 1] 본 발명에 사용한 기술과 사용 이유 사용기술Table 1 Techniques Used in the Present Invention and Reasons for Use

○ 이론적 배경Theoretical Background

1. 분석데이터(불편한 점, 개선 및 아이디어 게시 글)1. Analytical data (posting inconveniences, improvements and ideas)

- 분석 대상 : 전자정부 사이트 172개에 대한 게시 글-Analysis target: Posts about 172 e-government sites

(정보 제공 형 웹사이트 95 개, 정부 대표 홈페이지 43 개, 민원 처리형 웹사이트 31 개, 국민 참여 형 웹사이트 3개에서 게시글 수집)(Posts collected from 95 informational websites, 43 government representative websites, 31 complaint handling websites, and 3 public participation websites)

- 수집 기간 : 2013년 10 월 ~ 2014년 12 월-Collection period: October 2013-December 2014

- 분석 수량 : 3,195건, 18,343개의 의견-Analysis quantity: 3,195 cases, 18,343 opinions

- 분석 항목 : 불편한 점(9,303건) 개선/아이디어(9,0409건)-Analysis items: Improvement of inconveniences (9,303) / Ideas (9,0409)

- 분석 내용 : 3개 범주(서비스 ,정보, 시스템 ) 분류
-Analysis content: 3 categories (services, information, system)

[표 2] 게시글의 예[Table 2] Example of a post

2. 자연어의 처리2. Natural Language Processing

2.1 자연어 처리 개념2.1 Natural Language Processing Concepts

웹사이트 게시물의 경우 사람이 인지할 수 있는 국문, 영문, 숫자 등 자연 형태의 언어로 구성되어 있다. 기계 학습을 진행하기 위하여 컴퓨터가 이해할 수 있는 형태로 표현해야 하는데, 이에 따른 제반 기술을 자연어처리(Natural Language Processing, NLP)라 한다.
Website posts are composed of natural forms of language such as Korean, English, and numbers that can be recognized by humans. In order to proceed with the machine learning, the computer must express it in a form that can be understood. The related technology is called Natural Language Processing (NLP).

자연어는 컴퓨터 프로그램 언어와 같은 인공 언어와 구별된다. 인공 언어는 인간이 규칙을 정해 컴퓨터가 이해할 수 있도록 만들어진 언어이기 때문에, 누구나 그 규칙을 배우면 이해할 수 있게 되지만 자연어는 인간이 규칙을 정하긴 했지만, 오랜 시간이 흐르면서 관습적 규칙, 비형식적 규칙 등 다양한 변화가 가미됨으로 인해 규칙만으로 이해될 수 없는 특성이 있다. 자연어 처리 과정은 형태소 분석, 통사 분석, 의미 분석, 화용 분석의 4개단계로 구분할 수 있으며, 각 단계는 특정이론을 통해 명확한 이론이 확립된 단계가 아니라 지속해서 이론이 제시되고 검증되는 발전단계에 있다.
Natural language is distinguished from artificial languages such as computer programming languages. Artificial language is a language that humans set rules so that computers can understand it, so anyone can understand it when they learn the rules. Natural language, although humans set rules, over time, customary rules, informal rules, etc. Due to various changes, there are characteristics that cannot be understood by rules alone. The natural language processing process can be divided into four stages: morphological analysis, syntactic analysis, semantic analysis, and pragmatic analysis. have.

2.2 자동 띄어쓰기 시스템2.2 Auto Spacing System

텍스트 분석에서 가장 기초적 인 작업은 텍스트로부터 단어를 식별 하고 추출하는 토큰화(Tokenization)라고 할 수 있다. 한국어는 '어절(語節)'로 토큰의 단위 기준으로 보고, 중국어나 일본어와 같이 어절 경계 표지가 없는 언어와는 달리, 어절과 어절 사이에 공백을 두어 띄어쓰기를 하도록 규정하였다 . 한국어에 있어서 잘못된 띄어쓰기는 중의성(ambiguity)을 유발시키거나 텍스트분석에서 잡음(noise)을 일으켜 오히려 토큰화를 방해하며, 가독성을 떨어뜨린다. 이와 같이 한국어에서 띄어쓰기는 텍스트에 대한 사용자 가독성만큼 이나 기계 가독성에도 영향을 주는 중요한 요소이다. 문장 내의 띄어쓰기 오류는 많은 문법적, 의미적 모호성을 일으키며, 때로는 형태소 분석을 불가능하게 만들기도 한다.
The most basic task in text analysis is tokenization, which identifies and extracts words from text. Korean is regarded as a word by token, and it is required to use a space between words and words, unlike languages without word boundary marks such as Chinese or Japanese. In Korean, incorrect spacing causes ambiguity or noise in text analysis, which in turn hinders tokenization and decreases readability. As such, spacing in Korean is an important factor that affects machine readability as well as user readability for text. Spacing errors in sentences cause a lot of grammatical and semantic ambiguity, sometimes making stemming impossible.

따라서, 본 발명에서는 인터넷 환경에서 사용자가 띄어쓰기를 고려하지 않고 게시글을 쓰는 경우가 많다고 가정하고, 자동 띄어쓰기를 적용하여 문서를 준비하였다. 아래의 표 3은 비격식 문서인 인터넷 게시글을 나타내고 표 4는 비격식 문서인 게시글에 자동 띄어쓰기를 적용한 예를 나타낸다.Therefore, the present invention assumes that a user often writes a post without considering spacing in the Internet environment, and prepares a document by applying automatic spacing. Table 3 below shows an Internet post that is an informal document, and Table 4 shows an example of applying auto spacing to an article that is an informal document.

[표 3]비격식 문서인 인터넷 게시글의 예[Table 3] Example of an unofficial document Internet post

[표 4]비격식 문서인 게시글에 자동 띄어쓰기를 적용한 예[Table 4] Example of applying automatic spacing to a post that is an informal document

2.3 WPM(Word Piece Model)2.3 Word Piece Model

WPM은 음성 검색 시스템 구축을 위한 방법으로 언어에 대한 사전 지식 없이 혼잡도(perplexity)를 최소로 하여 어휘를 자동 생성하는 방법이다. 기존의 자연어 처리에서는 형태소 분석, 통사 분석, 의미 분석, 화용 분석 등 4가지로 진행되나, WPM은 국제 발음기호(the International Phonetic Alphabet, PA)기반의 발음 세트(Set)로 유닛을 코드화 시킨 후, 통계적 기법을 활용하여 사용 빈도수에 따라 조합을 하여 새로운 유닛을 생성한다. WPM 어휘 사전 생성 알고리즘은 도 1과 같은 과정으로 이루어져 있다.
WPM is a method for constructing a voice search system and automatically generates a vocabulary with minimal complexity without prior knowledge of the language. In conventional natural language processing, there are four types of morphological analysis, syntactic analysis, semantic analysis, and pragmatic analysis. However, WPM encodes a unit using a pronunciation set based on the International Phonetic Alphabet (PA). Create new units by combining them according to the frequency of use using statistical techniques. The WPM lexical dictionary generation algorithm consists of the process shown in FIG. 1.

WPM의 장점은 언어에 독립적이며, 통계적인 방식을 사용하므로 특정 도메인 또는 아직 의미가 파악되지 않는 언어에도 적용할 수 있다. 본 발명에서는 위의 표 4와 같이 자동 띄어쓰기를 적용한 문서에 도 1에서와 와 같이 WPM 알고리즘을 적용 하였다.
The benefits of WPM are language-independent and statistical, so they can be applied to specific domains or even languages that do not yet have meaning. In the present invention, the WPM algorithm is applied to the document to which the auto spacing is applied as shown in Table 4 as shown in FIG.

아래의 [표 5]는 자동 띄어쓰기를 적용한 후 WPM을 적용한 후의 예를 나타낸다.[Table 5] below shows an example after applying WPM after applying automatic spacing.

[표 5] 자동띄어쓰기를 적용한 후 WPM을 적용한 예[Table 5] Example of applying WPM after applying auto offset

즉, WPM은 통계적 기법을 활용하여 음절을 기반으로 음절들을 합치면서 새로운 어휘를 생성하고 사용 빈도수가 높은 어휘로 사전을 만드는 방법이다. WPM을 적용하여 토큰화한 문장의 예는 [표 5]과 같다 .
In other words, WPM uses statistical techniques to combine syllables based on syllables to generate new vocabulary and to create dictionaries with vocabulary with high frequency. [Table 5] shows an example of tokenized sentences using WPM.

[표 6] WPM 적용 전후 문장의 비교 예[Table 6] Comparative example of sentences before and after applying WPM

[표 6]을 통해서, WPM을 적용하면 형태소 분석과 다르게 토큰화 되는 것을 확인 할 수 있다 . "퀵메뉴 중에 네탄" 의 경우, 형태소 분석 시 "퀵"과 "메뉴"로 나누어지지만, WPM 적용 시 "퀵"과 "메", "뉴"로 나누어지게 된다 . 이는, 사용된 코퍼스에서 "퀵메뉴" 보다는 "퀵" 과 "메", "뉴"의 출현 빈도수가 높기 때문이다. 이렇듯 WPM은 통계적인 방식을 사용하기 때문에, 언어에 독립적이며 특정 도메인 또는 아직 의미를 파악하지 못한 상태에서도 적용할 수 있다는 장점을 가진다.
[Table 6] shows that WPM can be tokenized differently from morphological analysis. In the case of "Netane" in the "Quick Menu", it is divided into "Quick" and "Menu" when stemming, but it is divided into "Quick", "Me" and "New" when applying WPM. This is because the frequency of appearance of "Quick", "Me", and "New" is higher than that of "Quick Menu" in the corpus used. As such, WPM uses a statistical method, which is language independent and has the advantage of being applicable even in a specific domain or in a state in which meaning is not yet understood.

3. 단어 및 문서의 벡터 표현3. Vector representations of words and documents

많은 기계학습 알고리즘은 훈련을 위해 텍스트(text)의 문자열(string)을 그대로 입력할 수 없고, 컴퓨터가 인지할 수 있도록 입력이 수치화된 데이터(fixed-length feature vector)로 표현되는 것을 요구하기 때문에, 문서 분류 작업에 앞서 문서 또는 단어들을 벡터 공간상에 표현하는 방법이 필요하다. 이에 본 발명에서는, 순서와 의미를 내포하는 벡터의 형태로 단어 및 문서를 표현하는 계량기법인 word2vec을 활용하였다. 이러한 기법들을 통해 생성된 벡터들은 LR, DNN, CNN, LSTM 등과 같은 여러 기계학습 이나 인공 신경망 기술의 입력 자료로 사용될 수 있다.
Many machine learning algorithms cannot enter a string of text as it is for training, but require that the input be represented as a fixed-length feature vector for the computer to recognize. Prior to document classification, there is a need for a method of representing documents or words in a vector space. Accordingly, in the present invention, word2vec, which is a quantitative technique for expressing words and documents in the form of vectors containing order and meaning, was used. The vectors generated by these techniques can be used as inputs for various machine learning or artificial neural network technologies such as LR, DNN, CNN, LSTM, etc.

3.1 Word2Vec3.1 Word2Vec

Word2Vec은 원래 인공 신경망 연구에서 태어 났다. 같은 맥락(context)에 있는 단어는 가까운 의미를 가진다는 전제(Distributional Hypothesis)에서 출발한다. word2vec은 텍스트 문서를 통해 학습을 진행하며 문장 내에 한 단어와 같이 출현하는 다른 단어들을 관련 단어로써 인공 신경망에 학습시킨다. 연관된 단어들은 문서상에서 가까운 곳에 출현할 가능성이 높아지기 때문에 학습을 반복해 나가는 과정에서 주변 단어가 비슷한 두 단어는 가까운 벡터 공간에 놓이게 된다.
Word2Vec was originally born from artificial neural network research. Words in the same context begin with the Distribution Hypothesis. word2vec learns through text documents and learns other words that appear as a word in a sentence in the artificial neural network as related words. Because the associated words are more likely to appear nearer in the document, in the process of repeating learning, two words with similar words are placed in a close vector space.

word2vec은 단순하게 한 단어의 앞뒤로 서로 같은 정보가 있는지 없는지를 이용하여 학습하는 것이다. 따라서 아주 추상적인 동사나 형용사는 학습이 명사에 비해서 학습이 어려울 수 있다. 다만 그럼에도 불구하고 수 없이 많은 데이터를 보면 동사들이 어떤 목적어를 가지는지 규칙성을 파악함으로 어느 정도 동사들 간의 의미관계도 학습이 가능하다고 볼 수 있다.
word2vec simply learns whether there is the same information before or after a word. Thus, very abstract verbs or adjectives can be difficult to learn compared to nouns. Nevertheless, looking at countless data, it is possible to learn the semantic relationship between verbs to some extent by grasping the regularity of what object verbs have.

예를 들어 break, broken은 서로 비슷한 목적어를 가질 것이므로 두 동사는 비슷한 의미를 취할 것이라고 학습할 수 있을 것이다. 또한 충분히 많은 학습이 이루어지게 되면 break, broken의 벡터 공간에서의 거리가 have와 had의 벡터 공간에서 거리와 같아질 수 있다. 이는 과거의 의미를 학습할 수 있다는 것이다.
For example, you might learn that break and broken will have similar objects, so the two verbs will have similar meanings. Also, if enough learning is done, the distance in the vector space of break and broken can be equal to the distance in the vector space of have and had. This means that you can learn the meaning of the past.

word2vec의 모델은 심층 신경망(Deep Neural Network, DNN)이 아니다. 활성화 함수가 적용되지 않은 은닉층 1개와 소프트맥스 함수(softmax function)가 적용된 출력층으로 구성된 인공 신경망이다. 그래서 일반적인 심층 신경망보다 학습속도가 굉장히 빨라서 매우 큰 데이터도 손쉽게 학습시킬 수 있다는 것이 큰 장점이다. word2vec의 알고리즘은 내부적으로 도 2와 같이 CBOW(Continuous Bag Of Words, 이하 CBOW로 표기)와 SG(Skip Gram, 이하 SG로 표기)이라는 두 개의 신경망 모델을 이용해 문장을 학습하여 비슷한 의미의 단어들을 가까운 벡터 공간에 표현해준다.
The model of word2vec is not a deep neural network (DNN). It is an artificial neural network consisting of one hidden layer without activation function and an output layer with softmax function. Therefore, the learning speed is much faster than the general deep neural network, so it is easy to learn very large data. The algorithm of word2vec internally learns sentences using two neural network models, CBOW (Continuous Bag Of Words) and SG (Skip Gram, below). Express in vector space.

CBOW 모델은 입력이 t-2, t-1, t+1, t+2의 주변 단어들이고, 출력 CBOW 모델은 입력이 t-2, t-1, t+1, t+2의 주변 단어들이고, 출력이 t번째 단어를 예측하는 모델이다. SG 모델보다 상대적으로 속도가 몇 배 빠르다고 알려져 있다. 반대로 SG 모델은 입력이 1개의 t번째 단어이고, 출력이 t-2, t-1, t+1, t+2의 주변 단어들을 예측하는 모델이다. 상대적으로 빈도가 적은 단어를 잘 학습하여 성능이 CBOW보다 약간 더 좋다고 알려져 있다[16]. 따라서 본 논문에서는 단어 벡터 표현 생성 시 SG 모델을 이용한다.
The CBOW model has inputs surrounding words at t-2, t-1, t + 1, t + 2, and the output CBOW model has inputs surrounding words at t-2, t-1, t + 1, t + 2. The output is a model that predicts the t-th word. It is known to be several times faster than the SG model. In contrast, the SG model is a model in which the input is one t-th word and the output predicts the surrounding words of t-2, t-1, t + 1, and t + 2. It is known that its performance is slightly better than CBOW because it learns relatively few words well [16]. Therefore, in this paper, we use SG model to generate word vector representation.

4. 기계 학습4. Machine learning

4.1 소프트 맥스 회귀(Softmax regression)4.1 Softmax regression

소프트 맥스 회귀는 로지스틱 회귀의 멀티클래스(Multiclass) 버전이다. 모든 출력값의 분모의 총합을 1로 정규화 시킨다. 그리고, 각각의 출력별로 비율을 책정한다. 총합은 계속 1로 만들게 만들게 되므로 한 개의 강력한 피처(feature)가 나타나면 이 값은 1로 수렴을 하는 과정에서 나머지 값에도 영향이 미쳐서 0으로 수렴하게 만들어 학습의 가속화가 생긴다.
Softmax regression is a multiclass version of logistic regression. Normalize the sum of the denominators of all outputs to one. Then, a ratio is determined for each output. The sum will keep making it 1, so if one powerful feature appears, this value will also affect the rest of the values in the course of convergence to 1, converging to 0, thus accelerating learning.

본 발명에서는 출력(output)의 성능이 기존의 시그모이드(sigmoid) 함수보 다 소프트 맥스 회귀가 성능이 좋다고 알려져 있어 이를 적용하였다.
In the present invention, the performance of the output (soft) regression is known to have better performance than the conventional sigmoid (sigmoid) function is applied to this.

4.2 컨볼루션(Convolutional Neural Network, CNN)4.2 Convolutional Neural Network (CNN)

CNN은 영상에 적용이 용이하도록 만들어진 인공 신경망의 한 종류이다. CNN은 Lecun이 1998년 처음 제안하였으며 일반적인 다층 퍼셉트론에서 사용되는 구조와 다르게 컨볼루션 계층과 풀링 계층으로 이루어져 있다.
CNN is a type of artificial neural network designed to be easily applied to images. CNN was originally proposed in 1998 by Lecun and consists of a convolutional layer and a pooling layer, unlike the structure used in a typical multilayer perceptron.

도 3은 리쿤(Lecun)이 발표한 CNN의 구조를 나타내는 CNN의 구조도이다.3 is a structural diagram of a CNN showing the structure of a CNN published by Lecun.

위와 같은 CNN은 일반적으로 몇 개의 층으로 이루어져 있으며 기본적으로 3개의 다른 층을 가지고 있다.Such a CNN generally consists of several layers and basically three different layers.

－ 컨볼루션 계층 : 컨볼루션 자질을 추출하는 계층으로 유의미한 자질을 추출하는 층을 의미한다. Convolutional layer: A layer for extracting convolutional features, which means a layer for extracting significant features.

－ 풀링 계층 : 일반적으로 CNN은 이미지에 적용된다. 이미지 특성상 픽셀의 개수가 너무 많아 자질을 줄이기 위해 서브샘플링(sub-sampling) 하는 과정을 풀링이라한다.Pooling layer: In general, CNN is applied to images. Due to the image characteristics, the number of pixels is so large that subsampling (sub-sampling) to reduce the quality is called pooling.

- 풀리 커넥티드(Fully Connected) 계층 : 마지막으로 적용되며 컨볼루션 계층과 풀링 계층에서 나온 자질들을 이용해서 분류를 할 때 사용된다. 일반적인 인공 신경망처럼 행동한다.
Fully Connected Layer: Lastly applied and used to classify using features from the convolutional and pooling layers. Act like a normal artificial neural network

일반적인 CNN은 구조가 컨볼루션 계층 → 풀링 계층 → 컨볼루션 계층 → 풀링 계층 → … → Fully Connected 계층으로 이루어져 있다. 즉, 컨볼루션 계층과 풀링 계층을 번갈아 가면서 사용하여 자질을 추출한 후 마지막으로 Fully -Connected 계층을 통해서 분류를 수행한다.
The general CNN structure has a convolution layer → a pooling layer → a convolution layer → a pooling layer →…. → It is composed of Fully Connected layer. In other words, the feature is extracted by alternating the convolution layer and the pooling layer, and finally, classification is performed through the Fully-Connected layer.

CNN이 최근 들어 다른 알고리즘에 비해 영상 분류 및 객체 검출에 우수한 성능을 보이는 이유는 크게 세 가지를 들 수 있다.
There are three main reasons why CNNs have recently performed better than other algorithms for image classification and object detection.

첫 번째는 Rectified Linear Unit(ReLU) 이라는 활성화 함수의 도입으로 이전 sigmoid, tanh 등의 활성화 함수에서 함수에서 나타나던 문제인 그레이디언트 베니싱(gradient vanishing) 문제가 없어진 것이다. 그레이디언트 베니싱은 신경 회로망을 학습하는 대표적인 알고리즘인 오류 역전파 알고리즘에서 낮은 층으로 갈수록 전파되는 에러의 양이 적어짐으로 인해 그레이디언트 변화가 거의 없어져 학습이 일어나지 않는 현상이다. 이 문제로 인해 깊은 인공 신경 망의 학습이 어려웠는데 ReLU의 도입으로 이 문제를 해결하여 깊은 인공 신경망에서도 낮은 층까지 학습이 가능해졌다.
The first is the introduction of an activation function called Rectified Linear Unit (ReLU), which eliminates the problem of gradient vanishing, a problem that appeared in functions in previous activation functions such as sigmoid and tanh. Gradient vanishing is a phenomenon in which learning does not occur because the gradient change is almost eliminated due to the less amount of error propagating toward the lower layer in the error backpropagation algorithm, which is a representative algorithm for learning neural networks. Because of this problem, it was difficult to learn deep artificial neural networks, but the introduction of ReLU solved this problem, and even deep artificial neural networks were able to learn even lower layers.

두 번째 이유는 이유는 이미지넷과 같은 대용량 데이터베이스의 출현이다. 하드웨어의 발달로 인해 대용량 저장장치가 보편화되었고 아마존 메크니칼 터크(Amazon Mechanical Turk) 등을 이용한 크라우드 소싱이 가능해지면서 대용량 학습 데이터의 정답을 수작업으로 레이블링하는 일이 가능해졌다 가능해졌다. 이러한 100 만 장 이상의 대용량 영상 데이터베이스를 바탕으로 여러 층으로 이루어진 CNN을 학습함으로써 과적합 문제를 해결할 수 있었다.
The second reason is the emergence of large databases such as ImageNet. Advances in hardware have made mass storage more common and crowdsourcing with Amazon Mechanical Turk has made it possible to manually label the correct answer for large amounts of training data. The overfitting problem could be solved by learning the multi-layer CNN based on more than 1 million large image databases.

일반적인 인공 신경망의 경우 학습해야 하는 변수의 개수가 매우 많기 때문에 적은 양의 학습 데이터로는 과적합이 쉽게 일어나게 되는데 대용량 데이터베이스의 출현으로 깊은 인공 신경망을 과적합 없이 학습할 수 있게 된 것이다.
In the case of general artificial neural networks, the number of variables to be learned is so large that overfitting easily occurs with a small amount of training data. With the advent of large databases, deep artificial neural networks can be trained without overfitting.

마지막 이유는 드롭아웃(dropout)을 활용한 정규화(regularization)를 들 수 있다. 드롭아웃은 인공 신경망의 과적합을 방지하기 위해 학습 알고리즘 상에서 특정 비율의 뉴런을 무작위로 작동하지 않게 만든 채 학습을 수행하게 된다. 매 인테그레이션(iteration)마다 작동하지 않는 뉴런을 다르게 뽑아서 학습을 시켜 각각의 뉴런이 같은 정보를 학습하거나 아무런 정보도 학습하지 않는 것을 방지하였다.
The last reason is regularization with dropout. Dropouts are trained to disable a random percentage of neurons in a learning algorithm to prevent overfitting the artificial neural network. Each integration trained different neurons that did not work to prevent each neuron from learning the same information or learning any information.

위와 같은 이유로 CNN은 대용량의 영상 데이터가 존재할 때 영상 분류 및 객체 검출을 효과적으로 수행하며 현존하는 알고리즘 중 가장 좋은 성능을 보이는 것으로 보고되고 있다.
For this reason, CNN is reported to perform image classification and object detection effectively when there is a large amount of image data and to show the best performance among existing algorithms.

이렇듯 원래 컴퓨터 비전을 위해 고안된 CNN 이 최근에 자연어 처리에 효과적이라는 것이 알려지면서, semantic parsing, search query retrieval, sentence modeling 그리고 다른 전통적인 자연어 처리에 있어서 우수한 결과를 보여주었다. 이에 본 발명에서는 CNN과 함께 다음에 언급하는 LSTM과 복합모델을 이용하여 문서의 분류를 위한 분류기로 이용하여 시험을 진행한다.
CNN, originally designed for computer vision, has recently been shown to be effective in natural language processing, and has shown excellent results in semantic parsing, search query retrieval, sentence modeling, and other traditional natural language processing. Therefore, in the present invention, a test is performed using a classifier for classifying documents by using the LSTM and a composite model, which will be described later, together with the CNN.

4.3 장단기 기억 네트워크(Long Short Term Memory, LSTM)4.3 Long Short Term Memory (LSTM)

LSTML은 1997년 Hochreiter & Schmidhuber에 의해 제안된 RNN 아키텍 처이며 현재까지도 가장 주요한 RNN으로 자리 잡고 있다. LSTM은 전통적인 RNN 구조에서 구조에서 은닉계층의 유닛들을 LSTM 블록(Block)으로 대치시킨 형태와 같다. 도 4는 LSTM 블록 구조를 나타낸 도면이다. 도 4에 도시된 바와 같은 LSTM 블록들은 기존의 은닉 유닛(Hidden Unit)들과 마찬가지로 재귀적 구조를 띄며, 각각의 LSTM LSTM 블록 내부는 재귀적 구조를 가진 기억소자(MemoryCell)와 입력게이트(Input Gate), 포겟 기게이트(Forget Gate), 출력게이트(Output Gate) 3종류의 게이트 유닛들로 유닛들로 이루어져 있다. LSTM은 전통적인 RNN과 마찬가지로 은닉변수를 거쳐 최종 출력값을 계산하지만, 은닉변수의 계산 과정에서 앞에 거론된 게이트 유닛들을 적절하게 이용해서 정보의 흐름을 조절한다.
LSTML is the RNN architecture proposed by Hochreiter & Schmidhuber in 1997 and is still the most important RNN. LSTM is a form of replacing the hidden layer units with LSTM blocks in the structure of the conventional RNN structure. 4 is a diagram illustrating an LSTM block structure. As shown in FIG. 4, the LSTM blocks have a recursive structure like the conventional hidden units, and each LSTM block has a recursive structure with a memory cell and an input gate. ), The Forget Gate, and the Output Gate are comprised of three types of gate units. Like conventional RNN, LSTM calculates the final output value through hidden variable, but in the calculation of hidden variable, it uses the gate units mentioned earlier to control the flow of information.

각각의 은닉변수의 유도과정은 다음과 같다 : 가장 먼저 포겟 게이트를 통해 기억소자에 저장되어 있는 기존의 소자변수(Cell State)를 얼마나 잊어버릴지 결정한다.
The process of deriving each hidden variable is as follows: First, the forget gate determines how much of the existing cell state stored in the memory device is lost.

LSTM은 장기 의존성 문제를 피하고자 설계되었다. 오랫동안 정보를 기억하는 것이 사실상 LSTM의 기본 동작이다. 모든 순환 신경망은 사슬 형태의 반복되는 신경망 모듈들을 가진다. 표준 순환 신경망에서, 이 반복되는 모듈은 한 개의 tanh 층 같은 매우 간단한 구조를 가질 것이다.
LSTM is designed to avoid long-term dependency issues. Remembering information for a long time is actually the basic behavior of the LSTM. Every cyclic neural network has repeating neural network modules in chain form. In a standard cyclic neural network, this repeating module will have a very simple structure like one tanh layer.

도 5는 LSTM 네트워크 구조를 나타내는 도면이다. 도 5에 도시된 바와 같이 LSTM은 사슬과 같은 구조를 가진다. 그러나 반복되는 모듈은 다른 구조를 가진다. 이 모듈에는 하나의 신경망 층 대신 매우 특별한 방식으로 상호작용하는 네 개의 층이 있다.
5 is a diagram illustrating an LSTM network structure. As shown in FIG. 5, the LSTM has a chain-like structure. However, repeating modules have a different structure. This module has four layers that interact in a very special way instead of one neural network layer.

도 6은 LSTM 블록 다이어그램을 나타낸 도면이다. 도 6에 도시된 다이어그램에서, 노란색 상자는 학습된 신경망 층이다. 분홍색 원은 벡터 덧셈 같은 요소별 연산을 나타낸다. 각 화살표는 한 노드 출력에서 다른 노드의 입력으로 전체 벡터 하나를 전달한다. 합쳐지는 화살표들은 연관(concatenate)을 표시한다. 갈라지는 화살표는 그 내용이 복사되어 다른 곳으로 보내짐을 표시한다. LSTM의 핵심은 셀 상태(cell state) 즉 다이어그램의 위쪽을 통과해 지나는 수평선이다. 셀 상태는 일종의 컨베이어 벨트라 볼 수 있다. 셀 상태는 약간의 가벼운 선형 상호작용만 일으키며 전체 수평선을 그냥 똑바로 지나간다. 정보는 바뀌지 않은 채 그냥 흘러 갈 수 있다. LSTM은 셀 상태에 정보를 더하거나 지울 수 있다. 게이트라 불리는 구조들이 이 과정을 조절한다. 게이트는 정보가 선택적으로 지나가게 한다. 게이트는 시그모이드 신경망 층과 요소별 곱셈 연산으로 구성된다.
6 shows an LSTM block diagram. In the diagram shown in FIG. 6, the yellow box is a learned neural network layer. Pink circles represent element-wise operations, such as vector addition. Each arrow passes a whole vector from one node output to the other node's input. The merged arrows indicate concatenates. A forked arrow indicates that its contents are copied and sent elsewhere. At the heart of the LSTM is the cell state, the horizontal line passing through the top of the diagram. The cell state is a kind of conveyor belt. The cell state causes only a slight linear interaction and just passes straight through the entire horizontal line. The information can flow just unchanged. The LSTM may add or remove information to the cell state. Structures called gates control this process. The gate allows information to pass selectively. The gate consists of a sigmoid neural network layer and element-by-element multiplications.

○ 실험 데이터 구성 및 분류 모델○ Experimental data composition and classification model

1. 데이터 수집 및 실험 데이터 세트 구성1. Data Collection and Experimental Data Set Configuration

2013년 10월 ~ 2014년 12월 까지 국민행복 맞춤형 서비스 모니터 단을 통해 수집한 전자정부 사이트 172개(정보 제공형 웹사이트 95개, 정부대표홈페이지 43 개, 민원 처리형 웹사이트 31 개, 국민 참여형 웹사이트 3개)의 불편한 점, 개선/아이디어에 대한 게시 글을 수집하여 게시물 코퍼스를 구성하였다.
From October 2013 to December 2014, 172 e-government sites (95 informational websites, 43 government representative websites, 31 civil complaint handling websites, and public participation) collected through the National Customs Service Monitor Group The post corpus was constructed by collecting posts on the inconveniences and ideas of 3 websites).

[표 7] 실험에 사용할 정보와 분류 범주[Table 7] Information and classification categories for experiment

[표 7] 에서 보듯이, 게시물의 본문 정보는 사이트의 불편한 점/개선 및 아이디어에 대한 내용을 "서비스", "정보", "시스템" 의 3가지 범주로 분류한다. 각 범주별 텍스트 문서에 대한 예시는 "서비스"는 표 8, "정보"는 표 9, "시스템"은 표 10에서 보여지는 것과 같다.
As shown in [Table 7], the body information of the post categorizes the inconvenience / improvement and ideas of the site into three categories: "service", "information", and "system". Examples of text documents for each category are as shown in Table 8 for "Service", Table 9 for "Information", and Table 10 for "System".

[표 8] 서비스 데이터의 예시Table 8 Example of Service Data

[표 9] 정보 데이터의 예시Table 9 Example of Information Data

[표 10] 시스템 데이터의 예시Table 10 Example of System Data

본 연구에서 활용한 CNN, LSTM 모델은 범주를 분류하는 지도 학습 기반이기 때문에, 각 게시물마다 타겟 클래스(target class)로 사용할 사전에 입력된 범주 정보가 필요하다. 따라서 수집한 전체 게시물의 수기로 분류한 범주 정보를 사용하였다. 또한 모델 훈련시 훈련 데이터 세트의 크기가 작으면 과적합이 발생할 가능성이 높기 때문에, 범주별로 게시물 건수는 다음의 [표 11]과 같다.
Since the CNN and LSTM models used in this study are based on supervised learning to classify categories, pre-categorized category information is required for each post as a target class. Therefore, category information classified by handwriting of all collected posts was used. In addition, when the training data set is small in model training, the possibility of overfitting is high, so the number of posts by category is shown in [Table 11] below.

[표 11] 범주별 게시물 개수 및 라벨(Label)[Table 11] Post count and label by category

2. 데이터 전처리2. Data Preprocessing

실험 데이터의 전처리 수행과정은 도 7과 같다. The preprocessing of the experimental data is shown in FIG. 7.

전처리는 도 7과 같이 5단계로 진행되며, 크게 정보 추출 및 문자열 처리 과정과 문서를 토큰화하여 문서와 단어의 벡터 표현을 생성하는 과정으로 나눌 수 있다.
The preprocessing is performed in five steps as shown in FIG. 7. The preprocessing can be largely divided into a process of extracting information, processing a string, and generating a vector representation of a document and a word by tokenizing the document.

2.1 정보 추출 및 데이터 구성2.1 Extracting Information and Organizing Data

앞에서 구성한 실험 데이터 세트에 대해, 우선 모니터단의 활동을 통해 작성된 데이터를 엑셀 형태로 변환하고, 세트별로 파싱하여 [표 7]의 요소들로 본문, 범주 정보를 추출하였다. 추출한 정보들에 대해 [표 12] 같이 문자열 전처리를 수행한 후, [표 11]의 라벨을 기준으로 하여 범주별로 3개의 파일에 나누어 저장하였다.For the experimental data set constructed above, first, the data created through the activities of the monitor group was converted into Excel format, and the text and category information were extracted with the elements of [Table 7] by parsing each set. After the string preprocessing was performed on the extracted information as shown in [Table 12], based on the label of [Table 11], the information was divided and stored in three files for each category.

[표 12] 문자열 전처리 내역[Table 12] String Preprocessing History

2.2 문서 토큰화 방법 선정2.2 Select Document Tokenization Method

본 발명에서 사용한 CNN-LSTM 모델은 훈련을 위해 텍스트의 문자열을 그대로 입력할 수 없고, 컴퓨터가 인지할 수 있도록 입력이 고정 길이로 수치화되어 표현되는 것을 요구한다. 따라서 앞서 문자열 처리한 데이터에 대하여 문서 또는 문서에 포함된 단어들을 고정 크기의 벡터로 표현하는 전처리 과정이 추가로 필요하다. 본 발명에서 문서 또는 단어들을 벡터로 변환하기 위해서 사용한 라이브러리는 gensim의 doc2vec 라이브러리이다. gensim의 doc2vec 라이브러리는 문서의 토큰을 입력으로 받아서 문서 및 문서에 포함된 단어들의 벡터를 생성하기 때문에, 문서의 토큰화가 우선되어야 한다. 따라서 문서의 분류에 더 나은 성능을 보이는 토큰화 방법을 선정하기 위한 실험을 수행하였다.
The CNN-LSTM model used in the present invention cannot input a string of text as it is for training, but requires that the input be digitized to a fixed length so that the computer can recognize it. Therefore, there is a need for a preprocessing process for expressing a document or words included in the document as a fixed size vector with respect to the string-processed data. The library used to convert documents or words to vectors in the present invention is gensim's doc2vec library. Gensim's doc2vec library takes a document's token as input and generates a document and a vector of words in the document, so the tokenization of the document should take precedence. Therefore, an experiment was conducted to select a tokenization method that showed better performance in document classification.

문서 분류에 더 나은 성능을 보이는 토큰화 방법을 찾기 위해, doc2vec 라이브러리를 활용하여 생성한 문서의 벡터 표현을 입력 자질로 사용하여 기본적인 문서 분류를 수행하였다. 이는 상대적으로 높은 성능을 보이는 토큰화 방법을 이용하여 생성한 문서의 벡터 표현이 범주별 문서 사이의 차이를 더 잘 나타내고 범주별로 문서를 더 잘 구분하여, 이를 바탕으로 제안 모델에 적용시 문서의 분류에 있어 성능 향상에 기여할 수 있을 것으로 판단하였기 때문이다.
In order to find a better tokenization method for document classification, basic document classification was performed using the vector representation of the document generated using the doc2vec library as an input feature. This is because the vector representation of the document generated using the relatively high performance tokenization method shows the difference between the documents in each category and better classifies the documents in each category. This is because we think that it can contribute to the improvement of performance.

따라서 게시물을 어절 단위 , 자동띄어쓰기 적용, WPM 적용의 3가지 방법으로 토큰화한 후, 해당 토큰들로 doc2vec 라이브러리를 이용하여 생성한 문서의 벡터 표현을 LR 분류기에 전달하여 분류율을 산출하였다. 리소스 사용 및 성능을 고려하여, 각 문서 벡터의 크기는 300 차원으로 생성하였다.
Therefore, the posts were tokenized in three ways: word units , auto-justification, and WPM, and the classification rate was calculated by passing vector representations of the documents created using the doc2vec library to the LR classifier. In consideration of resource usage and performance, the size of each document vector was generated in 300 dimensions.

토큰화 방법을 찾는 실험 시, 데이터 세트는 8:1:1로 나누어 전체 데이터 세트의 90%를 훈련에 사용하고, 나머지 10%로 테스트를 수행하였다. 수행 결과는 [표 13]과 같이 WPM을 적용한 결과가 분류율 66%로 가장 높은 성능이 산출되었다.
In experiments looking for a tokenization method, the dataset was divided into 8: 1: 1, where 90% of the total dataset was used for training and the remaining 10% was tested. As shown in [Table 13], the result of applying WPM was the highest as the classification rate of 66%.

[표 13] 토큰화 방법별 비교[Table 13] Comparison of tokenization methods

[표 13]를 통해서, WPM을 적용하면 어절 단위 와 전체 토큰의 개수는 최대로 생성되나, 고유한 토큰의 개수는 적게 생성되는 것을 확인할 수 있다. 실제로 고유한 토큰의 개수는 WPM 적용시 가장 적다고 추정할 수 있다. 결과적으로, WPM을 적용하면 적은 개수의 고유한 토큰을 발생시키고, 생성된 토큰을 활용한 문서의 벡터 표현이 문서의 분류에 유용함을 실증적으로 확인할 수 있다.
[Table 13] shows that when WPM is applied, the maximum number of word units and total tokens is generated, but the number of unique tokens is generated less. In fact, it can be estimated that the number of unique tokens is the smallest when applying WPM. As a result, applying WPM generates a small number of unique tokens, and it can be empirically confirmed that the vector representation of the document using the generated token is useful for the classification of the document.

이어지는 모델별 비교 실험에서 WPM을 적용하여 생성한 토큰들을 doc2vec 라이브러리의 입력으로 사용하여 문서와 단어의 벡터 표현을 생성하였다.
In the following comparison experiment for each model, vector representations of documents and words were generated using the tokens generated by applying WPM as input to the doc2vec library.

2.3 Word2Vec을 활용한 벡터 생성2.3 Vector Generation with Word2Vec

전처리의 마지막 과정으로, CNN-LSTM 모델의 입력으로 사용할 문서와 단어의 벡터 표현을 생성하였다. 앞서 진행한 문서 분류에 유용한 토큰화 방법 선정 실험에 기반하여, WPM을 적용하여 생성한 토큰을 gensim의 doc2vec 라이브러리에 입력하여 문서와 단어의 벡터 표현을 생성하였다. gensim의 doc2vec 라이브러리는 word2vec 라이브러리의 확장으로, doc2vec 라이브러리를 활용하여 문서의 벡터 표현 생성시 문서에 포함된 단어들의 벡터 표현도 함께 생성된다. 리소스 사용 및 성능을 고려하여 , 각 문서 및 단어의 벡터 크기는 300차원으로 설정하였으며, doc2vec 클래스 생성자의 매개변수 설정값은 [표 14]와 같다 .
As a final step of the preprocessing, we generated a vector representation of the document and words to be used as input to the CNN-LSTM model. Based on the experiment of selecting tokenization method useful for document classification, the token generated by applying WPM was input to gensim's doc2vec library to generate vector representation of documents and words. gensim's doc2vec library is an extension of the word2vec library. When the vector representation of a document is generated using the doc2vec library, a vector representation of the words contained in the document is also generated. Considering resource usage and performance, the vector size of each document and word is set to 300 dimensions. The parameter settings of the doc2vec class constructor are shown in [Table 14].

[표 14] doc2vec 클래스 생성자의 매개변수 설정Table 14. Parameter settings for the doc2vec class constructor

○ 실험 및 성능 분석○ Experiment and performance analysis

1. Word2Vec를 활용한 CNN 모델(기반 모델)1. CNN Model (Base Model) using Word2Vec

원래 컴퓨터 비전을 위해 고안된 CNN 모델이 자연어 처리에 효과적이라는 것이 알려지고, semantic parsing, search query retrieval, sentence modeling, 그리고 다른 전통적인 자연어 처리에 있어서 우수한 결과를 이루었다. Yoon Kim의 연구("Convolutional Neural Network for Sentence Classification," Proceedings of the 2014 Conference on Classification," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), 2014.)에서 문장 단위 (sentence level) 분류 작업을 위해 word2vec 알고리즘을 통해 미리 훈련된(pre -trained) 단어의 벡터 표현을 활용한 CNN을 제안되었고, 문서 분류뿐 아니라 자연어 처리 및 텍스트 마이닝의 여러 분야에서 우수한 성능을 내는 기계 학습 도구로 평가 받고 있는 LR과 SVM 분류기를 이용한 경우보다 분류율에 있어 성능 향상을 이루었다. 수집한 코퍼스는 각 코퍼스마다 10겹 교차검증(10-fold cross validation)으로 실험하기 위해 훈련데이터 90% 와 테스트 데이터 10%로 배분하였다.
CNN models, originally designed for computer vision, are known to be effective in natural language processing and have shown excellent results in semantic parsing, search query retrieval, sentence modeling, and other traditional natural language processing. Sentence level classification in a study by Yoon Kim ("Convolutional Neural Network for Sentence Classification," Proceedings of the 2014 Conference on Classification, "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.) The CNN, which uses vector representations of pre-trained words through the word2vec algorithm, has been proposed for its work, and has been evaluated as a high-performance machine learning tool in many areas of natural language processing and text mining as well as document classification. The performance of the corpus was improved by using 10-fold cross validation for each corpus. Allocated.

도 8은 word2vec를 활용한 CNN 모델의 예를 나타낸 도면이다.8 is a diagram illustrating an example of a CNN model using word2vec.

도 8의 모델은 문장을 토큰화하면서 시작한다. 토큰화된 문장은 문장 매트릭스로 변환된다. 매트릭스의 행(row)은 각 토큰의 단어 벡터 표현이다. 도 8에서 "경찰청 홈페이지 인데 요즘 피 싱 범죄 나" 문장은 경찰 청 ,홈페이지, 인데 , 요즘 , 피, 싱, 범죄, 나 로 8개의 토큰으로 나누어지고, 각각의 토큰은 word2vec 라이브러리를 이용하여 벡터로 표현된다. 도 8에서는 각 토큰을 100 차원의 벡터로 변환한 것이다 . 만일, 단어 벡터의 차원을 d라고 하고, 주어진 문장의 길이(문장의 토큰 개수)를 s라고 하면, 문장 매트릭스의 차원은 s × d가 되고, 도 8에서는 8 × 300의 형태(shape)를 가지는 문장 매트릭스가 된다 문장 매트릭스로 변환시, zero-padding 방법을 사용하여 CNN 입력 길이를 맞춘다. 이제 문장을 매트릭스로 변환하였기 때문에, 일반적인 CNN에서와 같이 문장을 이미지처럼 다룰 수 있게 된다. 다음으로 필터를 통해서 컨볼루션(convolution)을 수행한다. 문장 매트릭스의 행은 하나의 토큰, 즉 단어를 표현하기 때문에, 필터의 너비(width) 는 단어 벡터의 차원과 동일한 너비 300를 사용하게 된다(앞서 기술하였듯 본 발명에서는, 단어를 300 차원의 벡터로 생성하였기 때문에 필터의 너비는 300이 될 것이다).
The model of FIG. 8 begins by tokenizing a sentence. Tokenized sentences are converted to sentence matrices. The row of the matrix is the word vector representation of each token. In FIG. 8, the sentence is "National Police Agency Homepage, Phishing Crime Me" sentence is divided into 8 tokens, such as Police Agency, homepage, nowadays, blood, singh, crime, me, each token is a vector using the word2vec library Is expressed. In FIG. 8, each token is converted into a vector having 100 dimensions. If the dimension of the word vector is d and the length of the given sentence (number of tokens in the sentence) is s, the dimension of the sentence matrix is s × d, and in FIG. 8 has a shape of 8 × 300. It becomes a sentence matrix When converting to a sentence matrix, the CNN input length is adjusted using the zero-padding method. Now that we've converted the sentence to a matrix, we can treat the sentence like an image, just like in a normal CNN. Next, we perform convolution through the filter. Because the rows of the sentence matrix represent a single token, a word, the width of the filter will use the same width 300 as the dimension of the word vector. Filter width will be 300).

따라서 필터의 너비는 고정되고, 단지 필터의 높이(height)만 다르게 설정할 수 있다. 필터의 높이는 함께 고려될 수 있는 인접한 단어의 수로써 컨텍스트 윈도우의 크기를 의미한다. 전술한 Yoon Kim의 연구에서는 필터의 높이 h를 필터의 영역 사이즈(region size)라고 표현하고 있다. 동일한 영역으로부터 상호 보완적인 (complementary) 자질을 학습하기 위해 동일한 영역 사이즈에 대해 여러개의 필터들을 적용할 수 있다. 도 8에서는 (3,4,5) 3개의 영역 사이즈를 가지는, 즉 3 ×5, 4 × 5, 5 × 5의 필터 3개를 각각 150개씩 , 전체 450개의 필터를 통해 컨볼루션을 수행하여 나온 출력값에 편차(bias)를 더하고, 활성화 함수를 적용하여 새로운 feature map 450 개를 만들고 있다. 주어진 문장에 대하여 위에서 아래로만 윈도우 슬라이딩(sliding)을 수행하기 때문에, 각 필터에 의해서 생성된 feature map은 s - h + 1 차원의 크기를 가지는 벡터가 되고, 필터 영역 사이즈 h와 문장의 길이 s에 따라서 다양해질 것이다.
Therefore, the width of the filter is fixed and only the height of the filter can be set differently. The height of the filter is the size of the context window as the number of adjacent words that can be considered together. In the above-mentioned study by Yoon Kim, the height h of the filter is expressed as the region size of the filter. Multiple filters can be applied for the same region size to learn complementary qualities from the same region. In FIG. 8, three (3,4,5) region sizes are obtained, namely, three filters of 3 × 5, 4 × 5, and 5 × 5 each have 150 convolutions of 450 filters. I am adding 450 bias to the output and applying the activation function to create 450 new feature maps. Since only sliding the window from top to bottom for a given sentence, the feature map generated by each filter becomes a vector with a size of s-h + 1 dimension, and the filter region size h and the length of the sentence s Therefore, it will be diversified.

다음으로 각각의 각각의 피처맵(feature map)에 1-max 풀링 연산(operation)을 수행한다. 1-max 풀링 연산을 통해 각 피처맵에서 가장 큰 값 하나를 선택함으로써 각 피처맵에서 가장 중요한 자질을 얻을 수 있게 된다. 각각의 피처맵에 적용된 1-max 풀링 연산의 출력값들을 연결(concatenation) 하고, 이 벡터가 탑레벨 피처(top-level feature) 벡터가 된다. 탑레벨 피처 벡터의 크기는 문장의 길이에는 영향을 받지 않고, 단지 영역 사이즈와 필터의 개수에만 영향을 받으며, 영역 사이즈 × 필터의 개수와 같은 크기가 된다. 따라서 [도 8] 에서 탑-레벨 피처 벡터의 크기는 모델에 입력되는 문장의 길이와 상관없이 일정하게 3 × 150 = 450 의 고정길이를 가지게 된다.
Next, a 1-max pooling operation is performed on each feature map. The 1-max pooling operation selects one of the largest values in each feature map to obtain the most important features in each feature map. The output values of the 1-max pulling operation applied to each feature map are concatenated, and this vector becomes a top-level feature vector. The size of the top-level feature vector is not affected by the length of the sentence, but only by the size of the region and the number of filters, and is equal to the size of the region size × the number of filters. Therefore, in FIG. 8, the size of the top-level feature vector has a fixed length of 3 × 150 = 450 regardless of the length of a sentence input to the model.

마지막으로, 탑레벨 피처 벡터는 최종 분류를 위해 풀리-커넥티드 소프트맥스(Fully -Connected softmax)계층 (이하 FC 계층으로 표기 )에 전달되며, FC 계층의 출력은 라벨(label)에 대한 확률분포가 된다. 가장 높은 출력값을 가지는 라벨이 주어진 문장의 예측 라벨이 되는 것이다. 도 8에서는 문장을 3개의 라벨로 분류하고 있다. FC 계층에서 정규화(regularization)의 수단으로 드롭아웃(dropout)이 적용될 수 있다.
Finally, the top-level feature vectors are passed to the Fully-Connected softmax layer (hereafter referred to as the FC layer) for final classification, and the output of the FC layer is the probability distribution for the label. do. The label with the highest output will be the predicted label for the given sentence. In FIG. 8, a sentence is classified into three labels. Dropout may be applied as a means of regularization in the FC layer.

도 9는 지금까지 예를 들어 설명한 word2vec을 활용한 CNN 모델을 간단하게 도식화하여 나타낸 도면이다.
9 is a diagram schematically illustrating a CNN model using word2vec described as an example so far.

2. Word2Vec을 활용한 CNN-LSTM 모델(실시예 모델)2. CNN-LSTM Model Using Word2Vec (Example Model)

이전 설명한 Yoon Kim의 연구에서 제안한 모델은, 다양한 영역 사이즈의 필터들을 여러 개 적용한 컨볼루션 계층과 1-max 풀링 계층이 하나로만 이루어져 있는 단일 레이어(one -layer) CNN 이며 FC 계층도 은닉층이 없이 소프트맥스(softmax) 출력 계층만을 가지는 간단한 구조로 도 9와 같다.
The model proposed by Yoon Kim's research described earlier is a one-layer CNN consisting of only one convolution layer and 1-max pooling layer with multiple filters of various domain sizes, and the FC layer is soft without any hidden layer. 9 is a simple structure having only a softmax output layer.

본 발명에서의 입력 계층은 전술한 기반 모델과 동일하며 CNN에 입력되는 입력되는 단어의 수, 즉 단어의 벡터표현으로 구성된 매트릭스는 높이(height) 가 고정된다. 그 이유는 앞서 기술하였듯이 더미 단어들로 인한 자원 사용량 낭비 및 훈련 시간 증가와 같은 비효율의 문제점을 보완하기 위해 하나의 문서에 포함된 전체 단어를 입력하는 대신에 성능을 저하시키지 않는 단어 수로 문서의 길이를 제한하여 입력한다고 전제하였기 때문이다.
The input layer of the present invention is the same as the base model described above, and the height of a matrix composed of the number of words inputted to the CNN, that is, a vector representation of the words is fixed. The reason for this is that, as described above, the length of the document in terms of the number of words does not decrease the performance instead of inputting the entire word contained in one document to compensate for inefficiencies such as waste of resource usage due to dummy words and increased training time. This is because it is assumed that the input is limited.

따라서 이어지는 실험에서 모델 성능을 저하시키지 않는 입력 단어(토큰 )의 개수를 선정한다. 기반 모델에서 컨볼루션 계층과 맥스풀링(MaxPooling) 계층 뒤에 LSTM 레이어를 추가한다 . 여기서 LSTM 모델은 LSTM 셀 레이어를 3개로 구성하여 MultiRnnCell로 구성한다. 각 모델의 단점을 상호 보완 하고자, 두 종류의 뉴럴 네트워크를 함께 활용하는 CNN-LSTM 모델을 게시물의 텍스트 분류에 사용하였다.
Therefore, in the following experiment, we select the number of input words (tokens) that do not degrade the model performance. In the base model, add the LSTM layer after the convolution layer and the MaxPooling layer. In this case, the LSTM model is composed of three LSTM cell layers and a MultiRnnCell. To compensate for the shortcomings of each model, we used the CNN-LSTM model, which utilizes two types of neural networks together, for text classification of posts.

CNN 모델을 이용하여 텍스트의 특징들을 잘 나타내는 벡터를 추출하고, 이를 LSTM 모델의 입력으로 하여 게시물 내용의 상황에서의 롱텀 디펜던시(Long-term dependency)가 반영되도록 분류 모델을 학습시킨다.
The CNN model is used to extract a vector representing the features of the text, and the input is input to the LSTM model. The classification model is trained to reflect the long-term dependency in the context of the post content.

도 10은 본 발명의 실시예에 따른 word2vec을 활용한 CNN 모델에 LSTM 모델을 더한 복합 모델의 예를 나타낸다. 도 10의 모델은 문장을 토큰화 하면서 시작한다. 토큰화된 문장은 문장 매트릭스로 변환된다. 매트릭스의 행(row)은 각 토큰의 단어 벡터 표현이다. 도 10에서 예를 들어 "경찰청 홈페이지 인데 요즘 피 싱 범죄 나" 문장은 "경찰청, 홈페이지, 인데, 요즘, 피, 싱, 범죄,나"의 8개의 토큰으로 나누어지고, 각각의 토큰은 word2vec 라이브러리를 이용하여 벡터로 표현된다.
10 illustrates an example of a composite model in which a LSNN model is added to a CNN model using word2vec according to an embodiment of the present invention. The model of FIG. 10 begins by tokenizing a sentence. Tokenized sentences are converted to sentence matrices. The row of the matrix is the word vector representation of each token. For example, in FIG. 10, the sentence "The Police Agency homepage these days phishing crime I" sentence is divided into eight tokens of "Police Office, homepage, but nowadays, blood, Singh, crime, I", each token is a word2vec library It is represented as a vector using.

도 10에서는 각 토큰을 300차원의 벡터로 변환한 것이다. 만일, 단어 벡터의 차원을 d라고 하고, 주어진 문장의 길이(문장의 토큰 개수)를 s라고 하면, 문장 매트릭스의 차원은 s × d 가 되고, 도 10에서는 8 ×300의 형태(shape)를 가지는 문장 매트릭스가 된다. 문장 매트릭스로 변환 시, 제로패딩(zero-padding) 방법을 사용하여 CNN 입력 길이를 맞춘다. 다음으로 필터(filter)를 통해서 컨볼루션 (convolution)을 수행한다. 문장 매트릭스의 행은 하나의 토큰, 즉 단어를 표현하기 때문에, 필터의 너비(width)는 단어 벡터의 차원과 동일한 너비 300을 사용하게 된다(앞서 기술하였듯이 본 발명에서는, 단어를 300차원의 벡터로 생성하였기 때문에 필터의 너비는 300이 된다). 따라서 필터의 너비는 고정되고, 단지 필터의 높이(height)만 다르게 설정할 수 있다. 필터의 높이는 함께 고려될 수 있는 인접한 단어의 수로써 컨텍스트 윈도우의 크기를 의미한다. 이전 설명한 Yoon Kim의 연구에서는 필터의 높이 h를 필터의 영역 사이즈(region size)라고 표현하고 있다. 동일한 영역으로부터 상호 보완적인(complementary) 자질을 학습하기 위해 동일한 영역 사이즈(region size)에 대해 여러 개의 필터들을 적용할 수 있다. 도 10에서는, (3,4,5) 3개의 영역 사이즈를 가지는, 즉 3 × 300, 4 × 300, 5 × 300 의 필터 3개를 각각 150개씩, 전체 450개의 필터를 통해 컨볼루션을 수행하여 나온 출력 값에 편차(bias)를 더하고, 활성화 함수를 적용하여 새로운 피처 맵 300*150*3개를 만들고 있다. 주어진 문장에 대하여 위에서 아래로만 윈도우 슬라이딩(sliding)을 수행하기 때문에, 각 필터에 의해서 생성된 피처 맵은 s - h + 1 차원의 크기를 가지는 벡터가 되고, 필터 영역 사이즈 h와 문장의 길이 s에 따라서 다양해질 것이다.
In FIG. 10, each token is converted into a 300-dimensional vector. If the dimension of the word vector is d and the length of the given sentence (number of tokens in the sentence) is s, the dimension of the sentence matrix is s × d, and in FIG. 10 has a shape of 8 × 300. It becomes a sentence matrix. When converting to a sentence matrix, the CNN input length is adjusted using a zero-padding method. Next, we perform convolution through a filter. Since the rows of the sentence matrix represent a single token, that is, the word, the width of the filter will use the same width 300 as the dimension of the word vector. Filter width is 300). Therefore, the width of the filter is fixed and only the height of the filter can be set differently. The height of the filter is the size of the context window as the number of adjacent words that can be considered together. Yoon Kim's research described earlier expresses the height h of the filter as the region size of the filter. Multiple filters can be applied to the same region size to learn complementary features from the same region. In FIG. 10, convolution is performed through a total of 450 filters having three (3,4,5) three region sizes, that is, three filters of 3 × 300, 4 × 300, and 5 × 300, respectively. I am creating a new feature map of 300 * 150 * 3 by adding a bias to the output and applying the activation function. Since window sliding is performed only from top to bottom for a given sentence, the feature map generated by each filter becomes a vector with a dimension of s-h + 1 dimension, and the filter region size h and the length of the sentence s Therefore, it will be diversified.

다음으로 각각의 피처 맵에 1-max 풀링 연산(operation)을 수행한다. 정보를 압축하기 위한 풀링 방법에는 여러 가지가 있지만 본 발명에서는 MAX 풀링 방식으로 max_pool_size - max 풀링 연산을 통해 각 피처 맵에서 가장 큰 값 하나를 선택함으로써 각 피처 맵에서 가장 중요한 자질을 얻을 수 있게 된다. 각각의 피처 ㅁ맵에 적용된 max_pool_size - max 풀링 연산에서 max_pool_size의 값은 실험을 통해 5를 선정하였으며, ksize[1,5,1,1], strides[1,5,1,1]을 수행하여 20*150*3개의 출력값들을 연결(concatenation)하고, 이 벡터가 탑-레벨 피처 벡터가 된다. 탑-ㄹ레벨 피처 벡터의 크기는 문장의 길이에는 영향을 받지 않고, 단지 영역 사이즈와 필터의 개수에만 영향을 받으며, 영역 사이즈 × 필터의 개수와 같은 크기가 된다. 따라서 도 10에서 탑-레벨 벡터의 크기는 모델에 입력되는 문장의 길이와 상관없이 일정하게 3 × 150 = 450 의 고정길이를 가지게 되며 본 연구에서는 맥스 풀링 계층을 통과하여 20*450개의 탑-레벨 피처 벡터가 생성된다.
Next, a 1-max pooling operation is performed on each feature map. Although there are various pooling methods for compressing information, in the present invention, the most important feature in each feature map can be obtained by selecting one of the largest values in each feature map through the max_pool_size-max pooling operation using the MAX pooling method. In the max_pool_size-max pooling operation applied to each feature ㅁ map, the value of max_pool_size was selected by experiment 5, and ksize [1,5,1,1] and strides [1,5,1,1] were performed by 20 * Concatenation of 150 * 3 outputs and this vector becomes a top-level feature vector. The size of the top-level feature vector is not affected by the length of the sentence, but only by the size of the region and the number of filters, and is equal to the size of the region size × the number of filters. Therefore, in FIG. 10, the size of the top-level vector has a fixed length of 3 × 150 = 450 regardless of the length of the sentence input to the model. In this study, 20 × 450 top-levels pass through the max pooling layer. Feature vectors are generated.

마지막으로, 탑-레벨 피처 벡터는 LSTM모델을 통해 최종 분류를 위해 BasicLSTMCell 3개 레이어로 구성하여 MultiRNNCell을 구성하였으며 기본 BasicLSTMCell 보다 근소하게 우수한 성능을 보여주었다. 풀 커넥션 + 소프트맥스 계층(이하 FC 계층으로 표기)에 전달되며, FC 계층의 출력은 라벨에 대한 확률 분포가 된다. 가장 높은 출력값을 가지는 라벨이 주어진 문장의 예측 라벨이 된다.
Finally, the top-level feature vector is composed of 3 layers of BasicLSTMCell for final classification through the LSTM model, which constitutes a MultiRNNCell, which shows slightly better performance than the basic BasicLSTMCell. Passed to the full connection + softmax layer (hereafter referred to as the FC layer), the output of the FC layer is the probability distribution for the label. The label with the highest output value becomes the prediction label of the given sentence.

3. 분류기 성능 분석3. Classifier Performance Analysis

도 11은 모델 분류기의 구성도를 나타내는 도면이다. 본 발명에서는 게시글 코퍼스에 자동 띄어쓰기와 WPM을 적용한 후, 코퍼스를 생성한다. 그리고, 단일 분류기와 복합분류기의 정확률을 파악하기 위해 각 분류기별 정확률을 측정하고, 파이썬(Python), 젠심(Gensim)의 Doc2Vec 라이브러리를 통해 모델을 생성하였다.
11 is a diagram illustrating a configuration diagram of a model classifier. In the present invention, after applying the automatic spacing and WPM to the post corpus, and generates a corpus. In order to determine the accuracy of the single classifier and the multiple classifier, we measured the accuracy of each classifier and generated a model through the Doc2Vec library of Python and Gensim.

모델 생성 후에는 분류기 성능 측정의 통계적 신뢰도를 높이기 위해서 10 겹 교차 검증을 적용하고, 분류모델에 따라 20 회 ~ 500 회 반복 훈련하였으며, 단일모델 분류기로는 CNN, LSTM 2가지 기계 학습 분류기에 적용하여 각각 정확률을 산출하고 제안 모델인 복합모델분류기로 CNN-LSTM 모델과 비교 분석하였다.
After model generation, 10-fold cross-validation was applied to increase the statistical reliability of classifier performance measurement, and it was trained 20 ~ 500 times according to classification model, and applied to CNN and LSTM machine learning classifier as single model classifier. The accuracy was calculated and compared with the CNN-LSTM model using the proposed composite model classifier.

본 발명의 분류기 실험 환경은 구글에서 제공하는 텐서플로우(TensorFlow)와 파이썬(python)의 딥러닝을 위한 라이브러리의 하나인 scikit-learn(sklearn)을 사용하였다. 텐서플로우는 본 실험에서 적용한 300 채널의 배열을 처리하는 데에 다중 GPU를 이용하였고, scikit-learn은 데이터 마이닝과 분석에 주로 사용되는 라이브러리이다 .
The classifier experiment environment of the present invention used TikiFlow-learn (sklearn), one of libraries for deep learning of TensorFlow and Python provided by Google. TensorFlow uses multiple GPUs to process the 300-channel array used in this experiment. Scikit-learn is a library mainly used for data mining and analysis.

그리고, 기본 분류기로 도 12와 같이 소프트맥스 회귀 모델을 구성하였으며, Word2Vec 300차원의 표현을 배열에 입력하여 훈련 데이터를 훈련하고, 테스트 데이터와 10 겹 교차 검증으로 비교한 후에 기계학습 모델을 생성하고, 정확률을 계산하였다 .
Then, as a basic classifier, a softmax regression model was constructed as shown in FIG. 12. The training data was trained by inputting a Word2Vec 300-dimensional representation into an array, and the machine data model was generated after comparing the test data with 10-fold cross-validation. The accuracy rate was calculated.

본 발명에서 기본 분류기로 선택한 소프트맥스 회귀 결과는, 이하에 보다 구체적으로 설명하겠지만, 표 16에 나타낸 바와 같이, 게시글에 대한 코퍼스를 소프트맥스 회귀 분류기에 적용했을 경우에 경우에는 게시글 코퍼스가 658,579개로 분류기로는 CNN-LSTM 사용시 73.7%로 가장 우수 하였고 LSTM 69.7%, CNN는 68.6%의 분류률을 보였다.
The softmax regression result selected as the default classifier in the present invention will be described in more detail below, but as shown in Table 16, when the corpus for a post is applied to the Softmax regression classifier, the post corpus is classified as 658,579. The highest score was 73.7% when using CNN-LSTM, and 69.7% for LSTM and 68.6% for CNN.

4. 모델별 성능 실험 및 분석4. Performance experiment and analysis by model

본 발명에서 성능 정확성은 각 클래스별 Precision, Recall, F1의 값을 가지고 평가를 한다. 아래의 [표 15]는 실험환경을 나타내고 [표 16]은 모델별 최종 성능 결과를 나타내고 있다.
Performance accuracy in the present invention is evaluated with the values of Precision, Recall, F1 for each class. Table 15 below shows the experimental environment and Table 16 shows the final performance results for each model.

[표 15] 성능평가 실험 환경[Table 15] Performance Evaluation Experiment Environment

[표 16] 모델별 분류 성능[Table 16] Classification performance by model

4.1 CNN 기반 분류기 성능 분석4.1 CNN-based Classifier Performance Analysis

도 13은 CNN 기반 분류기에서의 학습 수에 따른 성능을 나타낸다. 도 13에 도시된 바와 같이, CNN은 학습 횟수별(10, 30, 50, 100, 200)로 테스트를 수행했다. 학습 횟수가 많아질수록 성능이 저하되는 현상을 확인할 수 있었다.
13 shows performance according to the number of learning in the CNN based classifier. As shown in FIG. 13, the CNN performed tests by the number of learning (10, 30, 50, 100, 200). As the number of learning increases, performance decreases.

4.2 LSTM 기반 분류기 성능 분석4.2 LSTM Based Classifier Performance Analysis

도 14는 LSTM 기반 분류기에서의 학습 수에 따른 성능을 나타낸다. LSTM도 이전과 동일하게 학습 횟수별(10, 30, 50, 100, 200)로 테스트를 수행했다. CNN과 유사하게 학습 횟수가 많아질수록 성능이 저하되는 현상을 확인할 수 있었다.
14 shows the performance according to the number of learning in the LSTM based classifier. LSTM performed the same test by learning count (10, 30, 50, 100, 200) as before. Similar to CNN, the performance was decreased as the number of learning increases.

4.3 CNN-LSTM 기반 분류기 성능 분석4.3 CNN-LSTM Based Classifier Performance Analysis

먼저 CNN-LSTM의 구성에서 파라미터의 설정 값에 따라 성능의 차이를 보여 주었다. 본 발명은 워드 임베딩(Word Embedding)을 하기 위하여 기본적으로 300 차원 Word2Vec을 사용하며, 그 외의 파라미터는 가변적으로 값을 조정하여 성능을 테스트 하였다. 아래의 표 17은 가변적으로 변경되는 파라미터를 보여준다.
First of all, the performance difference was shown according to the parameter setting value in CNN-LSTM configuration. The present invention basically uses 300-dimensional Word2Vec for word embedding, and the other parameters were tested for performance by adjusting the values variably. Table 17 below shows parameters that are variably changed.

[표 17] 가변 파라미터의 종류[Table 17] Variable Parameter Types

가) 최대 문장 길이(Max_Sentence_Length)별 성능 테스트A) Performance test by maximum sentence length (Max_Sentence_Length)

본 발명에서 사용한 한 문장의 최소 토큰 수는 2, 최대 토큰 수는 463으로, 2~ 4632의 범위를 갖는다 .The minimum number of tokens in a sentence used in the present invention is 2 and the maximum number of tokens is 463, ranging from 2 to 4632.

[표 18] 최대 문장 길이별 성능 측정[Table 18] Performance measurement by maximum sentence length

나) 필터별 성능 테스트B) Performance test by filter

필터별 성능 테스트는 필터의 종류를 두 개에서 다섯 개까지 변경하며 진행되었다. 아래의 표 19는 필터별 성능을 나타낸다.
Filter-specific performance tests were conducted with two to five filter types. Table 19 below shows the performance of each filter.

[표 19] 필터별 성능 측정[Table 19] Performance Measurement by Filter

다) LSTM 은닉 크기(Hidden Size)별 성능 테스트C) Performance test by LSTM Hidden Size

LSTM 은닉 크기의 설정값을 50,100,200,300,400의 단위로 변경하며 테스트 하였다. 아래의 [표 20]은 LSTM 은닉 크기별 성능을 보여준다.
The LSTM concealment size was changed in units of 50, 100, 200, 300, 400 and tested. Table 20 below shows the performance of LSTM concealment sizes.

[표 20] LSTM 은닉 크기(Hidden Size)별 성능 테스트[Table 20] Performance Test by LSTM Hidden Size

라) BasicLSTMCell과 MultiRNNCell의 성능 비교D) Performance comparison between BasicLSTM Cell and MultiRNNCell

CNN 모델로부터 입력받은 값을 LSTM으로 입력받기 위한 셀 설정시 단일 레이어를 사용할 경우 BasicLSTMCell을 멀티 레이어 설정시 MultiRNNCell을 사용하여 셀의 레이어 수를 설정할 수 있다. 아래의 [표 21]은 셀 종류별 성능을 보여준다. 또한 이와 관련하여 도 15는 셀 종류별 성능 측정결과를 나타낸다. 표 21 및 도 15로부터 알 수 있듯이, 50회 수행 평균 1.7% 향상된 정확률을 보여 주었다.
When using a single layer when setting a cell to receive the value input from the CNN model to the LSTM, when the BasicLSTMCell is set to multi-layer, the number of layers of the cell can be set by using MultiRNNCell. Table 21 below shows the performance of each cell type. Also, in this regard, FIG. 15 shows a performance measurement result for each cell type. As can be seen from Table 21 and FIG. 15, the 50-per-run average 1.7% improved accuracy.

[표 21] 셀 종류별 성능 측정[Table 21] Performance Measurement by Cell Type

마) LSTM MultiRnnCell 레이어 수 별 성능 테스트5) LSTM MultiRnnCell Layer Performance Test

LSTM 은닉 레이어 설정값을 1,2,3개의 단위로 변경하며 테스트 되었다 . 아래 [표 22]는 LSTM MultiRnnCell 레이어 개수별 성능을 보여준다.
The LSTM hidden layer settings were tested in 1, 2, and 3 increments. Table 22 below shows the performance of LSTM MultiRnnCell layers.

[표 22] LSTM MultiRnnCell 레이어 개수별 성능[Table 22] LSTM MultiRnnCell Layer Performance

4.4 CNN-LSTM 최종 성능 측정 및 비교4.4 CNN-LSTM Final Performance Measurement and Comparison

성능 측정은 CNN-LSTM 구성 방법별 테스트 결과, 파라미터별 테스트 결과를 종합하여 측정되었다. 최종 성능 측정에서 사용된 CNN-LSTM의 구성 방법과 파라미터는 단위테스트에서 측정된 성능을 기반으로 하였다. 표 23은 최종 테스트에서 사용된 파라미터를 보여준다 .
Performance measurements were made by combining the test results for each CNN-LSTM configuration method and the test results for each parameter. The configuration method and parameters of CNN-LSTM used in the final performance measurement were based on the performance measured in the unit test. Table 23 shows the parameters used in the final test.

[표 23] 최종 테스트 파라미터[Table 23] Final Test Parameters

도 16은 표 23의 최종 파라미터를 사용한 CNN-LSTM 기반 분류기의 성능 분석을 나타내고, 도 17은 트레이닝 횟수를 10, 30, 50, 100, 200회를 각각 수행하여 모델별 정밀도(Accuracy)를 보여준다 . CNN, LSTM 단일 모델의 경우 수행 횟수가 증가함에 따라 성능이 저하되는 현상을 확인할 수 있었으나 CNN-LSTM의 경우 단일 모델 대비 안정적인 모습을 나타내었다.
FIG. 16 shows the performance analysis of the CNN-LSTM based classifier using the final parameters of Table 23. FIG. 17 shows the accuracy of each model by performing 10, 30, 50, 100, and 200 training times, respectively. In case of single model of CNN and LSTM, performance decreases as the number of execution increases. However, CNN-LSTM showed more stable performance than single model.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로, 본 발명이 속하는 기술분야에서 통상의 지식을 갖는 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 게시된 실시예는 본 발명의 기술 사상을 한정하기 위한 것이 아닌 설명을 위한 것이고, 이런 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다.
The above description is merely illustrative of the technical idea of the present invention, and those skilled in the art to which the present invention pertains may make various modifications and changes without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are not for limiting the technical spirit of the present invention but for the description, and the scope of the technical idea of the present invention is not limited by these embodiments.

따라서 본 발명의 보호 범위는 전술한 실시예에 의해 제한되기 보다는 아래의 청구범위에 의하여 해석되어야하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.
Therefore, the protection scope of the present invention should be construed by the claims below, rather than being limited by the above-described embodiment, and all technical ideas within the equivalent scope will be construed as being included in the scope of the present invention.

Claims

In the method of automatically classifying the category of unstructured data,
Collecting a string of unstructured data from a post posted on an internet bulletin board;
Performing preprocessing on the collected strings of unstructured data, including information extraction and string processing, and character tokenization processing, wherein the string is converted into a vector representation by performing preprocessing;
Inputting the preprocessed vector representation into a CNN-LSTM classifier;
And automatically classifying the category of the post by the CNN-LSTM classifier.
How to automatically classify categories of unstructured data.

The method of claim 1,
The information extraction and string processing,
Parsing an Excel file to extract body and category information;
Processing newline characters and special characters in the string;
Automatic spacing of the character string; And
Applying WPM to the auto spaced string,
The character tokenization process is
Converting the WPM-applied string into a vector representation by using the Word2Vec library;
How to automatically classify categories of unstructured data.

The method of claim 2,
The vector representation input to the CNN-LSTM classifier has a predetermined input length for the CNN through zero-padding, and the height of the matrix of vectors (corresponding to the number of tokens input to the CNN) is fixed. there is
How to automatically classify categories of unstructured data.

The method of claim 3,
The step of automatically classifying the category of the post by the CNN-LSTM classifier,
Performing convolution through a predetermined filter to generate a plurality of feature maps;
Performing one max pulling operation on each of the generated feature maps to obtain one feature from each feature map;
Concatenating all output values to produce a top level feature vector having a fixed length;
Configuring the generated top-level feature vector with a MultiRNNCell consisting of three layers of BasicLSTM Cell;
Outputting through a Full Connection + Softmax layer; Containing
How to automatically classify categories of unstructured data.

The method of claim 4, wherein
The output through the Full Connection + Softmax layer is the probability distribution for the label, with the label with the highest output value classified as the predictive label of the given sentence.
How to automatically classify categories of unstructured data.