KR20180120488A

KR20180120488A - Classification and prediction method of customer complaints using text mining techniques

Info

Publication number: KR20180120488A
Application number: KR1020170054520A
Authority: KR
Inventors: 배석주; 정륜선
Original assignee: 한양대학교 산학협력단
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2018-11-06

Abstract

A classification and prediction method for customer complaints using a text mining technique is disclosed. The classification and prediction method for customer complaints using a text mining technique can more accurately predict the customer complaints by analyzing the customer complaints in VOC data with various opinions of a customer using a classification and prediction algorithm for the customer complaints and a TF-IDF technique that is one of the text mining techniques, using a previously defined algorithm as previous information and performing positive/negative classification of the previous information using a naive Bayesian classifier in order to extract complaint data from data crawled in internet, and comparing and analyzing an artificial neural network and an SVM method.

Description

[0002] Classification and prediction methods for customer complaints using text mining techniques [

본 발명은 텍스트 마이닝 기법을 활용한 고객 불만에 대한 분류 및 예측 방법에 관한 것이다. The present invention relates to classification and prediction methods for customer complaints using text mining techniques.

VOC(Voice Of Customer)는 기업의 경영 활동에 있어서 고객들이 기업의 서비스에 반응하는 각종 문의, 불만, 제안 등을 의미한다.VOC (Voice Of Customer) refers to various inquiries, complaints, suggestions, etc. in response to customers' business services in business activities.

일반적으로 voc 분석 시스템은 콜 상담시스템의 콜 상담 정보, 민원관리 시스템의 전자민원, 서면민원, 칭 찬민원, 고객 아이디어 그리고 채팅상담 시스템에 있는 채팅상담 정보들을 수집하여 고도화된 언어처리 기법으로 분석하는 것이다. 텍스트마이닝을 활용한 voc 데이터 분석 기법은 유의미한 어휘(명사, 명사구)를 문서군 내의 발생 정도를 측정하고 이슈화되고 있는 어휘 순위를 기간(시간, 일간, 주간, 월간 등)측정, 랭킹 형식으로 제공하는 랭킹 분석, 어휘간의 관계성을 분석하여 연관 어휘를 플래시 형태의 시각화 등을 통해 표현하는 연관분석, 시간의 흐름에 따른 특정 어휘의 변화를 나타내는 추이 분석, 고객의 의견을 분석 하고 좋음/나쁨 형태로 자동으로 분류하는 평판 분석 등이 있다. In general, the voc analysis system collects chat counseling information in the call counseling system of the call counseling system, the electronic complaints of the civil service management system, will be. The voc data analysis technique using text mining measures the occurrence of meaningful vocabulary (nouns, noun phrases) in the document group and measures the lexical ranking of the issues in terms of period (time, day, weekly, monthly, etc.) Ranking analysis, analysis of relationship between vocabulary, association analysis that expresses related vocabulary through visualization of flash type, trend analysis which shows change of specific vocabulary according to time, analysis of customer's opinion, And reputation analysis that classifies automatically.

VOC 데이터의 대부분은 고객이 감정적인 상태에서 고객의 주관적 의사를 정해진 형식없이 표출한 것으로 그 형식이나 내용이 정형화 되어있지 않고, 정보통신의 발달로 인해 고객 불만 데이터는 기하급수적으로 증가하는데 종래의 텍스트 마이닝을 활용한 VOC 데이터 분석은 이와 같은 고객의 감성을 고려하지 않으며 문서 내에서 특정 단어가 출현하는 발생 정도 파악 및 어휘 간의 관계 파악을 위한 연관 분석, 추이 분석, 평판 분석 등의 기법이 있지만 빅 데이터를 효율적으로 분석하는 방법이 없었다.Most of the VOC data is the customer's emotional state, without the formal form of the customer, which is not formalized, and the complaint data increases exponentially due to the development of information communication. Analysis of VOC data using mining does not take into account the sensitivity of such customers, but there are techniques such as association analysis, trend analysis, and reputation analysis for identifying the occurrence of specific words in a document and for understanding the relationship between vocabularies, There was no way to analyze it efficiently.

본 발명은 텍스트 마이닝 기법을 활용한 고객 불만에 대한 분류 및 예측 방법을 제공하기 위한 것이다.The present invention provides classification and prediction methods for customer complaints using text mining techniques.

또한, 본 발명은 고객 불만에 대한 분류 및 예측 알고리듬, 텍스트 마이닝 기법 중 하나인 TF-IDF 기법을 활용하여 고객의 다양한 의견을 담고 있는 VOC 데이터에서 고객의 불만을 분석하고, 인터넷에서 크롤링한 데이터에서 불만 데이터를 추출하기 위해서 사전에 정의한 알고리즘을 사전정보로 하고 이를 나이브 베이지안 분류기를 이용하여 긍/부정 분류하며, 인공신경망과 SVM 방법을 비교 분석하여 고객의 불만을 더 정확히 예측하는 알고리듬을 제공하기 위한 것이다.In addition, the present invention utilizes the TF-IDF technique, which is one of the classification and prediction algorithms for customer complaints and text mining techniques, to analyze customer complaints from VOC data containing various opinions of customers, In order to extract complaints data, it is necessary to use a naive Bayesian classifier as a preliminarily defined algorithm as a preliminary information, and to provide an algorithm to predict customer complaints more accurately by comparing and analyzing artificial neural networks and SVM methods will be.

본 발명의 일 측면에 따르면, 고객 불만에 대한 분류 및 예측 알고리듬, 텍스트 마이닝 기법 중 하나인 TF-IDF 기법을 활용하여 고객의 다양한 의견을 담고 있는 VOC 데이터에서 고객의 불만을 분석하고, 인터넷에서 크롤링한 데이터에서 불만 데이터를 추출하기 위해서 사전에 정의한 알고리즘을 사전정보로 하고 이를 나이브 베이지안 분류기를 이용하여 긍/부정 분류하며, 인공신경망과 SVM 방법을 비교 분석하여 고객의 불만을 더 정확히 예측하는 방법이 제공된다. According to one aspect of the present invention, the customer's dissatisfaction is analyzed in the VOC data containing various opinions of the customers by using the classification and prediction algorithm for customer complaints and the TF-IDF technique, which is one of the text mining techniques, In order to extract complaints data from a single data, we use a predefined algorithm as a preliminary information, use the Naive Bayesian classifier to classify it as positive / negative classification, and compare the analytical method with the artificial neural network and SVM method. / RTI >

본 발명의 일 실시예에 따른 텍스트 마이닝 기법을 활용한 고객 불만에 대한 분류 및 예측 방법을 제공함으로써, 고객 불만에 대한 분류 및 예측 알고리듬, 텍스트 마이닝 기법 중 하나인 TF-IDF 기법을 활용하여 고객의 다양한 의견을 담고 있는 VOC 데이터에서 고객의 불만을 분석하고, 인터넷에서 크롤링한 데이터에서 불만 데이터를 추출하기 위해서 사전에 정의한 알고리즘을 사전정보로 하고 이를 나이브 베이지안 분류기를 이용하여 긍/부정 분류하며, 인공신경망과 SVM 방법을 비교 분석하여 고객의 불만을 더 정확히 예측할 수 있다. By providing a classification and prediction method for customer complaints using a text mining technique according to an embodiment of the present invention, a classification and prediction algorithm for customer complaints and a TF-IDF technique, which is one of text mining techniques, In order to extract customer complaints from VOC data containing various opinions and to extract complaint data from the data crawled on the Internet, a predefined algorithm is used as prior information, and the result is classified as positive / negative classification using a Naive Bayesian classifier, The comparison of the neural network and the SVM method can predict the customer's complaint more accurately.

도 1은 본 발명의 일 실시예에 따른 각 문장을 형용사, 명사와 같은 구성 요소로 분류한 예를 나타낸 도면.
도 2는 본 발명의 일 실시예에 따른 나비브 베이즈 분류기를 통해 감성을 분류한 결과를 나타낸 도면.
도 3은 본 발명의 일 실시예에 따른 TF-IDF 모델을 이용하여 계산된 고객불만지표를 나타낸 도면.1 is a view showing an example in which each sentence according to an embodiment of the present invention is classified into constituent elements such as an adjective and a noun.
FIG. 2 is a diagram showing a result of classifying sentiment through a butterfly classifier according to an embodiment of the present invention; FIG.
3 illustrates a customer complaint indicator calculated using the TF-IDF model in accordance with an embodiment of the present invention.

본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "구성된다" 또는 "포함한다" 등의 용어는 명세서상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.As used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. In this specification, the terms "comprising ", or" comprising "and the like should not be construed as necessarily including the various elements or steps described in the specification, Or may be further comprised of additional components or steps. Also, the terms "part," " module, "and the like described in the specification mean units for processing at least one function or operation, which may be implemented in hardware or software or a combination of hardware and software .

이하, 첨부된 도면들을 참조하여 본 발명의 실시예를 상세히 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명은 현대 기업 관리에서 고객 만족과 기업 의사 결정을 개선하는데 중요한 정보인 고객의 목소리를 텍스트 마이닝 기법을 활용하여 분석하기 위한 것이다. 고객 불만 사항 데이터를 예상하고 대응하기 위한 것으로 SNS / 게시판 데이터를 이용하여 감성 분석을 수행하고 고객의 불만 사항을 클러스터링하여 고객 불만 예측 모델을 제공할 수 있다.The present invention is to analyze customer voices, which are important information for improving customer satisfaction and corporate decision making in modern enterprise management, using text mining techniques. To analyze and anticipate customer complaint data, emotional analysis can be performed using SNS / bulletin data, and customer complaints can be clustered to provide a customer complaint prediction model.

즉, 본 발명은 고객 불만에 대한 분류 및 예측 알고리듬, 텍스트 마이닝 기법 중 하나인 TF-IDF 기법을 활용하여 고객의 다양한 의견을 담고 있는 VOC 데이터에서 고객의 불만을 분석하고, 인터넷에서 크롤링한 데이터에서 불만 데이터를 추출하기 위해서 사전에 정의한 알고리즘을 사전정보로 하고 이를 나이브 베이지안 분류기를 이용하여 긍/부정 분류하며, 인공신경망과 SVM 방법을 비교 분석하여 고객의 불만을 더 정확히 예측할 수 있다. That is, the present invention analyzes the customer's dissatisfaction in the VOC data containing various opinions of customers by utilizing the classification and prediction algorithm for customer complaints and the TF-IDF technique, which is one of the text mining techniques, In order to extract complaints data, it is possible to predict customer complaints more precisely by using predefined algorithm as preliminary information, comparing it with positive / negative classification using Naive Bayesian classifier, and comparing and analyzing artificial neural network and SVM method.

TF-IDF 가중치 모델은 정보검색 및 텍스트마이닝을 위해서 문서 내부의 단어 간 상대적 중요도를 평가하기 위해 사용되는 문서의 표현방식으로, 단어에 대한 빈도를 기반으로 단어의 감정어휘를 판단할 수 없다는 한계점이 존재한다. 이를 해결하기 위해 베이즈 확률이론을 기반으로 클래스를 분류하는 기법인 나이즈 베이브 분류기를 활용하여 주어진 데이터로부터 부정어 추출을 위한 TF-IDF 값에 부정어에 대한 가중치를 추가하였다. 감성분류를 위해 Riloff and Wiebe (2003)이 제안한 긍/부정 사전을 사전분포로하여 나이브 베이지안 분류기를 통해 분석하고, 나이브 베이즈 분류기를 통하여 얻은 사후확률을 부정지표로 정의하였다. 그 후 K-Means 알고리즘 및 계층적 클러스터 분석을 실시하여 고객불만에 대한 클러스터 기법을 개발하였다. 클러스터의 수를 결정하기 위하여 Duda index, Ch index, C index를 사용하여 클러스터의 수로 결정하였다. 그 다음 인공신경망과 SVM을 이용하여 고객의 불만을 예측한다. 인공 신경망은 RPROP 알고리즘을 사용하고, SVM은 radial basis kernel 알고리즘 사용하여 좀 더 정확도가 높은 예측모델을 확인할 수 있다.The TF-IDF weighting model is a document representation method used to evaluate the relative importance of words within a document for information retrieval and text mining. It is a limitation that the emotional vocabulary of a word can not be determined based on the frequency of words exist. To solve this problem, we added a negative weight to the TF-IDF value for adjective extraction from a given data using a Naïve Babe classifier that classifies classes based on Bayesian probability theory. For emotional classification, positive / negative dictionaries proposed by Riloff and Wiebe (2003) were analyzed using the Naive Bayesian classifier as a preliminary distribution, and the posterior probability obtained through the Naive Bayes classifier was defined as a negative index. After that, K-Means algorithm and hierarchical cluster analysis were performed to develop a cluster technique for customer complaints. The number of clusters was determined by using Duda index, Ch index, and C index to determine the number of clusters. Next, we use artificial neural networks and SVMs to predict customer complaints. The RPROP algorithm is used for the artificial neural network, and the SVM is based on the radial basis kernel algorithm.

도 1은 본 발명의 일 실시예에 따른 각 문장을 형용사, 명사와 같은 구성 요소로 분류한 예를 나타낸 도면이고, 도 2는 본 발명의 일 실시예에 따른 나비브 베이즈 분류기를 통해 감성을 분류한 결과를 나타낸 도면이고, 도 3은 본 발명의 일 실시예에 따른 TF-IDF 모델을 이용하여 계산된 고객불만지표를 나타낸 도면이다.FIG. 1 is a diagram illustrating an example of classifying each sentence according to an embodiment of the present invention into elements such as an adjective and a noun. FIG. 2 is a block diagram illustrating an exemplary embodiment of the present invention. FIG. 3 is a view showing a customer complaint index calculated using the TF-IDF model according to an embodiment of the present invention.

차량관련 불만 데이터를 미국지역의 SNS / 인터넷 동호회 게시판에서 크롤링을 통해 수집하여 감정분석을 실시하고 불만 데이터를 추출해서 불만을 예측할 수 있는 모델을 알아보았다.We collected the complaints related to the vehicle through the crawl on the SNS / Internet group bulletin board in the US, conducted emotional analysis, and extracted the complaint data to find out the model that can predict the complaint.

사전을 정의하여 각 문장을 형용사, 명사와 같은 구성요소로 구분하여 도 1과 같이 정리하였다.The dictionary is defined, and each sentence is divided into constituent elements such as adjectives and nouns.

데이터의 감성 분석을 위해 Riloff and Weibe(2003)이 정리한 긍정/부정 사전을 사전분포로 활용하여 나이브 베이즈 분류기를 구축하였다. 나이브 베이즈 분류기를 통해 감성을 긍정, 부정, 중립 세 종류로 구분을 하였고 세 값 중 큰 값을 갖는 감성을 최종 분류로 선택하였다. 그 결과는 도 2와 같다. 이 값을 이용하여 제안한 Advanced TF-IDF 모델에 적용하여 계산한 고객불만지표는 도 3과 같다.For the emotional analysis of the data, the Naive Bayes classifier was constructed using the positive / negative dictionary summarized by Riloff and Weibe (2003) as the preliminary distribution. Through the Naive Bayes classification, the emotions were classified into three types: positive, negative, and neutral. The results are shown in Fig. Fig. 3 shows the customer complaint index calculated by applying this value to the proposed Advanced TF-IDF model.

그 다음 K-means 알고리즘과 계층적 클러스터 분석을 이용하여 클러스터링을 진행하였다.Then clustering was performed using K-means algorithm and hierarchical cluster analysis.

최적의 군집수를 결정하지 못하였지만, 총 문서의 수 2,486건의 90%인 2,361건을 클러스터수로 결정하였다. 클러스터링 분석을 통해 키워드별로 불만 데이터를 분류하였고, 이때 3건 이상의 문서가 포함되어 있는 클러스터의 수는 21개이다. 이는 Gas mileage에 관련된 10개의 문서가 한 개의 클러스터에 들어가게 된다면, 이렇게 한 개의 클러스터에 여러 문서가 들어있는 클러스터가 21개라는 것을 의미한다.The number of clusters was not determined but the number of clusters was determined to be 2,361, which is 90% of the total number of 2,486 documents. The clustering analysis classifies the complaint data by keyword. In this case, the number of clusters that contain three or more documents is 21. This means that if 10 documents related to Gas mileage are to be placed in one cluster, then there are 21 clusters containing multiple documents in one cluster.

다음으로 인공신경망과 SVM을 이용하여 고객의 불만을 예측하였다. 분류모델을 구축하기 위해 훈련데이터 80%와 분류 모델을 검증하기 위해 시험데이터를 20%로 하였다. 불만을 예측한 결과 인공신경망은 80.77%, SVM은 83.61%의 정확도를 나타냈다. 따라서, SVM이 좀 더 정확한 예측 모델임을 알 수 있다.Next, we estimated customer complaints using artificial neural network and SVM. To construct a classification model, 80% of the training data and 20% of the test data were used to verify the classification model. As a result, 80.77% of artificial neural network and 83.61% of SVM were found. Therefore, we can see that SVM is a more accurate prediction model.

본 발명의 일 실시예에 따르면, 텍스트와 같은 비정형 데이터로부터 고객의 불만 및 요구를 빠른 시간 내에 효율적으로 파악할 수 있으며, 분석한 자료를 기반으로 개선과 상품이나 서비스 개발에 필요한 정보로 활용할 수 있고, 고객의 불만 발생 이전에 원인 제거 및 업무 개선을 통하여 더 나은 고객감동 서비스를 제공할 수 있는 효과가 있음을 알 수 있다.According to an embodiment of the present invention, it is possible to efficiently grasp a customer's complaints and requests from unstructured data such as texts in a short period of time, and to use the analyzed information as information necessary for improvement and development of goods or services, It can be seen that it is possible to provide better customer service by eliminating the cause and improving the business before the complaint of the customer.

상술한 본 발명에 따른 텍스트 마이닝 기법을 활용한 고객 불만에 대한 분류 및 예측 방법은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체로는 컴퓨터 시스템에 의하여 해독될 수 있는 데이터가 저장된 모든 종류의 기록 매체를 포함한다. 예를 들어, ROM(Read Only Memory), RAM(Random Access Memory), 자기 테이프, 자기 디스크, 플래쉬 메모리, 광 데이터 저장장치 등이 있을 수 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다. The method for classifying and predicting customer complaints using the text mining technique according to the present invention can be implemented as a computer-readable code on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording media storing data that can be decoded by a computer system. For example, it may be a ROM (Read Only Memory), a RAM (Random Access Memory), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, or the like. In addition, the computer-readable recording medium may be distributed and executed in a computer system connected to a computer network, and may be stored and executed as a code readable in a distributed manner.

상기한 본 발명의 실시예는 예시의 목적을 위해 개시된 것이고, 본 발명에 대한 통상의 지식을 가지는 당업자라면 본 발명의 사상과 범위 안에서 다양한 수정, 변경, 부가가 가능할 것이며, 이러한 수정, 변경 및 부가는 하기의 특허청구범위에 속하는 것으로 보아야 할 것이다. It will be apparent to those skilled in the art that various modifications, additions and substitutions are possible, without departing from the spirit and scope of the invention as defined by the appended claims. Should be regarded as belonging to the following claims.

Claims

We use the TF-IDF technique, a classification and prediction algorithm for customer complaints and text mining techniques, to analyze customer complaints from VOC data that contains various opinions of customers and to extract complaint data from data crawled on the Internet We use a text mining technique to predict customer complaints more precisely by using the naive Bayesian classifier as a pre-defined algorithm, and to compare and analyze the artificial neural network with the SVM method. Classification and prediction methods.