KR20220105792A

KR20220105792A - AI-based Decision Making Support System utilizing Dynamic Text Sources

Info

Publication number: KR20220105792A
Application number: KR1020210008591A
Authority: KR
Inventors: 장경희; 아자룰
Original assignee: 인하대학교 산학협력단
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2022-07-28
Also published as: KR102466559B1

Abstract

Disclosed are an AI-based decision-making support system utilizing dynamic text sources and a method thereof. The AI-based decision-making support system utilizing dynamic text sources, suggested by the present invention, comprises: a data mining and analysis unit analyzing raw data or scraping data based on user keywords and performing data mining and analysis when the data is scraping data that is unlabeled for unsupervised learning; a data categorization unit identifying raw data and scraping data by receiving labeled raw data and unlabeled scraping data from the data mining and analysis unit; a text classification and analysis unit extracting data from sources of unlabeled scraping data, converting the data into labeled data, and performing data-wrangling extraction and model evaluation on the labeled data; and a decision-making and classification unit predicting data after model evaluation performed by the text classification and analysis unit and providing prediction results through visualization of multiple decision-making graphs and information output by a chatbot application. According to the present invention, classification accuracy on datasets can be improved by using both machine learning and deep learning algorithms.

Description

AI-based Decision Making Support System utilizing Dynamic Text Sources

본 발명은 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 시스템에 관한 것이다. The present invention relates to an AI-based decision support system using a dynamic text source.

가장 최근 10년 동안 인터넷은 온라인 데이터 제작자 및 클라이언트와 같은 개인에 의해 점진적으로 활용되어 왔다[1]. 2008년[2]의 UGC(User-Generated Content) 조사에 따르면 미국 인터넷 클라이언트의 35%가 웹에서 최소한 한번 UGC에 기여했으며, 유럽, 일본 및 한국에서도 유사한 경향이 있었다. 텍스트 마이닝에서 Imran 등[3]은 위기 관련 통신의 자동 텍스트 분류를 수행하기 위한 플랫폼인 AIDR(Artificial Intelligence of Disaster Relief)을 제안했다. AIDR은 재난 중에 사람들이 게시하는 메시지를 일련의 사용자 정의 정보 범주로 분류한다. 무엇보다도, 전체 프로세스는 신뢰할 수 있는 정보만 실시간으로 수집, 처리 및 생산해야 하며, 지연 시간이 짧아야 한다 [4]. In the most recent decade, the Internet has been increasingly utilized by individuals such as online data producers and clients [1]. According to a User-Generated Content (UGC) survey in 2008 [2], 35% of Internet clients in the United States contributed to UGC at least once on the web, with a similar trend in Europe, Japan and Korea. In text mining, Imran et al. [3] proposed AIDR (Artificial Intelligence of Disaster Relief), a platform for performing automatic text classification of crisis-related communications. AIDR classifies messages posted by people during a disaster into a set of custom information categories. Above all, the entire process must collect, process and produce only reliable information in real time, and the latency must be low [4].

Daud [5]는 텍스트 코퍼스에서 섬세한 결합 능력을 가진 주제 모델의 검토에 집중하여 경계 추정(즉, Gibbs 샘플링)과 성능 평가 척도로 서로 다른 분류를 시퀀싱한 기존 모델과 필수 아이디어를 탐구했다. 마찬가지로, Daud는 텍스트 코퍼스를 표시하기 위해 주제 모델의 몇 가지 용도를 소개하고 몇 가지 미해결 문제와 향후 방향에 대해 논의했다. Daud [5] focused on the review of subject models with delicate binding capabilities in the text corpus, exploring existing models and essential ideas for sequencing different classifications by boundary estimation (i.e. Gibbs sampling) and performance rating scales. Similarly, Daud introduced some uses of subject models to represent text corpus and discussed some outstanding issues and future directions.

Dang 등[6]은 감정의 극성과 같은 감정 분석 문제를 해결하기 위해 딥 러닝(Deep Learning; DL)을 채택한 최신 연구를 검토했다. 이 모델에서는 일련의 데이터셋에 TF-IDF(Term Frequency-Inverse Document Frequency)와 단어 임베딩 절차를 사용했다. 감정 분석은 추상적인 감정을 인식하기 위한 언어 준비, 텍스트 검사 및 컴퓨터 음성학으로 구성된다[7]. 대부분의 경우, 새로운 데이터 입력 샘플의 범주는 비슷하다[7]. Dang et al. [6] reviewed the latest research employing Deep Learning (DL) to solve emotion analysis problems such as emotion polarity. In this model, we used Term Frequency-Inverse Document Frequency (TF-IDF) and word embedding procedures for a series of datasets. Sentiment analysis consists of language preparation, text inspection, and computer phonetics for recognizing abstract emotions [7]. In most cases, the categories of new data entry samples are similar [7].

Skrlj 등[8]은 주어진 문서 집합에서 식별된 의미 데이터를 학습에 사용되는 많은 새로운 강조점으로 변경하는 실용적인 의미론적 콘텐츠-추정 접근방식을 제시했다. 여기서 제안된 SRNA(Semantics-aware Recurrent Neural Architecture) 모델은 시스템이 의미 벡터와 원시 텍스트 문서를 동시에 얻을 수 있도록 한다. 이 것은 짧은 보고서에서 얻은 가장 높은 정확성(최대 10%)으로 의미 정보가 없는 방법론을 제안된 접근법이 능가한다는 것을 보여준다.Skrlj et al. [8] presented a pragmatic semantic content-estimation approach that transforms semantic data identified in a given set of documents into many new emphases used for learning. The Semantics-aware Recurrent Neural Architecture (SRNA) model proposed here allows the system to obtain semantic vectors and raw text documents simultaneously. This shows that the proposed approach outperforms the semantic-informed methodology with the highest accuracy (up to 10%) obtained in a short report.

본 발명이 이루고자 하는 기술적 과제는 데이터 마이닝 비구조적 데이터를 레이블링된(labeled) 데이터로 분류하는 모델을 개발하고 정보 및 의사결정 지원 시스템 애플리케이션을 구축하고자 한다. 본 발명의 주요 목표는 위험 데이터셋(hazardous dataset)을 처리함에도 불구하고 사용자의 의도를 파악할 수 있는 비정형 소스로부터 강력한 의사결정을 내리는 것이다. The technical task of the present invention is to develop a model for classifying data mining unstructured data into labeled data and to build an information and decision support system application. The main goal of the present invention is to make a strong decision from an unstructured source that can identify a user's intention despite processing a hazardous dataset.

일 측면에 있어서, 본 발명에서 제안하는 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 시스템은 사용자의 키워드에 기초하여 원시(raw) 데이터 또는 스크래핑(scraping) 데이터를 분석하여, 비지도 학습을 위한 레이블링되지 않은 스크래핑(scraping) 데이터인 경우, 데이터 마이닝 및 분석을 수행하는 데이터 마이닝 및 분석부, 데이터 마이닝 및 분석부로부터 레이블링된 원시 데이터와 레이블링되지 않은 스크래핑 데이터를 입력 받아 원시 데이터와 스크래핑 데이터를 식별하는 데이터 범주화부, 레이블링되지 않은 스크래핑 데이터의 소스로부터 데이터를 추출하여 레이블링된 데이터로 전환하고, 레이블링된 데이터에 대한 데이터 랭글링 추출(data-wrangling extraction) 및 모델 평가를 수행하는 텍스트 분류 및 분석부 및 텍스트 분류 및 분석부에서의 모델 평가 후 데이터를 예측하고 챗봇 애플리케이션에 의한 복수의 의사 결정 그래프 시각화 및 정보 출력을 통해 예측 결과를 제공하는 의사 결정 분류부를 포함한다. In one aspect, the AI-based decision support system using the dynamic text source proposed in the present invention analyzes raw data or scraping data based on a user's keyword, and labels for unsupervised learning. In the case of unlabeled scraping data, the data mining and analysis unit that performs data mining and analysis, the data mining and analysis unit receives the labeled raw data and the unlabeled scraping data from the data mining and analysis unit to identify the raw data and the scraping data. A data categorization unit, a text classification and analysis unit that extracts data from the source of unlabeled scraping data and converts it into labeled data, and performs data-wrangling extraction and model evaluation on the labeled data; and and a decision classification unit that predicts data after model evaluation in the text classification and analysis unit and provides prediction results through visualization of a plurality of decision graphs and information output by a chatbot application.

데이터 마이닝 및 분석부는 스크래핑 데이터 분류기를 통해 사용자의 키워드에 기초하여 데이터를 추출하고, 감정 분석 및 의사결정을 위해 이모티콘, 이모지 사인, 및 정규 표현과 제외어를 처리하기 위한 FCT(Filter Cleaning Text)를 정리하며, 감정 분석은 텍스트의 주관성 및 극성을 평가하여 데이터를 분석한 후, FCT를 통해 구조화된 열을 제공한다. The data mining and analysis unit extracts data based on the user's keywords through a scraping data classifier, and filters emoji, emoji signs, and filter cleaning text (FCT) for processing emojis, emoji signs, and regular expressions and negatives for sentiment analysis and decision-making. After analyzing the data by evaluating the subjectivity and polarity of the text, sentiment analysis provides structured columns through FCT.

텍스트 분류 및 분석부는 인공지능, ML(Machine Learning) 및 DL(Deep Learning)을 이용하여 학습을 수행하고, 데이터 랭글링 분류를 통해 레이블링 데이터를 생성하며, 레이블링되지 않은 데이터의 소스로부터 데이터 추출을 시작할 때 다중 클래스 레이블링 데이터에 대한 주제 모델링을 위해 문서를 항목 별로 클러스터링하고 비지도 생성 확률론적 방법인 LDA(Latent Dirichlet Allocation)을 이용하여 레이블링되지 않은 데이터에 대한 ML 및 DL 모델을 생성하고 텍스트를 분석한다. The text classification and analysis unit performs learning using artificial intelligence, machine learning (ML) and deep learning (DL), generates labeling data through data wrangling classification, and starts extracting data from the source of unlabeled data. For subject modeling for multi-class labeling data, documents are clustered by item, and ML and DL models are generated for unlabeled data using LDA (Latent Dirichlet Allocation), an unsupervised probabilistic method, and text is analyzed. .

텍스트 분류 및 분석부는 LDA를 이용하여 해당 문서를 정의하는 주제를 역추적하기 위한 매트릭스 인수분해를 통해 문서-용어 매트릭스를 하위 차원의 문서-주제 매트릭스와 주제-용어 매트릭스로 변경하고, 현재 포인트에 지정된 문서 내 각 단어의 비율 및 각 단어가 있는 모든 문서에 대한 주제별 할당 비율을 계산하며, 주제 모델링을 위한 WGP(Word Generative Function) 방식을 통해 미리 정해진 기준 이상의 빈도를 갖는 단어를 획득한다. The text classification and analysis unit uses LDA to change the document-term matrix to sub-dimensional document-topic matrix and topic-term matrix through matrix factorization to trace back the subject defining the document, and The ratio of each word in the document and the allocation ratio by topic for all documents with each word are calculated, and words having a frequency greater than or equal to a predetermined standard are obtained through the WGP (Word Generative Function) method for topic modeling.

또 다른 일 측면에 있어서, 본 발명에서 제안하는 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 방법은 데이터 마이닝 및 분석부를 통해 사용자의 키워드에 기초하여 원시(raw) 데이터 또는 스크래핑(scraping) 데이터를 분석하여, 비지도 학습을 위한 레이블링되지 않은 스크래핑(scraping) 데이터인 경우, 데이터 마이닝 및 분석을 수행하는 단계, 데이터 범주화부를 통해 데이터 마이닝 및 분석부로부터 레이블링된 원시 데이터와 레이블링되지 않은 스크래핑 데이터를 입력 받아 원시 데이터와 스크래핑 데이터를 식별하는 단계, 텍스트 분류 및 분석부를 통해 레이블링되지 않은 스크래핑 데이터의 소스로부터 데이터를 추출하여 레이블링된 데이터로 전환하고, 레이블링된 데이터에 대한 데이터 랭글링 추출(data-wrangling extraction) 및 모델 평가를 수행하는 단계 및 텍스트 분류 및 분석부에서의 모델 평가 후 의사 결정 분류부를 통해 챗봇 애플리케이션에 의한 복수의 의사 결정 그래프 시각화 및 정보 출력을 통해 예측 결과를 제공하는 단계를 포함한다.In another aspect, the AI-based decision support method using a dynamic text source proposed in the present invention analyzes raw data or scraping data based on a user's keywords through a data mining and analysis unit. Thus, in the case of unlabeled scraping data for unsupervised learning, performing data mining and analysis, and receiving labeled raw data and unlabeled scraping data from the data mining and analysis unit through the data categorization unit Identifying raw data and scraping data, extracting data from the source of unlabeled scraping data through text classification and analysis unit and converting it into labeled data, and data-wrangling extraction for labeled data and performing model evaluation and providing prediction results through a plurality of decision graph visualizations and information output by a chatbot application through a decision classification unit after model evaluation in the text classification and analysis unit.

본 발명의 실시예들에 따르면 데이터 마이닝 비구조적 데이터를 레이블링된 데이터로 분류하는 모델을 제안하고 정보 및 의사결정 지원 시스템 애플리케이션을 구축할 수 있다. 본 발명의 실시예들에 따르면 사용자의 의도를 파악할 수 있는 비정형 소스로부터 강력한 의사 결정을 내릴 수 있고, 머신 러닝과 딥 러닝 알고리즘 모두를 사용하여 데이터셋에 대한 분류 정확도를 향상시킬 수 있다.According to embodiments of the present invention, it is possible to propose a model for classifying data mining unstructured data into labeled data and build an information and decision support system application. According to embodiments of the present invention, it is possible to make a strong decision from an unstructured source capable of identifying a user's intention, and to improve classification accuracy for a dataset using both machine learning and deep learning algorithms.

도 1은 본 발명의 일 실시예에 따른 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 시스템의 구성을 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 사용자 키워드를 기반으로 특정 열로 추출된 데이터를 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 분석을 위해 주관성과 극성으로 정보를 측정하는 순수 데이터 세트 열을 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 단어 클라우드를 사용하여 데이터의 단어를 시각화한 도면이다.
도 5는 본 발명의 일 실시예에 따른 감정 분석 데이터의 산포 그래프 및 막대 그래프를 나타낸다.
도 6은 본 발명의 일 실시예에 따른 문서의 문장 분포 그래프를 나타내는 도면이다.
도 7은 본 발명의 일 실시예에 따른 문서 문장 당 LDA 단어 빈도를 나타내는 도면이다.
도 8은 본 발명의 일 실시예에 따른 텍스트 클래스와 레이블 간의 유사성 검사 결과를 나타내는 도면이다.
도 9는 본 발명의 일 실시예에 따른 챗봇 애플리케이션을위한 Seq2Seq 모델이있는 신경망 기능 기반 컨텍스트 인코더를 나타내는 도면이다.
도 10은 본 발명의 일 실시예에 따른 챗봇 애플리케이션의 정보 결정을 나타내는 도면이다.
도 11은 본 발명의 일 실시예에 따른 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 방법을 설명하기 위한 흐름도이다.1 is a diagram showing the configuration of an AI-based decision support system using a dynamic text source according to an embodiment of the present invention.
2 is a diagram illustrating data extracted into a specific column based on a user keyword according to an embodiment of the present invention.
3 is a diagram illustrating a pure data set column in which information is measured in terms of subjectivity and polarity for analysis according to an embodiment of the present invention.
4 is a diagram illustrating a visualization of words in data using a word cloud according to an embodiment of the present invention.
5 shows a scatter graph and a bar graph of emotion analysis data according to an embodiment of the present invention.
6 is a diagram illustrating a sentence distribution graph of a document according to an embodiment of the present invention.
7 is a diagram illustrating the frequency of LDA words per document sentence according to an embodiment of the present invention.
8 is a diagram illustrating a similarity check result between a text class and a label according to an embodiment of the present invention.
9 is a diagram illustrating a neural network function-based context encoder with a Seq2Seq model for a chatbot application according to an embodiment of the present invention.
10 is a diagram illustrating information determination of a chatbot application according to an embodiment of the present invention.
11 is a flowchart illustrating an AI-based decision support method using a dynamic text source according to an embodiment of the present invention.

지도 학습에는 이전에 명명된 데이터셋을 사용하여 모델을 초기에 준비해야 하는 시점이 있으며, 이는 불균형과 유사성을 밝혀낼 필요가 있기 때문이다. 대조적으로, 비지도 학습은 미리 지정된 데이터셋 없이 학습 및 예측과 연관되어 있다. There is a point in supervised learning where the model needs to be initially prepared using previously named datasets, as it is necessary to uncover imbalances and similarities. In contrast, unsupervised learning involves learning and prediction without a predefined dataset.

본 발명에서는 비지도 학습 데이터를 처리하고 주제에 대한 결정을 내리는 RADSS(Real-time AI-based Decision Support System) 모델을 제안한다. 제안된 RADSS 모델에는 두 가지 종류의 정보 입력 전략이 있다. 하나는 레이블링된 데이터(labeled data)이고 다른 하나는 데이터 마이닝 프로세스(data mining process)이다. 사용자는 이 두 가지 유형의 정보를 입력할 수 있다. 따라서, 텍스트 분류는 사용자가 주어진 정보를 특성화하기 위한 가장 중요한 단계 중 하나이며, 비지도 또는 지도 학습을 위한 데이터 순서를 선택한다. The present invention proposes a Real-time AI-based Decision Support System (RADSS) model that processes unsupervised learning data and makes a decision on a topic. There are two kinds of information input strategies in the proposed RADSS model. One is labeled data and the other is a data mining process. Users can enter these two types of information. Therefore, text classification is one of the most important steps for a user to characterize a given information, and select a data sequence for unsupervised or supervised learning.

데이터에 레이블링된 정보가 포함되어 있으면 텍스트 분류기와 전처리기에서 데이터 랭글링 추출(data-wrangling extraction)을 실행한다. 모델 평가를 마친 후 애플리케이션은 서비스 및 예측을 위해 데이터를 가져온다. If the data contains labeled information, the text classifier and preprocessor perform data-wrangling extraction. After evaluating the model, the application fetches the data for service and prediction.

이와는 대조적으로, 정보가 웹에서 다양한 소스를 스크래핑하여 나온다면, 데이터 마이닝 및 분석과 같은 수많은 것들을 처리할 필요가 있다. 분류기의 목표는 사용자로부터 순수 정보(clean information)를 기록하고 사용자가 원하는 출력을 반환하는 것이어야 한다. 순수 정보 세그먼트를 발견함에 있어, 정보 실행과 시각화를 측정하기 위해 감정 분석을 할 필요가 있다. LDA(Latent Dirichlet Allocation)과 원시 데이터 변환을 통한 주제 모델링은 이러한 비지도 학습을 정보 어셈블리 후, 레이블링된 데이터셋으로 전환하여 구조적 성능과 결과를 제공한다.In contrast, if information comes from scraping various sources from the web, it will need to deal with a number of things like data mining and analysis. The goal of the classifier should be to record clean information from the user and return the output desired by the user. In discovering pure information segments, it is necessary to perform sentiment analysis to measure information execution and visualization. Topic modeling through LDA (Latent Dirichlet Allocation) and raw data transformation provides structural performance and results by converting this unsupervised learning into a labeled dataset after information assembly.

의사 결정 시스템의 경우 RADSS 모델은 주제 분석을 위한 데이터 마이닝(예를 들어, 현재 새로운 코로나 바이러스), 정서 분석을 위한 트위터 데이터 사용(즉, 트윗); 비지도 및 지도 학습 주제 레이블링(다중 클래스 텍스트 분류); 강력한 애플리케이션 효율성을 제공하기 위한 하이퍼 튜닝 데이터; 다양한 그래프의 데이터 시각화, 텍스트 분류 방법 비교를 시각화한다. 마지막으로, 챗봇은 지도 및 비지도 프로세스 간에 정보 결정을 제공한다. For decision-making systems, the RADSS model includes data mining for topic analysis (eg, the current novel coronavirus), using Twitter data for sentiment analysis (ie, tweets); unsupervised and supervised learning topic labeling (multiclass text classification); Hyper-tuning data to provide strong application efficiency; Visualize data visualization of various graphs, and compare text classification methods. Finally, chatbots provide informed decisions between supervised and unsupervised processes.

본 발명은 데이터 마이닝 비구조적 데이터를 레이블링된(labeled) 데이터로 분류하는 모델을 개발하고 정보 및 의사결정 지원 시스템 애플리케이션을 구축한다. 본 발명의 주요 목표는 위험 데이터셋(hazardous dataset)을 처리함에도 불구하고 사용자의 의도를 파악할 수 있는 비정형 소스로부터 강력한 의사결정을 내리는 것이다. 자연어 처리(Natural Language Processing; NLP)는 디지털 메시지의 불일치 및 비표준 노이즈로 인해 강력한 분류기를 필요로 하는 다양한 텍스트 준비 단계를 만들기 위해 텍스트 분류에서 중요한 역할을 한다. 본 발명에서는 머신 러닝과 딥 러닝 알고리즘 모두를 사용하여 데이터셋에 대한 상당한 분류 정확도 향상을 관찰한다. 가장 높은 분류 정확도(88%)는 LSTM(Long Short-Term Memory) 방법을 사용하여 딥 러닝을 가진 짧은 코퍼스(corpus)에서 달성되었다. 또한 머신 러닝 알고리즘인 RF(Random Forest)는 합리적인 84%의 정확도를 제공한다. The present invention develops a model for classifying data mining unstructured data into labeled data and builds information and decision support system applications. The main goal of the present invention is to make a strong decision from an unstructured source that can identify a user's intention despite processing a hazardous dataset. Natural Language Processing (NLP) plays an important role in text classification to create various text preparation steps that require robust classifiers due to inconsistency and non-standard noise in digital messages. In the present invention, we observe a significant improvement in classification accuracy for datasets using both machine learning and deep learning algorithms. The highest classification accuracy (88%) was achieved in a short corpus with deep learning using the Long Short-Term Memory (LSTM) method. In addition, the machine learning algorithm Random Forest (RF) provides a reasonable 84% accuracy.

제안하는 RADSS 모델에서 데이터는 다양한 소스에서 추출될 수 있으며, 사전 처리는 DMS(Decision Making Support) 시스템을 통해 정확한 사용자 의도를 제공한다. 본 발명의 실시예에 따른 주제 모델링은 중요한 코퍼스를 범주로 라벨링하는 다중 클래스 텍스트 분류를 사용한다. 제안하는 모델은 텍스트 데이터를 분석하고 그것들을 긍정, 부정 또는 중립적인 감정으로 분류하는 자동화된 과정을 가지고 있다. 의미론적 텍스트 마이닝 접근방식은 텍스트 분류에 중요하다. 또한, 비정형 데이터가 유용한 콘텐츠를 만드는 애플리케이션 모델로부터 유용한 의미 콘텐츠를 보여준다. 이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다.In the proposed RADSS model, data can be extracted from various sources, and the pre-processing provides accurate user intent through the DMS (Decision Making Support) system. Subject modeling according to an embodiment of the present invention uses multi-class text classification to label important corpus into categories. The proposed model has an automated process that analyzes text data and classifies them as positive, negative or neutral emotions. A semantic text mining approach is important for text classification. It also reveals useful semantic content from application models where unstructured data creates useful content. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 시스템의 구성을 나타내는 도면이다. 1 is a diagram showing the configuration of an AI-based decision support system using a dynamic text source according to an embodiment of the present invention.

제안하는 RADSS(Real-time AI-based Decision Support System)의 분류 모델은 도 1과 같다. 여기서 사용자 입력(120)은 웹에서 정보를 추출하기 위해 특정 키워드나 주제를 제공하거나 데이터셋에 대한 특정 레이블을 제공하여 애플리케이션 결과를 얻는다. 웹 스크래핑 후 분류 모델에 대해 데이터를 범주화해야 한다. 원시(raw) 데이터 분류기(111)의 레이블링된 원시 데이터 또는 스크래핑(scraping) 데이터 분류기(112)의 레이블링되지 않은 웹 스크래핑 데이터인 정보를 식별하고, 레이블링되지 않은 웹 스크래핑 데이터에 대하여 데이터 마이닝 및 분석부(130)를 통해 데이터 마이닝 및 분석을 수행한다. A classification model of the proposed RADSS (Real-time AI-based Decision Support System) is shown in FIG. 1 . Here, the user input 120 provides a specific keyword or subject to extract information from the web, or provides a specific label for a dataset to obtain an application result. After web scraping, we need to categorize the data against the classification model. Identify information that is labeled raw data of the raw data classifier 111 or unlabeled web scraping data of the scraping data classifier 112, and data mining and analysis unit for unlabeled web scraping data Through 130, data mining and analysis are performed.

데이터 범주화부(140) 는 원시 데이터 또는 웹 스크래핑 데이터를 식별한다. The data categorization unit 140 identifies raw data or web scraping data.

텍스트 분류 및 분석부(150)는 추가 분석, 데이터 준비 및 모델 평가를 위해 이러한 데이터를 포맷한다. 이후, 애플리케이션을 구축하고 성능 예측 및 결과를 평가한다. The text classification and analysis unit 150 formats this data for further analysis, data preparation, and model evaluation. After that, you build the application and evaluate its performance predictions and results.

텍스트 분류 및 분석부(150)의 목적은 주어진 데이터 샘플이 원하는 출력을 얻는, 지도 또는 비지도 학습을 위한 정보를 보내는 것이며, 입력 및 출력의 시각적 정보 간의 관계를 보여준다. 원시 데이터가 포함된 정보를 스크래핑한 후 텍스트 분류 및 분석부(150)는 지정된 데이터셋 또는 마이닝된 데이터로 정보를 수신한다. 따라서 RADSS 모델은 평가를 수행하고 예측 또는 출력을 제공한다. 비지도 학습에서 가장 중요한 일은 클러스터링, 묘사 학습 및 밀도 추정이다. 그러나 데이터셋은 다중 클래스 텍스트 분류로 주제 모델링에 의해 준비되며, 여기서 데이터 랭글링 분류(151)는 먼저 레이블링 데이터(152)를 만든 다음 모델 평가에 들어간다. 사용자 입력이 레이블 코퍼스를 포함하는 경우 텍스트 분류기는 이를 지도 처리를 위해 전송한다. 모델에는 출력 결과에서 샘플이 어떻게 되어야 하는지를 결정하는 초기 정보가 있다. 그러므로, 그것은 생각할 수 있는 텍스트를 학습하는 것이며, 이진 또는 다중 클래스 레이블링 데이터(153) 분류를 적용할 필요가 있다. 분류의 목표는 데이터 포인트[9]를 나타내는 자연 구조 또는 계층 구조를 추론하는 것이다. The purpose of the text classification and analysis unit 150 is to send information for supervised or unsupervised learning, in which a given data sample obtains the desired output, and to show the relationship between input and output visual information. After scraping the information including the raw data, the text classification and analysis unit 150 receives the information as a specified dataset or mined data. Thus, the RADSS model performs the evaluation and provides a prediction or output. The most important tasks in unsupervised learning are clustering, descriptive learning, and density estimation. However, the dataset is prepared by topic modeling with multi-class text classification, where the data wrangling classification 151 first makes the labeling data 152 and then enters the model evaluation. If the user input contains a label corpus, the text classifier sends it for map processing. The model has initial information that determines what the samples should look like in the output. Therefore, it is to learn conceivable text, and it is necessary to apply binary or multi-class labeling data 153 classification. The goal of classification is to infer the natural structure or hierarchical structure representing data points [9].

다중 클래스 레이블링 데이터(153)에 대한 주제 모델링(155)을 수행한 후, 레이블 데이터(152)와 함께 특징을 추출(Feature Engineering)한다(154). 이후, 차원 축소(Dimensionality Reduction)(156)를 거쳐 모델 평가(157)를 수행한다. 모델 평가 후 사용자는 의사 결정 분류부(160)를 통해 챗봇 애플리케이션에 의한 몇 가지 의사 결정 그래프 시각화(162) 및 정보 출력(161)을 통해 원하는 예측 및 결과(170)를 얻는다. 이하, 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 시스템의 각 구성에 대하여 더욱 상세히 설명한다. After subject modeling 155 is performed on the multi-class labeling data 153 , features are extracted (Feature Engineering) together with the label data 152 ( 154 ). Thereafter, model evaluation 157 is performed through dimensionality reduction 156 . After model evaluation, the user obtains the desired predictions and results 170 through several decision graph visualizations 162 and information output 161 by the chatbot application through the decision classification unit 160 . Hereinafter, each configuration of the AI-based decision support system using a dynamic text source will be described in more detail.

본 발명의 실시예에 따른 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 시스템은 텍스트 분류에서 정보 및 DMS 시스템을 위한 RADSS 모델을 제안했다. 본 발명의 특징은 다음과 같다: 데이터 그룹화를 스크랩하기 위한 FCT(Filter Cleaning Text) 방법론 제안; 최고 단어 빈도 레이블 선택(highest word frequency label selection)을 위한 WGP(Word Generative Probabilistic) 방법 제안; 및 스크래핑 데이터셋을 기반으로 컨텍스트 기반 챗봇 애플리케이션 구현. The AI-based decision support system using a dynamic text source according to an embodiment of the present invention proposes a RADSS model for information and DMS systems in text classification. The features of the present invention are as follows: a Filter Cleaning Text (FCT) methodology proposal for scraping data groupings; Proposed Word Generative Probabilistic (WGP) method for highest word frequency label selection; and implementation of a context-based chatbot application based on the scraping dataset.

데이터 마이닝 프로세스는 사용자가 요청하거나 제공하는 데이터에 따라 정보의 패턴과 연결을 구분한다. 이 프로세스는 기업에서 원시 데이터를 유용한 정보로 변환하기 위해 사용된다. 데이터 마이닝 프로세스는 5단계로 나뉜다. 첫째, 조직은 데이터를 수집하여 데이터 저장부로 불러온다. 그런 다음 내부 서버나 클라우드에 데이터를 저장하고 관리한다. 비즈니스 분석가, 관리 팀 및 정보 기술 전문가가 데이터에 액세스하여 데이터 구성 방법을 결정한다. 그런 다음, 애플리케이션 소프트웨어는 사용자의 원하는 결과에 따라 데이터를 정렬한다. 마지막으로, 최종 사용자는 그래프나 표와 같이 공유하기 쉬운 형식으로 데이터를 제공한다[10]. The data mining process distinguishes patterns and connections in information based on data requested or provided by users. This process is used by businesses to transform raw data into useful information. The data mining process is divided into five steps. First, the organization collects data and loads it into a data store. The data is then stored and managed on an internal server or in the cloud. Business analysts, management teams, and information technology experts access data to determine how to organize it. The application software then sorts the data according to the user's desired result. Finally, end users provide data in an easy-to-share format such as graphs or tables [10].

본 발명의 실시예에 따른 데이터 마이닝 및 분석부(130)도 같은 일을 하지만 프로세스는 다르다. 본 발명의 실시예에 따른 시스템에서는 데이터 마이닝 및 분석부(130)가 사용자의 키워드를 기반으로 원시 데이터를 추출한다. 따라서 의사결정 분류는 정보를 긍정, 부정 또는 중립 분석에 대한 주관성과 극성으로 측정하는 순수 데이터셋을 만든다. 마지막으로, FCT(Filter Cleaning Text)는 정보가 정리된 순수 데이터를 설정한다. The data mining and analysis unit 130 according to an embodiment of the present invention also does the same, but the process is different. In the system according to the embodiment of the present invention, the data mining and analysis unit 130 extracts raw data based on the user's keyword. Decision classification thus creates a pure dataset that measures information in terms of subjectivity and polarity for positive, negative, or neutral analyses. Finally, FCT (Filter Cleaning Text) sets pure data in which information is cleaned.

도 2는 본 발명의 일 실시예에 따른 사용자 키워드를 기반으로 특정 열로 추출된 데이터를 나타내는 도면이다.2 is a diagram illustrating data extracted into a specific column based on a user keyword according to an embodiment of the present invention.

도 2에 나타낸 RADSS 모델 평가에서 스크래핑 데이터 분류기는 사용자에 의한 트위터 데이터 추출을 보여준다. 본 발명의 일 실시예에 따르면, 트위터의 API는 특정 키워드 또는 사용자가 언급한 키워드에 대한 모든 트윗을 지난 20분, 몇 달 또는 몇 년 내에 끌어오거나 특정 사용자의 리트윗되지 않은 트윗을 끌어오거나 하는 것과 같은 복잡한 쿼리를 허용한다[11]. 본 발명의 스크래핑 데이터 분류기는 트윗을 분석하여 일반 대중으로부터 정보를 받는 방식을 결정한다. 스크래핑 데이터 분류기는 특정 주제를 언급한 마지막 2,000개의 트윗을 수집한다. In the evaluation of the RADSS model shown in Fig. 2, the scraping data classifier shows the Twitter data extraction by the user. According to one embodiment of the present invention, Twitter's API is to pull all tweets for a specific keyword or keyword mentioned by a user within the last 20 minutes, months or years, or pull unretweeted tweets from a specific user. It allows complex queries such as [11]. The scraping data classifier of the present invention analyzes tweets to determine how information is received from the general public. The scraping data classifier collects the last 2,000 tweets that mention a particular topic.

예를 들어, 이러한 데이터셋에서는 데이터 필드가 ID, 생성된 시간, 소스, 원본 텍스트, 즐겨찾기_카운트, 리트윗_카운트, 원본_저자, 해시태그 및 사용자 멘션에 대한 열을 포함하는 사용자로부터 COVID-19 데이터를 추출한 다음 이들을 대상으로 감정 분석 알고리즘을 실행했다. 또한 공간 데이터로 알려진 특정 위치에 거주하는 사용자를 목표로 할 수 있다. 또 다른 애플리케이션은 주제가 가장 많이 언급된 전 세계의 영역을 매핑하는 것일 수 있다. 트위터 데이터는 (트위터 API의 개방성과 관대한 비율 제한과 결합되어) 강력한 결과를 도출할 수 있는 주제에 대한 정보를 어떻게 수신하는가에 대한 일반 대중으로의 게이트웨이(gateway)가 될 수 있다[12].For example, in these datasets, the data fields contain columns for ID, Time Created, Source, Original Text, Favorites_Count, Retweet_Count, Original_Author, Hashtag, and User Mention COVID from User. After extracting the -19 data, a sentiment analysis algorithm was run on them. It can also target users who live in specific locations known as spatial data. Another application might be mapping the areas of the world where a topic is mentioned the most. Twitter data can be a gateway to the general public on how to receive information on a topic (combined with the openness and generous rate limits of the Twitter API) that can lead to strong results [12].

도 3은 본 발명의 일 실시예에 따른 분석을 위해 주관성과 극성으로 정보를 측정하는 순수 데이터 세트 열을 나타내는 도면이다. 3 is a diagram illustrating a pure data set column in which information is measured in terms of subjectivity and polarity for analysis according to an embodiment of the present invention.

스크래핑 데이터 분류기에서 텍스트, 즉 노이즈가 많은 데이터를 추출했다. 따라서 분석에 가장 필요한 특정 열은 순수 데이터여야 한다. 제안하는 RADSS 에서 본 발명의 실시예에 따른 FCT(Filter Cleaning Text)는 이모티콘과 이모지(emoji) 사인, 그리고 많은 정규 표현(regular expressions)과 제외어(stop words)를 처리하기 위한 데이터를 정리하는 기능을 한다. 감정 분석은 텍스트의 기초가 되는 정보를 식별하고 추출하는 자동화된 프로세스이다[7]. 그것은 특정 주제나 주제에 대한 의견, 판단 또는 감정일 수 있다. 가장 일반적인 감정 분석 유형을 극성 감지라고 하는데, 이 유형에는 문장을 긍정, 부정 또는 중립으로 분류하는 것이 포함된다. 이 프로그램에는 두 가지 기능이 있다. 하나는 주관성(텍스트가 얼마나 주관적이거나 의견이 많은지; 0점은 사실을 나타내며 +1점은 매우 주관적인 의견)이라는 트윗을 찾는 것이고, 다른 하나는 극성이라고 불리는 트윗을 평가한다(텍스트가 얼마나 긍정적이거나 부정적인지; -1 점은 가장 부정적이고 +1 점은 가장 긍정적이고 0은 중립 문장이다). 데이터를 분석한 후 FCT는 구조화된 열을 제공하여 모델 평가 및 결과에 추가로 사용된다.Text, that is, noisy data, was extracted from the scraping data classifier. Therefore, the specific columns most needed for analysis must be pure data. Filter Cleaning Text (FCT) according to an embodiment of the present invention in the proposed RADSS organizes data for processing emoticons and emoji signs, and many regular expressions and stop words function. Sentiment analysis is an automated process that identifies and extracts the information underlying the text [7]. It may be an opinion, judgment, or feeling on a particular subject or subject. The most common type of sentiment analysis is called polarity sensing, which involves classifying a sentence as positive, negative, or neutral. This program has two functions. One is to look for tweets called subjectivity (how subjective or opinionated the text is; 0 is for fact and +1 is for very subjective opinions), and the other is to rate tweets called polarity (how positive or negative the text is). cognition; -1 is the most negative, +1 is the most positive, 0 is the neutral sentence). After analyzing the data, FCT provides structured columns for further use in model evaluation and results.

도 4는 본 발명의 일 실시예에 따른 단어 클라우드를 사용하여 데이터의 단어를 시각화한 도면이다.4 is a diagram illustrating a visualization of words in data using a word cloud according to an embodiment of the present invention.

모델 평가를 수행하는 가장 좋은 방법은 표시되 있는 단어 클라우드로부터 공통적인 단어를 이해하는 것이다. 단어 클라우드(다시 말해, 텍스트 클라우드 또는 태그 클라우드라고도 함)는 시각화의 한 유형이다. 특정 단어가 텍스트에 더 많이 표시될수록 단어 클라우드에는 더 크고 선명하게 나타난다[13]. RADSS 모델은 이러한 유형의 시각화에서 가장 자주 발생하는 코퍼스에서 단어를 결정할 수 있다. 도 4는 가장 널리 사용되는 단어가 중국, 사례, 사람, 확인됨, 코로나 바이러스 등이며, 이는 정보가 완벽하게 추출된 모델을 나타낸다. 표 1은 스크래핑 데이터 분석기를 통해 식별된 중립(Neutral), 긍정적(Positive), 부정적(Negative) 데이터를 나타낸다.The best way to perform model evaluation is to understand common words from the displayed word cloud. A word cloud (again, also called a text cloud or tag cloud) is a type of visualization. The more a particular word appears in the text, the larger and clearer it appears in the word cloud [13]. The RADSS model can determine the words in the corpus that occur most frequently in this type of visualization. Figure 4 shows a model from which the most widely used words are China, cases, people, confirmed, coronavirus, etc., from which information is perfectly extracted. Table 1 shows the neutral (Neutral), positive (Positive), negative (Negative) data identified through the scraping data analyzer.

<표 1><Table 1>

표 1에서, 얼마나 많은 긍정적인, 부정적인, 그리고 중립적인 뉴스 아이템을 가지고 있는지에 대한 데이터로부터의 가치 수를 확인할 수 있다. In Table 1, you can see the number of values from the data for how many positive, negative, and neutral news items you have.

도 5는 본 발명의 일 실시예에 따른 감정 분석 데이터의 산포 그래프 및 막대 그래프를 나타낸다. 5 shows a scatter graph and a bar graph of emotion analysis data according to an embodiment of the present invention.

도 5(a)는 극성과 주관성의 산포 그래프이고, 도 5(b)는 감정 분석 결과를 나타내는 막대 그래프이다. Fig. 5(a) is a scatter graph of polarity and subjectivity, and Fig. 5(b) is a bar graph showing the emotion analysis result.

대부분의 데이터가 중간 값인 0.00 값으로 중간에 있기 때문에 대부분의 데이터가 중립적인 것처럼 보인다. 감정 분석의 전체 분포는 분석을 기반으로 한 값을 갖는다. Most of the data appear to be neutral because most of the data are in the middle with a median value of 0.00. The overall distribution of sentiment analysis has a value based on the analysis.

본 발명의 실시예에 따른 텍스트 분류 및 분석부(150)에서는 인공지능(AI), ML(Machine Learning) 및 DL(Deep learning)을 이용하여 통계 패턴 인식을 위한 데이터를 머신(machines)에 제공한다. 학습 모델 알고리즘이 없으면 머신은 성능 및 평가 프로세스를 분석할 수 없다. 본 발명에서 제안하는 텍스트 분류는 ML과 DL 접근방식을 모두 사용하며, 그 결과의 평가와 함께 애플리케이션을 구축한다. 제안하는 접근 방식에서, 레이블링되지 않은 데이터를 생성하는 소스로부터 데이터를 추출한다. 데이터 추출은 레이블링되지 않은 코퍼스 데이터를 사전 기록된 정보 없이 레이블링된 데이터로 만든다. 원시 데이터를 분류하여 데이터셋의 의도를 결정한다. 데이터 추출을 시작할 때 알고리즘은 레이블링된 데이터로부터 학습한다[14]. 의도를 이해한 후 알고리즘은 새 데이터를 패턴과 연결하는 방법을 찾는다. 이러한 이유로 원시 데이터셋에 대한 순수 데이터를 생성하는 데 사용되는 몇 가지 전문 용어들이 있다. The text classification and analysis unit 150 according to an embodiment of the present invention provides data for statistical pattern recognition to machines using artificial intelligence (AI), machine learning (ML), and deep learning (DL). . Without the learning model algorithm, the machine cannot analyze the performance and evaluation process. The text classification proposed in the present invention uses both ML and DL approaches, and builds an application with evaluation of the results. In the proposed approach, we extract data from a source that generates unlabeled data. Data extraction turns unlabeled corpus data into labeled data with no pre-recorded information. Classify the raw data to determine the intent of the dataset. When starting to extract data, the algorithm learns from the labeled data [14]. After understanding the intent, the algorithm finds a way to associate the new data with the pattern. For this reason, there are several jargon terms used to generate pure data for raw datasets.

데이터 랭글링(wrangling) 프로세스에서, NLP(Natural Language Processing)는 워드 및 문장 토큰화, 제외어 및 대문자화 제거, 노이즈 제거, 맞춤법 수정, 어간 추출, 표제어 추출 등과 같은 처리를 위한 몇 가지 애플리케이션을 가지고 있다.In the data wrangling process, Natural Language Processing (NLP) has several applications for processing such as word and sentence tokenization, negative and capitalization removal, noise removal, spelling correction, stem extraction, lemma extraction, etc. have.

도 6은 본 발명의 일 실시예에 따른 문서의 문장 분포 그래프를 나타내는 도면이다. 6 is a diagram illustrating a sentence distribution graph of a document according to an embodiment of the present invention.

본 발명의 실시예에 따른 다중 클래스 레이블링 데이터는 주제 모델링을 위해 문서를 항목으로 클러스터링하여 대량의 텍스트를 효율적으로 분석한다. 코퍼스가 레이블링되지 않은 의미를 갖는 대량의 텍스트 데이터로, 이러한 데이터셋에 대한 ML 또는 DL 모델을 생성하기 위해 이전의 레이블링 접근 방식을 적용할 수 없을 것이다. 레이블링되지 않은 데이터가 있다면 레이블을 발견해야 한다. 텍스트 데이터의 경우 문서 클러스터는 항목별로 그룹화된다. 코퍼스를 모델링하기 위한 비지도 생성 확률론적 방법인 LDA(Latent Dirichlet Allocation)는 가장 일반적으로 사용되는 주제 모델링 방법이다[15]. 그것은 각 문서가 잠재 주제에 대한 확률적 분포로 표현될 수 있다고 가정하고, 모든 문서의 주제 분포가 공통의 Dirichlet을 사전 공유한다고 가정한다. LDA 모델의 각 잠재 주제는 단어의 확률적 분포로 나타내며, 주제의 단어 분포는 이를 이전에 공유한다. L 문서로 구성된 코퍼스 D가 N_d 단어

를 가진 경우, LDA 모델은 다음과 같은 생성 프로세스에 따라 D를 모델링한다: Multi-class labeling data according to an embodiment of the present invention efficiently analyzes a large amount of text by clustering documents into items for subject modeling. The corpus is a large amount of textual data with unlabeled semantics, making it impossible to apply previous labeling approaches to generate ML or DL models for these datasets. If you have unlabeled data, you need to find a label. For text data, document clusters are grouped by item. Latent Dirichlet Allocation (LDA), an unsupervised probabilistic method for modeling a corpus, is the most commonly used topic modeling method [15]. It assumes that each document can be expressed as a probabilistic distribution of potential topics, and that the topic distribution of all documents pre-shares a common Dirichlet. Each potential topic in the LDA model is represented by a probabilistic distribution of words, and the word distribution of the topic shares it previously. A corpus D consisting of L documents is N _d words

, the LDA model models D according to the following generation process:

(1)파라미터

가 있는 Dirichlet 분포에서 주제

에 대한 다항 분포

,(1) Parameter

Subjects from the Dirichlet distribution with

polynomial distribution for

,

(2)파라미터

가 있는 Dirichlet 분포에서 문서화된

에 대한 다항 분포

, 및(2) Parameter

documented in the Dirichlet distribution with

polynomial distribution for

, and

(3)문서

에서

단어에 대한 생성 프로세스에 따를 D를모델링한다. (3) Documents

at

Model D to follow the generation process for a word.

위의 생성 과정에서 문서의 단어는 관측된 변수일 뿐이고, 다른 것들은 잠재 변수(

및

)와 하이퍼 파라미터(

및

)이다. 잠재 변수와 하이퍼 파라미터를 유추하기 위해 다음과 같이 관측 데이터 D의 확률을 계산하고 최대화한다.In the above creation process, the words in the document are only observed variables, while others are latent variables (

and

) and hyperparameters (

and

)to be. To infer latent and hyperparameters, we compute and maximize the probability of the observed data D as follows.

(1)

(One)

본 발명에서는 구조화되지 않은 원시 데이터의 문서 유사성에 기초하여 Covid-19 데이터셋을 7개의 주제 클래스로 나누었다. In the present invention, we divided the Covid-19 dataset into 7 subject classes based on the document similarity of the unstructured raw data.

도 6에서, 주제 5는 문서의 문장 분포 그래프에서 전체 코퍼스로부터 가장 많은 문장을 가지고 있다. 이와는 대조적으로, 주제 6은 클래스 중에서 가장 적은 데이터를 가지고 있다. In Fig. 6, subject 5 has the most sentences from the entire corpus in the sentence distribution graph of documents. In contrast, subject 6 has the least data among the classes.

본 발명의 실시예에 따른 주제 모델링을 위한 WGP(Word Generative Probabilistic) 방법에서, LDA는 문서가 다양한 주제로부터 전달될 것으로 예상한다[16]. 그 시점에서 그러한 주제들은 가장 가능성 있는 보급에 의존하는 단어들을 만들어낸다. 문서 데이터셋이 주어지면, LDA는 어떤 경우에도 해당 문서를 정의하는 주제를 역추적하고 이해하려고 시도한다. 이것은 매트릭스 인수분해 기법이다. 벡터 공간에서, 코퍼스는 문서-용어 매트릭스로 제시될 수 있다. 다음 매트릭스는 말뭉치 O가 D1, D2, D3, ...,D_n 문서와 F단어의 어휘 크기 W1, W2, ...,W_n을 보고한다는 것을 보여준다. i,j 셀의 추정치는 문서 D_i에서 W_j의 빈도 수를 제공한다. LDA는 이 문서-용어 매트릭스를 두 개의 저차원 매트릭스 F1과 F2로 변경한다. F1은 문서-주제 매트릭스이고, F2는 측정(O, G) 및 (G, F)가 개별적으로 있는 주제-용어 매트릭스이며, 여기서 O는 문서 수, G는 주제 수, F는 표 2와 같이 어휘 크기이다.In the WGP (Word Generative Probabilistic) method for subject modeling according to an embodiment of the present invention, LDA expects documents to be delivered from various subjects [16]. At that point, such subjects produce words that depend most likely on dissemination. Given a dataset of documents, LDA in any case attempts to trace back and understand the subject matter that defines that document. This is a matrix factorization technique. In vector space, the corpus can be presented as a document-term matrix. The following matrix shows that corpus O reports the lexical sizes W1, W2, ...,W _n of documents D1, D2, D3, ...,D _n and words F. The estimate of cell i,j gives the frequency count of W _j in document D _i . LDA transforms this document-term matrix into two low-dimensional matrices F1 and F2. F1 is the document-topic matrix, F2 is the subject-term matrix with measures (O, G) and (G, F) separately, where O is the number of documents, G is the number of topics, and F is the vocabulary as shown in Table 2 is the size

<표 2><Table 2>

LDA는 각 단어 w, 각 기록 d에 대해 반복하며, 현재 주제-단어 작업을 새 작업으로 대체하려고 시도한다. 다른 주제인 G는 두 가지 확률의 결과인 p1과 p2의 결과인 우도 P의 단어 w에 지정된다. 모든 주제에 대해 다음과 같이 확률 p1과 p2가 계산된다[17]: The LDA iterates over each word w, each record d, and attempts to replace the current subject-word task with a new task. Another subject, G, is assigned to the word w of the likelihood P, which is the result of two probabilities, p1 and p2. For all subjects, the probabilities p1 and p2 are calculated as follows [17]:

p1 - p (t/d) = 현재 포인트 t에 지정된 문서 내 단어의 비율.p1 - p (t/d) = proportion of words in the document assigned to the current point t.

p2 - p (w/t) = w가 있는 모든 문서에 대한 주제별 할당 비율.p2 - p (w/t) = percentage of allotment by subject for all documents with w.

도 7은 본 발명의 일 실시예에 따른 문서 문장 당 LDA 단어 빈도를 나타내는 도면이다. 7 is a diagram illustrating the frequency of LDA words per document sentence according to an embodiment of the present invention.

도 7을 참조하면, LDA가 텍스트를 7개의 주제로 분류한 것으로, 비지도 데이터셋의 레이블 이름을 선택할 수 있는 가장 높은 단어 빈도이다. 이제 WGP(Word Generative Function) 방식을 통해 더 높은 빈도의 단어를 얻을 수 있다. 여기서 표 3은 분류된 이름으로 단어의 빈도를 가장 높게 표시함으로서 원시 데이터를 예측으로 선택하기에 더 편리하다.Referring to FIG. 7 , the LDA classifies text into 7 topics, which is the highest word frequency that can select the label name of the unsupervised dataset. Now it is possible to obtain higher frequency words through the WGP (Word Generative Function) method. Here, Table 3 shows the highest frequency of words with classified names, which makes it more convenient to select raw data as predictions.

<표 3><Table 3>

스크래핑된 데이터셋에는 1735개의 문장이 있다. 이 데이터셋에는 항목 이름(예를 들어, 장소, 사례, 미디어, 중국, 스프레드, 테스트, 라이브)과 항목 번호가 표시되어 있다. There are 1735 sentences in the scraped dataset. This dataset is marked with item names (eg Place, Case, Media, China, Spread, Test, Live) and Item Numbers.

도 8은 본 발명의 일 실시예에 따른 텍스트 클래스와 레이블 간의 유사성 검사 결과를 나타내는 도면이다. 8 is a diagram illustrating a similarity check result between a text class and a label according to an embodiment of the present invention.

도 8에서 FCT 열은 각 레이블에 속하는 문장을 보여준다.In FIG. 8 , the FCT column shows sentences belonging to each label.

본 발명의 실시예에 따른 모델 평가에서 텍스트와 문서는 구조화되지 않은 데이터셋이다. 그러나 이러한 레이블링되지 않은 과정은 분류의 일부로 수학적 모델링을 사용할 때 구조화된 특성 공간으로 변환되어야 한다. 첫째, 데이터는 불필요한 문자와 단어를 제외해야 한다. 처리 후에는 공식 특성 전략이 적용된다. 특성 추출에 자주 사용되는 기법은 TF-IDF와 Word2Vec이다.In model evaluation according to an embodiment of the present invention, text and documents are unstructured datasets. However, these unlabeled processes must be transformed into structured feature spaces when using mathematical modeling as part of classification. First, the data should exclude unnecessary characters and words. After treatment, the official characterization strategy is applied. The techniques frequently used for feature extraction are TF-IDF and Word2Vec.

본 발명의 실시예에 따른 차원 축소를 위해, 제외어를 제거하고 TF-IDF 벡터라이저에 임계값을 적용하지만, 여전히 많은 고유한 단어가 필요하며, 그 중 대부분은 필요하지 않고 일부는 중복되어 있다. 차원 감소 기술인 LSA(Latent Semantic Analysis)도 실행한다[18]. LSA는 SVD(Singular Value Decomposition)를 사용하며, 특히 Truncated SVD를 사용하여 차원 수를 줄이고 최적의 차원을 선택한다.For dimensionality reduction according to an embodiment of the present invention, we remove the negative word and apply a threshold to the TF-IDF vectorizer, but we still need many unique words, most of which are not needed and some are duplicated. . Latent Semantic Analysis (LSA), a dimensionality reduction technique, is also implemented [18]. LSA uses SVD (Singular Value Decomposition), especially Truncated SVD to reduce the number of dimensions and select the optimal dimension.

ML에서 모델 결정을 위해 다양한 알고리즘을 선택하고 기본 파라미터와 대조하였다[19]. 여기서 가장 큰 경고는 알고리즘이 즉시 제대로 수행되지 않을 수도 있지만 올바른 하이퍼 파라미터로 수행된다는 것이다. 이러한 과정은 어떤 종류의 알고리즘(예를 들어, Random Forest, AdaBoost, 확률적 경사 강하, KNN, Gaussian Naive Bayes, 의사결정 트리)이 자연적으로 더 잘 작동할 것인지에 대한 적절한 주요 이해를 제공할 것이다[20]. 단지 게이지로서 임의적인 가능성인 Sklearn(Python 라이브러리) 더미 알고리즘과 함께 시험해 보기 위해 6개의 별도 계산을 선택했다. 다양한 알고리즘을 평가하기 위한 측정에 대해서는 Accuracy, Precision, Recall 및 F1 점수를 살펴본다.Various algorithms were selected for model determination in ML and compared with basic parameters [19]. The biggest caveat here is that the algorithm may not perform right out of the box, but it will perform with the right hyperparameters. Such a process will provide an adequate key understanding of which kinds of algorithms (e.g., Random Forest, AdaBoost, stochastic gradient descent, KNN, Gaussian Naive Bayes, decision trees) will work better in nature [20] ]. I chose 6 separate calculations to try with the Sklearn (Python library) dummy algorithm, which is just a random possibility as a gauge. For measures to evaluate different algorithms, look at Accuracy, Precision, Recall, and F1 scores.

본 발명의 실시예에 따른 DL 접근법에서 데이터셋이 어떻게 기능하는지에 관한 다양한 방법을 탐구해야 한다. 데이터 소스는 더 작은 데이터셋이다. 이것이 LSTM 엔지니어링을 활용하는 RNN(Recurrent Neural Network)으로 가는 이유이다[21]. 방대한 데이터셋의 경우, TextCNN과 양방향 RNN(LSTM/GRU)과 같은 많은 접근법이 있다. LSTM은 시스템에 액세스할 수 있도록 메모리에 정보를 저장하도록 허용함으로써 기본 RNN의 문제를 극복하기 위한 것이었다. 그것은 많은 시간과 노력이 드는 설계를 배울 수 있는 특별한 종류의 RNN이다. LSTM을 사용하는 방법은 셀 익스프레스이며, 아웃라인의 헤드를 통과하는 수평 라인이다[22]. 셀 상태가 두 번 새로 고쳐졌고 이후 경사를 균형 있게 조정할 수 있는 계산도 거의 없었다. 그것은 또한 단기 메모리와 같은 설명과 함께 숨겨진 익스프레스를 가지고 있다.Various methods regarding how datasets function in the DL approach according to embodiments of the present invention should be explored. A data source is a smaller dataset. This is the reason for going to RNN (Recurrent Neural Network) utilizing LSTM engineering [21]. For large datasets, there are many approaches such as TextCNN and bidirectional RNN (LSTM/GRU). LSTMs were intended to overcome the problems of basic RNNs by allowing the system to store information in memory that could be accessed. It is a special kind of RNN that can learn to design which takes a lot of time and effort. The method using LSTM is Cell Express, which is a horizontal line passing through the head of the outline [22]. The cell state was refreshed twice and there were few calculations to balance the slope afterwards. It also has hidden expresses with descriptions like short-term memory.

도 9의 RADSS 챗봇 기능 모델은 ACE(Averaging Context Encoder)를 평균화하여 입력 Xs를 인코딩하고 출력 Yt를 집계하는 컨텍스트를 나타내는 도면이다. 따라서 RNN 및 ACE의 훈련 입력 Hs 계층은 주의(attention) Ht 계층으로 공급되기 직전에 요소 별 곱셈을 수행하며, 마지막으로 출력 Yt 레이어로 디코딩된다. 유한 상태 머신은 특정 생성 모델인 텍스트 생성을 위하여 의도 모델 입력을 사용한다. 각 모델은 의도에 따라 생성되며 대화가 중지될 때까지 계속 반복된다.The RADSS chatbot function model of FIG. 9 is a diagram illustrating a context in which an averaging context encoder (ACE) is averaged to encode an input Xs and aggregate an output Yt. Therefore, the training input Hs layer of RNN and ACE performs element-wise multiplication immediately before being fed to the attention Ht layer, and finally decoded into the output Yt layer. Finite state machines use intent model inputs to generate text, which is a specific generative model. Each model is created on purpose and iterates over and over until the conversation stops.

본 발명의 실시예에 따른 챗봇으로부터의 의사 결정 정보화에서, 챗봇은 정보 의사결정 지원 시스템의 실행 가능한 배치이다. 컨텍스트 기반 챗봇은 이벤트에 대한 설정, 설명 또는 생각, (완전히 이해될 수 있는 한) 기본적으로 사용자에 대한 모든 데이터의 메모리를 구성하는 하이퍼 튜닝 데이터셋 조건을 기반으로 한다. 사용자에 대한 이전 데이터를 가지고 있는 메모리는 대화가 진행됨에 따라 점차적으로 업데이트된다. 따라서 (컨텍스트를 얻기 위해), 상태 및 전환은 여기서 중요한 작업으로 간주된다. 의도를 고려하여, 행동을 실행하기 위해, 사용자들은 챗봇을 활용하는데, 챗봇은 의도 분류에 의해 이러한 활동을 인식한다. 사용자의 의도에 따라 챗봇을 특정 상태로 둔다. 전환은 챗봇 모드의 의도를 변경한다. 한 상태에서 시작해서 다음 단계로 넘어가는 교환 모드가 있는데, 이것은 토론을 특징짓고 챗봇을 디자인한다. 전환 시점에서는 챗봇에 동일한 상태에 속하는 많은 데이터가 필요하다. 데이터 부족으로 인해 모델을 교육하기가 더 어렵다. 신경망은 주입된 상태로부터 컨텍스트를 학습하는 이 단계에서 탁월한 효과를 발휘한다.In the informatization of decision making from a chatbot according to an embodiment of the present invention, the chatbot is an executable arrangement of an information decision support system. Context-based chatbots are based on settings, descriptions, or thoughts about an event, and conditions in a hyper-tuned dataset that (as far as can be fully understood) basically make up the memory of all data about the user. The memory containing previous data about the user is progressively updated as the conversation progresses. So (to get context), state and transition are considered important operations here. In order to execute actions taking into account the intent, users utilize chatbots, which recognize these activities by intention classification. It puts the chatbot in a specific state based on the user's intent. A transition changes the intent of the chatbot mode. There is an exchange mode that starts with one state and moves on to the next, which characterizes the discussion and designs the chatbot. At the transition point, the chatbot needs a lot of data that belongs to the same state. The lack of data makes it more difficult to train the model. Neural networks excel at this stage of learning the context from the injected state.

도 10은 본 발명의 일 실시예에 따른 챗봇 애플리케이션의 정보 결정을 나타내는 도면이다.10 is a diagram illustrating information determination of a chatbot application according to an embodiment of the present invention.

도 10(a)는 상황 기반 챗봇(지도 학습), 도 10(b)는 컨텍스트 기반 챗봇(비지도 학습)의 정보 결정을 나타내는 도면이다. FIG. 10(a) is a diagram illustrating information determination of a context-based chatbot (supervised learning), and FIG. 10(b) is a context-based chatbot (unsupervised learning).

도 10에서, Covid-19 스크래핑 데이터와 Covid-19 레이블링 데이터를 실험한다. 두 데이터 모두 정보 결정을 내리고 있다. 레이블링 데이터는 데이터 길이와 주어진 정보로 인해 데이터를 스크래핑하는 것보다 더 의미 있는 정보를 보여준다.In Figure 10, the Covid-19 scraping data and the Covid-19 labeling data are tested. Both data are making informed decisions. Labeling data reveals more meaningful information than scraping data due to data length and given information.

도 11은 본 발명의 일 실시예에 따른 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 방법을 설명하기 위한 흐름도이다.11 is a flowchart illustrating an AI-based decision support method using a dynamic text source according to an embodiment of the present invention.

제안하는 동적 텍스트 소스를 활용한 AI 기반 의사결정지원 방법은 데이터 마이닝 및 분석부를 통해 사용자의 키워드에 기초하여 원시(raw) 데이터 또는 스크래핑(scraping) 데이터를 분석하여, 비지도 학습을 위한 레이블링되지 않은 스크래핑(scraping) 데이터인 경우, 데이터 마이닝 및 분석을 수행하는 단계(1110), 데이터 범주화부를 통해 데이터 마이닝 및 분석부로부터 레이블링된 원시 데이터와 레이블링되지 않은 스크래핑 데이터를 입력 받아 원시 데이터와 스크래핑 데이터를 식별하는 단계(1120), 텍스트 분류 및 분석부를 통해 레이블링되지 않은 스크래핑 데이터의 소스로부터 데이터를 추출하여 레이블링된 데이터로 전환하고, 레이블링된 데이터에 대한 데이터 랭글링 추출(data-wrangling extraction) 및 모델 평가를 수행하는 단계(1130) 및 텍스트 분류 및 분석부에서의 모델 평가 후 의사 결정 분류부를 통해 챗봇 애플리케이션에 의한 복수의 의사 결정 그래프 시각화 및 정보 출력을 통해 예측 결과를 제공하는 단계(1140)를 포함한다. The AI-based decision support method using the proposed dynamic text source analyzes raw data or scraping data based on the user's keywords through the data mining and analysis unit, In the case of scraping data, performing data mining and analysis ( 1110 ), receiving labeled raw data and unlabeled scraping data from the data mining and analysis unit through the data categorization unit to identify raw data and scraping data step 1120, extracting data from the source of unlabeled scraping data through the text classification and analysis unit, converting it into labeled data, and performing data-wrangling extraction and model evaluation for the labeled data. After performing 1130 and model evaluation in the text classification and analysis unit, the decision classification unit includes a step 1140 of providing prediction results through visualization and information output of a plurality of decision graphs by the chatbot application.

단계(1110)에서, 데이터 마이닝 및 분석부를 통해 사용자의 키워드에 기초하여 원시 데이터 또는 스크래핑 데이터를 분석하여, 비지도 학습을 위한 레이블링되지 않은 스크래핑 데이터인 경우, 데이터 마이닝 및 분석을 수행한다. In step 1110, raw data or scraping data is analyzed based on the user's keyword through the data mining and analysis unit, and in the case of unlabeled scraping data for unsupervised learning, data mining and analysis are performed.

이때, 스크래핑 데이터 분류기를 통해 사용자의 키워드에 기초하여 데이터를 추출하고, 감정 분석 및 의사결정을 위해 이모티콘, 이모지 사인, 및 정규 표현과 제외어를 처리하기 위한 FCT(Filter Cleaning Text)를 정리하며, 감정 분석은 텍스트의 주관성 및 극성을 평가하여 데이터를 분석한 후, FCT를 통해 구조화된 열을 제공한다. At this time, data is extracted based on the user's keywords through the scraping data classifier, and the FCT (Filter Cleaning Text) for processing emoticons, emoji signs, and regular expressions and negative words for sentiment analysis and decision-making is organized. , sentiment analysis evaluates the subjectivity and polarity of the text to analyze the data, and then provides structured columns through FCT.

단계(1120)에서, 데이터 범주화부를 통해 데이터 마이닝 및 분석부로부터 레이블링된 원시 데이터와 레이블링되지 않은 스크래핑 데이터를 입력 받아 원시 데이터와 스크래핑 데이터를 식별한다. In step 1120 , the raw data and the scraping data are identified by receiving the labeled raw data and the unlabeled scraping data from the data mining and analysis unit through the data categorization unit.

단계(1130)에서, 텍스트 분류 및 분석부를 통해 레이블링되지 않은 스크래핑 데이터의 소스로부터 데이터를 추출하여 레이블링된 데이터로 전환하고, 레이블링된 데이터에 대한 데이터 랭글링 추출(data-wrangling extraction) 및 모델 평가를 수행하는 한다. 단계(1130)에서, 인공지능, ML(Machine Learning) 및 DL(Deep learning)을 이용하여 학습을 수행하고, 데이터 랭글링 분류를 통해 레이블링 데이터를 생성하며, 레이블링되지 않은 데이터의 소스로부터 데이터 추출을 시작할 때 다중 클래스 레이블링 데이터에 대한 주제 모델링을 위해 문서를 항목 별로 클러스터링하고 비지도 생성 확률론적 방법인 LDA(Latent Dirichlet Allocation)을 이용하여 레이블링되지 않은 데이터에 대한 ML 및 DL 모델을 생성하고 텍스트를 분석한다. 이때, LDA를 이용하여 해당 문서를 정의하는 주제를 역추적하기 위한 매트릭스 인수분해를 통해 문서-용어 매트릭스를 문서-주제 매트릭스, 주제-용어 매트릭스로 변경하고, 현재 포인트에 지정된 문서 내 각 단어의 비율 및 각 단어가 있는 모든 문서에 대한 주제별 할당 비율을 계산한다. 그리고, 주제 모델링을 위한 WGP(Word Generative Function) 방식을 통해 미리 정해진 기준 이상의 빈도를 갖는 단어를 획득한다. In step 1130, data is extracted from the source of unlabeled scraping data through the text classification and analysis unit and converted into labeled data, and data-wrangling extraction and model evaluation for the labeled data are performed. should perform In step 1130, learning is performed using artificial intelligence, machine learning (ML), and deep learning (DL), labeling data is generated through data wrangling classification, and data extraction from the source of unlabeled data is performed. At the beginning, for topic modeling on multi-class labeling data, we cluster documents by item and generate ML and DL models for unlabeled data using LDA (Latent Dirichlet Allocation), an unsupervised probabilistic method, and analyze the text. do. At this time, the document-term matrix is changed to the document-topic matrix and the topic-term matrix through matrix factorization to trace back the topic defining the document using LDA, and the ratio of each word in the document specified at the current point and calculate the subject-specific allocation rate for all documents with each word. Then, a word having a frequency greater than or equal to a predetermined criterion is acquired through a Word Generative Function (WGP) method for topic modeling.

단계(1140)에서, 텍스트 분류 및 분석부에서의 모델 평가 후 의사 결정 분류부를 통해 챗봇 애플리케이션에 의한 복수의 의사 결정 그래프 시각화 및 정보 출력을 통해 예측 결과를 제공한다. In step 1140 , after model evaluation in the text classification and analysis unit, a prediction result is provided through a plurality of decision graph visualizations and information output by the chatbot application through the decision classification unit.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

<참고 문헌><References>

[1] Umar, S.; Maryam, M.; Azhar, F.; Malik, S.; Samdani, G. Sentiment Analysis Approaches and Applications: A Survey. International Journal of Computer Applications, IJCA: 2018, volume: 181, pp. 1-9, doi: 10.5120/ijca2018916630.[1] Umar, S.; Maryam, M.; Azhar, F.; Malik, S.; Samdani, G. Sentiment Analysis Approaches and Applications: A Survey. International Journal of Computer Applications, IJCA: 2018, volume: 181, pp. 1-9, doi: 10.5120/ijca2018916630.

[2] Ochoa, X.; Duval, E.; Quantitative Analysis of User-Generated Content on the Web. First International Workshop on Understanding Web Evolution (WebEvolve2008), China 2008.[2] Ochoa, X.; Duval, E.; Quantitative Analysis of User-Generated Content on the Web. First International Workshop on Understanding Web Evolution (WebEvolve2008), China 2008.

[3] Imran, M.; Castillo, C.; Lucas, J.; Meier, P.; Vieweg, S.; AIDR: artificial intelligence for disaster response. In Proceedings of the 23rd International Conference on World Wide Web (WWW '14 Companion), Association for Computing Machinery, NY, USA, 2014, pp. 159-162.[3] Imran, M.; Castillo, C.; Lucas, J.; Meier, P.; Vieweg, S.; AIDR: artificial intelligence for disaster response. In Proceedings of the 23rd International Conference on World Wide Web (WWW '14 Companion), Association for Computing Machinery, NY, USA, 2014, pp. 159-162.

[4] Imran, M.; Lykourentzou, I.; Castillo,C. Engineering crowdsourced stream processing systems. arXiv 2013, arXiv:1310.5463.[4] Imran, M.; Lykourentzou, I.; Castillo, C. Engineering crowdsourced stream processing systems. arXiv 2013, arXiv:1310.5463.

[5] Daud, A. Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of computer science in China, 2010, pp. 280-301.[5] Daud, A. Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of computer science in China, 2010, pp. 280-301.

[6] Dang, C.; Moreno, G.; Maria, N.; Fernando, D. L. P. Sentiment Analysis Based on Deep Learning: A Comparative Study. 2020, Electronics. 9. 483. 10.3390/electronics9030483. [6] Dang, C.; Moreno, G.; Maria, N.; Fernando, D. L. P. Sentiment Analysis Based on Deep Learning: A Comparative Study. 2020, Electronics. 9. 483. 10.3390/electronics9030483.

[7] Twitter Sentiment Analysis with Machine Learning. Available online: https://monkeylearn.com/blog/ sentiment-analysis-of-twitter/ (accessed on 07052020).[7] Twitter Sentiment Analysis with Machine Learning. Available online: https://monkeylearn.com/blog/ sentiment-analysis-of-twitter/ (accessed on 07052020).

[8] Skrlj, B.; Kralj, J.; Lavrac, N.; Pollak, S.; Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture. Machine Learning and Knowledge Extraction 2019, pp. 575-589. doi:10.3390/make1020034.[8] Skrlj, B.; Kralj, J.; Lavrac, N.; Pollak, S.; Towards Robust Text Classification with Semantics-Aware Recurrent Neural Architecture. Machine Learning and Knowledge Extraction 2019, pp. 575-589. doi:10.3390/make1020034.

[9] Krendzelak, M.; Jakab, F. Text categorization with machine learning and hierarchical structures.　2015 13th International Conference on Emerging eLearning Technologies and Applications (ICETA), Stary Smokovec, 2015, pp. 1-5, doi: 10.1109/ICETA.2015.7558486.[9] Krendzelak, M.; Jakab, F. Text categorization with machine learning and hierarchical structures. 2015 13th International Conference on Emerging eLearning Technologies and Applications (ICETA), Stary Smokovec, 2015, pp. 1-5, doi: 10.1109/ICETA.2015.7558486.

[10] Sahu, H.; Shrma, S.; Gondhalakar, S. A Brief Overview on Data Mining Survey. 2011.[10] Sahu, H.; Shrma, S.; Gondhalakar, S. A Brief Overview on Data Mining Survey. 2011.

[11] Cuesta, A.; Barrero, D. F.; Mar

a, D. R-M. A framework for massive twitter data extraction and analysis. Malaysian Journal of Computer Science, 27, pp. 50-67.[11] Cuesta, A.; Barrero, D. F.; Mar

a, D. RM. A framework for massive twitter data extraction and analysis. Malaysian Journal of Computer Science, 27, pp. 50-67.

[12] Twitter Data mining: A guide to Big Data Analytics using python. Available online: https://www.toptal.com/python/twitter-data-mining-using-python (accessed on 06072020).[12] Twitter Data mining: A guide to Big Data Analytics using python. Available online: https://www.toptal.com/python/twitter-data-mining-using-python (accessed on 06072020).

[13] Heimerl, F.; Lohmann, S.; Lange, S.; Ertl, T. Word Cloud Explorer: Text Analytics Based on Word Clouds.　2014 47th Hawaii International Conference on System Sciences, IEEE: Waikoloa, HI, 2014, pp. 1833-1842, doi: 10.1109/HICSS.2014.231.[13] Heimerl, F.; Lohmann, S.; Lange, S.; Ertl, T. Word Cloud Explorer: Text Analytics Based on Word Clouds. 2014 47th Hawaii International Conference on System Sciences, IEEE: Waikoloa, HI, 2014, pp. 1833-1842, doi: 10.1109/HICSS.2014.231.

[14] Shang, W.; Dong, H.Z.; Wang, Y. A novel feature weight algorithm for text categorization.　2008 International Conference on Natural Language Processing and Knowledge Engineering, IEEE: Beijing, 2008, pp. 1-7, doi: 10.1109/NLPKE.2008.4906817.[14] Shang, W.; Dong, H. Z.; Wang, Y. A novel feature weight algorithm for text categorization. 2008 International Conference on Natural Language Processing and Knowledge Engineering, IEEE: Beijing, 2008, pp. 1-7, doi: 10.1109/NLPKE.2008.4906817.

[15] Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. Journal of machine Learning research, 2003.[15] Blei, D. M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. Journal of machine learning research, 2003.

[16] Arun, R.; Suresh, V.; Veni, M. C.E.; Narasimha Murthy, M.N. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations. In Advances in Knowledge Discovery and Data Mining, Zaki M.J., Yu J.X., Ravindran B.; Publisher: Springer, Berlin, Heidelberg, 2010; volume 6118, https://doi.org/10.1007/978-3-642-13657-3_43.[16] Arun, R.; Suresh, V.; Veni, M. C. E.; Narasimha Murthy, M.N. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations. In Advances in Knowledge Discovery and Data Mining, Zaki M.J., Yu J.X., Ravindran B.; Publisher: Springer, Berlin, Heidelberg, 2010; volume 6118, https://doi.org/10.1007/978-3-642-13657-3_43.

[17] Beginners guide to topic modeling in python. Available online: https://www.analyticsvidhya.com/ blog/2016/08/beginners-guide-to-topic-modeling-in-python/ ( accessed on 21072020). [17] Beginners guide to topic modeling in python. Available online: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/ ( accessed on 21072020).

[18] Christopher, D. M.; Prabhakar, R.; Hinrich S. Matrix decompositions & latent semantic indexing.　Introduction to Information Retrieval, Cambridge University Press, chapter 18: pp. 403-417, 2008.[18] Christopher, D. M.; Prabhakar, R.; Hinrich S. Matrix decompositions & latent semantic indexing. Introduction to Information Retrieval, Cambridge University Press, chapter 18: pp. 403-417, 2008.

[19] Kumari, S.; Saquib, Z.; Pawar, S.; Machine Learning Approach for Text Classification in Cybercrime.　2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 2018, pp. 1-6, doi: 10.1109/ICCUBEA.2018.8697442.[19] Kumari, S.; Saquib, Z.; Pawar, S.; Machine Learning Approach for Text Classification in Cybercrime. 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 2018, pp. 1-6, doi: 10.1109/ICCUBEA.2018.8697442.

[20] Bhumika; Sukhjit, S. S.; Nayyar, A. A review Paper on algorithms used for text classifications. 2013.[20] Bhumika; Sukhjit, S. S.; Nayyar, A. A review Paper on algorithms used for text classifications. 2013.

[21] Staudemeyer, R.C.; Morris, E.R. Understanding LSTM - a tutorial into Long Short-Term Memory Recurrent Neural Networks, 2019,　arXiv, abs/1909.09586.[21] Staudemeyer, R.C.; Morris, E. R. Understanding LSTM - a tutorial into Long Short-Term Memory Recurrent Neural Networks, 2019,　arXiv, abs/1909.09586.

[22] Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras. https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ ( accessed on 05042020).[22] Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras. https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ ( accessed on 05042020).

[23] Decision Support System. Available online: https://www.journals.elsevier.com/decision-support-systems (accessed on 06102020).[23] Decision Support System. Available online: https://www.journals.elsevier.com/decision-support-systems (accessed on 06102020).

Claims

a data mining and analysis unit that analyzes raw data or scraping data based on a user's keywords and performs data mining and analysis in the case of unlabeled scraping data for unsupervised learning;
a data categorization unit that receives the labeled raw data and the unlabeled scraping data from the data mining and analysis unit and identifies the raw data and the scraping data;
a text classification and analysis unit that extracts data from a source of unlabeled scraping data and converts it into labeled data, and performs data-wrangling extraction and model evaluation on the labeled data; and
Decision classification unit that predicts data after model evaluation in text classification and analysis unit and provides prediction results through multiple decision graph visualization and information output by chatbot application
AI-based decision support system that includes.

According to claim 1,
Data mining and analysis department,
It extracts data based on the user's keywords through the scraping data classifier, cleans up emoticons, emoji signs, and Filter Cleaning Text (FCT) to process regular expressions and negatives for sentiment analysis and decision-making, After analyzing the data by evaluating the subjectivity and polarity of the text, the analysis provides structured columns through FCT.
AI-based decision support system.

According to claim 1,
Text classification and analysis unit,
It performs learning using artificial intelligence, ML (Machine Learning) and DL (Deep learning), generates labeling data through data wrangling classification,
When we start extracting data from the source of unlabeled data, we cluster the documents itemwise for subject modeling for multi-class labeling data and use the unsupervised probabilistic method LDA (Latent Dirichlet Allocation) to analyze the unlabeled data. Create ML and DL models and analyze text
AI-based decision support system.

4. The method of claim 3,
Text classification and analysis unit,
LDA is used to change the document-term matrix to the document-topic matrix, the topic-term matrix through matrix factorization to backtrack the subject that defines the document, and the proportion and angle of each word in the document assigned to the current point. It calculates the allocation ratio by topic for all documents with words, and obtains words with a frequency greater than or equal to a predetermined standard through the WGP (Word Generative Function) method for topic modeling.
AI-based decision support system.

By analyzing raw data or scraping data based on the user's keywords through the data mining and analysis unit, in the case of unlabeled scraping data for unsupervised learning, data mining and analysis are performed. step;
receiving the labeled raw data and the unlabeled scraping data from the data mining and analysis unit through the data categorization unit to identify the raw data and the scraping data;
extracting data from a source of unlabeled scraping data through a text classification and analysis unit to convert it into labeled data, and performing data-wrangling extraction and model evaluation on the labeled data; and
After model evaluation in the text classification and analysis unit, the decision classification unit provides a prediction result through visualization of a plurality of decision graphs and information output by the chatbot application.
AI-based decision support method, including

6. The method of claim 5,
Analyze raw data or scraping data based on the user's keywords through the data mining and analysis unit, and if it is unlabeled scraping data for unsupervised learning, performing data mining and analysis includes:
It extracts data based on the user's keywords through the scraping data classifier, cleans up emoticons, emoji signs, and Filter Cleaning Text (FCT) to process regular expressions and negatives for sentiment analysis and decision-making, After analyzing the data by evaluating the subjectivity and polarity of the text, the analysis provides structured columns through FCT.
AI-based decision support method.

6. The method of claim 5,
The steps of extracting data from the source of unlabeled scraping data through the text classification and analysis unit, converting it into labeled data, and performing data-wrangling extraction and model evaluation on the labeled data,
It performs learning using artificial intelligence, ML (Machine Learning) and DL (Deep learning), generates labeling data through data wrangling classification,
When we start extracting data from the source of unlabeled data, we cluster the documents itemwise for subject modeling for multi-class labeling data and use the unsupervised probabilistic method LDA (Latent Dirichlet Allocation) to analyze the unlabeled data. Create ML and DL models and analyze text
AI-based decision support method.

8. The method of claim 7,
LDA is used to change the document-term matrix to the document-topic matrix topic-term matrix through matrix factorization to trace back the subject defining the document, and the proportion of each word in the document assigned to the current point and each word Calculates the allocation ratio by topic for all documents with
AI-based decision support method.