KR20200075120A

KR20200075120A - Business default prediction system and operation method thereof

Info

Publication number: KR20200075120A
Application number: KR1020180159763A
Authority: KR
Inventors: 윤덕찬; 천숙연; 데히야 와순다라; 서원영
Original assignee: 지속가능발전소 주식회사
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2020-06-26
Also published as: KR102168198B1; CN111557011A; US11481707B2; JP2020095693A; US20200193340A1; WO2020122487A1; JP6783002B2; EP3726441A1; EP3726441A4

Abstract

Shown are a corporate default prediction system and an operation method thereof. According to various embodiments of the present invention, a method of predicting a default risk of a company may comprise the steps of: collecting a plurality of news articles on the Internet; selecting a company to be analyzed; classifying news articles related to an analysis target company among the plurality of collected news articles into analysis target news articles; calculating a risk level for each of the analyzed news articles; generating feature vectors representing each group by grouping the analysis target news articles based on the calculated risk level; and calculating a default risk of the analysis target company based on the generated feature vectors.

Description

Enterprise default prediction system and its operation method {BUSINESS DEFAULT PREDICTION SYSTEM AND OPERATION METHOD THEREOF}

본 발명의 다양한 실시예는 기업 부도 예측 시스템 및 이의 동작 방법에 관한 것으로 뉴스 데이터 분석을 기반으로 기업의 부도 위험을 평가하는 시스템에 관한 것이다. Various embodiments of the present invention relate to a company default prediction system and a method for operating the same, and are related to a system for evaluating default risk of a company based on news data analysis.

현대사회에서 가장 규모가 크고 중요도가 높은 경제 주체는 기업이다. 한 해에도 수많은 기업들이 생성되고 소멸되며 경제적으로 많은 영향을 연관된 개인들, 기업들, 나아가 국가에 미치게 된다. 따라서, 기업들의 흥망성쇠를 분석하는 것은 해당 기업이 속하는 산업군뿐 아니라, 전체 산업군의 분석에 있어서도 기초가 된다. 기업이 소멸하는 원인은 다양하게 존재하나, 그 중 부도(Default)를 통해 기업이 소멸하는 경우 해당 기업의 임직원들은 물론, 해당 기업의 투자자나 거래 기업 또한 막대한 영향을 받게 된다.In the modern society, the largest and most important economic actors are corporations. In a year, numerous companies are created and destroyed, and economically impacting individuals, companies, and even the country. Therefore, analyzing the rise and fall of companies is the basis not only for the industry group to which the company belongs, but also for the analysis of the entire industry group. There are various reasons for a company to disappear, but among them, if the company disappears through default, not only the employees of the company, but also the investors or trading companies of the company are greatly affected.

이에 따라, 각 기업에 대한 부도가능성 예측에 대한 연구가 꾸준히 이루어져 왔다. 일반적으로 기업의 부도가능성은 기업의 영업이익을 통해 이자비용을 얼마나 감당할 수 있는지를 보여주는 지표인 이자보상배율, 부채비율 등을 통해 예측되었다. 이와 같이, 기업의 부도가능성 분석 예측에 있어 정량적으로 획득될 수 있는 재무적인 데이터들이 주로 이용되어 왔으나, 최근 들어서는 비재무적인 데이터를 토대로 기업의 리스크를 분석하는 방법론이 부각되고 있다.Accordingly, research on predicting the possibility of default for each company has been conducted steadily. In general, a company's defaultability was predicted through interest compensation ratios, debt ratios, etc., which are indicators of how much the company's operating profit can afford. As described above, financial data that can be quantitatively obtained in predicting a company's bankability analysis have been mainly used, but recently, a methodology for analyzing a company's risk based on non-financial data has emerged.

기업들이 공개하는 재무적 데이터에는 해당 기업에게 불리한 내용이 반영되지 않을 가능성이 있으며, 기업이 제공하는 재무 관련 보고서에 대한 신뢰성 또한 의심되는 경우도 존재하였다. 따라서, 비재무적인 데이터를 분석하는 방법론들이 다양하게 대두되고 있는 실정이며, 비재무적인 데이터 중 분석이 충분히 될만큼의 양을 가지고 있는 뉴스 기사 데이터가 주로 이용되기 시작하였다. There is a possibility that the financial data disclosed by companies may not reflect the disadvantages to the company, and there are cases in which the reliability of financial reports provided by companies is also suspected. Accordingly, various methodologies for analyzing non-financial data have emerged, and news article data having an amount sufficient to be analyzed among non-financial data has been mainly used.

뉴스 기사 데이터는 그 양이 분석의 대상이 될만큼 많지만, 뉴스 기사가 어떠한 기업에 관한 기사인지를 분별하고 해당 뉴스 기사를 부도 위험과 어떠한 방식으로 연관시켜 분석할 것인지 결정함에 있어 많은 어려움이 존재하였다.The amount of news article data is large enough to be analyzed, but there have been many difficulties in determining what company the news article is about and determining how to relate the news article to the risk of default. .

한국등록특허 제10-1599675호Korean Registered Patent No. 10-1599675

본 발명의 다양한 실시예는 복수의 뉴스 기사들을 토대로 특정 기업의 부도 위험성을 예측하는 방법을 제공하는 것을 그 목적으로 한다.Various embodiments of the present invention aim to provide a method for predicting the default risk of a specific company based on a plurality of news articles.

본 발명의 다양한 실시예는 다양한 머신러닝 알고리즘 및 분석 방법을 통해 부도 위험성 예측의 정확도를 향상시키는 것을 그 목적으로 한다. Various embodiments of the present invention aim to improve the accuracy of default risk prediction through various machine learning algorithms and analysis methods.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 다양한 실시예에 따른 기업의 부도 위험성 예측 방법은, 인터넷 상에서 복수의 뉴스 기사들을 수집하는 단계; 분석 대상이 되는 기업을 선택하는 단계; 수집된 복수의 뉴스 기사들 중 분석 대상 기업과 관련된 뉴스 기사들을 분석 대상 기사들로 분류하는 단계; 상기 분석 대상 기사들 각각에 대한 위험 레벨을 산출하는 단계; 산출된 위험 레벨을 기준으로 상기 분석 대상 기사들의 그룹화를 수행하여 각 그룹을 나타내는 특성 벡터들을 생성하는 단계; 및 생성된 특성 벡터들을 토대로 상기 분석 대상 기업의 부도 위험성을 계산하는 단계를 포함할 수 있다.Method for predicting the default risk of a company according to various embodiments of the present invention for solving the above-mentioned problems includes: collecting a plurality of news articles on the Internet; Selecting a company to be analyzed; Classifying news articles related to a company to be analyzed among the plurality of collected news articles into articles to be analyzed; Calculating a risk level for each of the articles to be analyzed; Generating characteristic vectors representing each group by grouping the articles to be analyzed based on the calculated risk level; And calculating the default risk of the analysis target company based on the generated characteristic vectors.

상기 분석 대상 기사들 각각에 대한 위험 레벨을 산출하는 단계는, 특정 머신 러닝 알고리즘을 채택하고 채택한 머신 러닝 알고리즘을 이용하여 수집된 뉴스 기사들에 대해 회귀 또는 항목화 분석을 수행한 뒤, 상기 회귀 또는 항목화 분석을 통해 도출된 위험 산출 알고리즘을 이용하여 상기 분석 대상 기사들 각각에 대한 위험 레벨을 산출하는 것일 수 있다.The step of calculating the risk level for each of the articles to be analyzed may include regression or itemization analysis on news articles collected by employing a specific machine learning algorithm and using the adopted machine learning algorithm. It may be to calculate a risk level for each of the articles to be analyzed by using a risk calculation algorithm derived through itemized analysis.

기업의 부도 위험성 예측 방법은, 수집된 뉴스 기사들에 대해 회귀 또는 항목화 분석을 수행함에 있어, 부도가 발생한 기업과 관련된 뉴스 기사들 중 해당 기업의 부도로부터 일정 시간 이내에 발행된 뉴스 기사들만을 분석 대상으로 선택하는 단계를 더 포함할 수 있다.The company's default risk prediction method analyzes only news articles published within a certain period of time from the default of the company, among the news articles related to the default, in performing regression or itemized analysis on the collected news articles The method may further include selecting a target.

수집된 복수의 뉴스 기사들 중 분석 대상 기업과 관련된 뉴스 기사들을 분석 대상 기사들로 분류하는 단계는, 상기 분석 대상 기업의 명칭이 포함된 뉴스 기사들을 선별하는 단계; 및 선별된 뉴스 기사들 각각에 대해 상기 분석 대상 기업과 관련 있는 기사인지 판단하는 단계를 더 포함할 수 있다.Among the collected news articles, classifying news articles related to an analysis target company into analysis target articles includes: selecting news articles including the name of the analysis target company; And determining whether each of the selected news articles is an article related to the analysis target company.

선별된 뉴스 기사들 각각에 대해 상기 분석 대상 기업과 관련 있는 기사인지 판단하는 단계는, 선별된 뉴스 기사들 각각에 대한 문맥 또는 주제를 식별하고, 식별된 문맥 또는 주제가 상기 분석 대상 기업 정보와 연관이 있는지 여부를 판단하는 단계를 더 포함할 수 있다.The step of determining whether each of the selected news articles is an article related to the company to be analyzed identifies a context or topic for each of the selected news articles, and the identified context or topic is associated with the company information to be analyzed. It may further include the step of determining whether or not.

산출된 위험 레벨을 기준으로 상기 분석 대상 기사들의 그룹화를 수행하여 각 그룹을 나타내는 특성 벡터들을 생성하는 단계는, 상기 분석 대상 기사들에 대해 산출된 위험 레벨을 기초로 하여, 위험 레벨을 복수의 구간으로 분류하고 이를 통해 상기 분석 대상 기사들의 그룹화를 수행하는 단계를 더 포함할 수 있다.The generating of characteristic vectors representing each group by performing grouping of the analysis target articles based on the calculated risk level is based on the calculated risk level for the analysis target articles, the risk level is a plurality of sections It may further include the step of categorizing as and performing grouping of the articles to be analyzed.

상기 위험 레벨을 복수의 구간으로 분류하는 단계는, 상기 분석 대상 기업이 속하는 산업의 종류에 따라 상이한 방식으로 분류를 수행하는 것을 특징으로 할 수 있다.The step of classifying the risk level into a plurality of sections may be characterized in that classification is performed in different ways according to the type of industry to which the analysis target company belongs.

생성된 특성 벡터들을 토대로 상기 분석 대상 기업의 부도 위험성을 계산하는 단계는, 그룹화가 수행된 분석 대상 기사들의 특성 벡터들을 의사 결정 나무 알고리즘을 통해 분류한 뒤 분류 결과를 기초로 하여 상기 분석 대상 기업의 부도 위험성 예측값을 산출하는 단계를 더 포함할 수 있다.The step of calculating the default risk of the analysis target company based on the generated property vectors is to classify the property vectors of the analyzed target articles through the decision tree algorithm and perform the grouping based on the classification result. The method may further include calculating a default risk prediction value.

그룹화가 수행된 분석 대상 기사들의 특성 벡터들을 의사 결정 나무 알고리즘을 통해 분류하는 단계는, 상기 분석 대상 기사들의 그룹에 대해 생성된 특성 벡터들을 포함하는 데이터를 n개로 분류한 뒤 n개의 데이터 집합에 대해 교차 검증(n-Cross Validation) 방식으로 의사 결정 나무(Decision Tree) 알고리즘을 적용하여 특성 벡터들을 분류하는 단계를 더 포함할 수 있다.In the step of classifying the feature vectors of the articles to be analyzed through the decision tree algorithm, the grouping includes the feature vectors generated for the group of the articles to be analyzed into n pieces, and then the n data sets The method may further include classifying feature vectors by applying a decision tree algorithm using an n-cross validation method.

본 발명의 다른 실시예에 따르면, 뉴스 기사들을 기초로 하여 기업의 부도 위험성을 예측하는 컴퓨팅 시스템은, 인터넷 상에서 복수의 뉴스 기사들을 수집하는 뉴스 기사 수집부; 분석 대상이 되는 기업을 선택하는 분석 기업 선정부; 수집된 복수의 뉴스 기사들 중 분석 대상 기업과 관련된 뉴스 기사들을 분석 대상 기사들로 분류하는 분석 대상 기사 분류부; 상기 분석 대상 기사들 각각에 대한 위험 레벨을 산출하는 위험 레벨 산출부; 산출된 위험 레벨을 기준으로 상기 분석 대상 기사들의 그룹화를 수행하여 각 그룹을 나타내는 특성 벡터를 생성하는 특성 벡터 생성부; 및 생성된 특성 벡터를 토대로 상기 분석 대상 기업의 부도 위험성을 계산하는 부도 위험성 예측부를 포함할 수 있다.According to another embodiment of the present invention, a computing system for predicting the risk of default of a company based on news articles includes: a news article collection unit collecting a plurality of news articles on the Internet; An analysis company selection unit for selecting a company to be analyzed; An analysis article classification unit for classifying news articles related to the analysis target company among the plurality of collected news articles as analysis analysis articles; A risk level calculator that calculates a risk level for each of the articles to be analyzed; A feature vector generator for generating a feature vector representing each group by grouping the articles to be analyzed based on the calculated risk level; And a default risk prediction unit for calculating default risk of the analysis target company based on the generated characteristic vector.

본 발명의 실시예들에 따르면, 특정 기업에 대해 해당 기업과 직접적인 관련이 있는 뉴스 기사들을 선별하여 해당 기사들만을 토대로 기업의 부도 위험성 예측을 수행할 수 있다.According to embodiments of the present invention, for a specific company, it is possible to select news articles that are directly related to the company and to predict the default risk of the company based only on the articles.

본 발명의 실시예들에 따르면, 부도 위험성 예측에 효과적인 머신러닝 알고리즘이 채택되어 사용될 수 있으며, 각 기사들에 대한 위험 레벨이 독립적으로 계산될 뿐 아니라 뉴스 기사들의 그룹화를 통한 분석으로 부도 위험성 예측의 정확도를 향상시킬 수 있다.According to the embodiments of the present invention, an effective machine learning algorithm may be employed to predict the default risk, and the risk level for each article may be calculated independently, and the prediction of default risk may be analyzed through grouping of news articles. Accuracy can be improved.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 기업 부도 예측 시스템이 동작하는 환경을 개략적으로 나타낸 화면이다.
도 2는 본 발명의 일 실시예에 따른 기업 부도 예측 시스템이 동작하는 방식을 개략적으로 나타낸 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 부도 예측 장치가 뉴스 기사들을 수집하는 동작을 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 부도 예측 장치가 NER 알고리즘을 이용하여 분석 대상 기업과 관련된 뉴스 기사를 분석 대상 기사로 분류하는 방식을 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 부도 예측 장치가 분석 대상 기사들에 대한 위험 레벨을 산출함에 있어 사용하는 독립 변수를 선택하는 과정을 나타낸 도면이다.
도 6은 본 발명의 일 실시예에 따른 부도 예측 장치가 분석 대상 기사들에 대한 위험 레벨 산출에 이용할 머신 러닝 알고리즘을 선택하는 과정을 나타낸 도면이다.
도 7은 본 발명의 일 실시예에 따른 부도 예측 장치가 분석 대상 기사들의 위험 레벨을 토대로 기업의 부도를 예측하는 과정을 나타내는 도면이다.
도 8은 본 발명의 일 실시예에 따른 부도 예측 장치가 특정 기업과 관련된 분석 대상 기사들을 그룹화 하는 방식을 나타낸 도면이다.
도 9는 본 발명의 일 실시예에 따른 부도 예측 장치가 분류된 특성 벡터들을 통해 분석 대상 기업의 부도 위험성을 계산하는 방식을 설명하기 위한 도면이다.
도 10은 본 발명의 일 실시예에 따른 부도 예측 장치가 의사 결정 나무 알고리즘을 통해 특성 벡터들을 분류하는데 사용하는 교차 검증 방식을 설명하기 위한 도면이다.
도 11은 본 발명의 일 실시예에 따른 부도 예측 장치의 구성을 블록도로 나타낸 것이다.
도 12는 본 발명의 일 실시예에 따른 부도 예측 장치가 분석 대상 기업의 부도 위험성 예측값을 도출하기 위한 과정을 나타낸 흐름도이다.1 is a screen schematically showing an environment in which an enterprise default prediction system according to an embodiment of the present invention operates.
2 is a flowchart schematically illustrating a method of operating a corporate default prediction system according to an embodiment of the present invention.
3 is a diagram illustrating an operation in which a default prediction apparatus according to an embodiment of the present invention collects news articles.
4 is a diagram illustrating a method of classifying a news article related to an analysis target company into an analysis target article by using the NER algorithm by the default prediction apparatus according to an embodiment of the present invention.
5 is a diagram illustrating a process of selecting an independent variable used by the default prediction apparatus according to an embodiment of the present invention in calculating a risk level for articles to be analyzed.
FIG. 6 is a diagram illustrating a process in which a default prediction apparatus according to an embodiment of the present invention selects a machine learning algorithm to be used in calculating risk levels for articles to be analyzed.
7 is a diagram illustrating a process in which a default prediction apparatus according to an embodiment of the present invention predicts a default of a company based on risk levels of articles to be analyzed.
8 is a diagram illustrating a method of grouping articles to be analyzed related to a specific company by the default prediction apparatus according to an embodiment of the present invention.
9 is a diagram for explaining a method of calculating a default risk of an analysis target company through the classified characteristic vectors according to the default prediction apparatus according to an embodiment of the present invention.
FIG. 10 is a diagram illustrating a cross-validation method used by the default prediction apparatus according to an embodiment of the present invention to classify feature vectors through a decision tree algorithm.
11 is a block diagram showing the configuration of a default prediction apparatus according to an embodiment of the present invention.
12 is a flowchart illustrating a process for a default prediction apparatus according to an embodiment of the present invention to derive a default risk prediction value of an analysis target company.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for describing the embodiments and is not intended to limit the present invention. In the present specification, the singular form also includes the plural form unless otherwise specified in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components other than the components mentioned. Throughout the specification, the same reference numerals refer to the same components, and “and/or” includes each and every combination of one or more of the components mentioned. Although "first", "second", etc. are used to describe various components, it goes without saying that these components are not limited by these terms. These terms are only used to distinguish one component from another component. Therefore, it goes without saying that the first component mentioned below may be the second component within the technical spirit of the present invention.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When a certain part of the specification "includes" a certain component, it means that the component may be further included other than excluding other components, unless otherwise specified. In addition, terms such as "... unit" and "module" described in the specification mean a unit that processes at least one function or operation, which may be implemented in hardware or software, or a combination of hardware and software. .

도 1은 본 발명의 일 실시예에 따른 기업 부도 예측 시스템이 동작하는 환경을 개략적으로 나타낸 화면이다.1 is a screen schematically showing an environment in which an enterprise default prediction system according to an embodiment of the present invention operates.

도 1을 참조하면, 기업 부도 예측 시스템은 부도 예측 장치(100), 사용자 단말기(200) 및 외부 서버(300)를 포함하여 구성될 수 있다.Referring to FIG. 1, the corporate default prediction system may include a default prediction apparatus 100, a user terminal 200, and an external server 300.

부도 예측 장치(100)는 뉴스 기사를 수집하고 수집된 뉴스 기사들의 분석을 통해 특정 기업에 대한 부도 위험성 예측을 수행할 수 있다.The default predicting apparatus 100 may collect default news articles and predict default risk for a specific company through analysis of the collected news stories.

부도 예측 장치(100)는 메모리 수단 및 연산 장치를 구비하고 있는 컴퓨팅 시스템으로 구성될 수 있다. 즉, 부도 예측 장치(100)는 집약적인 처리기능을 가지는 서버로 구성될 수도 있으며, 이와 달리 개인용 컴퓨터(예를 들어, 데스크탑 컴퓨터, 노트북 컴퓨터 등), 워크스테이션, PDA, 웹 패드 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 디지털 기기 중 하나로 구성될 수도 있다. 부도 예측 장치(100)가 포함하고 있는 메모리 수단에는, 부도 예측과 관련된 기능을 구현하는 소프트웨어가 저장 또는 설치된 상태로 기록될 수 있다.The default prediction apparatus 100 may be configured as a computing system having memory means and a computing device. That is, the default prediction apparatus 100 may be configured as a server having an intensive processing function. Alternatively, a memory means such as a personal computer (for example, a desktop computer, a notebook computer, etc.), a workstation, a PDA, a web pad, etc. It may be configured with one of digital devices equipped with a computing power equipped with a microprocessor. In the memory means included in the default prediction apparatus 100, software that implements functions related to default prediction may be stored or installed.

사용자 단말기(200)는 부도 예측 장치(100)와의 통신을 통해 특정 기업의 부도 위험성과 관련된 정보를 획득하고자 하는 사용자에 의해 사용되는 단말기일 수 있다. The user terminal 200 may be a terminal used by a user who wants to obtain information related to the default risk of a specific company through communication with the default prediction apparatus 100.

본 발명의 일 실시예에 따른 사용자 단말기(200)는 휴대폰, 스마트폰, PDA(Personal Digital Assistant), PMP(Portable Multimedia Player), 태블릿 PC, 등과 같이 네트워크를 통하여 웹 서버와 연결될 수 있는 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치를 포함할 수 있으며, 개인용 컴퓨터(예를 들어, 데스크탑 컴퓨터, 노트북 컴퓨터 등), 워크스테이션, PDA, 웹 패드 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 디지털 기기 중 하나일 수도 있다.The user terminal 200 according to an embodiment of the present invention is a mobile phone, a smart phone, a PDA (Personal Digital Assistant), a PMP (Portable Multimedia Player), a tablet PC, etc., which can be connected to a web server through a network. It may include a handheld-based wireless communication device, a personal computer (for example, a desktop computer, a notebook computer, etc.), a workstation, a PDA, a web pad, etc., equipped with a memory means and equipped with a microprocessor It could be one of the digital devices with computational power.

일 실시예에 따르면, 사용자는 사용자 단말기(200)를 이용하여 부도 위험성을 알고 싶은 기업에 대한 식별 정보를 부도 예측 장치(100)에 전송하여, 부도 예측 장치(100)가 분석한 해당 기업에 대한 부도 위험성 예측에 대한 정보를 사용자 단말기(200)를 통해 수신하여 확인할 수 있다.According to one embodiment, the user transmits identification information on the company that wants to know the risk of default using the user terminal 200 to the default prediction device 100, and the corresponding company analyzed by the default prediction device 100 Information about the default risk prediction may be received and confirmed through the user terminal 200.

외부 서버(300)는 부도 예측 장치(100)의 관리 주체에 의해 관리되지 않는 서버로, 일 실시예에 따르면 뉴스 데이터를 포함하고 있는 서버일 수 있다. 부도 예측 장치(100)는 외부 서버(300)에 접속하여 뉴스 데이터의 수집을 수행할 수 있다. 부도 예측 장치(100)에 의해 수행되는 뉴스 데이터의 수집은 웹 상에서 발행되는 다양한 종류의 뉴스 데이터를 크롤링(Crawling)하는 방식으로 이루어질 수 있다. 이 경우, 부도 예측 장치(100)는 복수의 외부 서버(300)들로부터 뉴스 데이터를 수집할 수 있다. 이와 같이, 외부 서버(300)의 수는 특정 개수로 한정되지 않음은 물론이다.The external server 300 is a server that is not managed by the management entity of the default prediction apparatus 100, and according to an embodiment, the external server 300 may be a server including news data. The default prediction apparatus 100 may access the external server 300 to collect news data. The collection of news data performed by the default prediction apparatus 100 may be performed by crawling various types of news data published on the web. In this case, the default prediction apparatus 100 may collect news data from a plurality of external servers 300. In this way, the number of external servers 300 is not limited to a specific number.

부도 예측 장치(100)는 다양한 방식으로 구현되는 통신망을 통해 사용자 단말기(200) 및 외부 서버(300)들과의 통신을 수행할 수 있다.The default prediction apparatus 100 may perform communication with the user terminal 200 and the external servers 300 through a communication network implemented in various ways.

통신망은 유선 통신망, 무선 통신망 및 복합 통신망 중 하나로써 구현될 수 있다. 예를 들어, 통신망은 3G, LTE(Long Term Evolution), 및 LTE-A 등과 같은 이동 통신망을 포함할 수 있다. 통신망은 와이파이(Wi-Fi), UMTS(Universal Mobile Telecommunisations System)/GPRS(General Packet Radio Service), 또는 이더넷(Ethernet) 등과 같은 유선 또는 무선 통신망을 포함할 수 있다. 통신망은 마그네틱 보안 전송(MST, Magnetic Secure Transmission), RFID(Radio Frequency IDentification), NFC(Near Field Communication), 지그비(ZigBee), Z-Wave, 블루투스(Bluetooth), 저전력 블루투스(BLE, Bluetooth Low Energy), 또는 적외선 통신(IR, InfraRed communication) 등과 같은 근거리 통신망을 포함할 수 있다. 통신망은 근거리 네트워크(LAN, Local Area Network), 도시권 네트워크(MAN, Metropolitan Area Network), 또는 광역 네트워크(WAN, Wide Area Network) 등을 포함할 수 있다. 도 2는 본 발명의 일 실시예에 따른 기업 부도 예측 시스템이 동작하는 방식을 개략적으로 나타낸 흐름도이다.The communication network may be implemented as one of a wired communication network, a wireless communication network, and a composite communication network. For example, the communication network may include a mobile communication network such as 3G, Long Term Evolution (LTE), and LTE-A. The communication network may include a wired or wireless communication network such as Wi-Fi, Universal Mobile Telecommunizations System (UMTS)/General Packet Radio Service (GPRS), or Ethernet. The communication networks are Magnetic Secure Transmission (MST), Radio Frequency IDentification (RFID), Near Field Communication (NFC), ZigBee, Z-Wave, Bluetooth, Low Power Bluetooth (BLE, Bluetooth Low Energy) Or, it may include a local area network, such as infrared communication (IR, InfraRed communication). The communication network may include a local area network (LAN), a metropolitan area network (MAN), or a wide area network (WAN). 2 is a flowchart schematically illustrating a method of operating a corporate default prediction system according to an embodiment of the present invention.

도 2를 참조하면, 부도 예측 장치(100)는 개략적으로 3단계 동작을 거쳐 최종적으로 기업의 부도 위험성을 예측할 수 있다.Referring to FIG. 2, the default predicting apparatus 100 may roughly perform a three-step operation to finally predict the default risk of a company.

도 2의 (a) 단계는 뉴스 데이터를 수집하여 전처리를 수행하는 단계이며, (b) 단계는 뉴스 기사별 위험 레벨을 산출하는 단계이고, (c) 단계는 최종적으로 기업의 부도 위험성 예측을 수행하는 단계이다.Step (a) of FIG. 2 is a step of performing pre-processing by collecting news data, step (b) is a step of calculating a risk level for each news article, and step (c) is finally performing a company default risk prediction It is a step.

도 2의 (a) 단계부터 살펴보면, (a) 단계에서 부도 예측 장치(100)는 인터넷에 접속하여 크롤링을 통해 뉴스 기사들을 수집할 수 있다. 이 과정에서, 부도 예측 장치(100)는 분석 대상으로 선택된 분석 대상 기업의 명칭을 포함하고 있는 뉴스 기사들만을 수집할 수도 있다. Referring to step (a) of FIG. 2, in step (a), the default prediction apparatus 100 may access the Internet to collect news articles through crawling. In this process, the default prediction apparatus 100 may collect only news articles that include the name of the analysis target company selected as the analysis target.

이후, 부도 예측 장치(100)는 수집한 뉴스 기사들이 분석 대상 기업과 관련되었는지 여부를 판단하고 이를 통해 복수의 뉴스 기사들의 일부를 분석 대상 기사로 분류할 수 있다. 예를 들어, 분석 대상 기업의 명칭을 텍스트의 형태로 포함하고 있는 뉴스 기사가 존재하더라도, 해당 명칭이 뉴스 기사의 텍스트 상에서 분석 대상 기업의 명칭을 지칭하는 것이 아닌 다른 용도로 사용되는 경우가 존재할 수 있다. 또한, 기사의 문맥 또는 주제가 분석 대상이 되는 기업과 연관이 없는 경우도 존재할 수 있다. 상기와 같은 경우들에 있어서는, 뉴스 기사에 분석 대상 기업의 명칭이 포함되어 있지만 해당 기사들이 분석 대상 기사로 분류되지 않을 수 있다.Subsequently, the default prediction apparatus 100 may determine whether the collected news articles are related to the company to be analyzed and classify a part of the plurality of news articles as the analysis target articles. For example, even if there is a news article containing the name of the company to be analyzed in the form of text, there may be a case where the name is used for a purpose other than the name of the company to be analyzed on the text of the news article. have. Also, there may be cases where the context or the subject of the article is not related to the company to be analyzed. In the above cases, the news article includes the name of the company to be analyzed, but the articles may not be classified as the article to be analyzed.

도 2의 (b) 단계를 살펴보면, (b) 단계에서 부도 예측 장치(100)는 분석의 대상이 되는 기업과 관련되었다고 분류된 분석 대상 기사들 각각에 대한 위험 레벨을 산출할 수 있다. Referring to step (b) of FIG. 2, in step (b), the default predicting apparatus 100 may calculate a risk level for each of the analysis targeted articles classified as being related to the company to be analyzed.

(b)단계에서 분석 대상 기사들 각각에 대한 위험 레벨 산출은 머신 러닝(Machine Learning)을 통해 수행될 수 있다. 즉, 부도 예측 장치(100)는 수집한 뉴스 기사들의 적어도 일부를 머신 러닝을 위한 학습용 데이터(Training Set) 및 시험용 데이터(Test set)로 설정하고, 특정 머신 러닝 알고리즘을 채택한후 해당 머신 러닝 알고리즘을 통해 학습용 데이터에 대한 분석을 수행하고 이를 시험용 데이터에 적용하여 분석 결과를 평가하는 방식으로 위험 레벨 산출에 사용되는 위험 산출 알고리즘을 도출할 수 있다. 이후, 부도 예측 장치(100)는 도출된 위험 산출 알고리즘을 통해 분석 대상 기사들에 대한 위험 레벨을 산출할 수 있다. In step (b), risk level calculation for each of the articles to be analyzed may be performed through machine learning. That is, the default prediction apparatus 100 sets at least some of the collected news articles as training data for machine learning and test set, and adopts a specific machine learning algorithm and then applies the corresponding machine learning algorithm. Through this, it is possible to derive a risk calculation algorithm used to calculate the risk level by performing analysis on learning data and applying it to test data to evaluate the analysis results. Subsequently, the default prediction apparatus 100 may calculate the risk level for the articles to be analyzed through the derived risk calculation algorithm.

부도 예측 장치(100)는 다양한 종류의 머신 러닝 알고리즘에 대해 학습 및 시험을 상기와 같은 방식으로 수행하고 그 결과를 평가하여 평가 결과가 가장 나은 머신 러닝 알고리즘을 채택할 수 있다. 평가 결과에 따른 머신 러닝 알고리즘의 채택은 부도 예측 장치(100)에 의해 자동적으로 수행될 수도 있으며, 부도 예측 장치(100)의 관리자가 평가 결과를 확인하고 선택할 수도 있다.The default prediction apparatus 100 may adopt a machine learning algorithm having the best evaluation result by performing learning and testing on various types of machine learning algorithms in the above manner and evaluating the result. The adoption of the machine learning algorithm according to the evaluation result may be automatically performed by the default prediction apparatus 100, or the administrator of the default prediction apparatus 100 may check and select the evaluation result.

일 실시예에 따르면, 학습용 데이터 내에는 어떠한 기업에 대해 부도가 발생하였는지에 대한 정보가 포함되어 있어, 부도 예측 장치(100)는 이와 같은 정보를 활용하여 머신 러닝 알고리즘의 평가를 수행할 수 있다. 예를 들어, 부도가 발생한 기업에 대한 뉴스 기사의 위험 레벨을 높게 산출하고 부도가 발생하지 않은 기업에 대한 뉴스 기사의 위험 레벨을 낮게 산출한 정도에 따라 머신 러닝 알고리즘들의 평가가 수행될 수 있다.According to an embodiment, the learning data includes information on which company has caused bankruptcy, and the default prediction apparatus 100 may use such information to evaluate the machine learning algorithm. For example, the evaluation of machine learning algorithms may be performed according to the degree to which the risk level of a news article for a company that has defaulted is calculated high and the risk level of a news article for a company that has not defaulted to low is calculated.

(b) 단계에서는 최종적으로 채택된 머신 러닝 알고리즘에 의해 도출된 위험 산출 알고리즘이 활용되어, 분석 대상 기사들 각각에 대한 위험 레벨이 산출될 수 있다.In step (b), a risk calculation algorithm derived by the finally adopted machine learning algorithm is utilized, so that a risk level for each of the articles to be analyzed can be calculated.

도 2의 (c) 단계를 살펴보면, 부도 예측 장치(100)는 (b) 단계에서 산출된 뉴스 기사들의 위험 레벨을 기초로 하여, 최종적으로 기업의 부도 위험성 예측을 수행할 수 있다.Referring to step (c) of FIG. 2, the default predicting apparatus 100 may finally perform a default risk prediction of a company based on the risk level of news articles calculated in step (b).

부도 예측 장치(100)는 분석 대상 기업에 대해 분류된 분석 대상 기사들의 위험 레벨을 기초로 하여, 분석 대상 기업의 부도 위험성을 수치화된 형태로 계산할 수 있다. 또한, 부도 예측 장치는 계산된 부도 위험성을 기초로 하여 해당 기업에 대한 부도 예상을 가부 형식(yes or no)으로 판단할 수도 있으며, 계산된 부도 위험성의 신뢰도를 부가적으로 산출할 수도 있다.The default predicting apparatus 100 may calculate the default risk of the target company in a numerical form based on the level of risk of the target articles analyzed for the target company. In addition, the default predicting apparatus may determine the default of the default for the company based on the calculated default risk, and may additionally calculate the reliability of the calculated default risk.

부도 예측 장치(100)는 분석 대상 기사들의 위험 레벨을 기초로 하여 분석 대상 기업의 부도 위험성을 계산함에 있어, 산출된 위험 레벨을 기준으로 분석 대상 기사들의 그룹화를 수행하고 각 그룹별 특성 벡터들을 생성하여 이를 토대로 부도 위험성을 계산할 수도 있다.The default predicting apparatus 100 performs grouping of the analysis target articles based on the calculated risk level and generates characteristic vectors for each group in calculating the default risk of the analysis target company based on the risk levels of the analysis target articles. Therefore, the risk of default can be calculated based on this.

도 3은 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 뉴스 기사들을 수집하는 동작을 나타내는 도면이다.3 is a diagram illustrating an operation in which the default prediction apparatus 100 according to an embodiment of the present invention collects news articles.

도 3을 참조하면, 부도 예측 장치(100)는 인터넷 상에서 크롤링을 통해 뉴스 기사를 수집할 수 있다. 도 3에는 부도 예측 장치(100)가 수집한 뉴스 기사 중 특정한 기업에 대한 뉴스 기사만이 표시되어 있다.Referring to FIG. 3, the default prediction apparatus 100 may collect news articles through crawling on the Internet. 3, only news articles for a specific company are displayed among news articles collected by the default prediction apparatus 100.

일 실시예에 따르면, 부도 예측 장치(100)는 Lucene과 같은 형태소 분석 라이브러리를 이용하여 전체 뉴스 기사들 중 특정 기업의 명칭을 포함하는 뉴스 기사만을 선별할 수 있다.According to an embodiment, the default prediction apparatus 100 may select only a news article including a specific company name from among all news articles using a morphological analysis library such as Lucene.

도 4는 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 NER 알고리즘을 이용하여 분석 대상 기업과 관련된 뉴스 기사를 분석 대상 기사로 분류하는 방식을 나타낸 도면이다.4 is a diagram illustrating a method of classifying a news article related to an analysis target company into an analysis target article using the NER algorithm by the default prediction apparatus 100 according to an embodiment of the present invention.

도 4를 참조하면, 부도 예측 장치(100)는 분석 대상 기업과 관련된 기사를 분류하는데 있어 특성화된 NER(Named Entity Recognition) 알고리즘을 활용할 수 있다. 도 3에서 전술된 바와 같이, 형태소 분석 라이브러리 등을 활용하여 전체 뉴스 기사들 중 특정 기업의 명칭을 포함하는 뉴스 기사만이 선별될 수 있지만, 이 경우 선별된 기사들을 그대로 분석 대상 기사로 분류하여 사용하기에는 몇가지 문제점이 존재하였다. 문제점들을 살펴보면, 특정 뉴스는 해당 기업과 전혀 관련이 없는 경우가 존재하였으며, 형태소 분석 라이브러리가 철자가 약간 틀렸다고 판단되는 기사를 허용하여 관련 없는 기사가 선별되는 경우가 존재하였다. 또한, 특정 기업의 명칭은 해당 기업의 명칭으로 사용될 뿐만 아니라 해당 기업이 판매하는 상품의 명칭으로 사용되는 등 기업의 명칭이 다른 의미로 사용된 기사가 선별되는 경우도 존재하였다.Referring to FIG. 4, the default prediction apparatus 100 may utilize a specialized Entity Recognition (NER) algorithm in classifying articles related to a company to be analyzed. As described above in FIG. 3, only news articles including the name of a specific company among all news articles may be selected by utilizing a morpheme analysis library, etc., but in this case, the selected articles are classified as articles to be analyzed and used. Several problems existed below. Looking at the problems, there were cases where certain news was not related to the company at all, and there were cases in which morphological analysis library allowed articles that were judged to be slightly misspelled, so that unrelated articles were selected. In addition, there were cases where articles with different names were selected, such as the name of a specific company not only used as the name of the company, but also used as the name of the product sold by the company.

상기와 같은 문제점을 해결하기 위해, 부도 예측 장치(100)는 NER 알고리즘을 활용하여, 형태소 분석 라이브러리를 통해 선별된 뉴스 기사가 분석 대상 기업과 관련된 뉴스인지를 판단하고 해당하는 기사들을 분석 대상 기사로 분류할 수 있다. 즉, 부도 예측 장치(100)는 분석 대상 기업의 명칭이 포함된 뉴스 기사들을 형태소 분석 라이브러리를 활용하여 선별하고, NER 알고리즘을 통해 선별된 뉴스 기사들 각각에 대해 해당 기사가 분석 대상 기업과 관련 있는 기사인지를 판단할 수 있다.In order to solve the above problems, the default predicting apparatus 100 utilizes the NER algorithm to determine whether the news articles selected through the morpheme analysis library are news related to the company to be analyzed and the corresponding articles as the analysis target articles Can be classified. That is, the default prediction apparatus 100 selects news articles including the name of the company to be analyzed using a morpheme analysis library, and for each of the news articles selected through the NER algorithm, the article is related to the company to be analyzed. You can judge whether it is an article.

일 실시예에 따르면, NER 알고리즘은 프로그래밍 언어인 R을 통해 구현될 수 있으며, POS tagging과 n-gram 방식을 활용하여 구현될 수 있다. POS(Part-Of-Speech) tagging이란, 텍스트를 문법적인 기능이나 형태에 따라 구분하는 방식을 의미하며, 이와 같은 구분은 다양한 기준을 통해 수행될 수 있다. n-gram이란, 단어를 그룹화하는 방식으로 문장을 n개의 음절 또는 단어를 가지는 그룹들로 분류하고 이에 대한 분석을 수행하는 방식을 의미할 수 있다.According to an embodiment, the NER algorithm may be implemented through a programming language R, and may be implemented using POS tagging and n-gram methods. POS (Part-Of-Speech) tagging refers to a method of classifying text according to a grammatical function or form, and such classification can be performed through various criteria. The n-gram may mean a method of classifying sentences into groups having n syllables or words and performing analysis on the words in a grouping manner.

도 4를 참조하면, NER 알고리즘을 통해 분석 대상 기업에 대해 선별된 기사들이 분석 대상 기업과 관련이 있는지 여부가 판단될 수 있다. 기업의 명칭이 포함된 텍스트가 추출되었는지 여부를 기초로 하여 NER 알고리즘이 적용되어 해당 기사가 분석 대상 기업과 관련이 있는지 여부가 판단될 수 있으며, 도 4에서와 같이 'ACCEPT' 또는 'REJECT'와 같은 가부 형식(yes or no)으로 결과가 도출될 수 있다. Referring to FIG. 4, it may be determined whether articles selected for a company to be analyzed through the NER algorithm are related to the company to be analyzed. Based on whether or not the text containing the name of the company has been extracted, it is possible to determine whether the article is related to the company to be analyzed by applying the NER algorithm, and as shown in FIG. 4,'ACCEPT' or'REJECT' and Results can be derived in the same form of yes or no.

상기와 같이, NER 알고리즘에 따르면, 선별된 뉴스 기사들 각각에 대한 문맥 또는 주제가 식별되고, 식별된 문맥 또는 주제가 분석 대상 기업 정보와 연관이 있는지 여부가 판단될 수 있으며, 특정 기업의 명칭이 해당 기업을 지칭하는 것이 아닌 다른 의미로 사용되고 있는지 여부가 판단될 수 있다.As described above, according to the NER algorithm, the context or subject for each of the selected news articles is identified, and whether the identified context or subject is related to the company information to be analyzed can be determined, and the name of the specific company is the corresponding company It may be determined whether or not it is used in a meaning other than.

도 5는 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 분석 대상 기사들에 대한 위험 레벨을 산출함에 있어 사용하는 독립 변수를 선택하는 과정을 나타낸 도면이다.FIG. 5 is a diagram illustrating a process of selecting an independent variable used by the default prediction apparatus 100 according to an embodiment of the present invention in calculating the risk level for the articles to be analyzed.

부도 예측 장치(100)는 도 3 및 도 4를 통해 전술된 방식을 통해 분류된 분석 대상 기사들에 대한 분석을 수행함에 있어, 머신 러닝 알고리즘을 활용할 수 있다. 부도 예측 장치(100)는 특정 머신러닝 알고리즘을 채택하고 분석 대상 기사들의 일부를 학습용 데이터로 설정한 뒤 그에 대한 분석을 수행하여 위험 산출 알고리즘을 도출할 수 있다. 이 과정에서 사용되는 머신러닝 알고리즘은 회귀 분석 또는 항목화 분석 방식을 활용할 수 있다.The default prediction apparatus 100 may utilize a machine learning algorithm when performing analysis on the articles to be analyzed classified through the above-described method through FIGS. 3 and 4. The default predicting apparatus 100 may derive a risk calculation algorithm by adopting a specific machine learning algorithm, setting some of the articles to be analyzed as training data, and then performing analysis on it. The machine learning algorithm used in this process can use regression analysis or itemized analysis.

도 5를 참조하면, 부도 예측 장치(100)는 다양한 방식을 통해 위험 레벨 산출에 사용할 독립 변수들을 선택할 수 있다. 위험 레벨 산출에 사용되는 독립 변수들은 뉴스 기사들에 포함되는 문장 또는 단어들의 적어도 일부를 포함할 수 있다.Referring to FIG. 5, the default prediction apparatus 100 may select independent variables to be used for calculating the risk level through various methods. The independent variables used to calculate the risk level may include at least some of the sentences or words included in news articles.

일 실시예에 따르면, 부도 예측 장치(100)는 뉴스 기사들에 포함되는 단어를 n-gram 형식으로 분류하여 선택할 수 있으며, 다양한 특징 선택(Feature Selection) 방식을 통해 단어를 선택할 수도 있다. 또한, 단어들이 뉴스 기사에서 등장하는 빈도에 따라 계산되는 희소성을 기초로 하여, 위험 레벨 산출을 수행함에 있어 독립 변수로 사용될 단어가 선택될 수도 있다.According to an embodiment, the default prediction apparatus 100 may classify and select words included in news articles in an n-gram format, and may select words through various feature selection methods. In addition, a word to be used as an independent variable in performing risk level calculation may be selected based on the scarcity calculated according to the frequency with which words appear in a news article.

도 5는 다양한 방식으로 독립 변수가 선택되었을 때의 결과치를 나타내고 있다. 도 5의 우측의 그래프에서 가로축은 독립 변수 선택이 서로 다르게 이루어진 방식을 나타내며, 막대 그래프는 각 방식에서 사용된 독립 변수의 개수, 선그래프는 각 방식 별 평가 점수를 나타낸다.5 shows the results when independent variables are selected in various ways. In the graph on the right side of FIG. 5, the horizontal axis represents a method in which independent variable selection is made differently, the bar graph represents the number of independent variables used in each method, and the line graph represents the evaluation score for each method.

도 5를 참조하면, 1-3 및 1-4 방식에서 살펴볼 수 있듯이 독립 변수의 개수가 커지게 되면 평가 점수도 그에 따라 상승될 수 있지만, 이와 같은 방식에 있어서는 과적합(Overfitting)의 문제도 발생할 수 있으며 너무 과대한 독립 변수의 수를 처리하는 방식은 컴퓨팅 시스템의 리소스를 과다하게 요구할 수 있다. 따라서, 3-1 방식과 3-2 방식에서 나타난 것과 같이 적당한 수의 독립 변수가 선정되면 그에 따라 적절한 평가 점수가 과적합의 문제 없이 산출될 수 있다.Referring to Figure 5, as can be seen in the 1-3 and 1-4 method, as the number of independent variables increases, the evaluation score may increase accordingly, but in this method, an overfitting problem may also occur. The way that it handles the number of independent variables that is too large can overwhelm the computing system's resources. Therefore, as shown in the 3-1 method and the 3-2 method, if an appropriate number of independent variables is selected, an appropriate evaluation score can be calculated accordingly without overfitting.

도 6은 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 분석 대상 기사들에 대한 위험 레벨 산출에 이용할 머신 러닝 알고리즘을 선택하는 과정을 나타낸 도면이다.FIG. 6 is a diagram illustrating a process in which the default prediction apparatus 100 according to an embodiment of the present invention selects a machine learning algorithm to be used for calculating a risk level for articles to be analyzed.

도 6을 참조하면, 부도 예측 장치(100)는 다양한 종류의 머신 러닝 알고리즘을 채택하여 분석 대상 기사들을 분석하고 분석 결과를 기초로 하여 위험 산출 알고리즘을 도출할 수 있다. 일 실시예에 따르면, 머신 러닝 알고리즘은 회귀 분석 또는 항목화 분석을 이용할 수 있다. 머신 러닝 알고리즘의 종류는 도 6에 도시된 것과 같이 MLP Regression, Logistic Regression, Decision Tree, Random Forest, Adaboost Classifier, SVM(Support Vector Machine) 등을 포함할 수 있다. 다만, 부도 예측 장치(100)가 채택하여 사용할 수 있는 머신 러닝 알고리즘의 종류는 상기와 같은 예에 한정되지 않음은 물론이다.Referring to FIG. 6, the default prediction apparatus 100 may adopt various types of machine learning algorithms, analyze articles to be analyzed, and derive a risk calculation algorithm based on the analysis results. According to one embodiment, the machine learning algorithm may use regression analysis or itemized analysis. The type of the machine learning algorithm may include MLP Regression, Logistic Regression, Decision Tree, Random Forest, Adaboost Classifier, SVM (Support Vector Machine), and the like as illustrated in FIG. 6. However, the types of machine learning algorithms that the default prediction apparatus 100 can adopt and use are not limited to the above examples.

도 6에는 다양한 종류의 머신 러닝 알고리즘을 통해 분석 대상 기사를 분석한 결과를 AUC 및 지니 계수(Gini Value)를 통해 평가한 수치가 나타나 있다. AUC(Area Under Curve)는 통계학에서 판별 모형의 성능을 평가하기 위하여 사용하는 계산 방식으로, x축을 False Positive Rate로 설정하고 Y축을 True Positive Rate로 설정한 그래프에서 나타나는 ROC(Receiver Operating Characteristics) 곡선의 아래 면적을 의미한다. AUC의 최대값은 1이며 높은 값이 나올수록 분류의 성능이 뛰어난 것으로 해석될 수 있다. 지니 계수 또한 통계학에서 판별 모형의 성능을 평가하기 위하여 사용되는 계산 방식 중 하나로, 그래프에서 ROC 곡선과 원점에서 시작되어 그래프를 절반으로 분류하는 대각선 사이의 면적을 A라고 하고, ROC 곡선 아래의 면적을 B라고 할 때 A를 (A+B)로 나눔으로써 구해질 수 있다. AUC와 지니 계수와의 관계는, AUC에 2를 곱한 값에서 1을 차감하면 지니 계수가 산출되는 관계일 수 있다.FIG. 6 shows numerical values obtained by evaluating a result of analyzing an article to be analyzed through various types of machine learning algorithms through AUC and Gini Value. Area Under Curve (AUC) is a calculation method used in statistics to evaluate the performance of discriminant models. It means the area below. The maximum value of AUC is 1, and the higher the value, the better the performance of classification. The Gini coefficient is also one of the calculation methods used in statistics to evaluate the performance of discriminant models.The area between the ROC curve in the graph and the diagonal line starting from the origin and classifying the graph in half is called A, and the area under the ROC curve is When B is called, it can be obtained by dividing A by (A+B). The relationship between the AUC and the Gini coefficient may be a relationship in which the Gini coefficient is calculated by subtracting 1 from the value obtained by multiplying AUC by 2.

부도 예측 장치(100)는 상기와 같은 평가 결과를 토대로, 특정한 머신 러닝 알고리즘을 채택하고, 채택한 알고리즘을 통해 도출된 위험 산출 알고리즘을 이용하여 분석 대상 기사들 각각에 대한 위험 레벨을 산출할 수 있다.The default predicting apparatus 100 may calculate a risk level for each of the articles to be analyzed by adopting a specific machine learning algorithm based on the above evaluation results and using a risk calculation algorithm derived through the algorithm.

도 7은 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 분석 대상 기사들의 위험 레벨을 토대로 기업의 부도를 예측하는 과정을 나타내는 도면이다.7 is a diagram illustrating a process in which the default predicting apparatus 100 according to an embodiment of the present invention predicts default of a company based on risk levels of articles to be analyzed.

도 7을 참조하면, 부도 예측 장치(100)는 분석 대상 기업에 관련된 분석 대상 기사들의 위험 레벨을 토대로 부도 기사 비율을 판단할 수 있으며 이에 따라 부도 위험성을 계산할 수 있다. 도 7을 참조하면, 대략의 오차는 존재하지만 시간이 진행될수록 부도 기사의 비율이 점차적으로 증가하는 것으로 표시되어 있으며, 부도 발생 4개월 전인 M-4의 시점부터는 부도 기사의 비율과 함께 부도 위험성 수치(D.D: Distance to Default) 또한 증가하는 것으로 표시된다. 도 7에서 원으로 표시된 부분(701)에서는 일시적으로 부도 기사 비율이 증가하였지만 부도 위험성 수치는 큰 변동을 보이지 않음을 볼 수 있다. 이와 같이, 부도 예측 장치(100)는 부도 기사 비율이 일시적으로 증가하거나 감소하더라도 부도 위험성 수치가 이에 과도하게 연계되어 변화하지 않도록 제어할 수 있다.Referring to FIG. 7, the default predicting apparatus 100 may determine the ratio of default articles based on the level of risk of the target articles of analysis related to the analysis target company, and may calculate default risk accordingly. Referring to FIG. 7, there is an approximate error, but it is indicated that the proportion of default articles gradually increases as time progresses. (DD: Distance to Default) is also shown to increase. It can be seen from the portion 701 indicated by a circle in FIG. 7 that the ratio of default articles temporarily increases, but the default risk value does not show a large change. As such, the default predicting apparatus 100 may control the default risk value to not change even if the default rate is temporarily increased or decreased.

도 8은 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 특정 기업과 관련된 분석 대상 기사들을 그룹화 하는 방식을 나타낸 도면이다.8 is a diagram illustrating a method in which the default prediction apparatus 100 according to an embodiment of the present invention groups analysis target articles related to a specific company.

도 8을 참조하면, 복수의 분석 대상 기사들에 대한 위험 레벨이 수치화 된 형태로 나타나 있으며, 분석 대상 기사들의 그룹화가 위험 레벨을 기준으로 수행된 형태가 표시되어 있다.Referring to FIG. 8, risk levels for a plurality of articles to be analyzed are represented in a numerical form, and a form in which grouping of articles to be analyzed is performed based on a risk level is displayed.

본 발명의 일 실시예에 따르면, 복수의 분석 대상 기사들은 도 8에서와 같이 다양한 수치의 위험 레벨로 표시될 수 있으며, 이는 위험 레벨에 기초한 임의의 기준에 따라 복수개의 그룹으로 그룹화될 수 있다. 도 8에서는 0.7, 0.8 및 0.9의 위험 레벨이 산출된 분석 대상 기사들이 제1 그룹(801)으로 그룹화되었으며, 1.2 및 1.3의 위험 레벨이 산출된 분석 대상 기사들이 제2 그룹(803)으로 그룹화되었고, 0.3, 0.4 및 0.5의 위험 레벨이 산출된 분석 대상 기사들이 제3 그룹(805)으로 그룹화된 상태가 표시되었다. 이와 같이, 본 발명의 일 실시예에 따르면 위험 레벨에 따른 복수의 구간이 설정되어 각 구간에 분석 대상 기사들이 속하게 됨으로써 분석 대상 기사들의 그룹화가 수행될 수 있다.According to an embodiment of the present invention, a plurality of articles to be analyzed may be displayed in various levels of risk levels as shown in FIG. 8, which may be grouped into a plurality of groups according to any criteria based on the risk level. In FIG. 8, the analysis target articles for which the risk levels of 0.7, 0.8, and 0.9 were calculated were grouped into the first group 801, and the analysis target articles for which the risk levels of 1.2 and 1.3 were calculated were grouped for the second group 803. , 0.3, 0.4, and 0.5 were calculated, and the analysis targeted articles were grouped into the third group 805. As described above, according to an embodiment of the present invention, a plurality of sections according to a risk level are set, and articles to be analyzed belong to each section, so grouping of the articles to be analyzed may be performed.

도 9는 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 분류된 특성 벡터들을 통해 분석 대상 기업의 부도 위험성을 계산하는 방식을 설명하기 위한 도면이다.9 is a diagram for explaining a method of calculating a default risk of an analysis target company through the classified characteristic vectors according to the default prediction apparatus 100 according to an embodiment of the present invention.

도 9를 참조하면, 부도 예측 장치(100)는 상기와 같이 분석 대상 기사들이 위험 레벨을 기준으로 그룹화된 후, 각 그룹을 나타내는 특성 벡터(Feature Vector)들을 생성할 수 있다. 특성 벡터는, 각 그룹에 속하는 분석 대상 기사들의 위험 레벨을 기초로 계산되는 통계치를 벡터의 원소로 포함할 수 있다. 분석 대상 기사들의 위험 레벨을 기초로 계산되는 통계치는 최소값(Minimum), 최대값(Maximum), 평균(Mean), 중앙값(Median), 최빈값(mode) 등을 포함할 수 있다.Referring to FIG. 9, the default prediction apparatus 100 may generate feature vectors representing each group after the analysis target articles are grouped based on the risk level. The characteristic vector may include a statistical value calculated based on the risk level of the analyzed articles belonging to each group as an element of the vector. Statistics calculated based on the level of risk of the articles to be analyzed may include a minimum value (Minimum), a maximum value (Maximum), an average (Mean), a median (Median), a mode (mode), and the like.

각 그룹에 대한 특성 벡터들을 생성한 후, 부도 예측 장치(100)는 생성된 특성 벡터들을 기초로 하여 분석 대상 기업의 부도 위험성을 계산할 수 있다. 일 실시예에 따르면, 부도 예측 장치(100)는 부도 위험성을 계산하고 최종적으로 해당 기업의 부도 위험성을 가부 형식(yes or no)으로 판단할 수도 있다. After generating the characteristic vectors for each group, the default prediction apparatus 100 may calculate the default risk of the analysis target company based on the generated characteristic vectors. According to an embodiment, the default predicting apparatus 100 may calculate default risk and finally determine the default risk of the company in a yes or no form.

이 과정에서 부도 예측 장치(100)는 특성 벡터들을 의사 결정 나무(Decision Tree) 알고리즘을 통해 분류할 수 있다. 즉 특성 벡터들이 가지는 다양한 종류의 특성을 기초로 특성 벡터들이 분류될 수 있으며 그 분류 결과를 기초로 하여 분석 대상 기업의 부도 위험성 예측값이 계산될 수 있고, 부도 위험성 예측의 신뢰도 또한 계산될 수 있다.In this process, the default prediction apparatus 100 may classify feature vectors through a decision tree algorithm. That is, the feature vectors can be classified based on various types of features of the feature vectors, and based on the classification result, the predicted default risk of the company to be analyzed can be calculated, and the reliability of the predicted default risk can also be calculated.

일 실시예에 따르면, 부도 예측 장치(100)는 의사 결정 나무 알고리즘을 통해 부도 위험성 예측값을 계산함에 있어, 각 클래스의 확률 분포값을 기초로할 수 있다. 예를 들어, 의사 결정 나무의 같은 가지(branch) 내에서 동일한 클래스가 나올 확률, 즉 데이터 분류에 이용되는 특성 벡터들이 특정 클래스로 분류될 확률을 기초로 하여 부도 위험성 예측값을 계산할 수 있다.According to an embodiment, the default prediction apparatus 100 may use the probability distribution value of each class in calculating the default risk prediction value through a decision tree algorithm. For example, the default risk prediction value can be calculated based on the probability that the same class will appear within the same branch of the decision tree, that is, the characteristic vectors used for data classification are classified into a specific class.

부도 예측 장치(100)는 부도 위험성 예측과 함께 부도 위험성 예측의 신뢰도 또한 계산할 수 있는데, 이는 의사 결정 나무 알고리즘에 의해 특성 벡터들의 모든 특성이 검토되지 않은 상태로 부도 위험성 예측값이 계산되는 경우 신뢰도를 감소시키는 형태로 계산될 수 있다. 예를 들어, 특성 벡터들이 의사 결정 나무 알고리즘 상에서 분류될 수 있는 기준이 10개인데 그 기준 중 일부만이 사용된 상태에서 최종 클래스 분류가 완료되어 부도 위험성 예측값이 계산되는 경우 사용된 기준의 개수가 적게 산정됨에 따라 신뢰도가 하락하는 방식으로 신뢰도 계산이 수행될 수 있다.The default prediction apparatus 100 may also calculate the reliability of the default risk prediction in addition to the default risk prediction, which decreases the reliability when the default risk prediction value is calculated without reviewing all the properties of the feature vectors by the decision tree algorithm. It can be calculated in the form. For example, if there are 10 criteria in which feature vectors can be classified on a decision tree algorithm, and the final class classification is completed while only a part of the criteria are used, and the default risk prediction value is calculated, the number of criteria used is small. Reliability calculation can be performed in such a way that the reliability decreases as it is calculated.

도 10은 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 의사 결정 나무 알고리즘을 통해 특성 벡터들을 분류하는데 사용하는 교차 검증 방식을 설명하기 위한 도면이다.FIG. 10 is a diagram illustrating a cross-validation method used by the default prediction apparatus 100 according to an embodiment of the present invention to classify feature vectors through a decision tree algorithm.

도 10을 참조하면, 부도 예측 장치(100)는 의사 결정 나무 알고리즘을 통해 특성 벡터들을 분류함에 있어, 교차 검증(n-Cross Validation) 방식을 통해 분류를 수행할 수 있다. 상세하게는, 부도 예측 장치(100)는 분석 대상 기사들에 대해 산출된 위험 레벨 정보를 포함하는 데이터를 기초로 하여, 데이터들을 n개로 분류한 뒤 n개의 데이터 집합 중 하나의 집합을 시험용 데이터(Test set)로, 나머지 집합을 학습용 데이터(Training set 또는 Learning set)으로 사용하는 것을 n번 반복할 수 있다. 이와 같이, 교차 검증 방식에 의하면 반복 과정에서 시험용 데이터로 사용되는 데이터 집합이 계속하여 변경되어 상이한 시험용 데이터 및 학습용 데이터로 분석을 n번까지 반복할 수 있게 된다. Referring to FIG. 10, when the default prediction apparatus 100 classifies feature vectors through a decision tree algorithm, classification may be performed through an n-cross validation method. In detail, the default prediction apparatus 100 classifies the data into n data based on data including risk level information calculated for the articles to be analyzed, and then sets one of the n data sets to test data ( Test set), the remaining set can be repeated n times using the training set (Training set or Learning set). As described above, according to the cross-validation method, the data set used as test data in the iterative process is continuously changed, and analysis can be repeated up to n times with different test data and learning data.

도 11은 본 발명의 일 실시예에 따른 부도 예측 장치(100)의 구성을 블록도로 나타낸 것이다.11 is a block diagram showing the configuration of the default prediction apparatus 100 according to an embodiment of the present invention.

도 11을 참조하면, 부도 예측 장치(100)는 뉴스 기사 수집부(110), 분석 기업 선정부(120), 분석 대상 기사 분류부(130), 위험 레벨 산출부(140), 특성 벡터 생성부(150), 부도 위험성 예측부(160), 통신부(170), 저장부(180) 및 제어부(190)를 포함하여 구성될 수 있다.Referring to FIG. 11, the default prediction apparatus 100 includes a news article collection unit 110, an analysis company selection unit 120, an analysis target article classification unit 130, a risk level calculation unit 140, and a characteristic vector generation unit It may be configured to include 150, the default risk prediction unit 160, the communication unit 170, the storage unit 180 and the control unit 190.

설명의 편의를 위하여 부도 예측 장치(100) 내에서 각각의 각각의 역할을 수행하는 주체들을 ~부의 형태로 표시하였으나, 각각의 부분들은 부도 예측 장치(100) 내에서 동작하는 서브 프로그램 모듈 또는 제어부(190)를 기능적으로 분류한 구성들일 수 있다. 이러한 프로그램 모듈들은 각 동작을 수행하거나, 특정 추상 데이터 유형을 실행하는 루틴, 서브루틴, 프로그램, 오브젝트, 컴포넌트, 데이터 구조 등을 포괄하는 개념이지만, 이에 제한되지는 않는다.For convenience of description, the subjects performing each role in the default prediction apparatus 100 are indicated in the form of ~ units, but each of the parts is a subprogram module or control unit operating in the default prediction apparatus 100 ( 190) may be functionally classified. These program modules include, but are not limited to, routines, subroutines, programs, objects, components, data structures, etc. that perform each operation or execute a specific abstract data type.

뉴스 기사 수집부(110)는 인터넷 상에서 뉴스 기사의 수집을 수행할 수 있다. 이를 위해, 뉴스 기사 수집부(110)는 다양한 종류의 외부 서버(300)에 접속할 수 있다. 뉴스 기사 수집부(110)는 웹 상에서 발행되는 뉴스 관련 데이터에 대해 크롤링을 수행하는 방식으로 뉴스 기사를 수집할 수 있다.The news article collection unit 110 may collect news articles on the Internet. To this end, the news article collection unit 110 may access various types of external servers 300. The news article collection unit 110 may collect news articles in a manner of crawling news-related data published on the web.

일 실시예에 따르면 뉴스 기사 수집부(110)는 분석 대상 기업이 선택되면 해당 기업과 관련된 기사만을 분류하여 수집할 수 있다. 뉴스 기사 수집부(110)는 이와 같은 기능을 수행함에 있어 후술할 분석 기업 선정부(120) 및 분석 대상 기사 분류부(130)의 기능을 이용할 수 있다.According to an embodiment, the news article collection unit 110 may classify and collect only articles related to a corresponding company when a company to be analyzed is selected. In performing such a function, the news article collection unit 110 may use the functions of the analysis company selection unit 120 and the analysis target article classification unit 130, which will be described later.

분석 기업 선정부(120)는 분석 대상이 되는 기업을 선택할 수 있다. 전술한 바와 같이, 뉴스 기사 수집부(110)는 분석 기업 선정부(120)가 선택하는 기업과 관련된 기사들만을 수집할 수 있으며, 후술할 분석 대상 기사 분류부(130) 또한 분석 기업 선정부(120)가 선택하는 기업들에 대한 기사를 분류할 수 있다. 일 실시예에 따르면, 분석 기업 선정부(120)는 복수의 분석 대상 기업을 선택할 수도 있으며, 사용자 단말기(200)로부터 수신한 기업 정보에 해당하는 기업을 분석 대상 기업으로 선택할 수도 있다.The analysis company selection unit 120 may select a company to be analyzed. As described above, the news article collection unit 110 may collect only articles related to the company selected by the analysis company selection unit 120, and the analysis target article classification unit 130, which will be described later, also analyzes the company selection unit ( 120) can categorize articles about the companies they choose. According to an embodiment, the analysis company selection unit 120 may select a plurality of companies to be analyzed, or may select a company corresponding to company information received from the user terminal 200 as a company to be analyzed.

분석 대상 기사 분류부(130)는 수집된 뉴스 기사들 중 분석 대상 기업과 관련된 뉴스 기사들을 분석 대상 기사들로 분류할 수 있다. 이를 위하여, 분석 대상 기사 분류부(130)는 일차적으로 분석 대상 기업의 명칭이 포함된 뉴스 기사들을 선별하고, 선별된 뉴스 기사들 각각에 대해 분석 대상 기업과 관련 있는 기사인지 여부를 판단할 수 있다. The analysis target article classification unit 130 may classify news articles related to an analysis target company among the collected news articles as analysis target articles. To this end, the article classification unit 130 to be analyzed may primarily select news articles including the name of the company to be analyzed and determine whether each of the selected news articles is an article related to the company to be analyzed. .

분석 대상 기사 분류부(130)는 분석 대상 기업의 명칭이 포함된 뉴스 기사를 선별함에 있어 형태소 분석 라이브러리를 활용할 수 있다. 또한, 분석 대상 기사 분류부(130)는 분석 대상 기업의 명칭이 포함된 뉴스 기사들이 선별되면, 선별된 뉴스 기사들 각각에 대한 문맥 또는 주제를 식별하고, 식별된 문맥 또는 주제가 분석 대상 기업 정보와 연관이 있는지 여부를 판단할 수 있다. 예를 들어, 분석 대상 기업이 장난감 제조업을 영위하는 기업인데, 해당 기업의 명칭이 포함되어 선별된 기사 중 하나가 이와 전혀 관련없는 반도체 관련 내용을 포함하고 있으면, 분석 대상 기사 분류부(130)는 해당 기사를 분석 대상 기업 정보와 관련없다고 판단하고 분석 대상 기사에서 제외시킬 수 있다.The article classification unit 130 to be analyzed may utilize a morphological analysis library in selecting a news article including the name of the company to be analyzed. In addition, the analysis target article classification unit 130 identifies the context or topic for each of the selected news articles when the news articles including the name of the analysis target company are selected, and the identified context or subject is analyzed with the company information to be analyzed. You can determine whether there is a connection. For example, if the company to be analyzed is a toy manufacturing business, and the name of the company is included and one of the selected articles contains semiconductor related content that is not related to this, the article classification unit 130 to be analyzed The article can be judged to be unrelated to the company information being analyzed and can be excluded from the article being analyzed.

위험 레벨 산출부(140)는 분석 대상 기사 분류부(130)가 분류한 분석 대상 기사들 각각에 대한 위험 레벨을 산출할 수 있다. 이 과정에서, 위험 레벨 산출부(140)는 특정 머신 러닝 알고리즘을 채택하고 해당 머신 러닝 알고리즘을 이용하여 수집된 뉴스 기사들에 대한 회귀 분석 또는 항목화 분석을 수행할 수 잇다. 위험 레벨 산출부(140)는 회귀 분석 또는 항목화 분석의 결과로 도출된 위험 산출 알고리즘을 이용하여 분석 대상 기사들 각각에 대한 위험 레벨을 산출할 수 있다. The risk level calculating unit 140 may calculate a risk level for each of the analysis target articles classified by the analysis target article classification unit 130. In this process, the risk level calculator 140 may adopt a specific machine learning algorithm and perform regression analysis or itemized analysis on news articles collected using the machine learning algorithm. The risk level calculator 140 may calculate a risk level for each of the analyzed articles using a risk calculation algorithm derived as a result of regression analysis or itemized analysis.

위험 레벨 산출부(140)는 분석 대상 기업에 대한 뉴스뿐 아니라, 뉴스 기사 수집부(110)가 수집한 모든 뉴스 기사에 대해서 분석을 수행할 수 있다. 즉, 위험 레벨 산출부(140)는 머신 러닝 알고리즘을 이용함에 있어 해당 머신 러닝 알고리즘을 학습시켜야 하므로 학습용 데이터로 이와 같은 뉴스 기사들을 활용할 수 있다. 일 실시예에 따르면, 위험 레벨 산출부(140)는 뉴스 기사뿐 아니라, 기업의 부도 정보를 관리할 수 있다. 기업의 부도 정보는 부도가 발생한 기업 정보와 부도 발생의 시점과 관련된 정보를 포함할 수 있다.The risk level calculating unit 140 may perform analysis on all news articles collected by the news article collection unit 110 as well as news about the company to be analyzed. That is, since the risk level calculator 140 needs to learn the corresponding machine learning algorithm when using the machine learning algorithm, news articles such as these may be used as training data. According to an embodiment, the risk level calculating unit 140 may manage not only news articles but also corporate default information. The default information of the company may include information related to the time when the default occurred and the company information that caused the default.

위험 레벨 산출부(140)는 머신 러닝 알고리즘을 통해 수집된 뉴스 기사들에 대한 분석을 수행함에 있어, 부도가 발생한 기업들에 대한 뉴스 기사만을 그 대상으로 할 수 있으며, 부도가 발생한 기업에 대한 뉴스 기사들에 대해 해당 기업의 부도로부터 일정 시간 이내에 발행된 뉴스 기사만을 분석 대상으로 선택할 수도 있다. 예를 들어, 위험 레벨 산출부(140)는 특정 회사에 부도가 발생한 경우 부도 발생일 이전 2년 동안 발행된 뉴스 기사만을 분석 대상으로 할 수 있다.In performing analysis on news articles collected through a machine learning algorithm, the risk level calculator 140 may target only news articles about companies that are bankrupt, and news about companies that are bankrupt. For articles, only news articles published within a certain period of time from the company's default may be selected for analysis. For example, the risk level calculator 140 may analyze only news articles issued for two years before the default date when the default occurs.

이와 같이, 위험 레벨 산출부(140)가 부도가 발생한 기업에 대해, 기업의 부도 발생 전 일정 시간 이내에 발행된 뉴스 기사만을 머신 러닝 알고리즘을 통해 분석하고, 분석 결과에 따라 도출된 위험 산출 알고리즘을 이용하여 분석 대상 기사들 각각에 대한 위험 레벨을 산출함에 따라, 위험 레벨 산출의 정확도가 향상될 수 있다. 즉, 위험 레벨 산출부(140)는 머신 러닝 알고리즘의 학습을 수행하는 단계에서는 부도가 발생하였다는 사실이 인지된 기업들의 기사를 통해 학습을 수행하고, 머신 러닝 알고리즘을 통해 도출된 위험 산출 알고리즘을 이용하여 부도가 현재까지 발생하지 않은 분석 대상 기업의 뉴스 기사들 각각에 대한 위험 레벨을 산출할 수 있다.As described above, the risk level calculating unit 140 analyzes only news articles issued within a predetermined time before the company defaults, through the machine learning algorithm, and uses the risk calculation algorithm derived according to the analysis results. By calculating the risk level for each of the articles to be analyzed, the accuracy of the risk level calculation can be improved. That is, the risk level calculating unit 140 performs learning through articles of companies that are aware that bankruptcy has occurred in the step of performing learning of the machine learning algorithm, and performs the risk calculation algorithm derived through the machine learning algorithm. Using it, the risk level can be calculated for each news article of the analysis target company that has not occurred until now.

특성 벡터 생성부(150)는 분석 대상 기사들 각각에 대해 산출된 위험 레벨을 기준으로 하여 분석 대상 기사들의 그룹화를 수행하고 각 그룹을 나타내는 특성 벡터들을 생성할 수 있다.The feature vector generator 150 may group the articles to be analyzed based on the risk level calculated for each article to be analyzed and generate feature vectors representing each group.

특성 벡터 생성부(150)는 분석 대상 기사들에 대해 산출된 위험 레벨의 구간에 따라 분석 대상 기사들을 복수의 그룹으로 분류하고, 각 그룹에 포함되는 분석 대상 기사들의 위험 레벨의 통계치를 토대로 그룹을 나타내는 특성 벡터들을 생성할 수 있다.The feature vector generator 150 classifies the analysis target articles into a plurality of groups according to the section of the risk level calculated for the analysis target articles, and sets the group based on the statistical value of the risk level of the analysis target articles included in each group. It is possible to generate representative feature vectors.

일 실시예에 따르면, 특성 벡터 생성부(150)가 특성 벡터들을 생성함에 있어 사용하는 분석 대상 기사들의 위험 레벨 통계치는 위험 레벨 수치들의 평균(Mean), 중앙값(Median), 최빈값(Mode), 최소값(Minimum), 최대값(Maximum) 등을 포함할 수 있다. 특성 벡터 생성부(150)는 이와 같은 통계치들 또는 이를 활용하여 산출된 다른 수치들을 특성 벡터의 원소로 하여 특성 벡터를 생성할 수 있다.According to an embodiment, the risk level statistic of the analysis target articles used by the feature vector generator 150 to generate the feature vectors is the mean (Mean), median (Median), mode (Mode), and minimum value of the risk level values. (Minimum), the maximum value (Maximum), and the like. The feature vector generator 150 may generate a feature vector using these statistics or other values calculated using the same as an element of the feature vector.

일 실시예에 따르면, 특성 벡터 생성부(150)는 위험 레벨을 복수의 구간으로 분류함에 있어 분석 대상 기업이 속하는 산업의 종류에 따라 상이한 방식으로 분류를 수행할 수 있다. 예를 들어, 특성 벡터 생성부(150)는 제조업, 의료업, 금융업, 통신업 등으로 산업을 분류하고 분석 대상 기업이 어디에 속하는지를 파악한 뒤, 각 산업의 종류에 맞춤형으로 위험 레벨의 구간 분류를 수행할 수 있다. 또다른 예를 들면, 특성 벡터 생성부(150)는 분석 대상 기업이 속하는 산업을 제조업 또는 비제조업으로 분류하고 그에 따라 위험 레벨의 구간 분류를 수행할 수 있다.According to an embodiment, when classifying the risk level into a plurality of sections, the characteristic vector generator 150 may classify the risk level in different ways according to the type of industry to which the analysis target company belongs. For example, the characteristic vector generator 150 classifies industries into manufacturing, medical, financial, and telecommunications industries, identifies where the company to be analyzed belongs to, and then performs risk level section classification according to each industry type. Can. For another example, the feature vector generator 150 may classify the industry to which the analysis target company belongs as a manufacturing industry or a non-manufacturing industry, and perform the risk level section classification accordingly.

부도 위험성 예측부(160)는 분석 대상 기업에 대해 생성된 복수의 특성 벡터를 토대로, 해당 기업의 부도 위험성 예측값을 산출하고, 예측값 산출과 관련된 신뢰도를 계산할 수 있다.The default risk predicting unit 160 may calculate a default risk predicted value of a corresponding company based on a plurality of characteristic vectors generated for a company to be analyzed, and calculate reliability associated with calculating the predicted value.

일 실시예에 따르면, 부도 위험성 예측부(160)는 의사 결정 나무 알고리즘을 통해 특성 벡터들을 분류함으로써 부도 위험성 예측값을 계산할 수 있다. 부도 위험성 예측부(160)는 분석 대상 기사들의 그룹에 대해 생성된 특성 벡터들을 포함하는 데이터를 n개로 분류한 뒤 n개의 데이터 집합에 대해 n-Cross Validation 방식으로 의사 결정 나무(Decision)를 적용하여 특성 벡터들의 분류를 수행할 수 있다.According to an embodiment, the default risk prediction unit 160 may calculate default risk prediction values by classifying feature vectors through a decision tree algorithm. The default risk predicting unit 160 classifies data including feature vectors generated for a group of articles to be analyzed into n pieces, and then applies a decision tree to n data sets in an n-cross validation method. Classification of feature vectors can be performed.

부도 위험성 예측부(160)는 의사 결정 나무 상에서 각 클래스의 확률 분포값을 기초로 하여 부도 위험성 예측값을 계산할 수 있다. 또한, 부도 위험성 예측부(160)는 부도 위험성 예측의 신뢰도를 계산함에 있어, 부도 위험성 예측값이 계산되는 과정에서 의사 결정 나무 알고리즘에 의해 검토된 특성의 개수를 기초로 하여 부도 위험성 예측의 신뢰도를 계산할 수 있다.The default risk prediction unit 160 may calculate a default risk prediction value based on a probability distribution value of each class on a decision tree. In addition, in calculating the reliability of the default risk prediction, the default risk prediction unit 160 calculates the reliability of the default risk prediction based on the number of characteristics reviewed by the decision tree algorithm in the process of calculating the default risk prediction value. Can.

통신부(170)는 부도 예측 장치(100)가 사용자 단말기(200) 및 외부 서버(300)와의 통신을 수행할 수 있도록 한다. 통신부(170)가 통신을 수행하기 위해서 사용하는 통신망은 유선 및 무선 등과 같은 그 통신 양태를 가리지 않고 구성될 수 있으며, 다양한 종류의 통신망으로 구현될 수 있다.The communication unit 170 allows the default prediction apparatus 100 to communicate with the user terminal 200 and the external server 300. The communication network used by the communication unit 170 to perform communication may be configured regardless of its communication mode such as wired or wireless, and may be implemented as various types of communication networks.

저장부(180)는 부도 예측 장치(100)가 동작하는 과정에서 활용하는 데이터를 보관하는 역할을 수행할 수 있다. 예를 들어, 저장부(180)는 수집되는 뉴스 기사 데이터 및 이에 대한 분석 데이터를 저장하여 관리할 수 있다.The storage unit 180 may serve to store data utilized in the process of operating the default prediction apparatus 100. For example, the storage unit 180 may store and manage collected news article data and analysis data therefor.

저장부(180)는 예를 들어, 메모리(memory), 캐시(cash), 버퍼(buffer) 등을 포함할 수 있으며, 소프트웨어, 펌웨어, 하드웨어 또는 이들 중 적어도 둘 이상의 조합으로 구성될 수 있다. 일실시예에 따르면, 저장부(180)는 ROM(Read Only Memory) 형태로 구현될 수 있다.The storage unit 180 may include, for example, a memory, a cache, a buffer, and the like, and may be composed of software, firmware, hardware, or a combination of at least two of them. According to an embodiment, the storage unit 180 may be implemented in the form of a read only memory (ROM).

제어부(190)는 뉴스 기사 수집부(110), 분석 기업 선정부(120), 분석 대상 기사 분류부(130), 위험 레벨 산출부(140), 특성 벡터 생성부(150), 부도 위험성 예측부(160), 통신부(170) 및 저장부(180) 간의 데이터 흐름을 제어하는 기능을 수행할 수 있다. 즉, 본 발명에 따른 제어부(190)는 뉴스 기사 수집부(110), 분석 기업 선정부(120), 분석 대상 기사 분류부(130), 위험 레벨 산출부(140), 특성 벡터 생성부(150), 부도 위험성 예측부(160), 통신부(170) 및 저장부(180)에서 각각 고유한 기능을 수행하도록 제어할 수 있다.The control unit 190 includes a news article collection unit 110, an analysis company selection unit 120, an analysis article classification unit 130, a risk level calculation unit 140, a characteristic vector generation unit 150, and a default risk prediction unit 160, a function of controlling data flow between the communication unit 170 and the storage unit 180 may be performed. That is, the control unit 190 according to the present invention includes a news article collection unit 110, an analysis company selection unit 120, an analysis target article classification unit 130, a risk level calculation unit 140, and a characteristic vector generation unit 150 ), the default risk prediction unit 160, the communication unit 170 and the storage unit 180 can be controlled to perform a unique function, respectively.

도 11에서 뉴스 기사 수집부(110), 분석 기업 선정부(120), 분석 대상 기사 분류부(130), 위험 레벨 산출부(140), 특성 벡터 생성부(150) 및 부도 위험성 예측부(160)는 제어부(190)를 기능적으로 분류한 구성이므로 하나의 제어부(190)로 통합되어 구성될 수 있다.In FIG. 11, a news article collection unit 110, an analysis company selection unit 120, an analysis target article classification unit 130, a risk level calculation unit 140, a characteristic vector generation unit 150, and a default risk prediction unit 160 ) Is a functional classification of the control unit 190, and thus may be integrated into one control unit 190.

도 12는 본 발명의 일 실시예에 따른 부도 예측 장치(100)가 분석 대상 기업의 부도 위험성 예측값을 도출하기 위한 과정을 나타낸 흐름도이다.12 is a flowchart illustrating a process for the default predicting apparatus 100 according to an embodiment of the present invention to derive a predicted default risk value of an analysis target company.

도 12를 참조하면, 부도 예측 장치(100)는 인터넷 상에서 뉴스 기사들에 대한 수집을 수행할 수 있다(S1201). 이와 같은 수집 단계에서 축척된 뉴스 기사 데이터는 추후 머신 러닝 알고리즘의 학습용 데이터 또는 시험용 데이터로 사용될 수 있으며, 분석 대상 기업으로 선택되는 기업과 관련된 뉴스 기사만이 별도로 분류되어 부도 위험성 예측에 사용될 수도 있다.Referring to FIG. 12, the default prediction apparatus 100 may collect news articles on the Internet (S1201). The news article data accumulated in this collection step may be used as training data or test data of a machine learning algorithm in the future, and only news articles related to a company selected as a company to be analyzed may be separately classified and used for predicting default risk.

부도 예측 장치(100)는 머신 러닝 알고리즘을 통해 뉴스 기사 분석을 수행하고 위험 산출 알고리즘을 도출할 수 있다(S1203). 이 과정에서, 부도 예측 장치(100)는 다양한 종류의 머신 러닝 알고리즘을 통해 뉴스 기사 분석을 수행하고 그 결과를 비교하여 특정 머신 러닝 알고리즘을 채택할 수 있다. 부도 예측 장치(100)는 뉴스 기사 데이터와 별개로, 기업들의 부도 정보를 수집하여 관리할 수 있으며 이를 토대로 뉴스 기사 분석 결과를 평가할 수 있다.The bankruptcy prediction apparatus 100 may perform news article analysis through a machine learning algorithm and derive a risk calculation algorithm (S1203). In this process, the default prediction apparatus 100 may adopt a specific machine learning algorithm by performing news article analysis through various types of machine learning algorithms and comparing the results. The bankruptcy prediction apparatus 100 may collect and manage bankruptcy information of companies separately from news article data, and may evaluate news article analysis results based on this.

부도 예측 장치(100)는 분석 대상이 되는 기업을 선택하고 해당 기업과 관련된 뉴스 기사들을 분석 대상 기사로 분류할 수 있다(S1205). 분석 대상 기업의 선택은 사용자에 의해 수행되어 선택 정보가 사용자 단말기(200)에서 부도 예측 장치(100)로 전달되는 방식으로 수행될 수도 있으며, 부도 예측 장치(100)의 관리자에 의해 수행될 수도 있다. 부도 예측 장치(100)는 미리 수집된 뉴스 기사들 중에서 해당 기업과 관련된 뉴스 기사들을 분류할 수도 있지만, 인터넷 상에서 해당 기업과 관련된 뉴스 기사들을 새롭게 검색하여 수집할 수도 있다. 부도 예측 장치(100)는 형태소 라이브러리를 활용하여 분석 대상 기업의 명칭을 포함하는 기사만을 수집할 수 있다.The bankruptcy prediction apparatus 100 may select a company to be analyzed and classify news articles related to the company as an analysis target article (S1205). The selection of the company to be analyzed may be performed by the user and the selection information may be transmitted from the user terminal 200 to the default prediction apparatus 100, or may be performed by the manager of the default prediction apparatus 100. . The bankruptcy prediction apparatus 100 may classify news articles related to a corresponding company among news articles collected in advance, but may also search and collect news articles related to the relevant company on the Internet. The default prediction apparatus 100 may collect only articles including the name of the company to be analyzed by utilizing the morpheme library.

부도 예측 장치(100)는 분석 대상이 되는 기업을 선택하고 해당 기업과 관련된 뉴스 기사들을 분석 대상 기사로 분류함에 있어, 형태소 라이브러리를 이용하여 수집된 전체 뉴스 기사들 중 분석 대상 기업의 명칭을 포함하는 뉴스 기사만을 선별할 수 있으며, NER 알고리즘을 이용하여 선별된 기사들 중 분석 대상 기업과 관련된 뉴스 기사들만을 분석 대상 기사들로 분류할 수 있다. 이와 달리, 부도 예측 장치(100)는 수집 단계에서 분석 대상 기업의 명칭을 포함하는 뉴스 기사들만이 수집된 경우 이에 대해서 NER 알고리즘을 이용하여 분석 대상 기업과 관련된 뉴스 기사들만을 분석 대상 기사들로 분류할 수 있다.The bankruptcy prediction apparatus 100 selects a company to be analyzed and classifies news articles related to the company into articles to be analyzed, and includes the name of the company to be analyzed among all news articles collected using the morpheme library. Only news articles can be selected, and only news articles related to the company to be analyzed among the selected articles using the NER algorithm can be classified as articles to be analyzed. In contrast, the default prediction apparatus 100 classifies only news articles related to a company to be analyzed as articles to be analyzed by using the NER algorithm when only news articles including the name of the company to be analyzed are collected in the collecting step. can do.

부도 예측 장치(100)는 분석 대상 기사를 분류한 이후, S1203 단계에서 도출된 위험 산출 알고리즘을 이용하여 분석 대상 기사들 각각에 대한 위험 레벨을 산출할 수 있다(S1207).After classifying the analysis target article, the default prediction apparatus 100 may calculate the risk level for each of the analysis target articles using the risk calculation algorithm derived in step S1203 (S1207).

분석 대상 기사들 각각에 대한 위험 레벨이 산출되면, 부도 예측 장치(100)는 산출된 위험 레벨을 기준으로 분석 대상 기사들의 그룹화를 수행할 수 있다(S1209). 일 실시예에 다르면, 분석 대상 기사들의 그룹화는 각 기사에 대해 산출된 위험 레벨이 복수개의 구간 중 어느 구간에 속하는지에 기초하여 수행될 수 있다.When the risk level for each of the articles to be analyzed is calculated, the default prediction apparatus 100 may perform grouping of the articles to be analyzed based on the calculated risk level (S1209). According to an embodiment, grouping of articles to be analyzed may be performed based on which of the plurality of sections the risk level calculated for each article belongs to.

부도 예측 장치(100)는 분석 대상 기사들이 그룹화된 결과를 활용하여, 분석 대상 기사들의 그룹별로 특성 벡터들을 생성할 수 있다(S1211). 특성 벡터들이 포함하는 원소는, 각 그룹에 포함되는 분석 대상 기사들의 위험 레벨을 토대로 계산되는 각종 통계치 또는 이러한 통계치를 활용하여 산출된 수치일 수 있다.The default prediction apparatus 100 may generate characteristic vectors for each group of articles to be analyzed by using a result of grouping the articles to be analyzed (S1211 ). The elements included in the feature vectors may be various statistics calculated based on the risk level of the articles to be analyzed included in each group, or values calculated using these statistics.

부도 예측 장치(100)는 최종적으로, 생성된 특성 벡터들을 기초로 하여 분석 대상 기업의 부도 위험성을 계산할 수 있다(S1213). 분석 대상 기업의 부도 위험성 계산은 의사 결정 나무 알고리즘을 통해 특성 벡터들을 분류하는 과정을 통해 수행될 수 있다. 또한, 부도 예측 장치(100)는 부도 위험성 예측값을 계산하는 과정에서 부도 위험성 예측의 신뢰도를 별도로 계산할 수 있다.The default predicting apparatus 100 may finally calculate the default risk of the analysis target company based on the generated feature vectors (S1213). Calculation of default risk of the company to be analyzed can be performed through the process of classifying feature vectors through a decision tree algorithm. In addition, the default prediction apparatus 100 may separately calculate the reliability of the default risk prediction in the process of calculating the default risk prediction value.

이와 같이, 본 발명의 다양한 실시예들에 의하면 특정 기업에 대한 뉴스 기사 분석을 통해 해당 기업의 부도 위험성을 계산할 수 있으며, 은행은 이를 토대로 기업들에 대한 신용평가를 수행할 수 있으며 기업들에 대한 대출 리스크 관리를 보다 강화하여 수행할 수 있다. 또한, 다양한 머신 러닝 알고리즘 중 부도 위험성 계산에 적합한 알고리즘이 선택될 수 있으며, 각각의 뉴스 기사들을 통한 위험 레벨이 독립적으로 계산될 뿐 아니라 뉴스 기사들의 그룹화를 통해 총체적인 부도 위험성이 예측될 수 있다.As described above, according to various embodiments of the present invention, it is possible to calculate the default risk of a company through analysis of news articles about a specific company, and the bank can perform a credit evaluation of companies based on this, and Loan risk management can be strengthened. In addition, an algorithm suitable for calculating default risk may be selected among various machine learning algorithms, and the level of risk through each news article may be calculated independently, and the overall default risk may be predicted through grouping of news articles.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, a software module executed by hardware, or a combination thereof. Software modules include random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside on any type of computer readable recording medium well known in the art.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. The embodiments of the present invention have been described above with reference to the accompanying drawings, but a person skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing its technical spirit or essential features. You will understand. Therefore, it should be understood that the above-described embodiments are illustrative in all respects and not restrictive.

100: 부도 예측 장치
200: 사용자 단말기
300: 외부 서버100: default prediction device
200: user terminal
300: external server

Claims

A method for predicting a company's default risk based on news articles, performed by a computing system, comprising:
Collecting a plurality of news articles on the Internet;
Selecting a company to be analyzed;
Classifying news articles related to a company to be analyzed among the plurality of collected news articles into articles to be analyzed;
Calculating a risk level for each of the articles to be analyzed;
Generating characteristic vectors representing each group by grouping the articles to be analyzed based on the calculated risk level; And
And calculating the default risk of the analysis target company based on the generated characteristic vectors.

According to claim 1,
The step of calculating the risk level for each of the articles to be analyzed,
After adopting a specific machine learning algorithm and performing a regression or itemization analysis on news articles collected using the adopted machine learning algorithm, the analysis target is performed using a risk calculation algorithm derived through the regression or itemization analysis A method for predicting a company's default risk by calculating the risk level for each of the articles.

According to claim 2,
In performing a regression or itemized analysis on the collected news articles, further comprising the step of selecting only the news articles published within a certain time from the default of the company among the news articles related to the default company How to predict the default risk of a company.

According to claim 1,
The step of classifying the news articles related to the analysis target company into the analysis target articles among the plurality of collected news articles,
Selecting news articles containing the name of the company to be analyzed; And
Further comprising the step of determining whether each of the selected news articles related to the analysis target company, the method for predicting the default risk of the company.

According to claim 4,
For each of the selected news articles, determining whether the article is related to the analysis target company,
And identifying a context or subject for each of the selected news articles, and determining whether the identified context or subject is related to the analyzed company information.

According to claim 1,
The step of generating the characteristic vectors representing each group by grouping the articles to be analyzed based on the calculated risk level,
Based on the risk level calculated for the articles to be analyzed, further comprising the step of classifying the risk level into a plurality of sections and performing grouping of the articles to be analyzed through this, a method for predicting default risk of a company.

The method of claim 6,
The step of classifying the risk level into a plurality of sections,
Method for predicting default risk of a company, characterized in that the classification is performed in different ways according to the type of industry to which the analysis target company belongs.

According to claim 1,
Calculating the default risk of the analysis target company based on the generated characteristic vectors,
A method of predicting the default risk of a company, further comprising classifying characteristic vectors of the articles to be analyzed, which have been grouped through a decision tree algorithm, and calculating the default risk prediction value of the analyzed target company based on the classification result.

The method of claim 8,
The step of classifying the feature vectors of the articles to be analyzed by grouping through the decision tree algorithm,
After classifying the data including the feature vectors generated for the group of articles to be analyzed into n, and applying a decision tree algorithm to the n data sets in a cross-validation method (n-Cross Validation) A method for predicting default risk of a company, further comprising classifying the vectors.

In a computing system that predicts a company's default risk based on news stories,
A news article collection unit collecting a plurality of news articles on the Internet;
An analysis company selection unit for selecting a company to be analyzed;
An analysis article classification unit for classifying news articles related to the analysis target company into the analysis target articles among the collected plurality of news articles;
A risk level calculator that calculates a risk level for each of the articles to be analyzed;
A feature vector generator for generating a feature vector representing each group by grouping the articles to be analyzed based on the calculated risk level; And
And a default risk prediction unit for calculating default risk of the analysis target company based on the generated characteristic vector.