KR20200077315A

KR20200077315A - Method and apparatus for interpreting individual machine learning predictions

Info

Publication number: KR20200077315A
Application number: KR1020180166697A
Authority: KR
Inventors: 강정석
Original assignee: 주식회사 에이젠글로벌
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2020-06-30
Also published as: KR102185252B1

Abstract

According to an embodiment of the present invention, an apparatus for interpreting a machine learning model includes: a model training unit to generate a prediction model for providing a probability in which a result to be predicted occurs, by applying a predetermined machine learning algorithm to a training data set including a plurality of data representing predetermined transaction, and to calculate an influence degree on the prediction model for each parameter constituting data; a probability predicting unit to calculate prediction probability in which a result occurs, by applying a prediction model to the training data set; a relative frequency coefficient calculating unit to select an upper group having a top first ratio and a lower group having a bottom second ratio from the training data set, based on the prediction probability, and to calculate a relative frequency coefficient, which is a ratio of a proportion of data belonging to each category while belonging to the upper group to a proportion of data belonging to each category while belonging to the lower group, with respect to each category of each parameter; a correlation calculating unit to calculate, with respect to the plurality of data in the training data set, the sum of relative frequency coefficients of parameters, to calculate correlation coefficients of the sum of the relative frequency coefficients and the prediction probability, to calculate the sum of products of the influence degrees of the parameters to the relative frequency coefficients of the parameters, and to calculate the correlation coefficients of the sum and the prediction probability; and a control unit to allow the relative frequency coefficient calculating unit to calculate the relative frequency coefficient until the correlation coefficient converts to a minimum value, and to allow the correlation coefficient calculating unit to calculate the correlation coefficient.

Description

METHOD AND APPARATUS FOR INTERPRETING INDIVIDUAL MACHINE LEARNING PREDICTIONS}

본 발명은 기계 학습의 결과 생성되는 예측 모델을 해석하여 그 해석 정보를 사용자에게 제공하는 기계 학습 모델 해석 장치 및 방법에 관한 것이다.The present invention relates to a machine learning model analysis apparatus and method for interpreting a prediction model generated as a result of machine learning and providing the analysis information to a user.

최근 전자 기술의 발달에 따라, 전자 장치는 고난이도의 문제 해결을 위해 이용되고 있다. 예를 들어, 전자 장치는 기계 학습을 이용하여 고난이도의 문제 해결을 수행하고 있다. 기계 학습은 데이터와 데이터의 처리 경험에 기초하여 정보 처리 능력이 자체적으로 향상되도록 하는 기술 분야를 의미할 수 있다. 전자 장치는 기계 학습을 통해 자체적인 학습 과정을 거치면서 입력되지 않은 정보도 습득하여 해결 방안, 즉 예측하고자 하는 결과의 도출을 위한 예측 모델을 생성할 수 있다. 2. Description of the Related Art Recently, with the development of electronic technology, electronic devices have been used to solve problems of high difficulty. For example, the electronic device performs high-level problem solving using machine learning. Machine learning may refer to a technical field in which information processing capability is improved by itself based on data and data processing experience. The electronic device may generate a solution, that is, a prediction model for deriving a result to be predicted by acquiring information that has not been input through a self-learning process through machine learning.

예측 모델에 데이터가 입력되면 입력된 데이터에 대해 예측하고자 하는 결과와 관련된 정보가 도출될 수 있다. 이 때, 도출된 정보와 관련하여, 예측 모델의 사용자는 해당 결과가 어떠한 식으로 도출된 것인지에 대한 의문을 가지게 될 수 있고, 이에 따라, 기계 학습을 이용하여 만들어진 예측 모델에 의한 결과에 대해 객관성 및 신뢰성이 확보된 해석의 제공이 요구될 수 있다. When data is input to the prediction model, information related to a result to be predicted for the input data may be derived. At this time, in relation to the derived information, the user of the prediction model may have a question of how the corresponding result is derived, and accordingly, objectivity and results of the prediction model created using machine learning It may be required to provide a reliable analysis.

한국등록특허 제10-1388654호 (2014년 04월 17일 등록)Korean Registered Patent No. 10-1388654 (registered on April 17, 2014)

본 발명이 해결하고자 하는 과제는, 기계 학습을 이용하여 예측하고자 하는 결과가 발생할 확률을 예측하는 예측 모델을 생성하고, 그 예측 모델에 의한 판단 결과에 대해 객관성 및 신뢰성이 확보된 해석을 제공하는 기계 학습 모델 해석 방법 및 장치를 제공하는 것이다. The problem to be solved by the present invention is to generate a predictive model predicting the probability that a desired result is predicted using machine learning, and to provide an analysis in which objectivity and reliability are secured for judgment results based on the predictive model. It is to provide a method and apparatus for interpreting a learning model.

다만, 본 발명이 해결하고자 하는 과제는 이상에서 언급한 바로 제한되지 않으며, 언급되지는 않았으나 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있는 목적을 포함할 수 있다.However, the problems to be solved by the present invention are not limited to those mentioned above, but are not mentioned, but include the purpose that can be clearly understood by those skilled in the art from the following description can do.

본 발명의 일 실시예에 따른, 기계 학습 모델 해석 방법은, 각각이 소정의 트랜잭션(transaction)을 나타내는 복수의 데이터로 이루어진 트레이닝 데이터 세트에 소정의 기계 학습 알고리즘을 적용하여 예측하고자 하는 결과가 발생할 확률을 제공하는 예측 모델을 생성하는 모델 학습 단계, 상기 트레이닝 데이터 세트(training data set)에 상기 예측 모델을 적용하여 상기 결과가 발생할 예측 확률을 산출하는 확률 예측 단계, 상기 예측 확률에 기초하여, 상기 트레이닝 데이터 세트에서 상위 제 1 비율의 상위 그룹 및 하위 제 2 비율의 하위 그룹을 선택하고, 상기 데이터를 구성하는 각 변수의 각 범주에 대하여 상기 상위 그룹에 속하는 데이터 중 상기 각 범주에 속하는 데이터의 비율에 대한 상기 하위 그룹에 속하는 데이터 중 상기 각 범주에 속하는 데이터의 비율의 비인 상대빈도계수를 산출하는 단계 및 상기 트레이닝 데이터 세트의 상기 복수의 데이터 각각에 대하여 상기 각 변수의 상기 상대빈도계수의 합을 산출하고, 상기 상대빈도계수의 합과 상기 예측 확률의 상관계수를 산출하는 단계를 포함하며, 상기 상대빈도계수 산출 단계 및 상기 상관계수 산출 단계는 상기 상관계수가 최소값에 수렴할 때까지 반복하여 수행된다. According to an embodiment of the present invention, a method for interpreting a machine learning model may apply a predetermined machine learning algorithm to a training data set composed of a plurality of data, each representing a predetermined transaction, and a probability that a result to be predicted may occur A model learning step of generating a prediction model that provides a probability prediction step of calculating a prediction probability that the result will occur by applying the prediction model to the training data set, based on the prediction probability, the training In the data set, the upper first ratio ratio upper group and the lower second ratio lower group are selected, and for each category of each variable constituting the data, the ratio of data belonging to each category among the data belonging to the upper group is selected. Calculating a relative frequency coefficient, which is a ratio of a ratio of data belonging to each category among data belonging to the subgroup, and calculating a sum of the relative frequency coefficients of each variable for each of the plurality of data in the training data set And calculating a correlation coefficient of the sum of the relative frequency coefficients and the predicted probability, and the calculating of the relative frequency coefficient and the calculating of the correlation coefficient are performed repeatedly until the correlation coefficient converges to a minimum value. .

또한, 상기 제 1 비율 및 상기 제 2 비율은 동일하다. Further, the first ratio and the second ratio are the same.

또한, 상기 제 1 비율 및 상기 제 2 비율은 서로 상이하다. In addition, the first ratio and the second ratio are different from each other.

또한, 상기 각 변수의 상기 예측 모델에 대한 영향도를 산출하는 단계, 및 새로운 데이터에 대하여 상기 상관계수가 최소값에 수렴할 때의 상기 각 변수의 상대빈도계수 및 상기 각 변수의 상기 영향도의 곱을 시각 정보로 제공하는 단계를 더 포함한다. Further, calculating the influence degree of each variable on the prediction model, and multiplying the product of the relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data. The method further includes providing visual information.

또한, 상기 시각 정보는 상기 상대빈도계수 및 상기 영향도의 곱의 최소값과 최대값의 범위 중 상기 새로운 데이터의 각 변수의 상기 상대빈도계수 및 상기 영향도의 곱의 위치를 나타내는 정보이다. In addition, the time information is information indicating the position of the product of the relative frequency coefficient and the influence frequency of each variable of the new data among the range of the minimum and maximum values of the product of the relative frequency coefficient and the influence degree.

또한, 상기 각 변수의 상기 예측 모델에 대한 영향도를 산출하는 단계, 및 새로운 데이터에 대하여 상기 상관계수가 최소값에 수렴할 때의 상기 각 변수의 상대빈도계수 및 상기 각 변수의 상기 영향도의 곱을 스코어링 값으로 제공하는 단계를 더 포함한다. Further, calculating the influence degree of each variable on the prediction model, and multiplying the product of the relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data. And providing a scoring value.

또한, 상기 상관계수가 최소값에 수렴할 때의 상기 상대빈도계수는 상기 모델 학습 단계에서 상기 상대빈도계수에 대응하는 변수의 가중치로 사용된다. In addition, the relative frequency coefficient when the correlation coefficient converges to a minimum value is used as a weight of a variable corresponding to the relative frequency coefficient in the model training step.

본 발명의 일 실시예에 따른, 기계 학습 모델 해석 방법은, 각각이 소정의 트랜잭션을 나타내는 복수의 데이터로 이루어진 트레이닝 데이터 세트에 소정의 기계 학습 알고리즘을 적용하여 예측하고자 하는 결과가 발생할 확률을 제공하는 예측 모델을 생성하는 모델 학습 단계, 상기 트레이닝 데이터 세트에 상기 예측 모델을 적용하여 상기 결과가 발생할 예측 확률을 산출하는 확률 예측 단계, 상기 데이터를 구성하는 각 변수의 상기 예측 모델에 대한 영향도를 산출하는 단계, 상기 예측 확률에 기초하여, 상기 트레이닝 데이터 세트에서 상위 제 1 비율의 상위 그룹 및 하위 제 2 비율의 하위 그룹을 선택하고, 상기 각 변수의 각 범주에 대하여 상기 상위 그룹에 속하는 데이터 중 상기 각 범주에 속하는 데이터의 비율에 대한 상기 하위 그룹에 속하는 데이터 중 상기 각 범주에 속하는 데이터의 비율의 비인 상대빈도계수를 산출하는 단계, 및 상기 트레이닝 데이터 세트의 상기 복수의 데이터 각각에 대하여 상기 각 변수의 상기 상대빈도계수와 상기 각 변수의 영향도의 곱의 합을 산출하고, 상기 합과 상기 예측 확률의 상관계수를 산출하는 단계를 포함하며, 상기 상대빈도계수 산출 단계 및 상기 상관계수 산출 단계는 상기 상관계수가 최소값에 수렴할 때까지 반복하여 수행된다. According to an embodiment of the present invention, a machine learning model analysis method provides a probability of generating a result to be predicted by applying a predetermined machine learning algorithm to a training data set composed of a plurality of data each representing a predetermined transaction. A model learning step of generating a prediction model, a probability prediction step of applying a prediction model to the training data set to calculate a prediction probability that the result will occur, and calculating a degree of influence of each variable constituting the data on the prediction model Step, based on the predicted probability, select the upper first ratio upper group and lower second ratio lower group from the training data set, and for each category of each variable, among the data belonging to the upper group Calculating a relative frequency coefficient, which is a ratio of a ratio of data belonging to each category among data belonging to the subgroup to a ratio of data belonging to each category, and each variable for each of the plurality of data in the training data set And calculating a sum of the product of the relative frequency coefficient and the influence of each variable, and calculating a correlation coefficient between the sum and the predicted probability, wherein the calculating the relative frequency coefficient and calculating the correlation coefficient include It is repeatedly performed until the correlation coefficient converges to the minimum value.

또한, 새로운 데이터에 대하여 상기 상관계수가 최소값에 수렴할 때의 상기 각 변수의 상대빈도계수 및 상기 각 변수의 상기 영향도의 곱을 시각 정보로 제공하는 단계를 더 포함한다. The method further includes providing, as visual information, a product of a relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data.

또한, 상기 시각 정보는 상기 상관계수가 최소값에 수렴할 때의 상기 상대빈도계수 및 상기 각 변수의 영향도의 곱의 최소값과 최대값의 범위 중 상기 새로운 데이터의 상기 상대빈도계수 및 상기 영향도의 곱의 위치를 나타내는 정보이다. In addition, the time information includes the relative frequency coefficient and the influence of the new data among the range of the minimum value and the maximum value of the product of the relative frequency coefficient and the influence of each variable when the correlation coefficient converges to the minimum value. Information indicating the location of the product.

또한, 새로운 데이터에 대하여 상기 상관계수가 최소값에 수렴할 때의 상기 각 변수의 상기 상대빈도계수 및 상기 각 변수의 상기 영향도의 곱을 스코어링 값으로 제공하는 단계를 더 포함한다. The method further includes providing a product of the relative frequency coefficient of each variable and the influence of each variable when the correlation coefficient converges to a minimum value for new data as a scoring value.

본 발명의 일 실시예에 따른, 기계 학습 모델 해석 장치는, 각각이 소정의 트랜잭션을 나타내는 복수의 데이터로 이루어진 트레이닝 데이터 세트에 소정의 기계 학습 알고리즘을 적용하여 예측하고자 하는 결과가 발생할 확률을 제공하는 예측 모델을 생성하고, 상기 데이터를 구성하는 각 변수의 상기 예측 모델에 대한 영향도를 산출하는 모델 학습부, 상기 트레이닝 데이터 세트에 상기 예측 모델을 적용하여 상기 결과가 발생할 예측 확률을 산출하는 확률 예측부, 상기 예측 확률에 기초하여 상기 트레이닝 데이터 세트에서 상위 제 1 비율의 상위 그룹 및 하위 제 2 비율의 하위 그룹을 선택하고, 상기 각 변수의 각 범주에 대하여 상기 상위 그룹에 속하는 데이터 중 상기 각 범주에 속하는 데이터의 비율에 대한 상기 하위 그룹에 속하는 데이터 중 상기 각 범주에 속하는 데이터의 비율의 비인 상대빈도계수를 산출하는 상대빈도계수 산출부, 상기 트레이닝 데이터 세트의 상기 복수의 데이터 각각에 대하여, 상기 각 변수의 상기 상대빈도계수의 합을 산출하고 상기 상대빈도계수의 합과 상기 예측 확률의 상관계수를 산출하거나, 상기 각 변수의 상기 상대빈도계수와 상기 각 변수의 영향도의 곱의 합을 산출하고, 상기 합과 상기 예측 확률의 상관계수를 산출하는 상관계수 산출부, 및 상기 상관계수가 최소값에 수렴할 때까지 상기 상대빈도계수 산출부로 하여금 상기 상대빈도계수를 산출하게 하고 상기 상관계수 산출부로 하여금 상기 상관계수를 산출하도록 제어하는 제어부를 포함한다. According to an embodiment of the present invention, a machine learning model analysis apparatus applies a predetermined machine learning algorithm to a training data set composed of a plurality of data, each representing a predetermined transaction, and provides a probability that a result to be predicted occurs A model learning unit that generates a predictive model and calculates the degree of influence of each variable constituting the data on the predictive model, and applies a predictive model to the training data set to calculate a predictive probability that the result will occur. Second, based on the predicted probability, the upper group of the first ratio and the lower group of the lower second ratio are selected from the training data set, and for each category of each variable, each category of data belonging to the upper group A relative frequency coefficient calculation unit for calculating a relative frequency coefficient that is a ratio of a ratio of data belonging to each category among data belonging to the sub-group to a ratio of data belonging to, for each of the plurality of data in the training data set, the Calculate the sum of the relative frequency coefficients of each variable, calculate the sum of the relative frequency coefficients and the correlation coefficient of the predicted probability, or calculate the sum of the product of the relative frequency coefficient of each variable and the influence of each variable Then, the correlation coefficient calculation unit for calculating the correlation coefficient of the sum and the prediction probability, and the relative frequency coefficient calculation unit until the correlation coefficient converges to the minimum value to calculate the relative frequency coefficient and the correlation coefficient calculation unit It includes a control unit for controlling to calculate the correlation coefficient.

또한, 새로운 데이터에 대하여 상기 상관계수가 최소값에 수렴할 때의 상기 각 변수의 상대빈도계수 및 상기 각 변수의 상기 영향도의 곱을 시각 정보로 제공하는 출력부를 더 포함한다. In addition, an output unit that provides, as visual information, a product of the relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data.

또한, 새로운 데이터에 대하여 상기 상관계수가 최소값에 수렴할 때의 상기 각 변수의 상대빈도계수 및 상기 각 변수의 상기 영향도의 곱을 스코어링 값으로 제공하는 스코어링부를 더 포함한다. In addition, a scoring unit that provides, as a scoring value, a product of the relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data.

또한, 상기 상관계수가 최소값에 수렴할 때의 상기 상대빈도계수는 상기 모델 학습 단계에서 상기 상대빈도계수에 대응하는 변수의 가중치로 사용된다.In addition, the relative frequency coefficient when the correlation coefficient converges to a minimum value is used as a weight of a variable corresponding to the relative frequency coefficient in the model training step.

본 발명의 실시예에 따른 기계 학습 모델 해석 방법 및 시스템은 각각의 트랜잭션을 나타내는 데이터에 대해 개별적인 예측 결과가 산출되도록 함으로써 기계 학습 개별 예측에 대한 해석이 가능하도록 할 수 있다. The machine learning model analysis method and system according to an embodiment of the present invention may enable individual machine learning prediction to be interpreted by allowing individual prediction results to be calculated for data representing each transaction.

본 발명의 실시예에 따른 기계 학습 모델 해석 방법 및 장치는, 기계 학습의 예측 결과에 대해 그러한 결과가 도출된 요인에 대한 합리적 근거를 제시함으로써 기계 학습 모델을 사용하는 사람의 의사 결정을 도울 수 있다. The method and apparatus for interpreting a machine learning model according to an embodiment of the present invention may assist a decision of a person using the machine learning model by providing a rational basis for a factor from which the result is derived for a prediction result of machine learning .

본 발명의 실시예에 따른 기계 학습 모델 해석 방법 및 장치는 데이터의 각 변수가 기계 학습에 의한 예측 모델의 결과에 미치는 영향을 시각 정보로 제공함으로써 사용자가 용이하게 의사결정을 하도록 할 수 있다. The method and apparatus for interpreting a machine learning model according to an embodiment of the present invention can allow a user to easily make a decision by providing visual information with an effect of each variable of data on a result of a prediction model by machine learning.

본 발명의 실시예에 따른 기계 학습 모델 해석 방법 및 장치는 기계 학습 알고리즘의 종류에 제한되지 않고 적용 가능하므로 예측 결과의 해석이 힘들다고 알려진 기계 학습 알고리즘에 의한 예측 모델에 대해서도 활용될 수 있다. The method and apparatus for interpreting a machine learning model according to an embodiment of the present invention can be applied to a prediction model based on a machine learning algorithm known to be difficult to interpret prediction results because it can be applied without being limited to the type of machine learning algorithm.

본 발명의 실시예에 따른 기계 학습 모델 해석 방법 및 장치는 기계 학습 모델의 해석을 위한 각 단계가 분리되어 구현되는 경우 모듈(module)로서 높은 활용도를 가질 수 있다. 다만, 본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The method and apparatus for analyzing a machine learning model according to an embodiment of the present invention may have high utilization as a module when each step for analysis of the machine learning model is implemented separately. However, the effects that can be obtained in the present invention are not limited to the above-mentioned effects, and other effects that are not mentioned will be clearly understood by those skilled in the art from the following description. Will be able to.

도 1은 본 발명의 일 실시예에 따른 기계 학습 모델 해석 장치의 기능 블록도의 예를 도시한다.
도 2는 본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법의 각 단계의 흐름을 도시한다.
도 3은 본 발명의 다른 일 실시예에 따른 기계 학습 모델 해석 방법의 각 단계의 흐름을 도시한다.
도 4는 본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법 및 장치에 의해 제공되는 상관계수의 산출을 위한 정보의 일 예를 도시한다.
도 5는 본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법 및 장치에 의해 제공되는 데이터의 각 변수의 예측 모델에 대한 영향도를 시각화한 정보의 일 예를 도시한다.
도 6은 본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법 및 장치에 의해 제공되는 데이터의 각 변수의 예측 모델에 대한 영향도를 시각화한 정보의 다른 예를 도시한다.
도 7은 본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법 및 장치에 의해 제공되는 데이터의 각 변수의 예측 모델에 대한 영향도를 시각화한 정보의 또 다른 예를 도시한다.
도 8은 본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법 및 장치에 의해 제공되는 데이터의 각 변수의 예측 모델에 대한 영향도를 시각화한 정보의 또 다른 예를 도시한다.1 shows an example of a functional block diagram of a machine learning model analysis apparatus according to an embodiment of the present invention.
Figure 2 shows the flow of each step of the machine learning model analysis method according to an embodiment of the present invention.
Figure 3 shows the flow of each step of the machine learning model analysis method according to another embodiment of the present invention.
4 shows an example of information for calculating a correlation coefficient provided by a method and apparatus for analyzing a machine learning model according to an embodiment of the present invention.
5 is a diagram illustrating an example of information visualizing an influence degree of a predictive model of each variable of data provided by a machine learning model analysis method and apparatus according to an embodiment of the present invention.
FIG. 6 illustrates another example of information visualizing an influence degree of each variable of data provided by a machine learning model analysis method and apparatus according to an embodiment of the present invention on a predictive model.
7 illustrates another example of information visualizing an influence degree of each variable of data provided by the machine learning model analysis method and apparatus according to an embodiment of the present invention on a predictive model.
8 illustrates another example of information visualizing an influence degree of each variable of data provided by a machine learning model analysis method and apparatus according to an embodiment of the present invention on a predictive model.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명의 범주는 청구항에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various forms, and only the embodiments allow the disclosure of the present invention to be complete, and those skilled in the art to which the present invention pertains. It is provided to fully inform the person of the scope of the invention, and the scope of the invention is only defined by the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명은 본 발명의 실시예들을 설명함에 있어 실제로 필요한 경우 외에는 생략될 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, detailed descriptions of known functions or configurations will be omitted except when actually necessary in describing the embodiments of the present invention. In addition, terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to a user's or operator's intention or practice. Therefore, the definition should be made based on the contents throughout this specification.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예들을 포함할 수 있는바, 특정 실시예들을 도면에 예시하고 상세한 설명에 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로서 이해되어야 한다.The present invention can be applied to various changes and may include various embodiments, and specific embodiments will be illustrated in the drawings and described in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood as including all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

제 1, 제 2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 이와 같은 용어들에 의해 한정되지는 않는다. 이 용어들은 하나의 구성요소들을 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinal numbers such as first and second may be used to describe various components, but the corresponding components are not limited by these terms. These terms are only used to distinguish one component from another.

어떤 구성요소가 다른 구성요소에 '연결되어' 있다거나 '접속되어' 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is said to be'connected' or'connected' to another component, it is understood that other components may be directly connected or connected to the other component, but other components may exist in the middle. It should be.

도 1은 본 발명의 일 실시예에 따른 기계 학습 모델 해석 장치의 기능적 구성의 예를 도시한다. 이하 사용되는 '…부', '…기' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어, 또는, 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. 1 shows an example of a functional configuration of a machine learning model analysis apparatus according to an embodiment of the present invention. '… Wealth','… Terms such as'is meant a unit that processes at least one function or operation, which may be implemented by hardware or software, or a combination of hardware and software.

도 1을 참조하면, 기계 학습 모델 해석 장치(100)는 모델 학습부(110), 확률 예측부(120), 상대빈도계수 산출부(130), 상관계수 산출부(140), 제어부(150), 출력부(160), 스코어링부(170)를 포함한다. Referring to FIG. 1, the machine learning model analysis apparatus 100 includes a model learning unit 110, a probability prediction unit 120, a relative frequency coefficient calculation unit 130, a correlation coefficient calculation unit 140, and a control unit 150 , An output unit 160 and a scoring unit 170.

모델 학습부(110)는 트레이닝 데이터 세트(tranining data set)에 소정의 기계 학습 알고리즘을 적용하여 예측하고자 하는 결과가 발생할 확률을 산출하는 예측 모델을 생성할 수 있다. 트레이닝 데이터 세트를 구성하는 각각의 데이터는 보험 청구, 은행 거래 등의 각각의 트랜잭션(transaction)을 나타낼 수 있다. 예를 들어, 보험금 지급 신청자의 보험 사기 또는 보험금 과다 청구를 예측하는 모델을 생성하기 위해, 아래 [표 1]과 같은 형태의 미리 확보된 보험금 지급 신청 내역에 관한 데이터를 트레이닝 데이터 세트로 사용할 수 있다. [표 1]의 트레이닝 데이터 세트는 직업, 연령, 사는 지역, 진료 받은 병원 등의 변수에 대한 데이터로 구성되어 있는데, 이외에도 보험금 지급 신청 경로, 보험금 지급 신청 금액 등의 변수를 포함할 수 있다. 다만 [표 1]의 트레이닝 데이터 세트 및 이에 기초한 보험 사기 또는 보험금 과다 청구 예측은 설명의 편의를 위해 사용한 예에 불과할 뿐 본 발명의 적용 예는 이에 한정 되는 것이 아니다. The model learning unit 110 may generate a prediction model that calculates a probability of generating a result to be predicted by applying a predetermined machine learning algorithm to a training data set. Each data constituting the training data set may represent respective transactions such as insurance claims and bank transactions. For example, in order to generate a model for predicting insurance fraud or overpayment of insurance claimants, data regarding pre-secured insurance claim details in the form of [Table 1] below may be used as a training data set. . The training data set in [Table 1] is composed of data on variables such as occupation, age, region of residence, and hospitals that have been treated. In addition, variables such as a path for applying for insurance payments and an amount for applying for insurance payments may be included. However, the training data set in [Table 1] and the prediction of insurance fraud or overpayment based on the training data are merely examples used for convenience of description, and application examples of the present invention are not limited thereto.

직업job 연령age 지역area 병원hospital 1One 공무원Official 30대30s 전북Jeonbuk 한방병원Oriental Hospital 22 판매업dealership 40대40s 광주Gwangju 종합병원general Hospital 33 주부housewife 60대60 cars 부산Busan 의원lawmaker

본 발명의 일 실시예에 따른 기계 학습 모델 해석 장치(100)의 모델 학습부(110)는 [표 1]과 같은 형태의 트레이닝 데이터 세트에 기계 학습을 수행하여 보험 사기(내지 보험금 과다 청구)에 해당할 확률을 산출하는 예측 모델을 생성한다. 기계 학습에 사용되는 알고리즘은 복수의 기계 학습 알고리즘 중 예측하고자 하는 결과에 적합한 것을 사용할 수 있으며, 예를 들면 DNN(deep neural network) 등을 사용할 수 있다. 이와 관련하여, 기계 학습에 사용되는 알고리즘의 종류는 제한되지 않으며, 이진 클래스 또는 멀티 클래스 기계 학습 알고리즘도 사용할 수 있다.The model learning unit 110 of the machine learning model analysis apparatus 100 according to an embodiment of the present invention performs machine learning on a training data set of the form shown in [Table 1] to insure insurance fraud (overage insurance overcharge) Produce a prediction model that calculates the corresponding probability. As an algorithm used for machine learning, one suitable for a result to be predicted may be used among a plurality of machine learning algorithms, for example, a deep neural network (DNN). In this regard, the type of algorithm used for machine learning is not limited, and binary or multi-class machine learning algorithms can be used.

확률 예측부(120)는 트레이닝 데이터 세트에 예측 모델을 적용하여 예측하고자 하는 결과, 예를 들어 보험 사기 결과가 발생할 예측 확률을 산출할 수 있다. 예를 들어, 확률 예측부(120)는 복수의 보험금 지급 신청 트랜잭션에 대한 데이터를 포함하는 [표 1]의 트레이닝 데이터 세트에 대해 모델 학습부(110)에서 생성된 예측 모델을 적용하여, 각각의 보험금 지급 신청 트랜잭션에 대해 보험 사기가 발생할 예측 확률을 산출할 수 있다. 아래 [표 2]는 [표 1]의 트레이닝 데이터 세트에 모델 학습부(110)에 의해 생성된 (보험 사기) 예측 모델을 적용하여 보험 사기가 발생할 예측 확률을 산출한 결과이다. The probability predictor 120 may apply a prediction model to a training data set to calculate a predicted probability of a result to be predicted, for example, an insurance fraud result. For example, the probability prediction unit 120 applies the prediction model generated by the model learning unit 110 to the training data set of [Table 1] including data for a plurality of insurance payment application transactions, respectively The predicted probability of insurance fraud will be calculated for the insurance payment transaction. [Table 2] below is a result of calculating the predicted probability of insurance fraud by applying the (insurance fraud) prediction model generated by the model learning unit 110 to the training data set of [Table 1].

직업job 연령age 지역area 병원hospital 예측 확률Prediction probability 1One 공무원Official 30대30s 전북Jeonbuk 한방병원Oriental Hospital 0.5268280.526828 22 판매업dealership 40대40s 광주Gwangju 종합병원general Hospital 0.2949940.294994 33 주부housewife 60대60 cars 부산Busan 의원lawmaker 0.2315290.231529

상대빈도계수 산출부(130)는 트레이닝 데이터 세트를 예측 확률에 따라 정렬하여 상위 제 1 비율의 상위 그룹 및 하위 제 2 비율의 하위 그룹을 선택하고, 데이터를 구성하는 각 변수의 각 범주에 대하여 상위 그룹에 속하는 데이터 중 각 범주에 속하는 데이터의 비율에 대한 하위 그룹에 속하는 데이터 중 각 범주에 속하는 데이터의 비율의 비인 상대빈도계수를 산출할 수 있다. The relative frequency coefficient calculating unit 130 sorts the training data set according to the predicted probability, selects the upper first ratio upper group and the lower second ratio lower group, and for each category of each variable constituting the data. A relative frequency coefficient, which is a ratio of a ratio of data belonging to each category among data belonging to a subgroup to a ratio of data belonging to each category among data belonging to a group, may be calculated.

예를 들어, 상대빈도계수 산출부(130)는 [표 1]의 트레이닝 데이터 세트를 [표 2]의 예측 확률을 기준으로 예측 확률이 높은 것에서 낮은 순서로 정렬하고, 예측 확률이 상위 5%의 상위 그룹과 예측 확률이 하위 3%의 하위 그룹을 선택할 수 있다. 여기서, 제 1 비율과 제 2 비율은 서로 동일할 수도 있고, 경우에 따라 서로 상이할 수도 있다. For example, the relative frequency coefficient calculating unit 130 sorts the training data set of [Table 1] from the highest prediction probability to the lowest order based on the prediction probability of [Table 2], and the prediction probability of the top 5%. You can select the upper group and the lower group with predicted probability of lower 3%. Here, the first ratio and the second ratio may be the same as each other, or may be different from each other in some cases.

상대빈도계수 산출부(130)는 각 변수의 각 범주에 대하여 상위 그룹에 속하는 데이터 중 각 범주에 속하는 데이터의 비율에 대한 하위 그룹에 속하는 데이터 중 각 범주에 속하는 데이터의 비율의 비인 상대빈도계수를 산출할 수 있다. 예를 들어, [표 1]의 '연령'의 변수는 '10대', '20대', '30대', '40대', '50대' 및 '60대'등의 범주로 표현될 수 있고, '지역'의 변수는 '전북', '광주', 및 '부산'등의 범주로 표현될 수 있다. 가령 [표 1]의 트레이닝 데이터 세트에 대하여, 상대빈도계수 산출부(130)는 '성별' 변수의 '여성' 범주의 상대빈도계수를 구하기 위해, 상위 그룹에서 '성별'이라는 변수의 '여성' 범주에 속하는 데이터의 비율(a%)과 하위 그룹에서 '여성' 범주에 속하는 데이터의 비율(b%)을 구하여 a에 대한 b의 비를 다양한 수학식으로 표현하고, 이를 '여성' 범주의 상대빈도계수로 구할 수 있다. 이 작업은 범주화된 변수의 각각의 범주에 대해 수행될 수 있다. 상대빈도계수 산출 방식의 상세 내용은 후술하도록 하겠다. The relative frequency coefficient calculating unit 130 calculates a relative frequency coefficient that is a ratio of a ratio of data belonging to each category among data belonging to a subgroup to a ratio of data belonging to each category among data belonging to each category for each category of each variable. Can be calculated. For example, the variables of'age' in [Table 1] can be expressed in categories such as '10s', '20s', '30s', '40s', '50s' and '60s' The variables of'region' can be expressed in categories such as'Jeonbuk','Gwangju', and'Busan'. For example, with respect to the training data set in [Table 1], the relative frequency coefficient calculating unit 130 calculates the relative frequency coefficient of the “female” category of the “sex” variable, and the “women” of the variable “sex” in the upper group. The ratio of data to a category (a%) and the ratio of data to a'female' category (b%) in a subgroup are obtained, and the ratio of b to a is expressed by various mathematical expressions. It can be obtained from the frequency coefficient. This can be done for each category of categorized variables. Details of the relative frequency coefficient calculation method will be described later.

상대빈도계수는 양(+) 또는 음(-)의 방향성을 갖게 되는데, 소정 범주의 상대빈도계수가 양수이면 그 범주는 예측하고자 하는 결과가 발생할 확률을 낮추고, 소정 범주의 상대빈도계수가 음수이면 그 범주는 예측하고자 하는 결과가 발생할 확률을 높이는 것으로 해석될 수 있다. The relative frequency coefficient has a positive (+) or negative (-) directionality. If the relative frequency coefficient of a certain category is positive, the category decreases the probability of the result to be predicted, and if the relative frequency coefficient of a predetermined category is negative The category can be interpreted as increasing the probability that the desired result will occur.

상관계수 산출부(140)는 트레이닝 데이터 세트의 복수의 데이터 각각에 대하여 각 변수의 상대빈도계수의 합을 산출하고 변수의 상대빈도계수의 합과 예측 확률 사이의 상관계수를 산출하여 상대빈도계수의 정확성을 검증할 수 있다. 다른 실시예의 경우, 각 변수의 상대빈도계수와 영향도의 곱의 합과 예측 확률 사이의 상관계수를 산출할 수 있다. 상관계수는, 예를 들면 피어슨 상관계수(pearson correlation codfficient)를 포함할 수 있다. 변수 별 상대빈도계수 수치의 합과 예측 확률의 피어슨 상관계수의 최소값은 -1일 수 있고, -1에 가까울수록 강한 음의 선형 관계를 나타낸다. 변수 별 상대빈도계수의 합과 예측 확률 또는 변수 별 상대빈도계수와 영향도의 곱의 합과 예측 확률의 상관계수가 최소값에 가깝다는 것은 그 시점의 상대빈도계수의 값이 해당 변수의 예측 확률에 미치는 영향을 잘 설명해 준다는 의미가 된다. 이에 대한 설명은 도 4를 참조하여 후술하도록 하겠다. The correlation coefficient calculating unit 140 calculates the sum of the relative frequency coefficients of each variable for each of a plurality of data in the training data set, and calculates the correlation coefficient between the sum of the relative frequency coefficients of the variable and the predicted probability to determine the relative frequency coefficient. Accuracy can be verified. In another embodiment, the correlation coefficient between the sum of the product of the relative frequency coefficient of each variable and the influence degree and the predicted probability may be calculated. The correlation coefficient may include, for example, a Pearson correlation codfficient. The minimum value of the sum of the relative frequency coefficients for each variable and the Pearson correlation coefficient of the prediction probability may be -1, and closer to -1 indicates a strong negative linear relationship. The correlation coefficient of the sum of the relative frequency coefficients for each variable and the predicted probability or the sum of the product of the relative frequency coefficients and the influence of each variable and the predicted probability is close to the minimum value. It means that the impact is well explained. This will be described later with reference to FIG. 4.

따라서, 변수 별 상대빈도계수 수치의 합과 예측 확률의 피어슨 상관계수가 최소값에 수렴할 때의 제 1 비율 및 제 2 비율이 각각 상위 그룹과 하위 그룹의 선정을 위한 비율로 확정될 수 있다. 한편, 제 1 비율과 제 2 비율은 서로 같은 값일 수도 있고, 50% 이하이되 서로 다른 값일 수도 있다. Therefore, the first ratio and the second ratio when the sum of the relative frequency coefficients for each variable and the Pearson correlation coefficient of the predicted probability converge to the minimum value can be determined as a ratio for selecting the upper group and the lower group, respectively. Meanwhile, the first ratio and the second ratio may be the same value, or 50% or less, but different values.

경우에 따라, 상관계수 산출부(140)는 복수의 데이터 각각에 대하여 각 변수의 상대빈도계수와 각 변수의 영향도의 곱의 합을 산출할 수 있다. 여기서, 변수의 영향도는 모델 학습부(110)에 의해 산출된 예측 모델에 대한 각 변수가 갖는 영향도를 의미한다. 영향도는 변수의 중요도를 나타내는 것으로 변수 별로 산출될 수 있으며, 어떤 변수의 영향도가 높다는 것은 결과의 예측에 중요한 요소로 작용함을 의미한다. In some cases, the correlation coefficient calculator 140 may calculate a sum of a product of a relative frequency coefficient of each variable and an influence degree of each variable for each of a plurality of data. Here, the degree of influence of the variable means the degree of influence of each variable on the prediction model calculated by the model learning unit 110. The degree of influence indicates the importance of a variable, which can be calculated for each variable. The high degree of influence of a variable means that it acts as an important factor in predicting the outcome.

따라서, 각 변수의 상대빈도계수와 변수 별 영향도의 곱을 이용하면 상대빈도계수와 영향도를 함께 고려하여 상위 그룹과 하위 그룹을 선정할 수 있다. 이처럼 상대빈도계수와 변수 별 영향도의 곱을 이용하면, 상대빈도계수와 같은 방향성을 나타내는 지표의 속성과 영향도와 같은 크기를 나타내는 지표의 속성이 함께 고려됨으로써 예측하고자 하는 결과에 대한 정확도를 향상시킬 수 있다. 상관계수 산출부(140)는 산출된 상대빈도계수와 영향도의 곱의 합과 예측 확률의 상관계수를 산출할 수 있다. Therefore, when the product of the relative frequency coefficient of each variable and the influence of each variable is used, the upper group and the lower group can be selected by considering the relative frequency coefficient and the influence degree together. When the product of the relative frequency coefficient and the influence of each variable is used, the properties of the indicator indicating the same direction as the relative frequency coefficient and the properties of the index indicating the same magnitude as the influence factor are considered together to improve the accuracy of the result to be predicted. have. The correlation coefficient calculating unit 140 may calculate a correlation coefficient of a sum of a product of the calculated relative frequency coefficient and an influence degree and a prediction probability.

제어부(150)는 상대빈도계수 산출부(130)에 의한 상위 그룹, 하위 그룹의 선택 및 상대빈도계수를 산출하는 동작과 상관계수 산출부(140)에 의한 상관계수를 산출하는 동작이 상관계수가 최소값에 수렴할 때까지 반복하여 수행되도록 제어함으로써, 예측 모델에 대한 더 정확한 해석이 가능하게 하는 각 범주화 변수의 범주 별 상대빈도계수가 산출되도록 한다. The control unit 150 selects the upper group and the lower group by the relative frequency coefficient calculating unit 130 and calculates the relative frequency coefficient and the operation of calculating the correlation coefficient by the correlation coefficient calculating unit 140, the correlation coefficient By controlling it to be performed repeatedly until it converges to the minimum value, it is possible to calculate the relative frequency coefficient for each category of each categorization variable that enables more accurate interpretation of the predictive model.

한편, 상관계수가 최소값에 수렴할 때의 상대빈도계수는 모델 학습부(110)의 기계 학습 알고리즘과 관련하여 상대빈도계수에 대응하는 변수의 가중치로 사용될 수 있다. 보다 구체적으로, 기계 학습 알고리즘에 대해서는 지속적으로 학습이 수행될 수 있는 데, 예측 모델에 대한 재학습이 수행되는 경우 산출된 상대빈도계수가 기계 학습 알고리즘의 초기 가중치로 이용되도록 하여, 재학습시의 성능을 향상시키고 학습 시간이 단축되도록 할 수 있다. 예를 들어, 기계 학습 알고리즘이 멀티 레이어를 가지는 신경망 알고리즘인 경우, 상대빈도계수를 첫번째 히든 레이어(hidden layer)에 대한 가중치로 적용하여 재학습이 수행되게 할 수 있다. Meanwhile, the relative frequency coefficient when the correlation coefficient converges to the minimum value may be used as a weight of a variable corresponding to the relative frequency coefficient in relation to the machine learning algorithm of the model learning unit 110. More specifically, for the machine learning algorithm, learning can be continuously performed. When re-learning for the predictive model is performed, the calculated relative frequency coefficient is used as an initial weight of the machine learning algorithm, so It can improve performance and shorten learning time. For example, when the machine learning algorithm is a neural network algorithm having multiple layers, re-learning may be performed by applying a relative frequency coefficient as a weight for the first hidden layer.

출력부(160)는 새로 입력되는 데이터에 대하여 상관계수가 최소값에 수렴할 때의 각 변수의 상대빈도계수 및 영향도의 곱을 시각 정보로 제공할 수 있다. 보다 구체적으로, 출력부(160)는 트레이닝 데이터 세트가 아닌 새로운 트랜잭션에 대한 데이터에 대하여 상관계수가 최소값에 수렴할 때의 각 변수의 상대빈도계수 및 영향도의 곱을 시각 정보로 제공할 수 있다. 출력부(160)는 다양한 형태의 시각 정보를 제공할 수 있으며, 시각 정보의 예는 도 5 내지 도 7을 참조할 수 있다. 도 5 내지 도 7에 대한 상세한 설명은 후술한다. The output unit 160 may provide, as visual information, a product of a relative frequency coefficient and an influence degree of each variable when the correlation coefficient converges to a minimum value for newly input data. More specifically, the output unit 160 may provide, as visual information, a product of a relative frequency coefficient and an influence degree of each variable when the correlation coefficient converges to a minimum value for data for a new transaction that is not a training data set. The output unit 160 may provide various types of visual information, and examples of visual information may refer to FIGS. 5 to 7. 5 to 7 will be described later in detail.

스코어링부(170)는 새로 입력되는 데이터에 대하여 상관계수가 최소값에 수렴할 때의 각 변수의 상대빈도계수 및 각 변수의 영향도의 곱을 스코어링 값으로 제공할 수 있다. 보다 구체적으로, 스코어링부(170)는 트레이닝 데이터 세트가 아닌 새롭게 입력되는 데이터에 대하여 상관계수가 최소값에 수렴할 때의 각 변수의 상대빈도계수 및 각 변수의 영향도의 곱을 스코어링 값으로 제공할 수 있다. 스코어링부(170)는 다양한 형태로 스코어링 값을 제공할 수 있으며, 스코어링 값이 제공되는 형태의 예는 도 7을 참조할 수 있다. The scoring unit 170 may provide, as a scoring value, a product of a relative frequency coefficient of each variable and an influence degree of each variable when the correlation coefficient converges to a minimum value for newly input data. More specifically, the scoring unit 170 may provide, as a scoring value, a product of a relative frequency coefficient of each variable and an influence degree of each variable when the correlation coefficient converges to a minimum value for newly input data rather than a training data set. have. The scoring unit 170 may provide a scoring value in various forms, and refer to FIG. 7 for an example of a form in which the scoring value is provided.

이하에서는 도 2 및 도 3을 참조하여 본 발명의 실시예에 따른 기계 학습 모델 해석 방법의 상세 동작을 설명하겠다. Hereinafter, detailed operations of the method for analyzing a machine learning model according to an embodiment of the present invention will be described with reference to FIGS. 2 and 3.

도 2는 본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법의 각 단계의 흐름을 도시한다. Figure 2 shows the flow of each step of the machine learning model analysis method according to an embodiment of the present invention.

도 2에 의한 실시예에 의할 때, 모델 학습부(110)에 의해 트레이닝 데이터 세트([표 1] 참조)에 소정의 기계 학습 알고리즘을 적용하여 예측하고자 하는 결과(보험 사기 내지 보험금 과다 청구)가 발생할 확률을 제공하는 예측 모델을 생성하는 단계(S110)가 수행된다. According to the embodiment shown in FIG. 2, a result of trying to predict by applying a predetermined machine learning algorithm to a training data set (see [Table 1]) by the model learning unit 110 (insurance fraud or excessive insurance claim) Step S110 of generating a prediction model that provides a probability of occurrence is performed.

확률 예측부(120)에 의해 트레이닝 데이터 세트에 예측 모델을 적용하여 예측하고자 하는 결과가 발생할 예측 확률을 산출하는데(S120), 예를 들어, [표 2]와 같이 각각의 보험금 지급 신청 트랜잭션에 대해 해당 신청이 보험 사기 내지 보험금 과다 청구에 해당할 확률을 산출하게 된다. The probability prediction unit 120 applies a prediction model to the training data set to calculate a prediction probability that a result to be predicted will occur (S120), for example, for each insurance payment transaction, as shown in [Table 2]. The application will calculate the probability of insurance fraud or overpayment.

상대빈도계수 산출부(130)에 의해 트레이닝 데이터 세트를 예측 확률을 기준으로 내림차순으로 정렬하여 상위 제 1 비율의 상위 그룹 및 하위 제 2 비율의 하위 그룹을 선택하고(S130), 트레이닝 데이터 세트를 구성하는 각 변수의 각 범주에 대하여 상위 그룹에 속하는 데이터 중 각 범주에 속하는 데이터의 비율에 대한 하위 그룹에 속하는 데이터 중 각 범주에 속하는 데이터의 비율의 비인 상대빈도계수를 산출한다(S140). The training data set is sorted in descending order based on the predicted probability by the relative frequency coefficient calculating unit 130 to select the upper first ratio group and the lower second ratio group (S130), and configure the training data set. For each category of each variable to be calculated, a relative frequency coefficient, which is a ratio of a ratio of data belonging to each category among data belonging to a lower group to a ratio of data belonging to each category, is calculated (S140).

상대빈도계수는 상위 그룹 내의 해당 범주에 속하는 데이터의 비율(a%)과 하위 그룹 내의 해당 범주에 속하는 데이터의 비율(b%)의 비를 다양한 수학식에 대입하는 것에 기초하여 산출될 수 있다. 상대빈도계수를 산출하는 수식은, a=b인 경우 0이 되고, a<b인 경우 양수를 반환하고, a와 b가 서로 역수 관계일 때는 크기가 같고 부호가 다른 값을 반환하는 수식일 수 있다. 이러한 조건을 만족하는 수식에 따르면, 어떤 변수('직업')의 특정 범주('공무원')의 상대빈도계수가 높을수록 해당 특정 범주('공무원')에 속하는 트랜잭션을 예측하고자 하는 결과(보험사기)에 해당할 확률이 낮고, 어떤 변수('직업')의 특정 범주('주부')의 상대빈도계수가 낮을수록 해당 특정 범주('주부')에 속하는 트랜잭션을 예측하고자 하는 결과(보험사기)에 해당할 확률이 높다. The relative frequency coefficient may be calculated based on substituting the ratio of the ratio (a%) of data belonging to the corresponding category in the upper group to the ratio (b%) of data belonging to the corresponding category in the lower group in various equations. The formula that calculates the relative frequency coefficient can be 0 if a=b, positive if a<b, and a value of the same size and different sign when a and b are inversely related to each other. have. According to the formula that satisfies these conditions, the higher the relative frequency coefficient of a specific category ('Occupation') of a variable ('Occupation'), the higher the result of trying to predict a transaction belonging to that specific category ('Public Employee') (insurance fraud) ), and the lower the relative frequency coefficient of a specific category ('Housewife') of a variable ('Occupation'), the more the result of attempting to predict a transaction belonging to that specific category ('Housewife') (insurance fraud) There is a high probability of corresponding to.

상대빈도계수를 산출하는 수식 f(x)는 하기의 수학식 1 내지 4 중 하나일 수 있으나 이에 한정되지 않고 상술한 조건을 만족하는 어떠한 수식도 사용될 수 있다. 여기서 f(x)는 x가

인 경우의 상대빈도계수를 나타낸다. Equation f(x) for calculating the relative frequency coefficient may be one of Equations 1 to 4 below, but is not limited thereto, and any equation that satisfies the above condition may be used. Where f(x) is x

Relative frequency coefficient in case of.

수학식 1에서, k는 임의의 정수일 수 있다. In Equation 1, k may be any integer.

이어서, 상관계수 산출부(140)에 의해 트레이닝 데이터 세트의 각 데이터에 대하여 각 변수의 상대빈도계수의 합을 산출하고(S150), 각 데이터에 대한 상대빈도계수의 합과 예측 확률의 상관계수를 산출한다(S160). 이 때 S130 내지 S160 단계는 상관계수가 최소값에 수렴할 때(S170의 '예')까지 반복하여 수행된다(S170의 '아니오'). 상관계수 산출부(140)에 의해 산출되는 상관계수는 피어슨 상관계수가 될 수 있는데, 피어슨 상관계수가 최소값에 수렴할 때 각 변수의 상대빈도계수의 합이 클수록 예측하고자 하는 결과가 발생할 확률이 작아진다는 것을 의미한다. 다시 말하면, 상관계수가 최소값에 수렴할 때의 제 1 비율과 제 2 비율이 각 변수의 각 범주별 상대빈도계수를 산출하기 위한 최적의 상위 그룹과 하위 그룹의 크기를 나타내는 비율이 됨을 의미한다. Subsequently, the correlation coefficient calculating unit 140 calculates the sum of the relative frequency coefficients of each variable for each data of the training data set (S150), and calculates the correlation coefficient of the sum of the relative frequency coefficients for each data and the predicted probability. Calculate (S160). At this time, steps S130 to S160 are repeatedly performed until the correlation coefficient converges to the minimum value (Yes in S170) (No in S170). The correlation coefficient calculated by the correlation coefficient calculation unit 140 may be a Pearson correlation coefficient. When the Pearson correlation coefficient converges to a minimum value, the larger the sum of the relative frequency coefficients of each variable, the less likely the result to be predicted. It means losing. In other words, it means that the first ratio and the second ratio when the correlation coefficient converges to the minimum value are ratios representing the optimal upper and lower group sizes for calculating the relative frequency coefficient for each category of each variable.

예를 들어 설명하면, 보험 사기를 예측하는 예측 모델을 생성하기 위한 [표 1]과 같은 트레이닝 데이터 세트에 대하여, 상관계수가 최소값에 수렴하는 경우의 각 변수 별 각 범주에 대한 상대빈도계수 값에 따라 트레이닝 데이터 세트를 구성하는 각 데이터의 변수 값에 해당하는 상대빈도계수 값을 산출하면 아래 [표 3]과 같이 나타낼 수 있다. [표 3]에서 보이는 바와 같이, 보험금 지급 신청을 한 사람의 '직업'변수에 대하여 '공무원' 범주의 상대빈도계수는 1.923, '판매업' 범주의 상대빈도계수는 0.061, '주부' 범주의 상대빈도계수는 -0.5865임을 알 수 있다. 이는 세 개의 직업군 중 보험금 지급을 신청하는 사람의 직업이 '주부'인 경우 보험 사기에 해당할 확률이 가장 높다는 것을 의미하고, '공무원'인 경우 보험 사기에 해당할 확률이 가장 낮다는 것을 의미한다.For example, for training data sets such as [Table 1] for generating a predictive model for predicting insurance fraud, the relative frequency coefficient values for each category for each variable when the correlation coefficient converges to the minimum value. Accordingly, when the relative frequency coefficient value corresponding to the variable value of each data constituting the training data set is calculated, it can be expressed as [Table 3] below. As shown in [Table 3], the relative frequency coefficient of the'public official' category is 1.923, the relative frequency coefficient of the'sales business' category is 0.061, and the relative of the'housewife' category for the'occupation' variable of the person who applied for insurance payments. It can be seen that the frequency coefficient is -0.5865. This means that if the job of the person applying for insurance payments among the three job groups is'housewife', the probability of insurance fraud is the highest, and if it is'public official', it is the lowest probability of insurance fraud. do.

직업job 연령age 지역area 병원hospital 1One 1.9231.923 4.1214.121 0.3030.303 1.9981.998 22 0.0610.061 4.0624.062 1.858 1.858 -0.703-0.703 33 -0.5865-0.5865 -0.919-0.919 -1.013-1.013 -0.001-0.001

상관계수 산출부(140)에 의해 트레이닝 데이터 세트의 복수의 보험금 지급 신청 트랜잭션에 대한 데이터 각각에 대하여, 각 변수에 대응하는 상대빈도계수의 합이 산출될 수 있다. 예를 들어, [표 3]의 1번 데이터의 경우, '직업' 변수의 '공무원' 범주에 대한 상대빈도계수인 1.923, '연령' 변수의 '30대' 범주에 대한 상대빈도계수인 4.121, '지역' 변수의 '전북' 범주에 대한 상대빈도계수인 0.303, '병원' 변수의 '한방병원' 범주에 대한 상대빈도계수인 1.998을 모두 더하여 상대빈도계수의 합이 산출될 수 있다. 복수의 보험금 지급 신청 트랜잭션에 대한 데이터 각각에 대한 상대빈도계수의 합이 산출된 결과는 [표 4]와 같다. The sum of the relative frequency coefficients corresponding to each variable may be calculated for each data for a plurality of insurance payment application transactions of the training data set by the correlation coefficient calculator 140. For example, in the case of data 1 in [Table 3], the relative frequency coefficient for the'public official' category of the'occupation' variable is 1.923, and the relative frequency coefficient for the '30s' category of the'age' variable is 4.121, The sum of the relative frequency coefficients can be calculated by adding both the relative frequency coefficient for the'Jeonbuk' category of the'Region' variable, 0.303, and the relative frequency coefficient for the'Oriental Hospital' category of the'Hospital' variable, 1.998. [Table 4] shows the result of calculating the sum of the relative frequency coefficients for each data for multiple insurance payment transaction transactions.

직업job 연령age 지역area 병원hospital 상대빈도계수의 합Sum of relative frequency coefficients 1One 1.9231.923 4.1214.121 0.3030.303 1.9981.998 8.3458.345 22 0.0610.061 4.0624.062 1.858 1.858 -0.703-0.703 5.2785.278 33 -0.5865-0.5865 -0.919-0.919 -1.013-1.013 -0.001-0.001 -2.5195-2.5195

상관계수 산출부(140)에 의해 상대빈도계수의 합과 예측 확률의 상관계수가 산출될 수 있으며, 상대빈도계수의 합과 예측 확률의 상관계수가 최소값이 되도록 하는 상위 그룹에 대한 제 1 비율과 하위 그룹에 대한 제 2 비율이 선정될 수 있다. The correlation coefficient calculating unit 140 may calculate the correlation coefficient of the sum of the relative frequency coefficients and the predicted probability, and the first ratio of the upper group to make the correlation coefficient of the sum of the relative frequency coefficients and the predicted probability be the minimum value. A second ratio for the subgroup can be selected.

도 4는 [표 1]과 같은 보험 지급 신청 내역에 대한 트레이닝 데이터 세트에 대해 도 2에 따른 방법을 수행한 결과 산출된 각 데이터에 대한 상대빈도계수의 합과 예측 확률의 관계를 그래프로 표시한 것으로, 이 경우 강한 음의 선형 관계가 나타남을 알 수 있다. 이는 상대빈도계수의 합이 작을수록 보험 사기에 해당할 확률이 높음을 의미한다. FIG. 4 is a graph showing the relationship between the sum of the relative frequency coefficients and the predicted probability for each data calculated as a result of performing the method according to FIG. 2 for the training data set for the insurance payment application history as shown in [Table 1]. In this case, it can be seen that a strong negative linear relationship appears. This means that the smaller the sum of the relative frequency coefficients, the higher the probability of insurance fraud.

도 3은 본 발명의 다른 일 실시예에 따른 기계 학습 모델 해석 방법의 각 단계의 흐름을 도시한다. Figure 3 shows the flow of each step of the machine learning model analysis method according to another embodiment of the present invention.

도 3에 의한 실시예의 경우 예측 모델을 생성하고(S210), 예측 확률을 산출하는 단계(S220)는 도 2에 도시된 실시예와 동일하다. 다만, 도 3의 실시예의 경우 모델 학습부(110)에 의해 각 변수의 예측 모델에 대한 영향도를 산출하는 단계(S230)가 수행된다. 영향도는 예측 모델에서 각 변수가 미치는 영향을 나타내는 값으로, 변수 각각에 대해 산출될 수 있다. 예를 들면, 예측 모델에 대해 '직업' 변수의 영향도, '연령' 변수의 영향도, '지역' 변수의 영향도, '병원' 변수의 영향도가 산출될 수 있다. In the case of the embodiment shown in FIG. 3, generating a prediction model (S210) and calculating the prediction probability (S220) are the same as the embodiment illustrated in FIG. 2. However, in the case of the embodiment of FIG. 3, a step (S230) of calculating the influence degree of each variable on the predictive model by the model learning unit 110 is performed. The degree of influence is a value representing the influence of each variable in the predictive model, and can be calculated for each variable. For example, the effect of the'occupation' variable, the effect of the'age' variable, the effect of the'region' variable, and the effect of the'hospital' variable can be calculated for the predictive model.

변수의 영향도는 기계 학습 알고리즘마다 서로 상이한 산출 방법을 이용하여 산출될 수 있다. 다만, 이에 제한되지 않으며, 일부 기계 학습 알고리즘에 대한 변수의 영향도 산출 방법은 같을 수도 있다. 영향도 산출 방법의 일 예로 결정 트리(decision tree)의 지니 불순도(gini impurity)가 있을 수 있으며, 이러한 영향도 산출 방법은 통상의 기술자에게 용이한 바 자세한 설명을 생략하겠다.The influence degree of the variable can be calculated using different calculation methods for each machine learning algorithm. However, the present invention is not limited thereto, and the method of calculating the influence of variables on some machine learning algorithms may be the same. An example of an impact calculation method may be Gini impurity of a decision tree, and the impact calculation method is easy for a person skilled in the art, and a detailed description will be omitted.

상대빈도계수 산출부(130)는 예측 확률에 기초하여 트레이닝 데이터 세트 중 상위 제 1 비율의 상위 그룹 및 하위 제 2 비율의 하위 그룹을 선택하고(S240), 트레이닝 데이터 세트를 구성하는 각 변수의 각 범주에 대하여 상위 그룹에 속하는 데이터 중 각 범주에 속하는 데이터의 비율에 대한 하위 그룹에 속하는 데이터 중 각 범주에 속하는 데이터의 비율의 비인 상대빈도계수를 산출한다(S250). The relative frequency coefficient calculating unit 130 selects an upper group of the upper first ratio and a lower group of the lower second ratio among the training data sets based on the predicted probability (S240), and each of the variables constituting the training data set The relative frequency coefficient, which is the ratio of the ratio of the data belonging to each category among the data belonging to the lower group to the ratio of data belonging to each category among the data belonging to the upper group with respect to the category, is calculated (S250).

상관계수 산출부(140)는 트레이닝 데이터 세트의 복수의 데이터 각각에 대해 각 변수의 상대빈도계수와 각 변수의 영향도의 곱의 합을 산출하고(S260), 산출된 상대빈도계수와 영향도의 곱의 합과 예측 확률의 상관계수를 산출한다(S270). S240 내지 S270 단계는 상관계수가 최소값에 수렴할 때(S280의 '예')까지 반복하여 수행(S280의 '아니오')될 수 있다. The correlation coefficient calculating unit 140 calculates a sum of a product of a relative frequency coefficient of each variable and an influence degree of each variable for each of a plurality of data in the training data set (S260), and calculates the calculated relative frequency coefficient and influence The correlation coefficient between the sum of the products and the predicted probability is calculated (S270). Steps S240 to S270 may be performed repeatedly until the correlation coefficient converges to the minimum value (Yes in S280) (No in S280).

도 5 내지 도 8은 본 발명의 실시예에 따른 기계 학습 모델 해석 방법 및 장치에 의해 제공되는 데이터의 각 변수의 상대빈도계수와 영향도의 곱에 대한 정보를 시각적으로 제공하는 예들을 도시한다. 5 to 8 illustrate examples of visually providing information about a product of a relative frequency coefficient and an influence degree of each variable of data provided by a machine learning model analysis method and apparatus according to an embodiment of the present invention.

도 5의 경우 상관계수가 최소값에 수렴할 때의 최종 확정된 각 변수별 상대빈도계수와 각 변수별 영향도의 곱의 최소값에서 최대값까지의 영역을 빈 막대로 표시하고, 새로 획득되는 데이터들에 대한 각 변수별 상대빈도계수와 영향도의 곱을 빗금 막대로 표시한 것이다. 도 5에 도시된 각각의 그래프는 서로 다른 새로운 트랜잭션 데이터에 대응하는 것이고, 각 그래프 상단의 스코어링 값은 각각의 새로운 데이터에 대해 상대빈도계수 및 영향도의 곱의 총 합을 나타낸다. 이 스코어링 값은 이와 같이 총 합을 나타내는 하나의 수량으로 제공할 수도 있고, 각 변수의 상대빈도계수와 영향도의 곱을 각각의 성분으로 하는 벡터로 표시하여 제공할 수 있다. In the case of FIG. 5, the area from the minimum value to the maximum value of the product of the relative frequency coefficient for each variable finally determined when the correlation coefficient converges to the minimum value and the influence degree for each variable is displayed as an empty bar, and newly acquired data The product of the relative frequency coefficient and the effect of each variable for is indicated by hatched bars. Each graph shown in FIG. 5 corresponds to different new transaction data, and the scoring value at the top of each graph represents the total sum of the product of the relative frequency coefficient and the impact factor for each new data. This scoring value may be provided as a single quantity representing the total sum as described above, or may be provided by displaying the product of the relative frequency coefficient and influence degree of each variable as a vector as each component.

도 5와 같이 0을 기준으로 상대빈도계수 및 영향도의 곱이 음수인 경우와 양수인 경우의 색, 패턴 등을 달리하여 표시할 수 있는데, 이는 새로운 데이터의 각 변수가 예측 결과에 얼마나 영향을 미치는지를 더 명확히 보여주는 효과가 있다. 상대빈도계수와 영향도의 곱이 음수인 경우는 해당 변수의 값이 보험 사기일 확률을 높이는데 영향을 미칠 수 있음을 의미하고, 양수인 경우는 보험 사기일 확률을 낮추는데 영향을 미칠 수 있음을 의미한다. 이와 같이, 0을 기준으로 상대빈도계수의 값이 명확히 구분되도록 함으로써 의사결정을 하는 사용자로 하여금 의사결정 대상 고객의 변수 별 상대빈도계수를 더 쉽게 파악할 수 있도록 한다.As shown in FIG. 5, the product of the relative frequency coefficient and the influence degree based on 0 can be displayed by differently displaying colors, patterns, etc. in the case of a negative number and a positive number, which shows how each variable of the new data affects the prediction result. It has the effect of showing more clearly. If the product of the relative frequency coefficient and the impact is negative, it means that the value of the variable can affect the probability of insurance fraud, and if it is positive, it means that it can affect the probability of insurance fraud. . As described above, by making the value of the relative frequency coefficient clearly based on 0, the user making the decision can more easily grasp the relative frequency coefficient for each variable of the customer to be determined.

도 6은 도 5의 그래프를 각 변수별 상대빈도계수와 영향도의 최소값이 0이 되도록 조정하여 표시한 것이다. FIG. 6 shows the graph of FIG. 5 adjusted to a minimum value of the relative frequency coefficient and influence degree for each variable.

도 7은 각 변수의 상대빈도계수와 영향도의 곱의 최소값과 최대값의 차이가 큰 순서대로 등차수열을 합산하여 기준값이 선형적으로 표시되도록 한 시각 정보를 나타낸다. FIG. 7 shows time information such that the reference values are linearly displayed by summing the sequence of the order of the difference between the minimum and maximum values of the product of the relative frequency coefficient and the influence of each variable.

도 8은 도 5의 막대 그래프를 방사형 그래프의 형태로 나타낸 것이다. 이러한 경우, 상대빈도계수와 영향도의 곱이 클수록 해당 범주로 그래프가 치우쳐 표시된다. 따라서 한눈에 상대빈도계수와 영향도의 곱이 큰 변수를 파악할 수 있다. FIG. 8 shows the bar graph of FIG. 5 in the form of a radial graph. In this case, the larger the product of the relative frequency coefficient and the influence degree, the more the graph is displayed in a corresponding category. Therefore, at a glance, it is possible to identify a variable that has a large product of the relative frequency coefficient and the impact factor.

도 5 내지 도 8의 시각 정보 모두 각 변수가 나타내는 상대빈도계수 및 영향도의 곱의 총 합을 스코어링 값으로 함께 제공하고 있다. All of the visual information in FIGS. 5 to 8 provides a total sum of products of a relative frequency coefficient and an influence degree represented by each variable as a scoring value.

본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법 및 장치는 고객의 정보가 입력되면 고객에 대해 예측하고자 하는 결과를 도출하는 다양한 분야에 적용되어 사용자에게 결과에 대한 합리적인 근거를 제공하고 기계학습 모델의 사용자의 의사결정을 도울 수 있다. 예를 들어, 기계 학습 모델 해석 방법 및 장치는 금융 분야에서 예측하고자 하는 결과를 도출하기 위해 이용될 수 있는다. 보다 구체적으로 예를 들면, 보험 사기인지 여부에 대한 결과를 자동으로 산출함으로써, 효과적이며 용이하게 보험 사기 여부의 판단 결과에 대한 합리적인 근거를 제공할 수 있고. 이에 기초하여, 사용자의 의사 결정을 도울 수 있다. The machine learning model analysis method and apparatus according to an embodiment of the present invention is applied to various fields that derive a result to be predicted for a customer when information of the customer is input, thereby providing a reasonable basis for the result to the user and a machine learning model Can help users make decisions. For example, a method and apparatus for interpreting a machine learning model may be used to derive desired results in the financial field. More specifically, for example, by automatically calculating the result of whether or not the insurance fraud, it can effectively and easily provide a reasonable basis for the judgment result of the insurance fraud. Based on this, it can help users make decisions.

본 발명의 일 실시예에 따른 기계 학습 모델 해석 방법 및 장치는 데이터의 각 변수가 기계 학습에 의한 예측 모델의 결과에 미치는 영향을 시각 정보로 제공함으로써 사용자가 용이하게 의사결정을 하도록 할 수 있다. The method and apparatus for interpreting a machine learning model according to an embodiment of the present invention can allow a user to easily make a decision by providing visual information with an effect that each variable of data affects a result of a prediction model by machine learning.

본 명세서에 첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.Combinations of each block in the block diagrams and respective steps in the flowcharts attached to this specification may be performed by computer program instructions. These computer program instructions may be mounted on a processor of a general purpose computer, special purpose computer, or other programmable data processing equipment, so that instructions performed through a computer or other programmable data processing equipment processor may be used in each block or flowchart of the block diagram. In each step, means are created to perform the functions described. These computer program instructions can also be stored in computer readable or computer readable memory that can be oriented to a computer or other programmable data processing equipment to implement a function in a particular way, so that computer readable or computer readable memory The instructions stored in it are also possible to produce an article of manufacture containing instructions means for performing the functions described in each step of each block or flowchart of the block diagram. Computer program instructions can also be mounted on a computer or other programmable data processing equipment, so a series of operational steps are performed on a computer or other programmable data processing equipment to create a process that is executed by the computer to create a computer or other programmable data. It is also possible for instructions to perform processing equipment to provide steps for performing the functions described in each block of the block diagram and in each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Further, each block or each step can represent a module, segment, or portion of code that includes one or more executable instructions for executing the specified logical function(s). It should also be noted that in some alternative embodiments it is also possible that the functions mentioned in blocks or steps occur out of order. For example, two blocks or steps shown in succession may in fact be executed substantially simultaneously, or it is also possible that the blocks or steps are sometimes performed in reverse order depending on the corresponding function.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 명세서에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical spirit of the present invention, and those skilled in the art to which the present invention pertains may make various modifications and variations without departing from the essential quality of the present invention. Therefore, the embodiments disclosed in the present specification are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the claims below, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

100: 기계 학습 모델 해석 장치
110: 모델 학습부
120: 확률 예측부
130: 상대빈도계수 산출부
140: 상관계수 산출부
150: 제어부
160: 출력부
170: 스코어링부100: machine learning model analysis device
110: model learning department
120: probability prediction unit
130: relative frequency coefficient calculation unit
140: correlation coefficient calculation unit
150: control unit
160: output unit
170: scoring unit

Claims

A model learning step of generating a prediction model that provides a probability that a result to be predicted is generated by applying a predetermined machine learning algorithm to a training data set consisting of a plurality of data, each representing a predetermined transaction;
A probability prediction step of calculating a prediction probability that the result will occur by applying the prediction model to the training data set;
Based on the predicted probability, from the training data set, an upper group of an upper first ratio and a lower group of a lower second ratio are selected, and among data belonging to the upper group for each category of each variable constituting the data Calculating a relative frequency coefficient that is a ratio of a ratio of data belonging to each category among data belonging to the sub-group to a ratio of data belonging to each category; And
Calculating a sum of the relative frequency coefficients of each variable for each of the plurality of data of the training data set, and calculating a correlation coefficient of the sum of the relative frequency coefficients and the prediction probability,
The step of calculating the relative frequency coefficient and the step of calculating the correlation coefficient are performed repeatedly until the correlation coefficient converges to a minimum value.
How to interpret machine learning models.

According to claim 1,
The first ratio and the second ratio are the same
How to interpret machine learning models.

According to claim 1,
The first ratio and the second ratio are different from each other
How to interpret machine learning models.

According to claim 1,
Calculating an influence degree of each variable on the prediction model; And
Further comprising the step of providing, as visual information, a product of a relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data.
How to interpret machine learning models.

The method of claim 4,
The time information is information indicating a position of the product of the relative frequency coefficient and the influence of each variable of the new data among a range of a minimum value and a maximum value of the product of the relative frequency coefficient and the influence degree.
How to interpret machine learning models.

According to claim 1,
Calculating an influence degree of each variable on the prediction model; And
The method further includes providing a product of a relative frequency coefficient of each variable when the correlation coefficient converges to a minimum value for new data, and a product of the influence degree of each variable as a scoring value.
How to interpret machine learning models.

According to claim 1,
When the correlation coefficient converges to a minimum value, the relative frequency coefficient is used as a weight of a variable corresponding to the relative frequency coefficient in the model training step.
How to interpret machine learning models.

A model learning step of generating a prediction model that provides a probability that a result to be predicted is generated by applying a predetermined machine learning algorithm to a training data set composed of a plurality of data each representing a predetermined transaction;
A probability prediction step of applying a prediction model to the training data set and calculating a prediction probability that the result will occur;
Calculating an influence degree of each variable constituting the data on the prediction model;
Based on the prediction probability, an upper first ratio upper group and a lower second lower group are selected from the training data set, and for each category of each variable, each category of data belonging to the upper group is assigned to each category. Calculating a relative frequency coefficient that is a ratio of a ratio of data belonging to each category among data belonging to the sub-group to a ratio of belonging data; And
Calculating a sum of a product of the relative frequency coefficient of each variable and an influence degree of each variable for each of the plurality of data in the training data set, and calculating a correlation coefficient of the sum and the predicted probability And
The step of calculating the relative frequency coefficient and the step of calculating the correlation coefficient are performed repeatedly until the correlation coefficient converges to a minimum value.
How to interpret machine learning models.

The method of claim 8,
The first ratio and the second ratio are the same
How to interpret machine learning models.

The method of claim 8,
The first ratio and the second ratio are different from each other
How to interpret machine learning models.

The method of claim 8,
Further comprising the step of providing, as visual information, a product of a relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data.
How to interpret machine learning models.

The method of claim 11,
The time information is obtained by multiplying the product of the relative frequency coefficient and the influence of the new data among the range of the minimum and maximum values of the product of the relative frequency coefficient and the influence of each variable when the correlation coefficient converges to the minimum value. Information indicating the location
How to interpret machine learning models.

The method of claim 11,
The method further includes providing a product of the relative frequency coefficient of each variable and the influence degree of each variable as a scoring value when the correlation coefficient converges to a minimum value for new data.
How to interpret machine learning models.

The method of claim 11,
When the correlation coefficient converges to a minimum value, the relative frequency coefficient is used as a weight of a variable corresponding to the relative frequency coefficient in the model training step.
How to interpret machine learning models.

By applying a predetermined machine learning algorithm to a training data set composed of a plurality of data each representing a predetermined transaction, a prediction model is provided that provides a probability that a desired result will occur, and the prediction of each variable constituting the data is generated. A model learning unit for calculating an influence degree on the model;
A probability prediction unit calculating a prediction probability that the result will occur by applying the prediction model to the training data set;
Based on the predicted probability, an upper first ratio upper group and a lower second ratio lower group are selected from the training data set, and for each category of each variable, among the data belonging to the upper group belonging to each category A relative frequency coefficient calculating unit that calculates a relative frequency coefficient that is a ratio of a ratio of data belonging to each category among data belonging to the subgroup with respect to the ratio of data;
For each of the plurality of data in the training data set, a sum of the relative frequency coefficients of each variable is calculated, a sum of the relative frequency coefficients and a correlation coefficient of the prediction probability are calculated, or the relative frequency of each variable A correlation coefficient calculation unit calculating a sum of a product of a coefficient and an influence degree of each variable, and calculating a correlation coefficient between the sum and the prediction probability; And
And a control unit for causing the relative frequency coefficient calculating unit to calculate the relative frequency coefficient until the correlation coefficient converges to a minimum value and controlling the correlation coefficient calculating unit to calculate the correlation coefficient.
Machine learning model analysis device.

The method of claim 15,
Further comprising an output unit that provides, as visual information, a product of the relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data.
Machine learning model analysis device.

The method of claim 15,
Further comprising a scoring unit for providing a product of a product of the relative frequency coefficient of each variable and the influence degree of each variable when the correlation coefficient converges to a minimum value for new data.
Machine learning model analysis device.

The method of claim 15,
When the correlation coefficient converges to a minimum value, the relative frequency coefficient is used as a weight of a variable corresponding to the relative frequency coefficient in the model training step.
Machine learning model analysis device.