KR20240055469A

KR20240055469A - Apparatus and Method for Fairness Visualization of AI Learning Dataset

Info

Publication number: KR20240055469A
Application number: KR1020220135781A
Authority: KR
Inventors: 김구; 최동규; 서형윤; 변경수
Original assignee: (주)아이소프트
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2024-04-29

Abstract

본 발명은 인공지능 학습 데이터셋의 공정성 시각화 장치 및 방법에 관한 것으로, 보다 구체적으로 배치 형태의 원본 데이터셋을 업로드 하는 업로더, 다수 개의 민감한 속성(Sensitive Columns)을 입력받고, 상기 원본 데이터셋으로부터 각 민감한 속성(Sensitive Column)에 해당하는 서브셋을 도출하는 구조분석 모듈, 다수 개의 서브셋의 크기에 따라 샘플링한 후 공정성이 보정된 보정 데이터셋을 출력하는 샘플링 모듈 및 데이터 공정성을 확인하기 용이하도록 상기 원본 데이터셋과 보정 데이터셋 중 적어도 하나를 기 설정된 형식으로 시각화하는 시각화 모듈을 포함하는 인공지능 학습 데이터셋의 공정성 시각화 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for visualizing the fairness of an artificial intelligence learning dataset, and more specifically, to an uploader that uploads an original dataset in batch form, receives a plurality of sensitive attributes (Sensitive Columns) as input, and A structural analysis module that derives subsets corresponding to each sensitive attribute (Sensitive Column), a sampling module that outputs a correction data set with fairness corrected after sampling according to the size of multiple subsets, and the original data set to facilitate checking data fairness. The present invention relates to an apparatus and method for visualizing the fairness of an artificial intelligence learning dataset, including a visualization module that visualizes at least one of a dataset and a correction dataset in a preset format.

Description

Apparatus and Method for Fairness Visualization of AI Learning Dataset}

본 발명은 인공지능 학습 데이터셋의 공정성 시각화 장치 및 방법에 관한 것으로, 보다 구체적으로 원본 데이터셋으로부터 공정성이 보정된 보정 데이터셋을 출력하고, 공정성 지표를 통해서 편향 및 공정성 정도를 시각적으로 제공하는 인공지능 학습 데이터 셋의 공정성 시각화 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for visualizing the fairness of an artificial intelligence learning dataset. More specifically, the present invention relates to an artificial intelligence learning dataset that outputs a correction dataset with the fairness corrected from the original dataset and visually provides the degree of bias and fairness through a fairness index. This relates to an apparatus and method for visualizing the fairness of an intelligent learning data set.

최근 인공지능 알고리즘이 빠르게 성장하고 있고, 이를 활용하여 의사결정을 내리기 전에 미리 그 위험과 영향을 예측하여 사람의 판단을 보조하거나 자동화할 수 있다. 치안, 정책, 금융, 의학, 채용 등 다양한 주제에서 인공지능 알고리즘이 의사결정 모형을 학습할 수 있고, 이러한 의사결정 모형들은 종종 사회에 존재하는 편향까지도 그대로 학습할 수 있다. 대상자의 삶에 중대한 영향을 미칠 수 있는 알고리즘이 내린 결정들은 그 결정과 무관해야 하는 특성에 따라 편향될 수 있으며, 성별, 인종, 종교 등 특정 집단에 속한 개인들에게 불리하게 작용한다.Recently, artificial intelligence algorithms have been growing rapidly, and can be used to assist or automate human judgment by predicting risks and impacts before making decisions. Artificial intelligence algorithms can learn decision-making models in various topics such as public safety, policy, finance, medicine, and employment, and these decision-making models can often even learn biases that exist in society. Decisions made by algorithms that can have a significant impact on a subject's life may be biased based on characteristics that should be unrelated to the decision, and may disadvantage individuals belonging to specific groups such as gender, race, or religion.

인공지능(AI) 기술의 빠른 발달로 다양한 산업 분야에 적용되면서 인공지능의 역작용이나 사회 전반에 끼치는 영향에 대한 논의가 필요하다. 특히, 성별과 인종, 사회 집단 등에 대해 편향을 갖거나 투명성 결여 등의 공정성 문제가 이슈가 된다. As artificial intelligence (AI) technology develops rapidly and is applied to various industrial fields, there is a need to discuss the adverse effects of artificial intelligence and its impact on society as a whole. In particular, fairness issues such as bias against gender, race, social groups, etc. or lack of transparency become issues.

인공지능(AI)의 제대로 된 동작 및 결과를 위해서는 학습을 위한 데이터셋의 공정성 보장이 매우 중요하다. 예를 들어, 뇌졸중 여부를 판단하는 인공 지능 모델의 경우 뇌졸중 환자와 정상인의 비율이 공정하지 못한 (eg, 환자 : 일반인 = 80,000개 : 20,000개) 학습 데이터셋을 사용하면 학습데이터의 불공정성에 의해 잘못된 결과를 도출할 수 있다. 또한, 해당 데이터에서 하나의 민감한 속성 (eg. 뇌졸중 여부)의 비율만 보정하는 것(eg. 환자 : 일반인 = 20,000개 : 20,000개)으로는 종속적인 다른 속성들에서 편향(eg. 환자-남성 : 환자-여성 = 15,000개 : 5,000개)이 발생할 수 있으므로 제대로 된 인공지능 학습의 결과를 기대할 수 없다. For the proper operation and results of artificial intelligence (AI), it is very important to ensure the fairness of the dataset for learning. For example, in the case of an artificial intelligence model that determines whether or not there is a stroke, if a learning dataset with an unfair ratio of stroke patients to normal people (eg, patients: general population = 80,000: 20,000) is used, errors may occur due to the unfairness of the learning data. Results can be derived. In addition, correcting only the ratio of one sensitive attribute (eg. stroke status) in the data (eg. patient: general population = 20,000 items: 20,000 items) may result in bias in other dependent attributes (eg. patient-male: Since patient-female = 15,000: 5,000) may occur, proper results of artificial intelligence learning cannot be expected.

또한, 현재까지 정적인 데이터셋을 대상으로 제거하는 방식을 사용하고 있으며 추가로 수집되는 데이터에 대한 추가적인 보정기술이 미흡하다. 그리고 보정기술이 적용되더라도 보정기술이 적용되기 전과 후 데이터셋의 보정 정도를 시각적으로 확인할 수 있는 기술이 전무하여 본 기술 분야에서 절실히 요구되는 실정이다.In addition, to date, a method of removing static data sets has been used, and additional correction technology for additionally collected data is insufficient. And even if correction technology is applied, there is no technology that can visually check the degree of correction of the dataset before and after the correction technology is applied, so it is desperately needed in this technology field.

대한민국 공개특허공보 제10-2022-0090359호Republic of Korea Patent Publication No. 10-2022-0090359 대한민국 공개특허공보 제10-2021-0028724호Republic of Korea Patent Publication No. 10-2021-0028724

본 발명은 상기와 같은 문제점을 해결하기 위한 것으로 무분별하게 수집된 무수히 많은 데이터의 불공정성을 해소할 수 있도록 다수 개의 서브셋의 크기에 따라 샘플링한 후 공정성이 보정된 보정 데이터셋을 출력하는 인공지능 학습 데이터셋의 공정성 시각화 장치 및 방법을 얻고자 하는 것을 목적으로 한다.The present invention is intended to solve the above problems. Artificial intelligence learning data is sampled according to the size of a plurality of subsets and outputs a correction dataset with fairness corrected to resolve the unfairness of countless data collected indiscriminately. The purpose is to obtain three fairness visualization devices and methods.

본 발명의 목적은 데이터 공정성을 확인하기 용이하도록 상기 원본 데이터셋과 보정 데이터셋 중 적어도 하나를 기 설정된 형식으로 시각화하는 인공지능 학습 데이터셋의 공정성 시각화 장치 및 방법을 제공하는 것이다.The purpose of the present invention is to provide an apparatus and method for visualizing the fairness of an artificial intelligence learning dataset that visualizes at least one of the original dataset and the correction dataset in a preset format to facilitate checking data fairness.

본 발명이 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 본 발명의 기재로부터 당해 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있다.The technical problems to be achieved by the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned can be clearly understood by those skilled in the art from the description of the present invention.

상기 목적을 달성하기 위하여, 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 장치는 배치 형태의 원본 데이터셋을 업로드 하는 업로더; 다수 개의 민감한 속성(Sensitive Columns)을 입력받고, 상기 원본 데이터셋으로부터 각 민감한 속성(Sensitive Column)에 해당하는 서브셋을 도출하는 구조분석 모듈; 다수 개의 서브셋의 크기에 따라 샘플링한 후 공정성이 보정된 보정 데이터셋을 출력하는 샘플링 모듈; 및 데이터 공정성을 확인하기 용이하도록 상기 원본 데이터셋과 보정 데이터셋 중 적어도 하나를 기 설정된 형식으로 시각화하는 시각화 모듈;을 제공한다.In order to achieve the above object, the fairness visualization device of the artificial intelligence learning dataset of the present invention includes an uploader that uploads the original dataset in batch form; A structural analysis module that receives a plurality of sensitive attributes (Sensitive Columns) as input and derives a subset corresponding to each sensitive attribute (Sensitive Columns) from the original dataset; A sampling module that samples according to the size of a plurality of subsets and then outputs a correction dataset with corrected fairness; and a visualization module that visualizes at least one of the original dataset and the corrected dataset in a preset format to facilitate checking data fairness.

상기 목적을 달성하기 위하여, 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 방법은 업로더에 의하여, 배치 형태의 원본 데이터셋이 업로드 되는 원본 데이터셋 업로드 단계; 구조분석 모듈에 의하여, 다수 개의 민감한 속성(Sensitive Columns)이 입력되고, 상기 원본 데이터셋으로부터 각 민감한 속성(Sensitive Column)에 해당하는 서브셋이 도출되는 서브셋 도출단계; 샘플링 모듈에 의하여, 다수 개의 서브셋의 크기에 따라 샘플링된 후 공정성이 보정된 보정 데이터셋이 출력되는 보정 데이터셋 출력단계; 및 시각화 모듈에 의하여, 데이터 공정성을 확인하기 용이하도록 상기 원본 데이터셋과 보정 데이터셋 중 적어도 하나가 기 설정된 형식으로 시각화되는 시각화 단계;를 제공한다.In order to achieve the above object, the method for visualizing the fairness of an artificial intelligence learning dataset of the present invention includes an original dataset upload step in which an original dataset in batch form is uploaded by an uploader; A subset derivation step in which a plurality of sensitive attributes (Sensitive Columns) are input by a structural analysis module, and a subset corresponding to each sensitive attribute (Sensitive Columns) is derived from the original data set; A correction data set output step in which a correction data set with fairness corrected after sampling according to the size of a plurality of subsets is output by a sampling module; and a visualization step in which at least one of the original dataset and the corrected dataset is visualized in a preset format by a visualization module to facilitate confirmation of data fairness.

이상과 같이 본 발명에 의하면 획득한 데이터셋의 불공정성이 점증적으로 해결될 수 있고, 종래 무분별하게 수집된 무수히 많은 데이터로 인공지능 모델이 학습됨으로써 인공지능 모델의 성능이 저하될 수 있는 문제점을 해결할 수 있는 효과가 있다. As described above, according to the present invention, the unfairness of the acquired dataset can be gradually solved, and the problem that the performance of the artificial intelligence model may deteriorate as the artificial intelligence model is learned with countless data collected indiscriminately in the past can be solved. There is a possible effect.

또한, 본 발명은 업로드된 원본 데이터셋 및 공정성이 보정된 보정 데이터셋의 편향 및 공정성 정도가 시각화됨으로써, 사용자가 시각적으로 데이터셋의 편향 및 공정성을 비교 및 확인될 수 있고, 데이터셋 내 불공정한 항목이 구체적으로 확인되어 해당 항목에 대한 공정성이 보완될 수 있도록 하는 효과가 있다. In addition, the present invention visualizes the degree of bias and fairness of the uploaded original dataset and the correction dataset whose fairness has been corrected, so that the user can visually compare and confirm the bias and fairness of the dataset, and the unfairness in the dataset is visualized. This has the effect of ensuring that the items are confirmed in detail so that the fairness of the items can be improved.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 상세한 설명 및 청구범위의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the detailed description and claims.

도 1은 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 장치 구성도이다.
도 2는 본 발명의 일실시예에 따라 입력된 다수 개의 민감한 속성(Sensitive Columns)을 표시한 도면이다.
도 3은 본 발명의 일실시예에 따라 입력된 민감한 속성(Sensitive Columns)에 따라 행렬형태로 데이터를 시각화한 도면이다.
도 4는 본 발명의 일실시예에 따라 공정성 및 편향을 그래프 형태로 시각화한 도면이다.
도 5는 본 발명의 일실시예에 따라 수적 속성(Numeric Column)에 대한 수적 특성을 시각화한 도면(a)이고, 범주 속성(Categorical Column)에 대한 범주 특성을 시각화한 도면(b)이다.
도 6은 본 발명의 일실시예에 따라 원본 데이터셋(Dataset₀) 또는 보정 데이터셋(Dataset₂) 내 모든 데이터들을 시각화한 도면이다.
도 7은 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 방법 흐름도이다.
도 8은 본 발명의 일실시예에 따라 제1 샘플링 단계를 표시한 도면이다.
도 9는 본 발명의 일실시예에 따라 제2 샘플링 단계와 제3 샘플링 단계를 표시한 도면이다.Figure 1 is a configuration diagram of a fairness visualization device for an artificial intelligence learning dataset of the present invention.
Figure 2 is a diagram showing a plurality of sensitive attributes (Sensitive Columns) input according to an embodiment of the present invention.
Figure 3 is a diagram visualizing data in matrix form according to sensitive attributes (Sensitive Columns) input according to an embodiment of the present invention.
Figure 4 is a diagram visualizing fairness and bias in graph form according to an embodiment of the present invention.
Figure 5 is a diagram (a) visualizing the numerical characteristics of a numeric attribute (Numeric Column) and a diagram (b) visualizing the categorical characteristics of a categorical attribute (Categorical Column) according to an embodiment of the present invention.
Figure 6 is a diagram visualizing all data in the original dataset (Dataset ₀ ) or the corrected dataset (Dataset ₂ ) according to an embodiment of the present invention.
Figure 7 is a flowchart of the fairness visualization method of the artificial intelligence learning dataset of the present invention.
Figure 8 is a diagram showing a first sampling step according to an embodiment of the present invention.
Figure 9 is a diagram showing a second sampling step and a third sampling step according to an embodiment of the present invention.

본 명세서에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in this specification are general terms that are currently widely used as much as possible while considering the function in the present invention, but this may vary depending on the intention or precedent of a person working in the art, the emergence of new technology, etc. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the relevant invention. Therefore, the terms used in the present invention should be defined based on the meaning of the term and the overall content of the present invention, rather than simply the name of the term.

다르게 정의되지 않는 한 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings they have in the context of the related technology, and unless clearly defined in the present application, should not be interpreted in an ideal or excessively formal sense. No.

이하, 본 발명에 따른 실시예를 첨부한 도면을 참조하여 상세히 설명하기로 한다. 도 1은 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 장치 구성도이다. 도 2는 본 발명의 일실시예에 따라 입력된 다수 개의 민감한 속성(Sensitive Columns)을 표시한 도면이다. 도 3은 본 발명의 일실시예에 따라 입력된 민감한 속성(Sensitive Columns)에 따라 행렬형태로 데이터를 시각화한 도면이다. 도 4는 본 발명의 일실시예에 따라 공정성 및 편향을 그래프 형태로 시각화한 도면이다. 도 5는 본 발명의 일실시예에 따라 수적 속성(Numeric Column)에 대한 수적 특성을 시각화한 도면(a)이고, 범주 속성(Categorical Column)에 대한 범주 특성을 시각화한 도면(b)이다. 도 6은 본 발명의 일실시예에 따라 원본 데이터셋(Dataset₀) 또는 보정 데이터셋(Dataset₂) 내 모든 데이터들을 시각화한 도면이다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. Figure 1 is a configuration diagram of a fairness visualization device for an artificial intelligence learning dataset of the present invention. Figure 2 is a diagram showing a plurality of sensitive attributes (Sensitive Columns) input according to an embodiment of the present invention. Figure 3 is a diagram visualizing data in matrix form according to sensitive attributes (Sensitive Columns) input according to an embodiment of the present invention. Figure 4 is a diagram visualizing fairness and bias in the form of a graph according to an embodiment of the present invention. Figure 5 is a diagram (a) visualizing the numerical characteristics of a numeric attribute (Numeric Column) and a diagram (b) visualizing the categorical characteristics of a categorical attribute (Categorical Column) according to an embodiment of the present invention. Figure 6 is a diagram visualizing all data in the original dataset (Dataset ₀ ) or the corrected dataset (Dataset ₂ ) according to an embodiment of the present invention.

도 7은 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 방법 흐름도이다. 도 8은 본 발명의 일실시예에 따라 제1 샘플링 단계(S320)를 표시한 도면이다. 도 9는 본 발명의 일실시예에 따라 제2 샘플링 단계(S340)와 제3 샘플링 단계(S360)를 표시한 도면이다.Figure 7 is a flowchart of the fairness visualization method of the artificial intelligence learning dataset of the present invention. Figure 8 is a diagram showing the first sampling step (S320) according to an embodiment of the present invention. Figure 9 is a diagram showing the second sampling step (S340) and the third sampling step (S360) according to an embodiment of the present invention.

인공지능 학습 데이터셋의 공정성 시각화 장치Fairness visualization device for artificial intelligence learning dataset

본 발명에서 언급하는 인공지능 학습 데이터셋의 공정성 시각화 장치는 장치 및 모듈이 복합적으로 구비된 시스템일 수 있다. 또는, 다수 개의 사용자 단말이 인터넷 네트워크를 통해서 접근 가능하도록 하고, 학습 데이터셋의 공정성을 보정하고, 보정된 학습 데이터셋의 공정성을 시각적으로 확인할 수 있도록 플랫폼을 구현하고 플랫폼 서비스를 제공하는 플랫폼 서버일 수 있다. 또는, 인공지능 학습 데이터셋의 공정성을 보정하는 방법과 보정된 학습 데이터셋의 공정성을 시각화하는 방법을 수행하는 프로그램을 기록한 컴퓨터 장치로 읽을 수 있는 기록매체일 수 있다. 즉, 상기 인공지능 학습 데이터셋의 공정성 시각화 장치는 다양한 방식으로 구현될 수 있다.The fairness visualization device of the artificial intelligence learning dataset mentioned in the present invention may be a system equipped with a complex device and module. Alternatively, it is a platform server that implements a platform and provides platform services so that multiple user terminals can access it through an Internet network, correct the fairness of the learning dataset, and visually check the fairness of the corrected learning dataset. You can. Alternatively, it may be a recording medium that can be read by a computer device that records a program that performs a method of correcting the fairness of an artificial intelligence learning dataset and a method of visualizing the fairness of the corrected learning dataset. In other words, the fairness visualization device for the artificial intelligence learning dataset can be implemented in various ways.

도 1을 보면, 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 장치는 업로더(100), 구조분석 모듈(200), 샘플링 모듈(300) 및 시각화 모듈(400)을 포함한다. Referring to Figure 1, the fairness visualization device of the artificial intelligence learning dataset of the present invention includes an uploader 100, a structural analysis module 200, a sampling module 300, and a visualization module 400.

상기 업로더(100)는 배치 형태의 원본 데이터셋을 업로드 한다. The uploader 100 uploads the original dataset in batch form.

본 발명에서 언급하는 배치(Batch)는 컴퓨터의 데이터 처리 형태의 하나로써, 처리해야 할 데이터를 일정기간 또는 일정량을 단위로 처리하는 것을 일컫는다. 즉, 상기 업로드(100)는 실시간으로 데이터 단위로 데이터를 획득하는 것이 아닌, 일정기간 또는 일정량 획득된 데이터를 배치(Batch) 단위로 데이터를 처리할 수 있다. Batch, as referred to in the present invention, is a type of computer data processing and refers to processing data to be processed over a certain period of time or in a certain amount. In other words, the upload 100 may process data acquired over a certain period of time or in a certain amount in batches, rather than acquiring data in real time.

본 발명에 일실시예에 따르면 상기 업로더(100)는 국가교통DB, 보건의료빅데이터개방시스템, 국민재난안전포털, 환경부 및 문화데이터광장 공공데이터포털 등 교통, 보건의료, 재난안전, 환경, 문화관광 등 각 분야에서 빅데이터를 제공하는 서버 및 데이터베이스로부터 원본 데이터셋을 획득할 수 있다. 그리고 상기 원본 데이터셋은 각 서버 및 데이터베이스에서 일정기간동안 저장된 배치(Batch) 형태의 데이터 집합이거나 일정량이 저장된 배치(Batch) 형태의 데이터 집합일 수 있다. According to one embodiment of the present invention, the uploader 100 is used in transportation, health and medical services, disaster safety, environment, etc., such as the National Transportation DB, Health and Medical Big Data Open System, National Disaster and Safety Portal, Ministry of Environment and Culture Data Plaza public data portal, etc. Original datasets can be obtained from servers and databases that provide big data in various fields, including culture and tourism. And the original data set may be a batch-type data set stored for a certain period of time in each server and database, or a batch-type data set stored in a certain amount.

다음으로, 상기 구조분석 모듈(200)은 다수 개의 민감한 속성(Sensitive Columns)을 입력받고, 상기 원본 데이터셋으로부터 각 민감한 속성(Sensitive Column)에 해당하는 서브셋을 도출한다.Next, the structural analysis module 200 receives a plurality of sensitive attributes (Sensitive Columns) and derives a subset corresponding to each sensitive attribute (Sensitive Columns) from the original dataset.

본 발명에서 언급하는 상기 민감한 속성(Sensitive Columns)은 뇌졸중 여부, 남녀 등 인공지능 기술로부터 출력되는 출력값에 영향을 끼칠 수 있는 항목이다. 상기 구조분석 모듈(200)은 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 장치에 포함되는 입력 모듈(600) 또는 별도의 사용자 단말을 통해서 다수 개의 민감한 속성(Sensitive Columns)을 입력받을 수 있다. The sensitive attributes (Sensitive Columns) mentioned in the present invention are items that can affect the output value output from artificial intelligence technology, such as whether there is a stroke or whether it is male or female. The structural analysis module 200 can receive a plurality of sensitive attributes (Sensitive Columns) through the input module 600 included in the fairness visualization device of the artificial intelligence learning dataset of the present invention or through a separate user terminal.

그리고 상기 구조분석 모듈(200)은 상기 원본 데이터셋을 다수 개의 민감한 속성(Sensitive Columns)에 따라 임의적으로 다수 개의 서브셋으로 분할할 수 있다. 예컨대 민감한 속성(Sensitive Columns)으로 성별 속성이 입력된다면 상기 구조분석 모듈(200)은 상기 원본 데이터셋을 남성인 서브셋과 여성인 서브셋으로 분할할 수 있다. 또는, 민감한 속성(Sensitive Columns)으로 A등급 내지 D등급을 포함하는 평가등급 속성이 입력된다면 상기 구조분석 모듈(200)은 상기 원본 데이터셋을 A등급인 서브셋, B등급인 서브셋, C등급인 서브셋 및 D등급인 서브셋으로 분할할 수 있다.In addition, the structural analysis module 200 may randomly divide the original dataset into a plurality of subsets according to a plurality of sensitive columns. For example, if the gender attribute is input as a sensitive attribute (Sensitive Columns), the structural analysis module 200 can divide the original data set into a male subset and a female subset. Alternatively, if evaluation grade attributes including grades A to D are input as sensitive columns, the structural analysis module 200 divides the original dataset into a grade A subset, a grade B subset, or a grade C subset. and D-class subsets.

다음으로, 상기 샘플링 모듈(300)은 다수 개의 서브셋의 크기에 따라 샘플링한 후 공정성이 보정된 보정 데이터셋을 출력한다.Next, the sampling module 300 samples according to the size of a plurality of subsets and then outputs a correction data set with fairness corrected.

상기 샘플링 모듈(300)은 여러 번의 샘플링 과정을 거치면서 점증적으로 데이터셋의 공정성을 보정하고, 결과적으로 공정성이 향상된 보정 데이터셋을 출력할 수 있다. The sampling module 300 can gradually correct the fairness of the data set by going through several sampling processes, and output a corrected data set with improved fairness as a result.

1차 샘플링 과정에 있어서, 상기 샘플링 모듈(300)은, 상기 원본 데이터셋(Dataset₀)을 대상으로 도출된 다수 개의 서브셋(Subset₀)에 대한 평균값을 산출하는 평균값 산출모듈(310) 및 상기 평균값을 이용하여 다수 개의 서브셋(Subset₀)을 1차 샘플링하는 1차 샘플링 모듈(320)을 포함하는 것을 특징으로 한다.In the first sampling process, the sampling module 300 includes an average value calculation module 310 that calculates the average value for a plurality of subsets (Subset ₀ ) derived from the original dataset (Dataset ₀ ), and the average value It is characterized by including a primary sampling module 320 that first samples a plurality of subsets (Subset ₀ ) using .

상기 1차 샘플링 모듈(320)은 상기 평균값을 기준으로 각 서브셋의 크기를 비교할 수 있다. 임의의 서브셋의 크기가 상기 평균값보다 크면 상기 평균값에 맞춰서 1차 샘플링하고, 임의의 서브셋의 크기가 상기 평균값보다 작거나 같으면 해당 크기를 유지하도록 1차 샘플링할 수 있다. The primary sampling module 320 may compare the size of each subset based on the average value. If the size of a random subset is larger than the average value, primary sampling can be performed according to the average value. If the size of a random subset is smaller than or equal to the average value, primary sampling can be performed to maintain the size.

본 발명에서 언급하는 서브셋의 크기는 각 서브셋에 포함되는 데이터의 개수를 일컫는다. 본 발명의 일실시예에 따르면, 상기 구조분석 모듈(200)로부터 4개의 서브셋이 도출되고 각각의 크기가 200, 600, 100, 300이라면 상기 평균값 산출모듈(310)은 4개의 서브셋에 대한 평균값으로 300을 산출할 수 있다. 그러면 상기 1차 샘플링 모듈(320)은 상기 평균값 300을 기준으로 각 서브셋의 크기를 비교할 수 있다. 상기 1차 샘플링 모델(320)은 200 크기를 갖는 서브셋 1이 평균값 300보다 크기가 작으므로 해당 크기인 200을 유지하도록 샘플링할 수 있고, 600 크기를 갖는 서브셋 2가 평균값 300보다 크기가 큼으로 평균값인 300을 제외한 300 크기를 갖도록 샘플링할 수 있다. 그리고 100 크기를 갖는 서브셋 3이 평균값 300보다 크기가 작으므로 해당 크기인 100을 유지하도록 샘플링할 수 있고, 마지막으로 300 크기를 갖는 서브셋 3은 평균값 300과 동일한 크기를 가짐으로 해당 크기인 300을 유지하도록 샘플링할 수 있다. 결과적으로, 상기 1차 샘플링 모듈(320)로부터 1차 샘플링된 4개의 서브셋의 크기는 200, 600, 100, 300에서 200, 300, 100, 300이 될 수 있다. 따라서 본 발명의 1차 샘플링 모듈(320)에 의하여 서브셋 간의 편향을 1차적으로 줄일 수 있는 현저한 효과가 있다.The size of the subset mentioned in the present invention refers to the number of data included in each subset. According to one embodiment of the present invention, if four subsets are derived from the structural analysis module 200 and the sizes of each are 200, 600, 100, and 300, the average value calculation module 310 calculates the average value for the four subsets. 300 can be calculated. Then, the first sampling module 320 can compare the size of each subset based on the average value 300. The first sampling model 320 can sample to maintain the size of 200 because subset 1 with a size of 200 is smaller than the average value of 300, and subset 2 with a size of 600 is larger than the average value of 300, so the average value is 300. It can be sampled to have a size of 300 excluding 300. And since subset 3 with a size of 100 is smaller than the average value of 300, it can be sampled to maintain that size of 100. Finally, subset 3 with a size of 300 has the same size as the average value of 300, so it can be sampled to maintain that size of 300. You can sample to As a result, the sizes of the four subsets initially sampled from the primary sampling module 320 may range from 200, 600, 100, and 300 to 200, 300, 100, and 300. Therefore, the primary sampling module 320 of the present invention has the significant effect of primarily reducing bias between subsets.

다음으로 2차 샘플링에 있어서, 상기 구조분석 모듈(200)은, 상기 원본 데이터셋(Dataset₀) 중에서 상기 1차 샘플링 모듈(320)로부터 1차 샘플링된 후 남은 잔여 데이터셋(Dataset₁)과 상기 원본 데이터셋(Dataset₀)을 대상으로 다수 개의 서브셋(Subset_A, Subset_B)을 각각 도출하고, 상기 샘플링 모듈(300)은, 대응하는 서브셋의 크기를 각각 비교하고, 대응하는 서브셋 크기의 차이값을 각각 산출하는 차이값 산출모듈(330) 및 상기 차이값을 이용하여 다수 개의 서브셋(Subset_A, Subset_B)을 2차 샘플링하는 2차 샘플링 모듈(340)을 더 포함하는 것을 특징으로 한다.Next, in the second sampling, the structural analysis module 200 selects the residual data set (Dataset ₁ ) remaining after the first sampling from the first sampling module 320 among the original data set (Dataset ₀ ) and the A plurality of subsets (Subset _A , Subset _B ) are derived from the original data set (Dataset ₀ ), and the sampling module 300 compares the sizes of the corresponding subsets and calculates the difference between the sizes of the corresponding subsets. It is characterized in that it further includes a difference value calculation module 330 that calculates each and a secondary sampling module 340 that secondarily samples a plurality of subsets (Subset _A and Subset _B ) using the difference values.

본 발명의 일실시예에 따르면, 상기 잔여 데이터셋(Dataset₁)으로부터 도출된 다수 개의 서브셋(Subset_A)은 A-1, A-2이고 200 및 300 크기를 가질 수 있고, 상기 원본 데이터셋(Dataset₀)으로부터 도출된 다수 개의 서브셋(Subset_B)은 B-1, B-2이고 200 및 600 크기를 가질 수 있다. 이때, 상기 잔여 데이터셋(Dataset₁)과 상기 원본 데이터셋(Dataset₀)은 상기 구조분석 모듈(200)로부터 동일한 민감한 속성으로 각 서브셋이 도출됨으로써, 서브셋 간 대응될 수 있다.According to an embodiment of the present invention, a plurality of subsets (Subset _A ) derived from the residual dataset (Dataset ₁ ) are A-1 and A-2 and may have sizes of 200 and 300, and the original dataset ( Multiple subsets (Subset _B ) derived from Dataset ₀ ) are B-1 and B-2 and can have sizes of 200 and 600. At this time, the residual data set (Dataset ₁ ) and the original data set (Dataset ₀ ) are each subset derived from the structural analysis module 200 with the same sensitive attribute, so that the subsets can correspond.

상기 차이값 산출모듈(330)은 A-1 서브셋과 대응되는 B-1 서브셋의 크기를 비교하고, 서브셋 간의 차이값을 0으로 산출할 수 있다. 그리고 A-2 서브셋과 대응되는 B-2 서브셋의 크기를 비교하고, 서브셋 간의 차이값을 300으로 산출할 수 있다. The difference calculation module 330 may compare the sizes of the A-1 subset and the corresponding B-1 subset and calculate the difference between the subsets as 0. Then, the sizes of the A-2 subset and the corresponding B-2 subset can be compared, and the difference between the subsets can be calculated as 300.

상기 2차 샘플링 모듈(340)은 A-1 서브셋과 B-1 서브셋 간의 차이값이 0임으로 각각의 크기를 200, 200으로 유지할 수 있다. 그리고 A-2 서브셋과 B-2 서브셋 간의 차이값이 300임으로 B-2 서브셋의 크기를 600에서 차이값만큼 감산하여 300의 크기로 샘플링할 수 있다. 이때, 대응되는 서브셋 중 크기가 큰 서브셋이 상기 차이값으로 샘플링되는 것을 특징으로 한다. 따라서 상기 2차 샘플링 모듈(340)은 2차 샘플링함에 따라 A-1 서브셋의 크기를 200, A-2 서브셋의 크기를 200, B-1 서브셋의 크기를 300, B-2 서브셋의 크기를 300으로 할 수 있다. The secondary sampling module 340 can maintain the respective sizes at 200 and 200 because the difference value between the A-1 subset and the B-1 subset is 0. And since the difference between the A-2 subset and the B-2 subset is 300, the size of the B-2 subset can be subtracted from 600 by the difference value and sampled to a size of 300. At this time, the subset with a larger size among the corresponding subsets is characterized in that it is sampled using the difference value. Therefore, as the secondary sampling module 340 performs secondary sampling, the size of the A-1 subset is set to 200, the size of the A-2 subset is set to 200, the size of the B-1 subset is set to 300, and the size of the B-2 subset is set to 300. You can do this.

따라서 본 발명의 2차 샘플링 모듈(340)에 의하여 상기 1차 샘플링 모듈(320)로부터 1차 샘플링 후 남은 잔여 데이터셋(Dataset₁)과 상기 업로더(100)로부터 배치 형태로 받은 원본 데이터셋(Dataset₀)을 대상으로 하여 도출된 각각의 서브셋(Subset_A, Subset_B)을 2차적으로 샘플링할 수 있는 현저한 효과가 있다. Therefore, by the secondary sampling module 340 of the present invention, the remaining data set (Dataset ₁ ) remaining after the first sampling from the primary sampling module 320 and the original data set received in batch form from the uploader 100 ( There is a notable effect of being able to secondarily sample each subset (Subset _A , Subset _B ) derived from Dataset ₀ ).

다음으로, 3차 샘플링에 있어서, 상기 샘플링 모듈(300)은, 상기 2차 샘플링 모듈(340)로부터 2차 샘플링된 다수 개의 서브셋의 크기를 비교한 후 최솟값을 선정하는 최솟값 선정모듈(350) 및 상기 1차 샘플링 모듈(320)로부터 1차 샘플링된 다수 개의 서브셋 각각에 상기 최솟값을 합산함으로써 3차 샘플링 하는 3차 샘플링 모듈(360)을 더 포함하는 것을 특징으로 한다.Next, in the third sampling, the sampling module 300 includes a minimum value selection module 350 that compares the sizes of the plurality of subsets sampled secondarily from the second sampling module 340 and then selects the minimum value. It further includes a 3rd sampling module 360 that performs 3rd sampling by adding the minimum value to each of the plurality of subsets first sampled from the 1st sampling module 320.

본 발명의 일실시예에 따르면, 상기 2차 샘플링 모듈(340)을 통해서 2차 샘플링된 A-1의 크기는 200, A-2의 크기는 200, B-1의 크기는 300, B-2의 크기는 300일 수 있다. 그러면 상기 최솟값 선정모듈(350)은 상기 최솟값으로 가장 작은 크기값인 200을 최솟값으로 선정할 수 있다.According to one embodiment of the present invention, the size of A-1 secondary sampled through the secondary sampling module 340 is 200, the size of A-2 is 200, the size of B-1 is 300, and the size of B-2 is 200. The size of may be 300. Then, the minimum value selection module 350 can select 200, which is the smallest size value, as the minimum value.

그리고 상기 1차 샘플링 모듈(320)로부터 1차 샘플링된 서브셋 1 내지 서브셋 4의 크기가 200, 300, 100, 300 이라면 2:3:1:3 비율을 가지게 되는데, 상기 3차 샘플링 모듈(360)은 상기 최솟값인 200을 서브셋 1 내지 서브셋 4 크기에 각각 합산할 수 있다. 따라서 상기 서브셋 1 내지 서브셋 4의 크기는 400, 500, 300, 500일 수 있고, 1.33:1.67:1:1.67 비율을 가질 수 있다. And if the sizes of subsets 1 to 4 first sampled from the first sampling module 320 are 200, 300, 100, and 300, they have a ratio of 2:3:1:3, and the 3rd sampling module 360 The minimum value of 200 can be added to the sizes of subsets 1 to 4, respectively. Therefore, the sizes of subsets 1 to 4 may be 400, 500, 300, and 500, and may have a ratio of 1.33:1.67:1:1.67.

따라서 상기 3차 샘플링 모듈(360)에 의하여 서브셋 간의 차이가 현저히 줄어들 수 있고, 1차 내지 3차 샘플링 과정을 거쳐 점증적으로 상기 원본 데이터셋(Dataset₀)의 공정성이 보정되어 보정 데이터셋(Dataset₂)이 출력될 수 있다. Therefore, the difference between subsets can be significantly reduced by the 3rd sampling module 360, and the fairness of the original dataset (Dataset ₀ ) is gradually corrected through the 1st to 3rd sampling process to produce a correction dataset (Dataset). ₂ ) can be output.

다음으로, 상기 시각화 모듈(400)은 데이터 공정성을 확인하기 용이하도록 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂) 중 적어도 하나를 기 설정된 형식으로 시각화한다.Next, the visualization module 400 visualizes at least one of the original dataset (Dataset ₀ ) and the corrected dataset (Dataset ₂ ) in a preset format to easily check data fairness.

상기 시각화 모듈(400)은 다양한 형식으로 편향 및 공정성을 시각화할 수 있다. The visualization module 400 can visualize bias and fairness in various formats.

우선, 도 2 내지 도 3의 일실시예를 보면 상기 업로더(100)는 CSV 파일형식의 원본 데이터셋(Dataset₀)을 업로드할 수 있다. 상기 구조분석 모듈(200)은 속도제한(A), 날씨(B), 조광(C), 최초충돌타입(D), 노면상태(E) 및 손상(F)을 다수 개의 민감한 속성(Sensitive Columns)으로 입력받을 수 있다. 그러면, 상기 시각화 모듈(400)은 상기 원본 데이터셋(Dataset₀)에 포함된 데이터를 상기 다수 개의 민감한 속성(Sensitive Columns)에 따라 행렬형태로 시각화하는 것을 특징으로 한다. 따라서 사용자는 속성별 각 데이터 값을 세부적으로 확인할 수 있는 현저한 효과가 있다.First, looking at the embodiment of FIGS. 2 and 3, the uploader 100 can upload an original dataset (Dataset ₀ ) in a CSV file format. The structural analysis module 200 includes speed limit (A), weather (B), lighting (C), initial collision type (D), road surface condition (E), and damage (F) through a number of sensitive attributes (Sensitive Columns). You can receive input as . Then, the visualization module 400 is characterized in that it visualizes the data included in the original dataset (Dataset ₀ ) in matrix form according to the plurality of sensitive attributes (Sensitive Columns). Therefore, there is a notable effect in that users can check each data value for each attribute in detail.

또한, 도 4의 일실시예를 보면 상기 시각화 모듈(400)은, 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂)을 계열로 하고, 통계적 동등성 차이(Statistical Parity Difference), 동등 기회 차이(Equal Opportunity Difference), 평균 배당 차이(Average Odds Difference), 이질적 영향(Disparate Impact), theil index를 포함하는 공정성 지표별 공정성(Fair) 및 편향(Bias)을 그래프 형태로 시각화하는 것을 특징으로 한다.In addition, looking at the embodiment of FIG. 4, the visualization module 400 uses the original dataset (Dataset ₀ ) and the corrected dataset (Dataset ₂ ) as a series, and calculates statistical parity difference and equal opportunity difference. It is characterized by visualizing the fairness and bias of each fairness index, including Equal Opportunity Difference, Average Odds Difference, Disparate Impact, and theil index, in graph form.

사용자는 상기 시각화 모듈(400)로부터 시각화된 그래프를 통해서 다음과 같은 판단이 가능하다. 우선, 임의의 공정성 지표에 대한 그래프에서 상기 보정 데이터셋(Dataset₂)(Mitigated)의 지표가 상기 원본 데이터셋(Dataset₀)(Original) 지표보다 공정성(Fair) 기준선에 가까우면 공정하게 잘 보정된 것으로 해석될 수 있다. 예컨대, 도 4의 통계적 동등성 차이(Statistical Parity Difference) 지표에서 상기 보정 데이터셋(Dataset₂)(Mitigated)의 지표는 -0.09이고, 상기 원본 데이터셋(Dataset₀)(Original) 지표인 -0.18 보다 공정성(Fair) 기준선인 0에 더 가까우므로, 사용자에 의하여 통계적 동등성 차이(Statistical Parity Difference) 지표에 한해서 공정성이 보정되었다고 판단될 수 있다.The user can make the following decisions through the graph visualized by the visualization module 400. First, in the graph for any fairness index, if the index of the correction dataset (Dataset ₂ ) (Mitigated) is closer to the fairness baseline than the index of the original dataset (Dataset ₀ ) (Original), it is fair and well-corrected. It can be interpreted as For example, in the Statistical Parity Difference index of FIG. 4, the index of the correction dataset (Dataset ₂ ) (Mitigated) is -0.09, and is fairer than the index of -0.18 of the original dataset (Dataset ₀ ) (Original). (Fair) Since it is closer to the baseline of 0, the user can judge that fairness has been corrected only for the Statistical Parity Difference indicator.

또한, 동등 기회 차이(Equal Opportunity Difference) 지표와 평균 배당 차이(Average Odds Difference) 지표에서는 상기 원본 데이터셋(Dataset₀)(Original)의 지표가 상기 보정 데이터셋(Dataset₂)(Mitigated)의 지표보다 공정성(Fair) 기준선에 가까운 것이 확인될 수 있다. 즉, 상기 샘플링 모듈(300)을 통해서 다른 지표에서는 오히려 편향이 발생한 것이다. 상기 시각화 모듈(400)은 상기 샘플링 모듈(300)로부터 불공정한 결과가 도출된 것을 확인한 사용자가 데이터셋을 활용하고자 하는 목적에 맞게 상기 원본 데이터셋(Dataset₀)을 사용하거나, 보정 데이터셋(Dataset₂)을 사용하는 등 사용자의 데이터셋 사용여부 결정에 시각적으로 도움을 줄 수 있는 현저한 효과가 있다.In addition, in the Equal Opportunity Difference indicator and the Average Odds Difference indicator, the indicator of the original dataset (Dataset ₀ ) (Original) is better than the indicator of the calibration dataset (Dataset ₂ ) (Mitigated). It can be confirmed that it is close to the fairness baseline. In other words, bias occurred in other indicators through the sampling module 300. The visualization module 400 uses the original dataset (Dataset 0) or a correction dataset (Dataset ₀ ) according to the purpose of utilizing the dataset by a user who has confirmed that an unfair result has been derived from the sampling module 300. ₂ ) has a notable effect of visually helping users decide whether to use a dataset.

또한, 도 5의 일실시예를 보면 상기 시각화 모듈(400)은, 다수 개의 민감한 속성(Sensitive Columns) 중 숫자 형식의 데이터를 갖는 민감한 속성(Sensitive Column)을 구분하고, 수적 속성(Numeric Column)에 대한 수적 특성(Numeric Features)을 분석하여 차트로 표시하는 수적 특성 표시부(410) 및 다수 개의 민감한 속성(Sensitive Columns) 중 문자 형식의 데이터를 갖는 민감한 속성(Sensitive Column)을 구분하고, 범주 속성(Categorical Column)에 대한 범주 특성(Categorical Features)을 분석하여 차트로 표시하는 범주 특성 표시부(420)를 포함함으로써, 특성별 데이터 공정성을 확인할 수 있도록 하는 것을 특징으로 한다.In addition, looking at the embodiment of FIG. 5, the visualization module 400 distinguishes a sensitive column having data in a numeric form among a plurality of sensitive columns and displays a numeric column in the numeric column. A numeric feature display unit 410 analyzes the numeric features and displays them in a chart, and among a plurality of sensitive columns, a sensitive column with character data is distinguished and a categorical feature is displayed. It is characterized by including a categorical feature display unit 420 that analyzes categorical features for columns and displays them in a chart, allowing the data fairness by feature to be confirmed.

도 5의 (a)의 일실시예를 보면, 상기 수적 특성 표시부(410)는 다수 개의 민감한 속성(Sensitive Columns) 중 속도제한 속성에 대한 원본 데이터셋(Dataset₀) 및 보정 데이터셋(Dataset₂) 내 각각의 데이터의 수(Count), 평균(Mean), 표준편차(Std dev), 데이터가 0인 비율(Zeros), 최솟값(Min), 중간값(Median) 및 최대값(Max)을 수적 특성(Numeric Features)으로 각각 도출할 수 있다. 그리고 상기 수적 특성 표시부(410)는 도출한 특성들을 반영하여 각 데이터셋의 데이터를 차트형식으로 시각화할 수 있다. Looking at one embodiment of Figure 5 (a), the numerical characteristic display unit 410 includes an original dataset (Dataset ₀ ) and a correction dataset (Dataset ₂ ) for the speed limit attribute among a plurality of sensitive attributes (Sensitive Columns). The number (Count), mean (Mean), standard deviation (Std dev), percentage of data being 0 (Zeros), minimum value (Min), median, and maximum value (Max) of each data are numerical characteristics. Each can be derived as (Numeric Features). And the numerical characteristic display unit 410 can visualize the data of each dataset in a chart format by reflecting the derived characteristics.

도 5의 (b)의 일실시예를 보면, 상기 범주 특성 표시부(420)는 다수 개의 민감한 속성(Sensitive Columns) 중 날씨 속성에 대한 원본 데이터셋(Dataset₀) 및 보정 데이터셋(Dataset₂) 내 각각의 데이터의 수(Count), 속성 내 날씨의 종류(Unique), 데이터 수가 많은 날씨의 종류(Top), 데이터 수가 많은 날씨의 데이터 수(Freq top), 평균 문자열 길이(Avg str len)를 범주 특성(Categorical Features)으로 각각 도출할 수 있다. 그리고 상기 범주 특성 표시부(420)는 도출한 특성들을 반영하여 각 데이터셋의 데이터를 차트형식으로 시각화할 수 있다.Looking at an embodiment of Figure 5 (b), the category characteristic display unit 420 displays the original dataset (Dataset ₀ ) and the correction dataset (Dataset ₂ ) for weather attributes among a plurality of sensitive attributes (Sensitive Columns). The number of each data (Count), the type of weather within the property (Unique), the type of weather with a large number of data (Top), the number of weather data with a large number of data (Freq top), and the average string length (Avg str len) are categorized. Each can be derived as Categorical Features. And the category characteristic display unit 420 can visualize the data of each dataset in a chart format by reflecting the derived characteristics.

따라서 상기 수적 특성 표시부(410) 및 범주 특성 표시부(420)에 의하여, 사용자가 원본 데이터셋(Dataset₀) 및 보정 데이터셋(Dataset₂)을 민감한 속성(Sensitive Columns)에 대한 데이터 분포를 시각적으로 비교 및 확인할 수 있는 현저한 효과가 있다. Therefore, by the numerical characteristic display unit 410 and the category characteristic display unit 420, the user visually compares the data distribution for sensitive attributes (Sensitive Columns) between the original dataset (Dataset ₀ ) and the corrected dataset (Dataset ₂ ). And there is a remarkable effect that can be confirmed.

또한, 상기 시각화 모듈(400)은, 상기 시각화 모듈은, 상기 범주 속성(Categorical Column)을 이용하여 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂) 중 적어도 하나를 비닝(Binning)하는 비닝 설정부(430), 상기 수적 속성(Numeric Column)을 이용하여 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂) 중 적어도 하나를 스케터링(Scattering)하는 스케터링 설정부(440), 상기 범주 속성(Categorical Column) 내 범주 특성(Categorical Features)을 서로 다른 색으로 표현되도록 설정하는 계열 설정부(450) 및 임의의 데이터가 선택되면 임의의 데이터에 대한 속성별 데이터 값을 세부적으로 표시하는 세부특성 표시부(460)를 더 포함함으로써, 데이터셋 전체에 대한 공정성을 확인할 수 있도록 시각화하는 것을 특징으로 한다.In addition, the visualization module 400 is configured to binning at least one of the original dataset (Dataset ₀ ) and the corrected dataset (Dataset ₂ ) using the categorical attribute (Categorical Column). A binning setting unit 430, a scattering setting unit 440 that scatters at least one of the original dataset (Dataset ₀ ) and the corrected dataset (Dataset ₂ ) using the numeric attribute (Numeric Column). , a series setting unit 450 that sets the categorical features within the categorical column to be expressed in different colors, and when arbitrary data is selected, data values for each attribute for the arbitrary data are displayed in detail It is characterized by visualization so that the fairness of the entire dataset can be confirmed by further including a detailed characteristic display unit 460.

일반적으로, 비닝(Binning)은 다수 개의 픽셀로 구성된 이미지에서 픽셀을 묶어 관리하는 기능을 일컫는다. 즉, 본 발명의 상기 비닝 설정부(430)는 데이터셋 내 수많은 데이터를 날씨, 최초충돌타입, 노면상태 및 손상 등의 범주 속성(Categorical Column) 중 하나로 묶어서 확인하기 위해서 범주 속성을 비닝 X축 및 비닝 Y축 중 적어도 하나에 설정하고, 설정된 범주 속성에 따라 데이터를 구분하여 시각적으로 표시되도록 한다.In general, binning refers to the function of grouping and managing pixels in an image composed of multiple pixels. That is, the binning setting unit 430 of the present invention configures the binning X-axis and Binning is set on at least one of the Y-axes, and the data is visually displayed by dividing it according to the set category properties.

반대로, 스케터링(Scattering)은 데이터를 흩어지게 하는 기능을 일컫는다. 즉, 본 발명의 상기 스케터링 설정부(440)는 데이터셋 내 수많은 데이터를 속도제한, 조광 등의 수적 속성(Numeric Column) 중 하나로 묶어서 확인하기 위해서 수적 속성을 스케터링 X축 및 스케터링 Y축 중 적어도 하나에 설정하고, 설정된 수적 속성에 따라 데이터를 구분하여 시각적으로 표시되도록 한다.Conversely, scattering refers to the function of scattering data. In other words, the scattering setting unit 440 of the present invention configures the numeric properties into the scattering Set to at least one of the following and display the data visually by dividing it according to the set numerical properties.

도 6의 일실시예를 보면, 시각화 대상은 수적 속성(Numeric Column)이 1개, 범주 속성(Categorical Column)이 6개인 임의의 원본 데이터셋(Dataset₀)일 수 있다. 이때, 상기 비닝 설정부(430)는 범주 속성(Categorical Column) 중 하나인 사고 유형(Crush_type)이 비닝 X축에 시각화 되도록 선택할 수 있다. 이때, 사고 유형(Crush_type)은 손상(Injury) 및/또는 충돌로 인한 견인(Tow due to crash), 손상 없음(No Injury) 및/또는 운전해서 떠남(Drive away)으로 구분될 수 있다. 그러면 원본 데이터셋(Dataset₀) 내 데이터는 상기 비닝 설정부(430)로부터 선택된 사고 유형(Crush_type)별로 구분되어 각각 시각화될 수 있다. 또한, 상기 비닝 설정부(430)는 범주 속성(Categorical Column) 중 하나인 데미지(Damage)가 비닝 Y축에 시각화 되도록 선택할 수 있다. 이때, 데미지(Damage)는 1500 이상, 1500 미만으로 구분될 수 있다. 그러면 원본 데이터셋(Dataset₀) 내 데이터는 상기 비닝 설정부(430)로부터 선택된 데미지(Damage)별로 구분되어 각각 시각화될 수 있다. 따라서 도 6의 일실시예와 같이 원본 데이터셋(Dataset₀) 내 데이터들이 4분면으로 구분되어 모여서 표시될 수 있다. Looking at the example of FIG. 6, the object of visualization may be any original dataset (Dataset ₀ ) with one numeric column and six categorical columns. At this time, the binning setting unit 430 may select a crash type (Crush_type), one of the categorical columns, to be visualized on the binning X-axis. At this time, the accident type (Crush_type) can be divided into Injury and/or Tow due to crash, No Injury, and/or Drive away. Then, the data in the original dataset (Dataset ₀ ) can be visualized separately by being classified by the accident type (Crush_type) selected from the binning setting unit 430. Additionally, the binning setting unit 430 can select Damage, one of the categorical columns, to be visualized on the binning Y-axis. At this time, damage can be divided into 1500 or more and 1500 or less. Then, the data in the original dataset (Dataset ₀ ) can be visualized separately by the damage selected from the binning setting unit 430. Therefore, as in the embodiment of FIG. 6, the data in the original dataset (Dataset ₀ ) can be divided into four quadrants and displayed together.

그리고 각 분면에 시각화된 그래프는 상기 계열 설정부(450)로부터 최초 사고 유형(First_crash_type) 내 범주 특성을 서로 다른 색으로 표현되도록 설정할 수 있다. 그리고 각 분면에 시각화된 그래프에서 임의의 데이터가 선택되면 상기 세부특성 표시부(460)로부터 임의의 데이터에 대한 속성별 특성이 세부적으로 표시될 수 있다. 즉, 원본 데이터셋(Dataset₀)은 데미지(Damage)가 1500이상이고, 사고 유형(Crush_type)이 손상(Injury) 및/또는 충돌로 인한 견인(Tow due to crash)인 데이터가 상대적으로 작은 것을 시각적으로 파악할 수 있고, 불공정성이 심한 데이터셋인 것을 확인할 수 있다.In addition, the graph visualized in each quadrant can be set from the series setting unit 450 to express category characteristics within the first accident type (First_crash_type) in different colors. And when arbitrary data is selected from the graph visualized in each quadrant, the characteristics of each attribute for the arbitrary data can be displayed in detail from the detailed characteristic display unit 460. In other words, the original dataset (Dataset ₀ ) has a damage of over 1500, and the data with an accident type (Crush_type) of Injury and/or Tow due to crash is relatively small. This can be identified, and it can be confirmed that it is a highly unfair dataset.

다양한 방식으로 데이터 시각화가 가능한 본 발명의 시각화 모듈(400)에 의하여, 상기 업로더(100)로부터 업로드된 원본 데이터셋(Dataset₀) 및 샘플링 모듈(300)로부터 공정성이 보정된 보정 데이터셋(Dataset₂)은 공정성 지표에 따라 편향 정도 및 공정성 정도가 각각 시각화될 수 있다. By the visualization module 400 of the present invention capable of visualizing data in various ways, the original dataset (Dataset ₀ ) uploaded from the uploader 100 and the correction dataset (Dataset) whose fairness has been corrected from the sampling module 300 ₂ ), the degree of bias and fairness can be visualized respectively according to the fairness index.

또한, 본 발명의 시각화 모듈(400)에 의하여, 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂)의 하나의 범주 속성(Categorical Column)에 대한 범주 특성(Categorical Features)이 함께 시각화됨으로써 사용자가 해당 범주 속성에 대한 공정성이 얼마나 향상되었는지 시각적으로 비교 및 확인할 수 있다. 그리고 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂)의 수적 속성(Numeric Column)에 대한 수적 특성(Numeric Features)이 함께 시각화됨으로써 사용자가 해당 수적 속성에 대한 공정성이 얼마나 향상되었는지 시각적으로 비교 및 확인할 수 있다. In addition, by the visualization module 400 of the present invention, the categorical features of one categorical column of the original dataset (Dataset ₀ ) and the correction dataset (Dataset ₂ ) are visualized together. Users can visually compare and see how much the fairness has improved for the corresponding category attributes. In addition, the Numeric Features for the Numeric Column of the original dataset (Dataset ₀ ) and the correction dataset (Dataset ₂ ) are visualized together, allowing the user to visually see how much the fairness of the corresponding numeric column has improved. You can compare and check.

또한, 본 발명의 시각화 모듈(400)에 의하여, 원본 데이터셋(Dataset₀)이 대상이 되어 비닝 XY축에 범주 속성(Categorical Column)과 스케터링 XY축에 수적 속성(Numeric Column) 중 적어도 하나가 선택되어 디수 개의 속성에 따른 데이터 분포가 시각화됨으로써 사용자가 해당 데이터셋의 다수 개의 속성에 따른 데이터 분포를 시각적으로 확인할 수 있다.In addition, by the visualization module 400 of the present invention, the original dataset (Dataset ₀ ) is targeted and has at least one of a categorical column on the binning XY axis and a numeric column on the scattering XY axis. The data distribution according to the selected number of properties is visualized, allowing the user to visually check the data distribution according to the multiple properties of the dataset.

다음으로, 본 발명은 상기 업로더(100)로부터 업로드 될 또 다른 원본 데이터셋(Dataset₀₀)에 샘플링 후 남은 잔여 데이터가 포함될 수 있도록 상기 잔여 데이터를 저장하는 저장 모듈(500)을 더 포함하는 것을 특징으로 한다.Next, the present invention further includes a storage module 500 for storing the remaining data so that the remaining data after sampling can be included in another original dataset (Dataset ₀₀ ) to be uploaded from the uploader 100. It is characterized by

즉, 본 발명은 상기 샘플링 후 남은 잔여 데이터를 단순 제거하는 것이 아니고, 상기 저장 모듈(500)로부터 상기 잔여 데이터를 저장하여 차후에 상기 업로드(100)로부터 또 다른 배치 형태의 원본 데이터 셋이 업로드 되면 이와 합쳐져 공정성이 보정된 보정 데이터셋(Dataset₂)을 다시 출력할 수 있도록 할 수 있다. In other words, the present invention does not simply remove the residual data remaining after the sampling, but stores the residual data from the storage module 500 and later uploads another batch of original data sets from the upload 100. By combining them, the correction dataset (Dataset ₂ ) with corrected fairness can be output again.

따라서 본 발명에 의하면, 데이터셋의 불공정성을 점증적으로 해결할 수 있고, 종래 무분별하게 수집된 무수히 많은 데이터로 인공지능 모델이 학습됨으로써 인공지능 모델의 성능이 저하될 수 있는 문제점을 해결할 수 있는 현저한 효과가 있다. Therefore, according to the present invention, the unfairness of the dataset can be solved incrementally, and the problem of the performance of the artificial intelligence model being deteriorated as the artificial intelligence model is learned with countless data collected indiscriminately has a significant effect. There is.

또한, 본 발명은 업로드된 원본 데이터셋 및 공정성이 보정된 보정 데이터셋의 편향 및 공정성 정도를 시각적으로 비교 및 확인할 수 있고, 데이터셋 내 불공정한 항목을 확인하여 해당 항목에 대한 공정성을 보완할 수 있도록 하는 현저한 효과가 있다. In addition, the present invention can visually compare and confirm the degree of bias and fairness of the uploaded original dataset and the correction dataset whose fairness has been corrected, and can check unfair items in the dataset to supplement the fairness of the items. It has a remarkable effect.

인공지능 학습 데이터셋의 공정성 시각화 방법How to visualize the fairness of artificial intelligence learning datasets

도 7을 보면, 본 발명의 인공지능 학습 데이터셋의 공정성 시각화 방법은 원본 데이터셋 업로드 단계(S100), 서브셋 도출단계(S200), 보정 데이터셋 출력단계(S300) 및 시각화 단계(S400)를 포함한다. Referring to Figure 7, the fairness visualization method of the artificial intelligence learning dataset of the present invention includes an original dataset upload step (S100), a subset derivation step (S200), a correction dataset output step (S300), and a visualization step (S400). do.

우선, 상기 원본 데이터셋 업로드 단계(S100)는 업로더(100)에 의하여, 배치 형태의 원본 데이터셋이 업로드 된다.First, in the original dataset upload step (S100), the original dataset in batch form is uploaded by the uploader 100.

본 발명에서 언급하는 배치(Batch)는 컴퓨터의 데이터 처리 형태의 하나로써, 처리해야 할 데이터가 일정기간 또는 일정량 정리되어 처리되는 것을 일컫는다. 즉, 상기 원본 데이터셋 업로드 단계(S100)는 실시간으로 데이터 단위로 데이터가 획득되는 것이 아닌, 일정기간 또는 일정량 획득된 데이터가 배치(Batch) 단위로 데이터가 처리될 수 있다.Batch, as referred to in the present invention, is a type of computer data processing and refers to data to be processed being organized and processed for a certain period of time or a certain amount. In other words, in the original dataset upload step (S100), data may not be acquired in real-time data units, but data acquired over a certain period of time or a certain amount may be processed in batches.

본 발명의 일실시예에 따르면 상기 원본 데이터셋 업로드 단계(S100)는 국가교통DB, 보건의료빅데이터개방시스템, 국민재난안전포털, 환경부 및 문화데이터광장 공공데이터포털 등 교통, 보건의료, 재난안전, 환경, 문화관광 등 각 분야에서 빅데이터를 제공하는 서버 및 데이터베이스로부터 원본 데이터셋이 획득될 수 있다. 그리고 상기 원본 데이터셋은 각 서버 및 데이터베이스에서 일정기간동안 저장된 배치(Batch) 형태의 데이터 집합이거나 일정량이 저장된 배치(Batch) 형태의 데이터 집합일 수 있다. According to one embodiment of the present invention, the original data set upload step (S100) is performed on transportation, health care, and disaster safety, such as the National Transportation DB, Health and Medical Big Data Open System, National Disaster and Safety Portal, Ministry of Environment and Culture Data Plaza public data portal, etc. Original data sets can be obtained from servers and databases that provide big data in various fields such as environment, culture and tourism. And the original data set may be a batch-type data set stored for a certain period of time in each server and database, or a batch-type data set stored in a certain amount.

다음으로, 상기 서브셋 도출단계(S200)는 구조분석 모듈(200)에 의하여, 다수 개의 민감한 속성(Sensitive Columns)이 입력되고, 상기 원본 데이터셋으로부터 각 민감한 속성(Sensitive Column)에 해당하는 서브셋이 도출된다. Next, in the subset derivation step (S200), a plurality of sensitive attributes (Sensitive Columns) are input by the structural analysis module 200, and a subset corresponding to each sensitive attribute (Sensitive Columns) is derived from the original dataset. do.

본 발명에서 언급하는 상기 민감한 속성(Sensitive Columns)은 뇌졸중 여부, 남녀 등 인공지능 기술로부터 출력되는 출력값에 영향을 끼칠 수 있는 항목이다. 상기 서브셋 도출단계(S200)는 인공지능 학습 데이터셋의 공정성 시각화 장치에 포함된 입력 모듈(600) 또는 별도의 사용자 단말을 통해서 다수 개의 민감한 속성(Sensitive Columns)이 입력될 수 있다.The sensitive attributes (Sensitive Columns) mentioned in the present invention are items that can affect the output value output from artificial intelligence technology, such as whether there is a stroke or whether it is male or female. In the subset derivation step (S200), a plurality of sensitive attributes (Sensitive Columns) may be input through the input module 600 included in the fairness visualization device of the artificial intelligence learning dataset or a separate user terminal.

그리고 상기 서브셋 도출단계(S200)는 상기 원본 데이터셋이 다수 개의 민감한 속성(Sensitive Columns)에 따라 임의적으로 다수 개의 서브셋으로 분할될 수 있다. 예컨대 민감한 속성(Sensitive Columns)으로 성별 속성이 입력된다면 상기 서브셋 도출단계(S200)는 상기 원본 데이터셋이 남성인 서브셋과 여성인 서브셋으로 분할될 수 있다. 또는, 민감한 속성(Sensitive Columns)으로 A등급 내지 D등급을 포함하는 평가등급 속성이 입력된다면 상기 서브셋 도출단계(S200)는 상기 원본 데이터셋이 A등급인 서브셋, B등급인 서브셋, C등급인 서브셋 및 D등급인 서브셋으로 분할될 수 있다. In the subset derivation step (S200), the original dataset may be randomly divided into a plurality of subsets according to a plurality of sensitive columns. For example, if the gender attribute is input as a sensitive attribute (Sensitive Columns), in the subset derivation step (S200), the original dataset may be divided into a male subset and a female subset. Alternatively, if an evaluation grade attribute including grades A to D is input as a sensitive attribute (Sensitive Columns), the subset derivation step (S200) determines that the original dataset is a grade A subset, a grade B subset, or a grade C subset. and D-class subsets.

다음으로, 상기 보정 데이터셋 출력단계(S300)는 샘플링 모듈(300)에 의하여, 다수 개의 서브셋의 크기에 따라 샘플링된 후 공정성이 보정된 보정 데이터셋이 출력된다.Next, in the correction data set output step (S300), the sampling module 300 samples a plurality of subsets according to their size and then outputs a correction data set whose fairness has been corrected.

상기 보정 데이터셋 출력단계(S300)는 여러 번의 샘플링 과정을 거치면서 점증적으로 데이터셋의 공정성이 보정되고, 결과적으로 공정성이 향상된 보정 데이터셋이 출력되는 것을 특징으로 한다.The correction data set output step (S300) is characterized in that the fairness of the data set is gradually corrected through several sampling processes, and as a result, a correction data set with improved fairness is output.

도 8을 보면 1차 샘플링 과정에 있어서, 상기 보정 데이터셋 출력단계(S300)는 평균값 산출모듈(310)에 의하여, 상기 원본 데이터셋(Dataset₀)을 대상으로 도출된 다수 개의 서브셋(Subset₀)에 대한 평균값이 산출되는 평균값 산출단계(S310) 및 상기 1차 샘플링 모듈(320)에 의하여, 상기 평균값이 이용되어 다수 개의 서브셋(Subset₀)이 1차 샘플링되는 1차 샘플링 단계(S320)를 포함하는 것을 특징으로 한다.Referring to FIG. 8, in the first sampling process, the correction data set output step (S300) generates a plurality of subsets (Subset ₀ ) derived from the original data set (Dataset ₀ ) by the average value calculation module 310. It includes an average value calculation step (S310) in which the _average value for It is characterized by:

본 발명에서 언급하는 서브셋의 크기는 각 서브셋에 포함되는 데이터의 개수를 일컫는다. 본 발명의 일실시예에 따르면, 상기 서브셋 도출단계(S210)로부터 4개의 서브셋이 도출되고 각각의 크기가 200, 600, 100, 300이라면 상기 평균값 산출단계(S310)는 4개의 서브셋에 대한 평균값으로 300이 산출될 수 있다. 그러면 상기 1차 샘플링 단계(S320)는 상기 평균값 300이 기준이 되어 각 서브셋의 크기가 비교될 수 있다. 상기 1차 샘플링 단계(S320)는 200 크기를 갖는 서브셋 1이 평균값 300보다 크기가 작으므로 해당 크기인 200을 유지하도록 샘플링될 수 있고, 600 크기를 갖는 서브셋 2가 평균값 300보다 크기가 큼으로 평균값인 300을 제외한 300 크기를 갖도록 샘플링될 수 있다. 그리고 100 크기를 갖는 서브셋 3이 평균값 300보다 크기가 작으므로 해당 크기인 100을 유지하도록 샘플링될 수 있고, 마지막으로 300 크기를 갖는 서브셋 3은 평균값 300과 동일한 크기를 가짐으로 해당 크기인 300을 유지하도록 샘플링될 수 있다. 결과적으로, 상기 1차 샘플링 모듈(320)로부터 1차 샘플링된 4개의 서브셋의 크기는 200, 600, 100, 300에서 200, 300, 100, 300이 될 수 있다. 따라서 본 발명의 상기 평균값 산출단계(S310) 및 1차 샘플링 단계(S320)에 의하여 서브셋 간의 편향을 1차적으로 줄일 수 있는 현저한 효과가 있다.The size of the subset mentioned in the present invention refers to the number of data included in each subset. According to one embodiment of the present invention, if four subsets are derived from the subset derivation step (S210) and the sizes of each are 200, 600, 100, and 300, the average value calculation step (S310) is the average value for the four subsets. 300 can be calculated. Then, in the first sampling step (S320), the average value of 300 is used as a standard, and the size of each subset can be compared. In the first sampling step (S320), subset 1 with a size of 200 is smaller than the average value of 300, so it can be sampled to maintain the size of 200, and subset 2 with a size of 600 is larger than the average value of 300, so the average value It can be sampled to have a size of 300 excluding 300. And since subset 3 with a size of 100 is smaller than the average value of 300, it can be sampled to maintain the size of 100, and finally, subset 3 with a size of 300 has the same size as the average value of 300, so it can be sampled to maintain the size of 300. can be sampled to As a result, the sizes of the four subsets initially sampled from the primary sampling module 320 may range from 200, 600, 100, and 300 to 200, 300, 100, and 300. Therefore, there is a significant effect of primarily reducing bias between subsets through the average value calculation step (S310) and the first sampling step (S320) of the present invention.

도 9를 보면 2차 샘플링 과정에 있어서, 상기 보정 데이터셋 출력단계(S300)는 1차 샘플링 후 서브셋 간의 편향이 1차적으로 완료되면 2차 샘플링이 수행될 수 있도록 상기 서브셋 도출단계(S200)로 회귀할 수 있다.9, in the secondary sampling process, the correction data set output step (S300) progresses to the subset derivation step (S200) so that secondary sampling can be performed when the bias between subsets is initially completed after primary sampling. It is possible to regress.

1차 샘플링 후 서브셋 도출단계(S220a, S220b)는 상기 구조분석 모듈(200)에 의하여, 상기 원본 데이터셋(Dataset₀) 중에서 상기 1차 샘플링 단계(S320)로부터 1차 샘플링된 후 남은 잔여 데이터셋(Dataset₁)과 상기 원본 데이터셋(Dataset₀)을 대상으로 다수 개의 서브셋(Subset_A, Subset_B)이 각각 도출될 수 있다. The subset derivation step (S220a, S220b) after the first sampling is the residual data set remaining after the first sampling from the first sampling step (S320) among the original data set (Dataset ₀ ) by the structural analysis module 200. Multiple subsets (Subset _A and Subset _B ) can be derived from (Dataset ₁ ) and the original dataset (Dataset ₀ ), respectively.

그리고 상기 보정 데이터셋 출력단계(S300)는 차이값 산출모듈(330)에 의하여, 대응하는 서브셋의 크기가 각각 비교되고, 대응하는 서브셋 크기의 차이값이 각각 산출되는 차이값 산출단계(S330) 및 2차 샘플링 모듈(340)에 의하여, 상기 차이값이 이용되어 다수 개의 서브셋(Subset_A, Subset_B)이 2차 샘플링되는 2차 샘플링 단계(S340)를 더 포함하는 것을 특징으로 한다.And the correction data set output step (S300) is a difference value calculation step (S330) in which the sizes of the corresponding subsets are compared by the difference value calculation module 330 and the difference values of the sizes of the corresponding subsets are calculated. It further includes a secondary sampling step (S340) in which a plurality of subsets (Subset _A and Subset _B ) are secondarily sampled by using the difference value by the secondary sampling module 340.

본 발명의 일실시예에 따르면, 상기 잔여 데이터셋(Dataset₁)으로부터 도출된 다수 개의 서브셋(Subset_A)은 A-1, A-2이고 200 및 300 크기를 가질 수 있고, 상기 원본 데이터셋(Dataset₀)으로부터 도출된 다수 개의 서브셋(Subset_B)은 B-1, B-2이고 200 및 600 크기를 가질 수 있다. 이때, 상기 잔여 데이터셋(Dataset₁)과 상기 원본 데이터셋(Dataset₀)은 1차 샘플링 후 서브셋 도출단계(S220a, S220b)로부터 동일한 민감한 속성으로 각 서브셋이 도출됨으로써, 서브셋 간 대응될 수 있다.According to an embodiment of the present invention, a plurality of subsets (Subset _A ) derived from the residual dataset (Dataset ₁ ) are A-1 and A-2 and may have sizes of 200 and 300, and the original dataset ( Multiple subsets (Subset _B ) derived from Dataset ₀ ) are B-1 and B-2 and can have sizes of 200 and 600. At this time, the residual data set (Dataset ₁ ) and the original data set (Dataset ₀ ) are each derived with the same sensitive attribute from the subset derivation step (S220a, S220b) after the first sampling, so that the subsets can correspond.

상기 차이값 산출단계(S330)는 A-1 서브셋과 대응되는 B-1 서브셋의 크기가 비교되고, 서브셋 간의 차이값이 0으로 산출될 수 있다. 그리고 A-2 서브셋과 대응되는 B-2 서브셋의 크기가 비교되고, 서브셋 간의 차이값이 300으로 산출될 수 있다. In the difference value calculation step (S330), the sizes of the A-1 subset and the corresponding B-1 subset are compared, and the difference value between the subsets may be calculated as 0. Then, the sizes of the A-2 subset and the corresponding B-2 subset are compared, and the difference between the subsets can be calculated as 300.

상기 2차 샘플링 단계(S340)는 A-1 서브셋과 B-1 서브셋 간의 차이값이 0임으로 각각의 크기가 200, 200으로 유지될 수 있다. 그리고 A-2 서브셋과 B-2 서브셋 간의 차이값이 300임으로 B-2 서브셋의 크기가 600에서 차이값만큼 감산되어 300의 크기로 샘플링될 수 있다. 이때, 대응되는 서브셋 중 크기가 큰 서브셋이 상기 차이값으로 샘플링되는 것을 특징으로 한다. 따라서 상기 2차 샘플링 단계(S340)는 2차 샘플링됨에 따라 A-1 서브셋의 크기가 200, A-2 서브셋의 크기가 200, B-1 서브셋의 크기가 300, B-2 서브셋의 크기가 300으로 될 수 있다. In the second sampling step (S340), since the difference value between the A-1 subset and the B-1 subset is 0, the respective sizes can be maintained at 200 and 200. And since the difference value between the A-2 subset and the B-2 subset is 300, the size of the B-2 subset can be subtracted from 600 by the difference value and sampled to a size of 300. At this time, the subset with a larger size among the corresponding subsets is characterized in that it is sampled using the difference value. Therefore, in the second sampling step (S340), the size of the A-1 subset is 200, the size of the A-2 subset is 200, the size of the B-1 subset is 300, and the size of the B-2 subset is 300. It can be.

따라서 본 발명의 2차 샘플링 단계(S340)에 의하여 상기 1차 샘플링 단계(S320)로부터 1차 샘플링된 후 남은 잔여 데이터셋(Dataset₁)과 상기 원본 데이터셋 업로드 단계(S100)로부터 배치 형태로 받은 원본 데이터셋(Dataset₀)을 대상으로 하여 도출된 각각의 서브셋(Subset_A, Subset_B)이 2차적으로 샘플링될 수 있는 현저한 효과가 있다. Therefore, the remaining dataset (Dataset ₁ ) remaining after the first sampling from the first sampling step (S320) by the second sampling step (S340) of the present invention and the data received in batch form from the original dataset upload step (S100) There is a notable effect in that each subset (Subset _A , Subset _B ) derived from the original dataset (Dataset ₀ ) can be sampled secondarily.

도 9를 보면 3차 샘플링에 있어서, 상기 보정 데이터셋 출력단계(S300)는 최솟값 선정모듈(350)에 의하여, 상기 2차 샘플링 단계(S340)로부터 2차 샘플링된 다수 개의 서브셋의 크기가 비교된 후 최솟값이 선정되는 최솟값 선정단계(S350) 및 3차 샘플링 모듈(360)에 의하여, 상기 1차 샘플링 단계(S320)로부터 1차 샘플링된 다수 개의 서브셋 각각에 상기 최솟값이 합산됨으로써 3차 샘플링 되는 3차 샘플링 단계(S360)를 더 포함하는 것을 특징으로 한다.9, in the third sampling, the correction data set output step (S300) compares the sizes of a plurality of subsets sampled secondarily from the second sampling step (S340) by the minimum value selection module 350. Then, by the minimum value selection step (S350) and the 3rd sampling module 360 in which the minimum value is selected, the minimum value is added to each of the plurality of subsets first sampled from the 1st sampling step (S320), resulting in 3rd sampling. It is characterized in that it further includes a tea sampling step (S360).

본 발명의 일실시예에 따르면, 상기 2차 샘플링 단계(S340)를 통해서 2차 샘플링된 A-1의 크기는 200, A-2의 크기는 200, B-1의 크기는 300, B-2의 크기는 300일 수 있다. 그러면 상기 최솟값 선정단계(S350)는 상기 최솟값으로 가장 작은 크기값인 200이 최솟값으로 선정될 수 있다.According to one embodiment of the present invention, the size of A-1 secondarily sampled through the second sampling step (S340) is 200, the size of A-2 is 200, the size of B-1 is 300, and the size of B-2 is 200. The size of may be 300. Then, in the minimum value selection step (S350), 200, which is the smallest size value, may be selected as the minimum value.

그리고 상기 1차 샘플링 단계(S320)로부터 1차 샘플링된 서브셋 1 내지 서브셋 4의 크기가 200, 300, 100, 300 이라면 2:3:1:3 비율을 가지게 되는데, 상기 3차 샘플링 단계(S360)는 상기 최솟값인 200이 서브셋 1 내지 서브셋 4 크기에 각각 합산될 수 있다. 따라서 상기 서브셋 1 내지 서브셋 4의 크기는 400, 500, 300, 500일 수 있고, 1.33:1.67:1:1.67 비율을 가질 수 있다. And if the sizes of subsets 1 to 4 sampled from the first sampling step (S320) are 200, 300, 100, and 300, they have a ratio of 2:3:1:3, and the third sampling step (S360) The minimum value of 200 may be added to the sizes of subsets 1 to 4, respectively. Accordingly, the sizes of subsets 1 to 4 may be 400, 500, 300, and 500, and may have a ratio of 1.33:1.67:1:1.67.

따라서 본 발명은 상기 3차 샘플링 단계(S360)에 의하여 서브셋 간의 차이가 현저히 줄어들 수 있고, 점증적으로 상기 원본 데이터셋(Dataset₀)의 공정성이 보정되어 보정 데이터셋(Dataset₂)이 출력될 수 있다.Therefore, in the present invention, the difference between subsets can be significantly reduced by the third sampling step (S360), and the fairness of the original data set (Dataset ₀ ) can be gradually corrected to output a corrected data set (Dataset ₂ ). there is.

다음으로, 상기 시각화 단계(S400)는 시각화 모듈(400)에 의하여, 데이터 공정성을 확인하기 용이하도록 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂) 중 적어도 하나가 기 설정된 형식으로 시각화된다. Next, in the visualization step (S400), at least one of the original dataset (Dataset ₀ ) and the corrected dataset (Dataset ₂ ) is visualized in a preset format by the visualization module 400 to facilitate confirmation of data fairness. do.

상기 시각화 단계(S400)는 다양한 형식으로 편향 및 공정성이 시각화될 수 있다. In the visualization step (S400), bias and fairness can be visualized in various formats.

우선 도 2 내지 도 3의 일실시예를 보면, 상기 원본 데이터셋 업로드 단계(S100)는 CSV 파일형식의 원본 데이터셋(Dataset₀)이 업로드될 수 있다. 상기 서브셋 도출단계(S200)는 속도제한(A), 날씨(B), 조광(C), 최초충돌타입(D), 노면상태(E) 및 손상(F)을 다수 개의 민감한 속성(Sensitive Columns)으로 입력받을 수 있다. 그러면, 상기 시각화 단계(S400)는 상기 원본 데이터셋(Dataset₀)에 포함된 데이터가 상기 다수 개의 민감한 속성(Sensitive Columns)에 따라 행렬형태로 시각화되는 것을 특징으로 한다. 따라서 사용자는 속성별 각 데이터 값을 세부적으로 확인할 수 있는 현저한 효과가 있다.First, looking at the embodiment of FIGS. 2 and 3, in the original dataset upload step (S100), the original dataset (Dataset ₀ ) in CSV file format may be uploaded. The subset derivation step (S200) includes speed limit (A), weather (B), lighting (C), initial collision type (D), road surface condition (E), and damage (F) into a number of sensitive attributes (Sensitive Columns). It can be input as . Then, the visualization step (S400) is characterized in that the data included in the original dataset (Dataset ₀ ) is visualized in matrix form according to the plurality of sensitive attributes (Sensitive Columns). Therefore, there is a notable effect in that users can check each data value for each attribute in detail.

또한, 도 4의 일실시예를 보면 상기 시각화 단계(S400)는 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂)을 계열로 하고, 통계적 동등성 차이(Statistical Parity Difference), 동등 기회 차이(Equal Opportunity Difference), 평균 배당 차이(Average Odds Difference), 이질적 영향(Disparate Impact), theil index를 포함하는 공정성 지표별 공정성(Fair) 및 편향(Bias)이 그래프 형태로 시각화되는 것을 특징으로 한다.In addition, looking at the embodiment of FIG. 4, the visualization step (S400) uses the original dataset (Dataset ₀ ) and the corrected dataset (Dataset ₂ ) as a series, and calculates the statistical parity difference (Statistical Parity Difference) and the equal opportunity difference ( It features the fairness and bias of each fairness index, including Equal Opportunity Difference, Average Odds Difference, Disparate Impact, and theil index, visualized in graph form.

사용자는 상기 시각화 단계(S400)로부터 시각화된 그래프를 통해서 다음과 같은 판단이 가능하다. 우선, 임의의 공정성 지표에 대한 그래프에서 상기 보정 데이터셋(Dataset₂)(Mitigated)의 지표가 상기 원본 데이터셋(Dataset₀)(Original) 지표보다 공정성(Fair) 기준선에 가까우면 공정하게 잘 보정된 것으로 해석될 수 있다. 예컨대, 도 4의 통계적 동등성 차이(Statistical Parity Difference) 지표에서 상기 보정 데이터셋(Dataset₂)(Mitigated)의 지표는 -0.09이고, 상기 원본 데이터셋(Dataset₀)(Original) 지표인 -0.18 보다 공정성(Fair) 기준선인 0에 더 가까우므로 통계적 동등성 차이(Statistical Parity Difference) 지표에 한해서 공정성이 보정되었다고 판단될 수 있다. The user can make the following decisions through the graph visualized from the visualization step (S400). First, in the graph for any fairness index, if the index of the correction dataset (Dataset ₂ ) (Mitigated) is closer to the fairness baseline than the index of the original dataset (Dataset ₀ ) (Original), it is fair and well-corrected. It can be interpreted as For example, in the Statistical Parity Difference index of FIG. 4, the index of the correction dataset (Dataset ₂ ) (Mitigated) is -0.09, and is fairer than the index of -0.18 of the original dataset (Dataset ₀ ) (Original). (Fair) Since it is closer to the baseline of 0, fairness can be judged to have been corrected only for the Statistical Parity Difference indicator.

또한, 동등 기회 차이(Equal Opportunity Difference) 지표와 평균 배당 차이(Average Odds Difference) 지표에서는 상기 원본 데이터셋(Dataset₀)(Original) 지표가 상기 보정 데이터셋(Dataset₂)(Mitigated)의 지표보다 공정성(Fair) 기준선에 가까운 것이 확인될 수 있다. 즉, 상기 보정 데이터셋 출력단계(S300)를 통해서 다른 지표에서는 오히려 편향이 발생한 것이다. 상기 시각화 단계(S400)는 상기 보정 데이터셋 출력단계(S300)로부터 불공정한 결과가 도출된 것을 확인한 사용자가 데이터셋을 활용하고자 하는 목적에 맞게 상기 원본 데이터셋(Dataset₀)을 사용하거나, 보정 데이터셋(Dataset₂)을 사용하는 등 사용자의 데이터셋 사용여부 결정에 시각적으로 도움을 줄 수 있는 현저한 효과가 있다.In addition, in the Equal Opportunity Difference indicator and the Average Odds Difference indicator, the original data set (Dataset ₀ ) (Original) indicator is fairer than the indicator of the calibration data set (Dataset ₂ ) (Mitigated). (Fair) It can be confirmed that it is close to the baseline. In other words, bias occurred in other indicators through the correction data set output step (S300). In the visualization step (S400), the user who confirms that unfair results were derived from the correction data set output step (S300) uses the original dataset (Dataset ₀ ) for the purpose of utilizing the dataset, or uses the correction data Using a set (Dataset ₂ ) has the notable effect of visually helping users decide whether to use a dataset.

또한, 도 5의 일실시예를 보면 상기 시각화 단계(S400)는, 수적 특성 표시부(410)에 의하여, 다수 개의 민감한 속성(Sensitive Columns) 중 숫자 형식의 데이터를 갖는 민감한 속성(Sensitive Column)이 구분되고, 수적 속성(Numeric Column)에 대한 수적 특성(Numeric Features)이 분석되어 차트로 표시하는 수적 특성 표시단계(S410) 및 범주 특성 표시부(420)에 의하여, 다수 개의 민감한 속성(Sensitive Columns) 중 문자 형식의 데이터를 갖는 민감한 속성(Sensitive Column)이 구분되고, 범주 속성(Categorical Column)에 대한 범주 특성(Categorical Features)이 분석되어 차트로 표시되는 범주 특성 표시단계(S420)를 포함함으로써, 특성별 데이터 공정성을 확인할 수 있도록 하는 것을 특징으로 한다.In addition, looking at the embodiment of FIG. 5, in the visualization step (S400), a sensitive column having data in the form of a number is distinguished among a plurality of sensitive columns by the numeric characteristic display unit 410. By the numeric feature display step (S410) and the category feature display unit (420) in which the numeric features for the numeric columns are analyzed and displayed in a chart, characters among a plurality of sensitive features (Sensitive Columns) By including a categorical feature display step (S420) in which sensitive columns with data in the form are distinguished and categorical features for the categorical columns are analyzed and displayed in charts, data by feature It is characterized by ensuring fairness.

도 5의 (a)의 일실시예를 보면, 상기 수적 특성 표시단계(S410)는 다수 개의 민감한 속성(Sensitive Columns) 중 속도제한 속성에 대한 원본 데이터셋(Dataset₀) 및 보정 데이터셋(Dataset₂) 내 각각의 데이터의 수(Count), 평균(Mean), 표준편차(Std dev), 데이터가 0인 비율(Zeros), 최솟값(Min), 중간값(Median) 및 최대값(Max)이 수적 특성(Numeric Features)으로 각각 도출될 수 있다. 그리고 상기 수적 특성 표시단계(S410)는 도출된 특성들이 반영되어 각 데이터셋의 데이터가 차트형식으로 시각화될 수 있다. Looking at one embodiment of Figure 5(a), the numerical characteristic display step (S410) includes an original dataset (Dataset ₀ ) and a correction dataset (Dataset ₂₎ for the speed limit attribute among a plurality of sensitive attributes (Sensitive Columns). ) The number (Count), mean (Mean), standard deviation (Std dev), percentage of data being 0 (Zeros), minimum value (Min), median, and maximum value (Max) of each data are numerical values. Each can be derived as Numeric Features. And in the numerical characteristic display step (S410), the derived characteristics are reflected and the data of each dataset can be visualized in chart format.

도 5의 (b)의 일실시예를 보면, 상기 범주 특성 표시단계(S420)는 다수 개의 민감한 속성(Sensitive Columns) 중 날씨 속성에 대한 원본 데이터셋(Dataset₀) 및 보정 데이터셋(Dataset₂) 내 각각의 데이터의 수(Count), 속성 내 날씨의 종류(Unique), 데이터 수가 많은 날씨의 종류(Top), 데이터 수가 많은 날씨의 데이터 수(Freq top), 평균 문자열 길이(Avg str len)가 범주 특성(Categorical Features)으로 각각 도출될 수 있다. 그리고 상기 범주 특성 표시단계(S420)는 도출된 특성들이 반영되어 각 데이터셋의 데이터가 차트형식으로 시각화될 수 있다. Looking at an embodiment of Figure 5 (b), the category characteristic display step (S420) includes an original dataset (Dataset ₀ ) and a correction dataset (Dataset ₂ ) for weather attributes among a plurality of sensitive attributes (Sensitive Columns). The number of each data (Count), the type of weather in the property (Unique), the type of weather with a large number of data (Top), the number of weather data with a large number of data (Freq top), and the average string length (Avg str len) Each can be derived as categorical features. And in the category characteristic display step (S420), the derived characteristics are reflected and the data of each dataset can be visualized in chart format.

따라서 본 발명의 상기 수적 특성 표시단계(S410) 및 범주 특성 표시단계(S420)에 의하여, 사용자가 원본 데이터셋(Dataset₀) 및 보정 데이터셋(Dataset₂)을 민감한 속성(Sensitive Columns)에 대한 데이터 분포를 시각적으로 비교 및 확인할 수 있는 현저한 효과가 있다. Therefore, by the numerical characteristic display step (S410) and the category characteristic display step (S420) of the present invention, the user converts the original dataset (Dataset ₀ ) and the corrected dataset (Dataset ₂ ) into data for sensitive attributes (Sensitive Columns). There is a remarkable effect of being able to visually compare and confirm distributions.

또한, 도 6의 일실시예를 보면 상기 시각화 단계(S400)는, 비닝 설정부(430)에 의하여, 상기 범주 속성(Categorical Column)이 이용되어 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂) 중 적어도 하나가 비닝(Binning)되는 비닝 설정단계(S430), 스케터링 설정부(440)에 의하여, 상기 수적 속성(Numeric Column)이 이용되어 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂) 중 적어도 하나가 스케터링(Scattering)되는 스케터링 설정단계(S440), 계열 설정부(450)에 의하여, 범주 속성(Categorical Column) 내 범주 특성(Categorical Features)이 서로 다른 색으로 표현되도록 설정되는 계열 설정단계(S450) 및 세부특성 표시부(460)에 의하여, 임의의 데이터가 선택되면 임의의 데이터에 대한 속성별 데이터 값이 세부적으로 표시되는 세부특성 표시단계(S460)를 더 포함함으로써, 데이터셋 전체에 대한 공정성을 확인할 수 있도록 시각화되는 것을 특징으로 한다.In addition, looking at the embodiment of FIG. 6, the visualization step (S400) uses the categorical attribute (Categorical Column) by the binning setting unit 430 to create the original dataset (Dataset ₀ ) and the correction dataset ( In the binning setting step (S430) in which at least one of Dataset ₂ is binning, the numeric attribute (Numeric Column) is used by the scattering setting unit 440 to combine the original dataset (Dataset ₀ ) and the correction data. In the scattering setting step (S440) in which at least one of the three (Dataset ₂ ) is scattered, the categorical features in the categorical column are colored in different colors by the series setting unit 450. It further includes a series setting step (S450) that is set to be expressed and a detailed characteristic display step (S460) in which, when arbitrary data is selected by the detailed characteristic display unit (460), data values for each attribute for the arbitrary data are displayed in detail. By doing so, it is characterized by visualization so that the fairness of the entire dataset can be confirmed.

일반적으로, 비닝(Binning)은 다수 개의 픽셀로 구성된 이미지에서 픽셀을 묶어 관리하는 기능을 일컫는다. 즉, 본 발명의 상기 비닝 설정단계(S430)는 데이터셋 내 수많은 데이터가 날씨, 최초충돌타입, 노면상태 및 손상 등의 범주 속성(Categorical Column) 중 하나로 묶어서 확인하기 위해서 범주 속성이 비닝 X축 및 비닝 Y축 중 적어도 하나에 설정되고, 설정된 범주 속성에 따라 데이터가 구분되어 시각적으로 표시되도록 한다.In general, binning refers to the function of grouping and managing pixels in an image composed of multiple pixels. In other words, in the binning setting step (S430) of the present invention, in order to confirm that numerous data in the dataset are grouped into one of the categorical columns such as weather, initial collision type, road surface condition, and damage, the category properties are binning Binning is set on at least one of the Y-axes, and data is divided and visually displayed according to the set category properties.

반대로, 스케터링(Scattering)은 데이터를 흩어지게 하는 기능을 일컫는다. 즉, 본 발명의 상기 스케터링 설정단계(S440)는 데이터셋 내 수많은 데이터가 속도제한, 조광 등의 수적 속성(Numeric Column) 중 하나로 묶어서 확인하기 위해서 수적 속성이 스케터링 X축 및 스케터링 Y축 중 적어도 하나에 설정되고, 설정된 수적 속성에 따라 데이터가 구분되어 시각적으로 표시되도록 한다.Conversely, scattering refers to the function of scattering data. In other words, in the scattering setting step (S440) of the present invention, in order to confirm that numerous data in the dataset are grouped into one of the numeric attributes (Numeric Column) such as speed limit and lighting, the numeric attribute is the scattering X axis and the scattering Y axis. is set to at least one of the following, and the data is divided according to the set numerical properties and displayed visually.

도 6의 일실시예를 보면, 시각화 대상은 수적 속성(Numeric Column)이 1개, 범주 속성(Categorical Column)이 6개인 임의의 원본 데이터셋(Dataset₀)일 수 있다. 이때, 상기 비닝 설정단계(S430)는 범주 속성(Categorical Column) 중 하나인 사고 유형(Crush_type)이 비닝 X축에 시각화 되도록 선택될 수 있다. 이때, 사고 유형(Crush_type)은 손상(Injury) 및/또는 충돌로 인한 견인(Tow due to crash), 손상 없음(No Injury) 및/또는 운전해서 떠남(Drive away)으로 구분될 수 있다. 그러면 원본 데이터셋(Dataset₀) 내 데이터는 상기 비닝 설정단계(S430)로부터 선택된 사고 유형(Crush_type)별로 구분되어 각각 시각화될 수 있다. 또한, 상기 비닝 설정단계(S430)는 범주 속성(Categorical Column) 중 하나인 데미지(Damage)가 비닝 Y축에 시각화 되도록 선택될 수 있다. 이때, 데미지(Damage)는 1500 이상, 1500미만으로 구분될 수 있다. 그러면 원본 데이터셋(Dataset₀) 내 데이터는 상기 비닝 설정단계(S430)로부터 선택된 데미지(Damage)별로 구분되어 각각 시각화될 수 있다.Looking at the example of FIG. 6, the object of visualization may be any original dataset (Dataset ₀ ) with one numeric column and six categorical columns. At this time, in the binning setting step (S430), the accident type (Crush_type), one of the categorical columns, can be selected to be visualized on the binning X-axis. At this time, the accident type (Crush_type) can be divided into Injury and/or Tow due to crash, No Injury, and/or Drive away. Then, the data in the original dataset (Dataset ₀ ) can be visualized separately by being classified by the accident type (Crush_type) selected in the binning setting step (S430). Additionally, in the binning setting step (S430), damage, one of the categorical columns, may be selected to be visualized on the binning Y-axis. At this time, damage can be divided into 1500 or more and 1500 or less. Then, the data in the original dataset (Dataset ₀ ) can be visualized separately by the damage selected in the binning setting step (S430).

따라서 도 6의 일실시예와 같이 원본 데이터셋(Dataset₀) 내 데이터들이 4분면으로 구분되어 모여서 표시될 수 있다. 각 분면에 시각화된 그래프는 상기 계열 설정단계(S450)로부터 최초 사고 유형(First_crash_type) 내 범주 특성이 서로 다른 색으로 표현되도록 설정될 수 있다. 그리고 각 분면에 시각화된 그래프에서 임의의 데이터가 선택되면 상기 세부특성 표시단계(S460)로부터 임의의 데이터에 대한 속성별 특성이 세부적으로 표시될 수 있다. 즉, 원본 데이터셋(Dataset₀)은 데미지(Damage)가 1500이상이고, 사고 유형(Crush_type)이 손상(Injury) 및/또는 충돌로 인한 견인(Tow due to crash)인 데이터가 상대적으로 작은 것이 사용자로부터 시각적으로 파악될 수 있고, 불공정성이 심한 데이터셋인 것을 확인할 수 있다.Therefore, as in the embodiment of FIG. 6, the data in the original dataset (Dataset ₀ ) can be divided into four quadrants and displayed together. The graph visualized in each quadrant can be set so that category characteristics within the first accident type (First_crash_type) are expressed in different colors from the series setting step (S450). And when arbitrary data is selected from the graph visualized in each quadrant, the characteristics of each attribute for the arbitrary data can be displayed in detail from the detailed characteristic display step (S460). In other words, the original dataset (Dataset ₀ ) has damage over 1500, and the data with the accident type (Crush_type) of Injury and/or Tow due to crash is relatively small. It can be visually understood from the , and it can be confirmed that it is a highly unfair dataset.

다양한 방식으로 데이터 시각화가 가능한 본 발명의 시각화 단계(S400)에 의하여, 상기 원본 데이터셋 업로드 단계(S100)로부터 업로드된 원본 데이터셋(Dataset₀) 및 보정 데이터셋 출력단계(S300)로부터 공정성이 보정된 보정 데이터셋(Dataset₂)은 공정성 지표에 따라 편향 정도 및 공정성 정도가 각각 시각화될 수 있다. By the visualization step (S400) of the present invention, which allows data visualization in various ways, fairness is corrected from the original dataset (Dataset ₀ ) uploaded from the original dataset upload step (S100) and the correction dataset output step (S300). The degree of bias and fairness of the corrected dataset (Dataset ₂ ) can be visualized respectively according to the fairness index.

또한, 본 발명의 시각화 단계(S400)에 의하여, 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂)의 하나의 범주 속성(Categorical Column)에 대한 범주 특성(Categorical Features)이 함께 시각화됨으로써 사용자가 해당 범주 속성에 대한 공정성이 얼마나 향상되었는지 시각적으로 비교 및 확인할 수 있다. 그리고 상기 원본 데이터셋(Dataset₀)과 보정 데이터셋(Dataset₂)의 수적 속성(Numeric Column)에 대한 수적 특성(Numeric Features)이 함께 시각화됨으로써 사용자가 해당 수적 속성에 대한 공정성이 얼마나 향상되었는지 시각적으로 비교 및 확인할 수 있다. In addition, by the visualization step (S400) of the present invention, the categorical features for one categorical column of the original dataset (Dataset ₀ ) and the correction dataset (Dataset ₂ ) are visualized together. Users can visually compare and see how much the fairness has improved for the corresponding category attributes. In addition, the Numeric Features for the Numeric Column of the original dataset (Dataset ₀ ) and the correction dataset (Dataset ₂ ) are visualized together, allowing the user to visually see how much the fairness of the corresponding numeric column has improved. You can compare and check.

또한, 본 발명의 시각화 단계(S400)에 의하여, 원본 데이터셋(Dataset₀)이 대상이 되어 비닝 XY축에 범주 속성(Categorical Column)과 스케터링 XY축에 수적 속성(Numeric Column) 중 적어도 하나가 선택되어 디수 개의 속성에 따른 데이터 분포가 시각화됨으로써 사용자가 해당 데이터셋의 다수 개의 속성에 따른 데이터 분포를 시각적으로 확인할 수 있다.In addition, by the visualization step (S400) of the present invention, the original dataset (Dataset ₀ ) is targeted and at least one of a categorical column on the binning XY axis and a numeric column on the scattering XY axis is included. The data distribution according to the selected number of properties is visualized, allowing the user to visually check the data distribution according to the multiple properties of the dataset.

다음으로, 본 발명은 저장 모듈(500)에 의하여, 상기 원본 데이터셋 업로드 단계(S100)로부터 업로드 될 또 다른 원본 데이터셋(Dataset₀₀)에 샘플링 후 남은 잔여 데이터가 포함될 수 있도록 상기 잔여 데이터가 저장되는 저장 단계(S500)를 더 포함하는 것을 특징으로 한다. 한편, 상기 시각화 단계(S400)와 상기 저장 단계(S500)는 서술한 순서에 한정되지 않고, 시각화 단계(S400) 후 저장 단계(S500)가 수행되거나 저장 단계(S500) 후 시각화 단계(S400)가 수행되거나 시각화 단계(S400)와 저장 단계(500)가 동시에 수행될 수 있다.Next, the present invention stores the residual data so that the residual data remaining after sampling can be included in another original dataset (Dataset ₀₀ ) to be uploaded from the original dataset upload step (S100) by the storage module 500. It is characterized in that it further includes a storage step (S500). Meanwhile, the visualization step (S400) and the storage step (S500) are not limited to the described order, and the storage step (S500) is performed after the visualization step (S400) or the visualization step (S400) is performed after the storage step (S500). Alternatively, the visualization step (S400) and the storage step (500) may be performed simultaneously.

즉, 본 발명은 상기 샘플링 후 남은 잔여 데이터가 단순 제거되는 것이 아니고, 상기 저장 단계(S500)로부터 상기 잔여 데이터가 저장되어 차후에 상기 원본 데이터셋 업로드 단계(S100)로부터 또 다른 배치 형태의 원본 데이터 셋이 업로드 되면 이와 합쳐져 공정성이 보정된 보정 데이터셋(Dataset₂)이 다시 출력 및 시각화될 수 있도록 할 수 있다. In other words, in the present invention, the residual data remaining after the sampling is not simply removed, but the residual data is stored from the storage step (S500) and later stored as an original data set in another batch form from the original data set upload step (S100). Once uploaded, it can be combined with this so that the correction dataset (Dataset ₂ ) with corrected fairness can be output and visualized again.

따라서 본 발명에 의하면, 데이터셋의 불공정성이 점증적으로 해결될 수 있고, 종래 무분별하게 수집된 무수히 많은 데이터로 인공지능 모델이 학습됨으로써 인공지능 모델의 성능이 저하될 수 있는 문제점을 해결할 수 있는 현저한 효과가 있다. Therefore, according to the present invention, the unfairness of the dataset can be solved incrementally, and the problem that the performance of the artificial intelligence model may deteriorate as the artificial intelligence model is learned with countless data collected indiscriminately can be solved. It works.

또한, 본 발명은 업로드된 원본 데이터셋 및 공정성이 보정된 보정 데이터셋의 편향 및 공정성 정도가 시각적으로 비교 및 확인될 수 있고, 데이터셋 내 불공정한 항목이 확인되어 해당 항목에 대한 공정성이 보완될 수 있도록 하는 현저한 효과가 있다. In addition, the present invention can visually compare and confirm the degree of bias and fairness of the uploaded original dataset and the correction dataset whose fairness has been corrected, and unfair items in the dataset can be confirmed and the fairness of the items can be improved. It has a remarkable effect in enabling

실시예들은 하드웨어, 소프트웨어, 펌웨어, 미들웨어, 마이크로코드, 하드웨어 기술 언어, 또는 이들의 임의의 조합에 의해 구현될 수 있다. 소프트웨어, 펌웨어, 미들웨어 또는 마이크로코드로 구현되는 경우, 필요한 작업을 수행하는 프로그램 코드 또는 코드 세그먼트들은 컴퓨터 판독 가능 저장 매체에 저장되고 하나 이상의 프로세서에 의해 실행될 수 있다.Embodiments may be implemented by hardware, software, firmware, middleware, microcode, hardware description language, or any combination thereof. When implemented as software, firmware, middleware, or microcode, program code or code segments that perform necessary tasks may be stored in a computer-readable storage medium and executed by one or more processors.

그리고 본 명세서에 설명된 주제의 양태들은 컴퓨터에 의해 실행되는 프로그램 모듈 또는 컴포넌트와 같은 컴퓨터 실행 가능 명령어들의 일반적인 맥락에서 설명될 수 있다. 일반적으로, 프로그램 모듈 또는 컴포넌트들은 특정 작업을 수행하거나 특정 데이터 형식을 구현하는 루틴, 프로그램, 객체, 데이터 구조를 포함한다. 본 명세서에 설명된 주제의 양태들은 통신 네트워크를 통해 링크되는 원격 처리 디바이스들에 의해 작업들이 수행되는 분산 컴퓨팅 환경들에서 실시될 수도 있다. 분산 컴퓨팅 환경에서, 프로그램 모듈들은 메모리 저장 디바이스들을 포함하는 로컬 및 원격 컴퓨터 저장 매체에 둘 다에 위치할 수 있다.And aspects of the subject matter described herein may be described in the general context of computer-executable instructions, such as program modules or components that are executed by a computer. Typically, program modules or components include routines, programs, objects, and data structures that perform specific tasks or implement specific data types. Aspects of the subject matter described herein may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media, including memory storage devices.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술 분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 으로 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with limited examples and drawings as described above, various modifications and variations can be made from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or in a different configuration. Appropriate results may be achieved by substitution or substitution of elements or equivalents.

그러므로 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims also fall within the scope of the claims described below.

100.. 업로더
200.. 구조분석 모듈
300.. 샘플링 모듈
310.. 평균값 산출부
320.. 1차 샘플링부
330.. 차이값 산출부
340.. 2차 샘플링부
350.. 최솟값 선정부
360.. 3차 샘플링부
400.. 시각화 모듈
410.. 수적 특성 표시부
420.. 범주 특성 표시부
430.. 비닝 설정부
440.. 스케터링 설정부
450.. 계열 설정부
460.. 세부특성 표시부
500.. 저장 모듈
600.. 입력 모듈100.. Uploader
200.. Structural analysis module
300.. Sampling module
310.. Average value calculation unit
320.. Primary sampling unit
330.. Difference value calculation unit
340.. Secondary sampling unit
350.. Minimum value selection part
360.. 3rd sampling unit
400.. Visualization module
410.. Numerical characteristic display unit
420.. Category attribute display
430.. Binning setting section
440.. Scattering setting unit
450.. Series setting section
460.. Detailed characteristics display unit
500.. storage module
600.. Input module

Claims

An uploader that uploads the original dataset in batch form;
A structural analysis module that receives a plurality of sensitive attributes (Sensitive Columns) as input and derives a subset corresponding to each sensitive attribute (Sensitive Columns) from the original dataset;
A sampling module that samples according to the size of a plurality of subsets and then outputs a correction dataset with corrected fairness; and
A visualization device for visualizing at least one of the original dataset and the correction dataset in a preset format to facilitate checking data fairness. A device for visualizing fairness of an artificial intelligence learning dataset, including a visualization module.

According to clause 1,
The visualization module is,
A numeric feature display unit that distinguishes sensitive columns with numeric data among a plurality of sensitive columns, analyzes the numeric features for the numeric columns, and displays them on a chart; and
A category characteristic display unit that distinguishes sensitive columns having character data among a plurality of sensitive columns, analyzes categorical features for the categorical columns, and displays them on a chart; By including,
A fairness visualization device for artificial intelligence learning datasets that allows checking data fairness by characteristic.

According to clause 2,
The visualization module is,
a binning setting unit for binning at least one of the original data set and the corrected data set using the categorical column;
a scattering setting unit that scatters at least one of the original data set and the corrected data set using the numeric attribute (Numeric Column);
A series setting unit that sets categorical features within the categorical column to be expressed in different colors; and
When arbitrary data is selected, by further including a detailed characteristic display unit to display data values for each attribute of the arbitrary data,
A fairness visualization device for artificial intelligence learning datasets, characterized in that it visualizes the fairness of the entire dataset.

According to clause 1,
A storage module for storing the remaining data so that the remaining data after sampling can be included in another original data set to be uploaded from the uploader. A fairness visualization device for an artificial intelligence learning dataset, further comprising:

According to clause 1,
The sampling module is,
an average value calculation module that calculates an average value for a plurality of subsets derived from the original data set; and
A fairness visualization device for an artificial intelligence learning dataset, comprising: a primary sampling module that first samples a plurality of subsets using the average value.

According to clause 5,
The structural analysis module is,
Among the original datasets, a plurality of subsets are derived from the original dataset and a residual dataset remaining after the first sampling from the first sampling module,
The sampling module is,
a difference value calculation module that compares the sizes of corresponding subsets and calculates difference values between the sizes of the corresponding subsets; and
A fairness visualization device for an artificial intelligence learning dataset, further comprising a secondary sampling module that secondary samples a plurality of subsets using the difference values.

According to clause 6,
The sampling module is,
a minimum value selection module that selects the minimum value after comparing the sizes of a plurality of secondary sampled subsets from the secondary sampling module; and
A 3rd sampling module that performs 3rd sampling by adding the minimum value to each of the plurality of subsets first sampled from the 1st sampling module. A device for visualizing fairness of an artificial intelligence learning dataset, further comprising:

An original dataset upload step in which the original dataset in batch form is uploaded by the uploader;
A subset derivation step in which a plurality of sensitive attributes (Sensitive Columns) are input by a structural analysis module, and a subset corresponding to each sensitive attribute (Sensitive Columns) is derived from the original data set;
A correction data set output step in which a correction data set with fairness corrected after sampling according to the size of a plurality of subsets is output by a sampling module; and
A visualization method of visualizing the fairness of an artificial intelligence learning dataset, including a visualization step of visualizing at least one of the original dataset and the correction dataset in a preset format to facilitate confirmation of data fairness by a visualization module.