WO2022114639A1 - Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant - Google Patents

Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant Download PDF

Info

Publication number
WO2022114639A1
WO2022114639A1 PCT/KR2021/016695 KR2021016695W WO2022114639A1 WO 2022114639 A1 WO2022114639 A1 WO 2022114639A1 KR 2021016695 W KR2021016695 W KR 2021016695W WO 2022114639 A1 WO2022114639 A1 WO 2022114639A1
Authority
WO
WIPO (PCT)
Prior art keywords
reference value
column
column values
values
data set
Prior art date
Application number
PCT/KR2021/016695
Other languages
English (en)
Korean (ko)
Inventor
권준호
김구
김진우
신현실
Original Assignee
부산대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 부산대학교 산학협력단 filed Critical 부산대학교 산학협력단
Publication of WO2022114639A1 publication Critical patent/WO2022114639A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Definitions

  • the present invention removes the bias of the attribute value corresponding to the raw dataset attribute for artificial intelligence learning, and the artificial intelligence learning dataset fairness based on multidimensional subset association analysis to create an artificial intelligence learning dataset with guaranteed fairness. It relates to a device for guaranteeing and a method for ensuring fairness of an artificial intelligence learning dataset using the same.
  • AI algorithms can learn decision-making models, and these decision-making models can learn even biases that often exist in society. Decisions made by an algorithm that can have a significant impact on a subject's life may be biased according to characteristics that should not be related to the decision, and are adversely affected by individuals belonging to a specific group such as gender, race, or religion.
  • AI artificial intelligence
  • a step 6-1 including only the number of column values having the same value as the reference value when the number of column values of the selected cluster in the fifth step is equal to or greater than the reference value;
  • step 6-2 when the number of column values of the selected cluster and 10% or more of the reference value, and in step 6-1, data including only the number of column values having the same value as the reference value are combined to create a subdataset Step 7-1 to do;
  • the extraction is performed by checking one or more of the number of column values, the number of columns, clustering of columns, the number of clustering of columns, and column headers.
  • the reference value is characterized in that it is calculated by [Equation 1] below.
  • the data collection of the sub-dataset additionally generated in the ninth step uses the number of column values of the selected cluster, the reference value, and the selected column as input values, and the sub-data additionally generated in the 7-1 step as the result value It is characterized in that it is assembled into an association module.
  • the present invention provides an apparatus for ensuring fairness of artificial intelligence learning datasets based on multidimensional subset association analysis
  • an upload unit 10 for uploading a raw data set file prepared in CSV (Comma-Separated Values) format
  • a parsing unit 20 for parsing the structure of the uploaded raw data set file
  • a number check unit 30 for checking the number of clustered columns and the total number of column values in the parsed structure
  • a reference value calculator 40 for calculating a reference value using the number of clustering of the columns and the number of total column values
  • a reference value comparison unit 50 for selecting one cluster and comparing the number of column values of the selected cluster with the reference value
  • a data collection unit 90 that collects the data of the sub-dataset additionally generated by the iteration unit 80.
  • the parsing unit The parsing unit,
  • the reference value is characterized in that it is calculated by [Equation 1] below.
  • the present invention can solve a problem that may be biased according to characteristics that should be independent of the decision made by the algorithm and ensure fairness.
  • 1 is a flowchart showing a method of ensuring fairness of AI learning dataset based on multidimensional subset association analysis according to the present invention.
  • FIG. 2 is a block diagram illustrating an apparatus for ensuring fairness of AI learning dataset based on multidimensional subset association analysis according to the present invention.
  • 3 is a graph result showing a conventional data set.
  • the present invention removes the bias of the attribute value corresponding to the raw dataset attribute for artificial intelligence learning, and the artificial intelligence learning dataset fairness based on multidimensional subset association analysis to create an artificial intelligence learning dataset with guaranteed fairness. It relates to a device for guaranteeing and a method for ensuring fairness of an artificial intelligence learning dataset using the same.
  • the present invention looks at datasets and evaluates classifiers with fairness in mind, noting how undesirable biases can occur in machine learning (NL). To frame the context of the machine learning process about fairness, we identify biases and consider the long-term impact of model predictions that arise when these biases are not addressed.
  • the first step (S10) uploads a raw data set file prepared in a CSV (Comma-Separated Values) format. More specifically, the CSV (Comma-Separated Values) format forms a table, a field, a column, and a column value.
  • CSV Common-Separated Values
  • the table is a collection of data in an appropriate format for quick reference, and represents the structure of data in a relational database model as a two-dimensional table. In other words, it is managed in the form of column values and columns, and by designating the key, you can quickly and easily find the data you want.
  • the field is a piece of information having a certain meaning, and is the smallest unit of processing in the database system.
  • the column value is also called a record or a tuple in a relational database, and indicates a single structured data item in a table.
  • a database table can be considered to be composed of column values and columns or fields. Column values in each table represent a series of related data, and all column values in a table have the same structure.
  • the column refers to a series of data values of a specific simple data type in a relational database table and each column in the table.
  • the second step (S20) parses the structure of the uploaded raw data set file.
  • the bias of the attribute value corresponding to the attribute of the raw dataset for artificial intelligence learning is parsed by parsing the structure of the raw dataset file prepared in CSV format.
  • the parsing of the structure of the raw data set file includes the number of column values, the number of columns, the clustering of columns, the number of clustering of columns, and the column names. It is desirable to extract by checking any one or more of the header).
  • the clustering of the columns is to classify the values corresponding to each field in the raw data set file into categories, and each column is classified according to features.
  • a reference value is calculated using the number of clusters of the column and the number of total column values. More specifically, the reference value is characterized in that it is calculated by [Equation 1] below.
  • the reference value is a value obtained by dividing the number of clustering of columns from the total number of column values, and it is possible to check the standard value of each cluster by calculating an average value for each cluster of fields.
  • the reference value may be 500.
  • the fifth step (S50) after selecting one cluster, the number of column values of the selected cluster is compared with the reference value.
  • repeat execution (LOOP) is performed according to each group.
  • the number of column values to be included is determined using the values compared in the fifth step (S50).
  • the sixth step (S60) is performed separately in the following steps 6-1 (S61) and 6-2 (S62).
  • step 6-1 when the number of column values of the group selected in step S50 is equal to or greater than the reference value, only the number of column values equal to the reference value is included.
  • the first cluster when the reference value is 500 and the number of column values in the first cluster is 600, the first cluster includes only 500 column values corresponding to the reference value and discards the remaining 100 column values, or Remove.
  • step 6-2 when the number of column values in the selected cluster in the fifth step (S50) is less than the reference value, the number of column values in the selected cluster and 10% of the reference value are compared.
  • the reference value is 500
  • 50 which is 10% of the reference value
  • 100 which is the number of column values in the second cluster
  • the column of the second cluster It is confirmed that the number of values exceeds the reference value.
  • the seventh step (S70) is performed using the values performed in the 6-1 step (S61) and the 6-2 step (S62), and data including the determined number of column values is combined to create a subdataset.
  • the seventh step (S70) is performed separately in the following steps 7-1 (S71) and 7-2 (S72).
  • the 7-1 step (S71) is performed when the number of column values of the selected cluster in the 6-2 step (S62) and 10% or more of the reference value are the same as the reference value in the 6-1 step (S61). Creates a subdataset by combining data including only the number of column values of values.
  • the second cluster since the number of column values (100) of the second cluster is greater than 10% (50) of the reference value, the second cluster is included in the sub-dataset and is generated.
  • the first cluster is generated by being included in the lower data set including only 500 column values corresponding to the reference value.
  • step 7-2 when the number of column values in the group selected in step 6-2 is less than 10% of the reference value, the column values in the selected cluster are removed.
  • the reference value is 500
  • 50 which is 10% of the reference value
  • the column of the second cluster 30 the number of values
  • the fifth step (S50) to the seventh step (S70) are repeatedly performed (Loop) by the number of clustering of the column to additionally generate a sub-dataset.
  • the ninth step (S90) collects the data of the additionally generated sub-dataset. More specifically, the data collection of the sub-dataset additionally generated in the ninth step uses the number of column values of the selected cluster, the reference value, and the selected column as input values, and the step 7-1 ( The sub-dataset additionally generated in S71) is collected as a result value (output), and is collected through re-execution of the loop.
  • the ninth step if the input values are 35 white males, 30 white females, 20 black males, and 15 black females in a column having a total of 100 values, in the ninth step (S90), the column After value extraction, white male 25, white female 25, black male 20, and black female 15 are collected, and the result is a fair subset with 85 values.
  • an upload unit 10 a parsing unit 20, a number check unit 30, It consists of a reference value calculation unit 40 , a reference value comparison unit 50 , a lower data set generation unit 60 , a column value removal unit 70 , an iterative execution unit 80 , and a data collection unit 90 .
  • the upload unit 10 uploads the file after checking whether the raw data set file is in CSV (Comma-Separated Values) format. More specifically, the CSV (Comma-Separated Values) format forms a table, a field, a column, and a column value.
  • CSV Common-Separated Values
  • the parsing unit 20 parses the structure of the raw data set file.
  • the parsing unit 20 parses the structure of the raw data set file is the number of column values, the number of columns, the number of columns, the number of clustering of columns, the number of clustering of columns, and the column header (column header). ), it is preferable to extract by checking any one or more.
  • the clustering of the columns is to classify the values corresponding to each field in the raw data set file into categories, and each column is classified according to features.
  • the number check unit 30 checks the number of clustered columns and the total number of column values in the parsed structure.
  • the reference value calculator 40 calculates a reference value using the number of clusters of the column and the number of total column values.
  • the reference value is characterized in that it is calculated by [Equation 1] below.
  • the reference value is a value obtained by dividing the number of clustering of column values from the total number of column values, and the average value for each cluster is calculated for a field so that a standard value for each cluster can be identified.
  • the reference value may be 500.
  • the reference value comparison unit 50 selects one cluster and compares the number of column values of the selected cluster with the reference value.
  • the first cluster when the reference value is 500, when the number of column values in the first cluster is 600, the first cluster includes only 500 column values corresponding to the reference value.
  • the lower data set generating unit 60 combines data including only the number of column values of the selected cluster and the number of column values equal to or greater than 10% of the reference value and the same as the reference value.
  • the reference value is 500
  • 50 which is 10% of the reference value
  • 100 which is the number of column values in the second cluster
  • the column of the second cluster It is confirmed that the number of values exceeds the reference value.
  • the column value removal unit 70 is less than 10% of the number of column values of the selected cluster and the reference value, the column value of the selected cluster is removed.
  • the column value of the third cluster is deleted or removed in step 7-2 (S72).
  • the iteration performing unit 80 repeats the reference value comparison unit 50, the lower data set generation unit 60, and the column value removal unit 70 by the number of clustering of the column (Loop), Create additional subdatasets.
  • the data collection unit 90 collects data of the sub-dataset additionally generated by the iteration unit 80 .
  • the data aggregation unit 90 collects the number of column values of a specific cluster, a reference value, and a specific column as input values and the sub-dataset as a result value, thereby re-performing the loop. collected through
  • an apparatus for ensuring fairness of an artificial intelligence learning dataset based on multidimensional subset association analysis and a method of ensuring fairness of an artificial intelligence learning dataset using the same are TRAFFIC in the TRANSPORTATION domain dataset of CHICAGO DATA PORTAL.
  • the CRASHES-CRASHES dataset was used.
  • Data set and expected work information are as follows.
  • the NUMERIC FEATURES are as follows.
  • posted_speed_limit information on the speed of the vehicle recorded by the police at the time of the accident
  • the categorical features are as follows.
  • ⁇ weather_condition Weather information recorded by the police at the time of the accident. Examples of input values: CLEAR , RAIN etc.
  • ⁇ lighting_condition Lighting (brightness) information recorded by the police at the time of the accident.
  • ⁇ roadway_surface_cond Information on the condition of the road surface at the accident point recorded by the police at the time of the accident. Examples of input values: DRY, ICE, WET etc.
  • ⁇ first_crash_type Information on the type of accident recorded by the police at the time of the accident.
  • a PREDICTION task is run to determine if the cost of repairing a car accident exceeds $1,500.
  • LABEL indicates whether the cost of repairing a car accident is over $1,500.
  • FIG. 4 shows a dataset executed by the present invention, and when it is filtered by categorical features (CATEGORICAL FEATURES), as shown in the graph, it can be confirmed that the graph deviation is small in the data of each feature.
  • categorical features CATEGORICAL FEATURES
  • the present invention can solve a problem that may be biased according to characteristics that should be independent of the decision made by the algorithm and ensure fairness.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et un procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant, le dispositif éliminant les biais de valeurs d'attribut, qui correspondent à des attributs d'ensemble de données brutes pour un apprentissage d'intelligence artificielle, et créant un ensemble de données d'apprentissage d'intelligence artificielle avec une équité garantie.
PCT/KR2021/016695 2020-11-27 2021-11-16 Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant WO2022114639A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2020-0161943 2020-11-27
KR20200161943 2020-11-27

Publications (1)

Publication Number Publication Date
WO2022114639A1 true WO2022114639A1 (fr) 2022-06-02

Family

ID=78521431

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/016695 WO2022114639A1 (fr) 2020-11-27 2021-11-16 Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant

Country Status (2)

Country Link
KR (1) KR102321735B1 (fr)
WO (1) WO2022114639A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102321735B1 (ko) * 2020-11-27 2021-11-04 부산대학교 산학협력단 다차원의 부분집합 연관 분석에 기반한 인공지능 학습 데이터셋 공정성을 보장하는 장치 및 이를 이용한 인공지능 학습 데이터셋 공정성을 보장하는 방법
KR20230149914A (ko) 2022-04-20 2023-10-30 서울시립대학교 산학협력단 인공지능 모델 학습 장치 및 방법
KR20240032288A (ko) 2022-09-02 2024-03-12 동국대학교 산학협력단 정량적 녹차학습데이터셋 생성모듈과 정성적 녹차학습데이터셋 생성모듈로 이루어진 하이브리드형 녹차학습데이터셋 구축장치 및 방법

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160096460A (ko) * 2015-02-05 2016-08-16 삼성전자주식회사 복수의 분류기를 포함하는 딥 러닝 기반 인식 시스템 및 그 제어 방법
JP2018113048A (ja) * 2013-02-20 2018-07-19 ハートフォード スチーム ボイラー インスペクション アンド インシュアランス カンパニー 外れ値偏りを低減するシステム及び方法
KR102005628B1 (ko) * 2017-04-26 2019-07-30 김정희 학습 데이터 전처리 방법 및 시스템
US20200081865A1 (en) * 2018-09-10 2020-03-12 Google Llc Rejecting Biased Data Using a Machine Learning Model
KR20200046899A (ko) * 2018-10-26 2020-05-07 삼성에스디에스 주식회사 관심 데이터 추출 방법 및 그 장치
KR102321735B1 (ko) * 2020-11-27 2021-11-04 부산대학교 산학협력단 다차원의 부분집합 연관 분석에 기반한 인공지능 학습 데이터셋 공정성을 보장하는 장치 및 이를 이용한 인공지능 학습 데이터셋 공정성을 보장하는 방법

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018113048A (ja) * 2013-02-20 2018-07-19 ハートフォード スチーム ボイラー インスペクション アンド インシュアランス カンパニー 外れ値偏りを低減するシステム及び方法
KR20160096460A (ko) * 2015-02-05 2016-08-16 삼성전자주식회사 복수의 분류기를 포함하는 딥 러닝 기반 인식 시스템 및 그 제어 방법
KR102005628B1 (ko) * 2017-04-26 2019-07-30 김정희 학습 데이터 전처리 방법 및 시스템
US20200081865A1 (en) * 2018-09-10 2020-03-12 Google Llc Rejecting Biased Data Using a Machine Learning Model
KR20200046899A (ko) * 2018-10-26 2020-05-07 삼성에스디에스 주식회사 관심 데이터 추출 방법 및 그 장치
KR102321735B1 (ko) * 2020-11-27 2021-11-04 부산대학교 산학협력단 다차원의 부분집합 연관 분석에 기반한 인공지능 학습 데이터셋 공정성을 보장하는 장치 및 이를 이용한 인공지능 학습 데이터셋 공정성을 보장하는 방법

Also Published As

Publication number Publication date
KR102321735B1 (ko) 2021-11-04

Similar Documents

Publication Publication Date Title
WO2022114639A1 (fr) Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant
WO2019037197A1 (fr) Procédé et dispositif d'apprentissage de classificateur de sujets, et support de stockage lisible par ordinateur
WO2021080102A1 (fr) Procédé de formation et d'essai d'un réseau d'adaptation correspondant à un réseau de brouillage pouvant traiter des données à dissimuler à des fins de confidentialité et dispositif de formation et dispositif d'essai utilisant ledit procédé
WO2021080103A1 (fr) Procédé d'apprentissage et de test d'un réseau d'apprentissage utilisateur à utiliser pour reconnaître des données obscurcies créées par dissimulation de données originales afin de protéger des informations personnelles et dispositif d'apprentissage et dispositif de test l'utilisant
WO2020082562A1 (fr) Procédé, appareil, dispositif et support de mémoire d'identification de symbole
WO2020073495A1 (fr) Procédé, appareil et dispositif de réexamen basés sur l'intelligence artificielle, et support d'informations
EP3461290A1 (fr) Modèle d'apprentissage pour détection de région faciale saillante
WO2017007084A1 (fr) Dispositif et procédé d'extraction de sujet
WO2018058959A1 (fr) Procédé et appareil de vérification de langage sql, serveur et dispositif de stockage
WO2014193041A1 (fr) Système et procédé d'analyse de rendement en utilisant des données de capteur d'un équipement de fabrication
WO2020034534A1 (fr) Procédé, appareil et dispositif de production de preuve sur la base d'un enregistreur de conduite, et support d'informations
WO2021091022A1 (fr) Système d'apprentissage automatique et procédé de fonctionnement pour système d'apprentissage automatique
WO2015129983A1 (fr) Dispositif et procédé destinés à recommander un film en fonction de l'exploration distribuée de règles d'association imprécises
WO2022039318A1 (fr) Procédé et système d'apprentissage d'intelligence artificielle utilisant des données d'image désidentifiées
WO2024063197A1 (fr) Appareil et procédé de retouche d'image
WO2022019675A1 (fr) Dispositif et procédé d'analyse de symboles compris dans un plan d'étage d'un site
WO2020085558A1 (fr) Appareil de traitement d'image d'analyse à grande vitesse et procédé de commande associé
WO2019074316A1 (fr) Système de reconnaissance basé sur un réseau neuronal artificiel convolutif dans lequel l'enregistrement, la recherche et la reproduction d'une image et d'une vidéo sont divisés entre un dispositif mobile et un serveur, et exécutés par ceux-ci
WO2022086147A1 (fr) Procédé permettant d'entraîner et de tester un réseau d'apprentissage utilisateur à utiliser pour reconnaître des données brouillées créées par brouillage de données originales pour protéger des informations personnelles et dispositif d'apprentissage utilisateur et dispositif de test faisant appel à celui-ci
WO2021012508A1 (fr) Procédé, appareil et dispositif de reconnaissance d'image d'ia, ainsi que support d'informations
WO2021002722A1 (fr) Procédé de perception d'une situation basée sur un marquage d'événement et système associé
WO2023191129A1 (fr) Procédé de surveillance de facture et de régulation légale et programme associé
WO2018236120A1 (fr) Procédé et dispositif d'identification de quasi-espèces au moyen d'un marqueur négatif
WO2020204219A1 (fr) Procédé de classification de valeurs aberrantes dans un apparentissage de reconnaissance d'objet à l'aide d'une intelligence artificielle, dispositif de classification et robot
WO2020071618A1 (fr) Procédé et système d'apprentissage partiel de réseau neuronal basé sur l'entropie

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21898465

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21898465

Country of ref document: EP

Kind code of ref document: A1