WO2022114639A1 - Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant - Google Patents
Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant Download PDFInfo
- Publication number
- WO2022114639A1 WO2022114639A1 PCT/KR2021/016695 KR2021016695W WO2022114639A1 WO 2022114639 A1 WO2022114639 A1 WO 2022114639A1 KR 2021016695 W KR2021016695 W KR 2021016695W WO 2022114639 A1 WO2022114639 A1 WO 2022114639A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- reference value
- column
- column values
- values
- data set
- Prior art date
Links
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000012098 association analyses Methods 0.000 title claims abstract description 21
- 238000013480 data collection Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000002411 adverse Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000007115 recruitment Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
Definitions
- the present invention removes the bias of the attribute value corresponding to the raw dataset attribute for artificial intelligence learning, and the artificial intelligence learning dataset fairness based on multidimensional subset association analysis to create an artificial intelligence learning dataset with guaranteed fairness. It relates to a device for guaranteeing and a method for ensuring fairness of an artificial intelligence learning dataset using the same.
- AI algorithms can learn decision-making models, and these decision-making models can learn even biases that often exist in society. Decisions made by an algorithm that can have a significant impact on a subject's life may be biased according to characteristics that should not be related to the decision, and are adversely affected by individuals belonging to a specific group such as gender, race, or religion.
- AI artificial intelligence
- a step 6-1 including only the number of column values having the same value as the reference value when the number of column values of the selected cluster in the fifth step is equal to or greater than the reference value;
- step 6-2 when the number of column values of the selected cluster and 10% or more of the reference value, and in step 6-1, data including only the number of column values having the same value as the reference value are combined to create a subdataset Step 7-1 to do;
- the extraction is performed by checking one or more of the number of column values, the number of columns, clustering of columns, the number of clustering of columns, and column headers.
- the reference value is characterized in that it is calculated by [Equation 1] below.
- the data collection of the sub-dataset additionally generated in the ninth step uses the number of column values of the selected cluster, the reference value, and the selected column as input values, and the sub-data additionally generated in the 7-1 step as the result value It is characterized in that it is assembled into an association module.
- the present invention provides an apparatus for ensuring fairness of artificial intelligence learning datasets based on multidimensional subset association analysis
- an upload unit 10 for uploading a raw data set file prepared in CSV (Comma-Separated Values) format
- a parsing unit 20 for parsing the structure of the uploaded raw data set file
- a number check unit 30 for checking the number of clustered columns and the total number of column values in the parsed structure
- a reference value calculator 40 for calculating a reference value using the number of clustering of the columns and the number of total column values
- a reference value comparison unit 50 for selecting one cluster and comparing the number of column values of the selected cluster with the reference value
- a data collection unit 90 that collects the data of the sub-dataset additionally generated by the iteration unit 80.
- the parsing unit The parsing unit,
- the reference value is characterized in that it is calculated by [Equation 1] below.
- the present invention can solve a problem that may be biased according to characteristics that should be independent of the decision made by the algorithm and ensure fairness.
- 1 is a flowchart showing a method of ensuring fairness of AI learning dataset based on multidimensional subset association analysis according to the present invention.
- FIG. 2 is a block diagram illustrating an apparatus for ensuring fairness of AI learning dataset based on multidimensional subset association analysis according to the present invention.
- 3 is a graph result showing a conventional data set.
- the present invention removes the bias of the attribute value corresponding to the raw dataset attribute for artificial intelligence learning, and the artificial intelligence learning dataset fairness based on multidimensional subset association analysis to create an artificial intelligence learning dataset with guaranteed fairness. It relates to a device for guaranteeing and a method for ensuring fairness of an artificial intelligence learning dataset using the same.
- the present invention looks at datasets and evaluates classifiers with fairness in mind, noting how undesirable biases can occur in machine learning (NL). To frame the context of the machine learning process about fairness, we identify biases and consider the long-term impact of model predictions that arise when these biases are not addressed.
- the first step (S10) uploads a raw data set file prepared in a CSV (Comma-Separated Values) format. More specifically, the CSV (Comma-Separated Values) format forms a table, a field, a column, and a column value.
- CSV Common-Separated Values
- the table is a collection of data in an appropriate format for quick reference, and represents the structure of data in a relational database model as a two-dimensional table. In other words, it is managed in the form of column values and columns, and by designating the key, you can quickly and easily find the data you want.
- the field is a piece of information having a certain meaning, and is the smallest unit of processing in the database system.
- the column value is also called a record or a tuple in a relational database, and indicates a single structured data item in a table.
- a database table can be considered to be composed of column values and columns or fields. Column values in each table represent a series of related data, and all column values in a table have the same structure.
- the column refers to a series of data values of a specific simple data type in a relational database table and each column in the table.
- the second step (S20) parses the structure of the uploaded raw data set file.
- the bias of the attribute value corresponding to the attribute of the raw dataset for artificial intelligence learning is parsed by parsing the structure of the raw dataset file prepared in CSV format.
- the parsing of the structure of the raw data set file includes the number of column values, the number of columns, the clustering of columns, the number of clustering of columns, and the column names. It is desirable to extract by checking any one or more of the header).
- the clustering of the columns is to classify the values corresponding to each field in the raw data set file into categories, and each column is classified according to features.
- a reference value is calculated using the number of clusters of the column and the number of total column values. More specifically, the reference value is characterized in that it is calculated by [Equation 1] below.
- the reference value is a value obtained by dividing the number of clustering of columns from the total number of column values, and it is possible to check the standard value of each cluster by calculating an average value for each cluster of fields.
- the reference value may be 500.
- the fifth step (S50) after selecting one cluster, the number of column values of the selected cluster is compared with the reference value.
- repeat execution (LOOP) is performed according to each group.
- the number of column values to be included is determined using the values compared in the fifth step (S50).
- the sixth step (S60) is performed separately in the following steps 6-1 (S61) and 6-2 (S62).
- step 6-1 when the number of column values of the group selected in step S50 is equal to or greater than the reference value, only the number of column values equal to the reference value is included.
- the first cluster when the reference value is 500 and the number of column values in the first cluster is 600, the first cluster includes only 500 column values corresponding to the reference value and discards the remaining 100 column values, or Remove.
- step 6-2 when the number of column values in the selected cluster in the fifth step (S50) is less than the reference value, the number of column values in the selected cluster and 10% of the reference value are compared.
- the reference value is 500
- 50 which is 10% of the reference value
- 100 which is the number of column values in the second cluster
- the column of the second cluster It is confirmed that the number of values exceeds the reference value.
- the seventh step (S70) is performed using the values performed in the 6-1 step (S61) and the 6-2 step (S62), and data including the determined number of column values is combined to create a subdataset.
- the seventh step (S70) is performed separately in the following steps 7-1 (S71) and 7-2 (S72).
- the 7-1 step (S71) is performed when the number of column values of the selected cluster in the 6-2 step (S62) and 10% or more of the reference value are the same as the reference value in the 6-1 step (S61). Creates a subdataset by combining data including only the number of column values of values.
- the second cluster since the number of column values (100) of the second cluster is greater than 10% (50) of the reference value, the second cluster is included in the sub-dataset and is generated.
- the first cluster is generated by being included in the lower data set including only 500 column values corresponding to the reference value.
- step 7-2 when the number of column values in the group selected in step 6-2 is less than 10% of the reference value, the column values in the selected cluster are removed.
- the reference value is 500
- 50 which is 10% of the reference value
- the column of the second cluster 30 the number of values
- the fifth step (S50) to the seventh step (S70) are repeatedly performed (Loop) by the number of clustering of the column to additionally generate a sub-dataset.
- the ninth step (S90) collects the data of the additionally generated sub-dataset. More specifically, the data collection of the sub-dataset additionally generated in the ninth step uses the number of column values of the selected cluster, the reference value, and the selected column as input values, and the step 7-1 ( The sub-dataset additionally generated in S71) is collected as a result value (output), and is collected through re-execution of the loop.
- the ninth step if the input values are 35 white males, 30 white females, 20 black males, and 15 black females in a column having a total of 100 values, in the ninth step (S90), the column After value extraction, white male 25, white female 25, black male 20, and black female 15 are collected, and the result is a fair subset with 85 values.
- an upload unit 10 a parsing unit 20, a number check unit 30, It consists of a reference value calculation unit 40 , a reference value comparison unit 50 , a lower data set generation unit 60 , a column value removal unit 70 , an iterative execution unit 80 , and a data collection unit 90 .
- the upload unit 10 uploads the file after checking whether the raw data set file is in CSV (Comma-Separated Values) format. More specifically, the CSV (Comma-Separated Values) format forms a table, a field, a column, and a column value.
- CSV Common-Separated Values
- the parsing unit 20 parses the structure of the raw data set file.
- the parsing unit 20 parses the structure of the raw data set file is the number of column values, the number of columns, the number of columns, the number of clustering of columns, the number of clustering of columns, and the column header (column header). ), it is preferable to extract by checking any one or more.
- the clustering of the columns is to classify the values corresponding to each field in the raw data set file into categories, and each column is classified according to features.
- the number check unit 30 checks the number of clustered columns and the total number of column values in the parsed structure.
- the reference value calculator 40 calculates a reference value using the number of clusters of the column and the number of total column values.
- the reference value is characterized in that it is calculated by [Equation 1] below.
- the reference value is a value obtained by dividing the number of clustering of column values from the total number of column values, and the average value for each cluster is calculated for a field so that a standard value for each cluster can be identified.
- the reference value may be 500.
- the reference value comparison unit 50 selects one cluster and compares the number of column values of the selected cluster with the reference value.
- the first cluster when the reference value is 500, when the number of column values in the first cluster is 600, the first cluster includes only 500 column values corresponding to the reference value.
- the lower data set generating unit 60 combines data including only the number of column values of the selected cluster and the number of column values equal to or greater than 10% of the reference value and the same as the reference value.
- the reference value is 500
- 50 which is 10% of the reference value
- 100 which is the number of column values in the second cluster
- the column of the second cluster It is confirmed that the number of values exceeds the reference value.
- the column value removal unit 70 is less than 10% of the number of column values of the selected cluster and the reference value, the column value of the selected cluster is removed.
- the column value of the third cluster is deleted or removed in step 7-2 (S72).
- the iteration performing unit 80 repeats the reference value comparison unit 50, the lower data set generation unit 60, and the column value removal unit 70 by the number of clustering of the column (Loop), Create additional subdatasets.
- the data collection unit 90 collects data of the sub-dataset additionally generated by the iteration unit 80 .
- the data aggregation unit 90 collects the number of column values of a specific cluster, a reference value, and a specific column as input values and the sub-dataset as a result value, thereby re-performing the loop. collected through
- an apparatus for ensuring fairness of an artificial intelligence learning dataset based on multidimensional subset association analysis and a method of ensuring fairness of an artificial intelligence learning dataset using the same are TRAFFIC in the TRANSPORTATION domain dataset of CHICAGO DATA PORTAL.
- the CRASHES-CRASHES dataset was used.
- Data set and expected work information are as follows.
- the NUMERIC FEATURES are as follows.
- posted_speed_limit information on the speed of the vehicle recorded by the police at the time of the accident
- the categorical features are as follows.
- ⁇ weather_condition Weather information recorded by the police at the time of the accident. Examples of input values: CLEAR , RAIN etc.
- ⁇ lighting_condition Lighting (brightness) information recorded by the police at the time of the accident.
- ⁇ roadway_surface_cond Information on the condition of the road surface at the accident point recorded by the police at the time of the accident. Examples of input values: DRY, ICE, WET etc.
- ⁇ first_crash_type Information on the type of accident recorded by the police at the time of the accident.
- a PREDICTION task is run to determine if the cost of repairing a car accident exceeds $1,500.
- LABEL indicates whether the cost of repairing a car accident is over $1,500.
- FIG. 4 shows a dataset executed by the present invention, and when it is filtered by categorical features (CATEGORICAL FEATURES), as shown in the graph, it can be confirmed that the graph deviation is small in the data of each feature.
- categorical features CATEGORICAL FEATURES
- the present invention can solve a problem that may be biased according to characteristics that should be independent of the decision made by the algorithm and ensure fairness.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne un dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et un procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant, le dispositif éliminant les biais de valeurs d'attribut, qui correspondent à des attributs d'ensemble de données brutes pour un apprentissage d'intelligence artificielle, et créant un ensemble de données d'apprentissage d'intelligence artificielle avec une équité garantie.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2020-0161943 | 2020-11-27 | ||
KR20200161943 | 2020-11-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022114639A1 true WO2022114639A1 (fr) | 2022-06-02 |
Family
ID=78521431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2021/016695 WO2022114639A1 (fr) | 2020-11-27 | 2021-11-16 | Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR102321735B1 (fr) |
WO (1) | WO2022114639A1 (fr) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102321735B1 (ko) * | 2020-11-27 | 2021-11-04 | 부산대학교 산학협력단 | 다차원의 부분집합 연관 분석에 기반한 인공지능 학습 데이터셋 공정성을 보장하는 장치 및 이를 이용한 인공지능 학습 데이터셋 공정성을 보장하는 방법 |
KR20230149914A (ko) | 2022-04-20 | 2023-10-30 | 서울시립대학교 산학협력단 | 인공지능 모델 학습 장치 및 방법 |
KR20240032288A (ko) | 2022-09-02 | 2024-03-12 | 동국대학교 산학협력단 | 정량적 녹차학습데이터셋 생성모듈과 정성적 녹차학습데이터셋 생성모듈로 이루어진 하이브리드형 녹차학습데이터셋 구축장치 및 방법 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20160096460A (ko) * | 2015-02-05 | 2016-08-16 | 삼성전자주식회사 | 복수의 분류기를 포함하는 딥 러닝 기반 인식 시스템 및 그 제어 방법 |
JP2018113048A (ja) * | 2013-02-20 | 2018-07-19 | ハートフォード スチーム ボイラー インスペクション アンド インシュアランス カンパニー | 外れ値偏りを低減するシステム及び方法 |
KR102005628B1 (ko) * | 2017-04-26 | 2019-07-30 | 김정희 | 학습 데이터 전처리 방법 및 시스템 |
US20200081865A1 (en) * | 2018-09-10 | 2020-03-12 | Google Llc | Rejecting Biased Data Using a Machine Learning Model |
KR20200046899A (ko) * | 2018-10-26 | 2020-05-07 | 삼성에스디에스 주식회사 | 관심 데이터 추출 방법 및 그 장치 |
KR102321735B1 (ko) * | 2020-11-27 | 2021-11-04 | 부산대학교 산학협력단 | 다차원의 부분집합 연관 분석에 기반한 인공지능 학습 데이터셋 공정성을 보장하는 장치 및 이를 이용한 인공지능 학습 데이터셋 공정성을 보장하는 방법 |
-
2020
- 2020-12-29 KR KR1020200185398A patent/KR102321735B1/ko active IP Right Grant
-
2021
- 2021-11-16 WO PCT/KR2021/016695 patent/WO2022114639A1/fr active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018113048A (ja) * | 2013-02-20 | 2018-07-19 | ハートフォード スチーム ボイラー インスペクション アンド インシュアランス カンパニー | 外れ値偏りを低減するシステム及び方法 |
KR20160096460A (ko) * | 2015-02-05 | 2016-08-16 | 삼성전자주식회사 | 복수의 분류기를 포함하는 딥 러닝 기반 인식 시스템 및 그 제어 방법 |
KR102005628B1 (ko) * | 2017-04-26 | 2019-07-30 | 김정희 | 학습 데이터 전처리 방법 및 시스템 |
US20200081865A1 (en) * | 2018-09-10 | 2020-03-12 | Google Llc | Rejecting Biased Data Using a Machine Learning Model |
KR20200046899A (ko) * | 2018-10-26 | 2020-05-07 | 삼성에스디에스 주식회사 | 관심 데이터 추출 방법 및 그 장치 |
KR102321735B1 (ko) * | 2020-11-27 | 2021-11-04 | 부산대학교 산학협력단 | 다차원의 부분집합 연관 분석에 기반한 인공지능 학습 데이터셋 공정성을 보장하는 장치 및 이를 이용한 인공지능 학습 데이터셋 공정성을 보장하는 방법 |
Also Published As
Publication number | Publication date |
---|---|
KR102321735B1 (ko) | 2021-11-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022114639A1 (fr) | Dispositif pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle sur la base d'une analyse d'association de sous-ensemble multidimensionnel, et procédé pour assurer l'équité d'un ensemble de données d'apprentissage d'intelligence artificielle l'utilisant | |
WO2019037197A1 (fr) | Procédé et dispositif d'apprentissage de classificateur de sujets, et support de stockage lisible par ordinateur | |
WO2021080102A1 (fr) | Procédé de formation et d'essai d'un réseau d'adaptation correspondant à un réseau de brouillage pouvant traiter des données à dissimuler à des fins de confidentialité et dispositif de formation et dispositif d'essai utilisant ledit procédé | |
WO2021080103A1 (fr) | Procédé d'apprentissage et de test d'un réseau d'apprentissage utilisateur à utiliser pour reconnaître des données obscurcies créées par dissimulation de données originales afin de protéger des informations personnelles et dispositif d'apprentissage et dispositif de test l'utilisant | |
WO2020082562A1 (fr) | Procédé, appareil, dispositif et support de mémoire d'identification de symbole | |
WO2020073495A1 (fr) | Procédé, appareil et dispositif de réexamen basés sur l'intelligence artificielle, et support d'informations | |
EP3461290A1 (fr) | Modèle d'apprentissage pour détection de région faciale saillante | |
WO2017007084A1 (fr) | Dispositif et procédé d'extraction de sujet | |
WO2018058959A1 (fr) | Procédé et appareil de vérification de langage sql, serveur et dispositif de stockage | |
WO2014193041A1 (fr) | Système et procédé d'analyse de rendement en utilisant des données de capteur d'un équipement de fabrication | |
WO2020034534A1 (fr) | Procédé, appareil et dispositif de production de preuve sur la base d'un enregistreur de conduite, et support d'informations | |
WO2021091022A1 (fr) | Système d'apprentissage automatique et procédé de fonctionnement pour système d'apprentissage automatique | |
WO2015129983A1 (fr) | Dispositif et procédé destinés à recommander un film en fonction de l'exploration distribuée de règles d'association imprécises | |
WO2022039318A1 (fr) | Procédé et système d'apprentissage d'intelligence artificielle utilisant des données d'image désidentifiées | |
WO2024063197A1 (fr) | Appareil et procédé de retouche d'image | |
WO2022019675A1 (fr) | Dispositif et procédé d'analyse de symboles compris dans un plan d'étage d'un site | |
WO2020085558A1 (fr) | Appareil de traitement d'image d'analyse à grande vitesse et procédé de commande associé | |
WO2019074316A1 (fr) | Système de reconnaissance basé sur un réseau neuronal artificiel convolutif dans lequel l'enregistrement, la recherche et la reproduction d'une image et d'une vidéo sont divisés entre un dispositif mobile et un serveur, et exécutés par ceux-ci | |
WO2022086147A1 (fr) | Procédé permettant d'entraîner et de tester un réseau d'apprentissage utilisateur à utiliser pour reconnaître des données brouillées créées par brouillage de données originales pour protéger des informations personnelles et dispositif d'apprentissage utilisateur et dispositif de test faisant appel à celui-ci | |
WO2021012508A1 (fr) | Procédé, appareil et dispositif de reconnaissance d'image d'ia, ainsi que support d'informations | |
WO2021002722A1 (fr) | Procédé de perception d'une situation basée sur un marquage d'événement et système associé | |
WO2023191129A1 (fr) | Procédé de surveillance de facture et de régulation légale et programme associé | |
WO2018236120A1 (fr) | Procédé et dispositif d'identification de quasi-espèces au moyen d'un marqueur négatif | |
WO2020204219A1 (fr) | Procédé de classification de valeurs aberrantes dans un apparentissage de reconnaissance d'objet à l'aide d'une intelligence artificielle, dispositif de classification et robot | |
WO2020071618A1 (fr) | Procédé et système d'apprentissage partiel de réseau neuronal basé sur l'entropie |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21898465 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21898465 Country of ref document: EP Kind code of ref document: A1 |