WO2022114639A1

WO2022114639A1 - Device for ensuring fairness of artificial intelligence learning data set based on multidimensional subset association analysis, and method for ensuring fairness of artificial intelligence learning data set by using same

Info

Publication number: WO2022114639A1
Application number: PCT/KR2021/016695
Authority: WO
Inventors: 권준호; 김구; 김진우; 신현실
Original assignee: 부산대학교 산학협력단
Priority date: 2020-11-27
Filing date: 2021-11-16
Publication date: 2022-06-02
Also published as: KR102321735B1

Abstract

The present invention relates to a device for ensuring fairness of an artificial intelligence learning data set based on a multidimensional subset association analysis, and a method for ensuring fairness of an artificial intelligence learning data set by using same, the device removing the biases of attribute values, which correspond to raw data set attributes for artificial intelligence learning, and creating an artificial intelligence learning data set with ensured fairness.

Description

Apparatus for ensuring fairness of AI learning dataset based on multidimensional subset association analysis and method of ensuring fairness of AI training dataset using the same

The present invention removes the bias of the attribute value corresponding to the raw dataset attribute for artificial intelligence learning, and the artificial intelligence learning dataset fairness based on multidimensional subset association analysis to create an artificial intelligence learning dataset with guaranteed fairness. It relates to a device for guaranteeing and a method for ensuring fairness of an artificial intelligence learning dataset using the same.

Recently, artificial intelligence algorithms are growing rapidly, and by using them, the risks and impacts can be predicted in advance to assist or automate human judgment. In various subjects such as public security, policy, finance, medicine, and recruitment, AI algorithms can learn decision-making models, and these decision-making models can learn even biases that often exist in society. Decisions made by an algorithm that can have a significant impact on a subject's life may be biased according to characteristics that should not be related to the decision, and are adversely affected by individuals belonging to a specific group such as gender, race, or religion.

With the rapid development of artificial intelligence (AI) technology, it is necessary to discuss the adverse effects of artificial intelligence or its effect on society as a whole as it is applied to various industries. In particular, fairness issues such as bias or lack of transparency with respect to gender, race, and social group become an issue.

For the proper operation and results of artificial intelligence (AI), it is very important to ensure the fairness of the dataset for learning. For example, in artificial intelligence that distinguishes males from females, if a training dataset is used in which the ratio of the number of male and female data is simply not fair (eg, male : female = 80,000row : 20,000row), incorrect results will occur. And the cause can be said to be the unfairness of the dataset.

In the example described above, if bias occurs in other data-dependent properties of the male and female, proper AI learning results cannot be expected just by matching the male and female data ratios. Currently, specific methods and technologies for ensuring the fairness of datasets have not been developed.

The present invention has been devised to solve the above problems, and an object of the present invention is to solve a problem that may be biased according to characteristics that should be independent of the decision made by an algorithm and to a method for ensuring fairness.

The technical problems to be solved by the invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. will be able

A method of ensuring fairness of an artificial intelligence learning dataset based on multidimensional subset association analysis according to the present invention,

A first step of uploading a raw data set file prepared in CSV (Comma-Separated Values) format;

a second step of parsing the structure of the uploaded raw data set file;

a third step of checking the number of clustering columns and the total number of column values in the parsed structure;

a fourth step of calculating a reference value using the number of clustered columns and the total number of column values;

a fifth step of selecting one cluster and comparing the number of column values of the selected cluster with the reference value;

a step 6-1 including only the number of column values having the same value as the reference value when the number of column values of the selected cluster in the fifth step is equal to or greater than the reference value;

a step 6-2 of comparing the number of column values of the selected cluster with a value of 10% of the reference value when the number of column values of the selected cluster in step 5 is less than the reference value;

In step 6-2, when the number of column values of the selected cluster and 10% or more of the reference value, and in step 6-1, data including only the number of column values having the same value as the reference value are combined to create a subdataset Step 7-1 to do;

a 7-2 step of removing the column value of the selected cluster when the number of column values of the selected cluster in step 6-2 is less than 10% of the reference value;

an eighth step of repeatedly performing (Loop) steps 5 to 7 as many as the number of clustering of the columns to additionally create a sub-dataset; and

and a ninth step of collecting data of the additionally generated sub-dataset.

Parsing the structure of the raw data set file in the second step is,

It is characterized in that the extraction is performed by checking one or more of the number of column values, the number of columns, clustering of columns, the number of clustering of columns, and column headers.

The reference value is characterized in that it is calculated by [Equation 1] below.

[Equation 1]

The data collection of the sub-dataset additionally generated in the ninth step uses the number of column values of the selected cluster, the reference value, and the selected column as input values, and the sub-data additionally generated in the 7-1 step as the result value It is characterized in that it is assembled into an association module.

In addition, the present invention provides an apparatus for ensuring fairness of artificial intelligence learning datasets based on multidimensional subset association analysis,

an upload unit 10 for uploading a raw data set file prepared in CSV (Comma-Separated Values) format;

a parsing unit 20 for parsing the structure of the uploaded raw data set file;

a number check unit 30 for checking the number of clustered columns and the total number of column values in the parsed structure;

a reference value calculator 40 for calculating a reference value using the number of clustering of the columns and the number of total column values;

a reference value comparison unit 50 for selecting one cluster and comparing the number of column values of the selected cluster with the reference value;

a lower data set generating unit 60 for combining data including only the number of column values in the selected cluster and 10% or more of the reference value and the number of column values having the same value as the reference value;

a column value removal unit 70 that removes the column values of the selected cluster when the number of column values in the selected cluster is less than 10% of the reference value;

An iterative performing unit that repeatedly performs (Loop) the reference value comparison unit 50, the lower data set generating unit 60, and the column value removing unit 70 by the number of clustering of the column to additionally generate a lower data set (80);

and a data collection unit 90 that collects the data of the sub-dataset additionally generated by the iteration unit 80.

The parsing unit,

Extracting by checking one or more of the number of column values, the number of columns, the clustering of columns, the number of clustering of columns, and the column header in the structure of the raw data set file do it with

[Equation 1]

The data aggregation unit 90 uses the number of column values of the selected cluster, the reference value, and the selected column as input values, and collects the additionally generated sub-data as a result value with an association module. .

By means of solving the above problems, the present invention can solve a problem that may be biased according to characteristics that should be independent of the decision made by the algorithm and ensure fairness.

1 is a flowchart showing a method of ensuring fairness of AI learning dataset based on multidimensional subset association analysis according to the present invention.

2 is a block diagram illustrating an apparatus for ensuring fairness of AI learning dataset based on multidimensional subset association analysis according to the present invention.

3 is a graph result showing a conventional data set.

4 is a data set graph result executed by the present invention.

Terms used in this specification will be briefly described, and the present invention will be described in detail.

The terms used in the present invention have been selected as currently widely used general terms as possible while considering the functions in the present invention, which may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. Therefore, the term used in the present invention should be defined based on the meaning of the term and the overall content of the present invention, rather than the name of a simple term.

In the entire specification, when a part “includes” a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

Hereinafter, with reference to the accompanying drawings, the embodiments of the present invention will be described in detail so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein.

Specific details including the problem to be solved for the present invention, the means for solving the problem, and the effect of the invention are included in the embodiments and drawings to be described below. Advantages and features of the present invention, and a method for achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings.

Hereinafter, the present invention will be described in more detail with reference to the accompanying drawings.

The present invention looks at datasets and evaluates classifiers with fairness in mind, noting how undesirable biases can occur in machine learning (NL). To frame the context of the machine learning process about fairness, we identify biases and consider the long-term impact of model predictions that arise when these biases are not addressed.

The present invention, a method for ensuring fairness of an artificial intelligence learning dataset based on multidimensional subset association analysis, is performed by the following steps as shown in FIG. 1 .

First, the first step (S10) uploads a raw data set file prepared in a CSV (Comma-Separated Values) format. More specifically, the CSV (Comma-Separated Values) format forms a table, a field, a column, and a column value.

The table is a collection of data in an appropriate format for quick reference, and represents the structure of data in a relational database model as a two-dimensional table. In other words, it is managed in the form of column values and columns, and by designating the key, you can quickly and easily find the data you want.

The field is a piece of information having a certain meaning, and is the smallest unit of processing in the database system.

The column value is also called a record or a tuple in a relational database, and indicates a single structured data item in a table. In simple terms, a database table can be considered to be composed of column values and columns or fields. Column values in each table represent a series of related data, and all column values in a table have the same structure.

The column refers to a series of data values of a specific simple data type in a relational database table and each column in the table.

Next, the second step (S20) parses the structure of the uploaded raw data set file. In the present invention, the bias of the attribute value corresponding to the attribute of the raw dataset for artificial intelligence learning is parsed by parsing the structure of the raw dataset file prepared in CSV format.

In the second step (S20), the parsing of the structure of the raw data set file includes the number of column values, the number of columns, the clustering of columns, the number of clustering of columns, and the column names. It is desirable to extract by checking any one or more of the header).

The clustering of the columns is to classify the values corresponding to each field in the raw data set file into categories, and each column is classified according to features.

Next, in the third step (S30), the number of clustered columns and the total number of column values in the parsed structure are checked.

Next, in the fourth step ( S40 ), a reference value is calculated using the number of clusters of the column and the number of total column values. More specifically, the reference value is characterized in that it is calculated by [Equation 1] below.

[Equation 1]

The reference value is a value obtained by dividing the number of clustering of columns from the total number of column values, and it is possible to check the standard value of each cluster by calculating an average value for each cluster of fields.

As an example, when the total number of column values is 2,000 and the number of clustering columns is 4, the reference value may be 500.

Next, in the fifth step (S50), after selecting one cluster, the number of column values of the selected cluster is compared with the reference value. In the fifth step (S50), repeat execution (LOOP) is performed according to each group.

Next, in the sixth step (S60), the number of column values to be included is determined using the values compared in the fifth step (S50). The sixth step (S60) is performed separately in the following steps 6-1 (S61) and 6-2 (S62).

In step 6-1 (S61), when the number of column values of the group selected in step S50 is equal to or greater than the reference value, only the number of column values equal to the reference value is included.

In one embodiment, when the reference value is 500 and the number of column values in the first cluster is 600, the first cluster includes only 500 column values corresponding to the reference value and discards the remaining 100 column values, or Remove.

In step 6-2 (S62), when the number of column values in the selected cluster in the fifth step (S50) is less than the reference value, the number of column values in the selected cluster and 10% of the reference value are compared.

In one embodiment, when the reference value is 500, when the number of column values in the second cluster is 100, 50, which is 10% of the reference value, is compared with 100, which is the number of column values in the second cluster, and the column of the second cluster It is confirmed that the number of values exceeds the reference value.

Next, the seventh step (S70) is performed using the values performed in the 6-1 step (S61) and the 6-2 step (S62), and data including the determined number of column values is combined to create a subdataset. The seventh step (S70) is performed separately in the following steps 7-1 (S71) and 7-2 (S72).

The 7-1 step (S71) is performed when the number of column values of the selected cluster in the 6-2 step (S62) and 10% or more of the reference value are the same as the reference value in the 6-1 step (S61). Creates a subdataset by combining data including only the number of column values of values.

In an embodiment, since the number of column values (100) of the second cluster is greater than 10% (50) of the reference value, the second cluster is included in the sub-dataset and is generated.

In addition, the first cluster is generated by being included in the lower data set including only 500 column values corresponding to the reference value.

In step 7-2 (S72), when the number of column values in the group selected in step 6-2 is less than 10% of the reference value, the column values in the selected cluster are removed.

In one embodiment, when the reference value is 500, when the number of column values in the third cluster is 30, since the number of column values in the third cluster is less than the reference value, 50, which is 10% of the reference value, and the column of the second cluster 30, the number of values, is compared, and when it is determined that the value is less than 10% of the reference value in step 6-2 (S62), the column value of the third cluster is deleted or removed in step 7-2 (S72) .

Next, in the eighth step (S80), the fifth step (S50) to the seventh step (S70) are repeatedly performed (Loop) by the number of clustering of the column to additionally generate a sub-dataset.

Next, the ninth step (S90) collects the data of the additionally generated sub-dataset. More specifically, the data collection of the sub-dataset additionally generated in the ninth step uses the number of column values of the selected cluster, the reference value, and the selected column as input values, and the step 7-1 ( The sub-dataset additionally generated in S71) is collected as a result value (output), and is collected through re-execution of the loop.

In an embodiment according to the present invention, if the input values are 35 white males, 30 white females, 20 black males, and 15 black females in a column having a total of 100 values, in the ninth step (S90), the column After value extraction, white male 25, white female 25, black male 20, and black female 15 are collected, and the result is a fair subset with 85 values.

In addition, the apparatus for ensuring fairness of artificial intelligence learning datasets based on multidimensional subset association analysis according to the present invention is as shown in FIG. 2 , an upload unit 10, a parsing unit 20, a number check unit 30, It consists of a reference value calculation unit 40 , a reference value comparison unit 50 , a lower data set generation unit 60 , a column value removal unit 70 , an iterative execution unit 80 , and a data collection unit 90 .

First, the upload unit 10 uploads the file after checking whether the raw data set file is in CSV (Comma-Separated Values) format. More specifically, the CSV (Comma-Separated Values) format forms a table, a field, a column, and a column value.

Next, the parsing unit 20 parses the structure of the raw data set file. The parsing unit 20 parses the structure of the raw data set file is the number of column values, the number of columns, the number of columns, the number of clustering of columns, the number of clustering of columns, and the column header (column header). ), it is preferable to extract by checking any one or more.

Next, the number check unit 30 checks the number of clustered columns and the total number of column values in the parsed structure.

Next, the reference value calculator 40 calculates a reference value using the number of clusters of the column and the number of total column values. The reference value is characterized in that it is calculated by [Equation 1] below.

[Equation 1]

The reference value is a value obtained by dividing the number of clustering of column values from the total number of column values, and the average value for each cluster is calculated for a field so that a standard value for each cluster can be identified.

As an embodiment, when the total number of column values is 2,000 and the number of clustering of column values is 4, the reference value may be 500.

Next, the reference value comparison unit 50 selects one cluster and compares the number of column values of the selected cluster with the reference value.

In an embodiment, when the reference value is 500, when the number of column values in the first cluster is 600, the first cluster includes only 500 column values corresponding to the reference value.

Next, the lower data set generating unit 60 combines data including only the number of column values of the selected cluster and the number of column values equal to or greater than 10% of the reference value and the same as the reference value.

Next, when the column value removal unit 70 is less than 10% of the number of column values of the selected cluster and the reference value, the column value of the selected cluster is removed.

In addition, when the reference value is 500, when the number of column values in the third cluster is 30, since the number of column values in the third cluster is less than the reference value, 50, which is 10% of the reference value, and the number of column values in the second cluster 30 is compared, and when it is determined that the value is less than 10% of the reference value in step 6-2 (S62), the column value of the third cluster is deleted or removed in step 7-2 (S72).

Next, the iteration performing unit 80 repeats the reference value comparison unit 50, the lower data set generation unit 60, and the column value removal unit 70 by the number of clustering of the column (Loop), Create additional subdatasets.

Next, the data collection unit 90 collects data of the sub-dataset additionally generated by the iteration unit 80 . The data aggregation unit 90 collects the number of column values of a specific cluster, a reference value, and a specific column as input values and the sub-dataset as a result value, thereby re-performing the loop. collected through

In one embodiment, the present inventors, an apparatus for ensuring fairness of an artificial intelligence learning dataset based on multidimensional subset association analysis and a method of ensuring fairness of an artificial intelligence learning dataset using the same are TRAFFIC in the TRANSPORTATION domain dataset of CHICAGO DATA PORTAL. The CRASHES-CRASHES dataset was used.

Data set and expected work information are as follows.

- SOURCE : TRAFFIC CRASHES - CRASHES

- Only FIELDs that can affect machine learning fairness in the original dataset are selected and used for training

- Use the DAMAGE FIELD modification of the original dataset

· $500 OR LESS: ≤ 1500

$501 - $1,500 : ≤ 1500

· OVER - $1,500 : > 1500

The NUMERIC FEATURES are as follows.

posted_speed_limit: information on the speed of the vehicle recorded by the police at the time of the accident

The categorical features are as follows.

· weather_condition: Weather information recorded by the police at the time of the accident. Examples of input values: CLEAR , RAIN etc.

· lighting_condition : Lighting (brightness) information recorded by the police at the time of the accident. Example input values: DAYLIGHT, DARKNESS, DARKNESS LIGHTED ROAD etc.

· roadway_surface_cond : Information on the condition of the road surface at the accident point recorded by the police at the time of the accident. Examples of input values: DRY, ICE, WET etc.

· first_crash_type : Information on the type of accident recorded by the police at the time of the accident. Example input values: REAR EDN, TURNING, ANGLE, PARKED MOTOR VEHICLE etc.

A PREDICTION task is run to determine if the cost of repairing a car accident exceeds $1,500.

LABEL indicates whether the cost of repairing a car accident is over $1,500.

3 shows a conventional dataset, and when filtered by categorical features, as shown in the graph, it can be seen that the graph deviation is large in the data of each feature.

On the other hand, FIG. 4 shows a dataset executed by the present invention, and when it is filtered by categorical features (CATEGORICAL FEATURES), as shown in the graph, it can be confirmed that the graph deviation is small in the data of each feature.

As such, those skilled in the art to which the present invention pertains will understand that the above-described technical configuration of the present invention may be implemented in other specific forms without changing the technical spirit or essential characteristics of the present invention.

Therefore, the embodiments described above are to be understood as illustrative and not restrictive in all respects, and the scope of the present invention is indicated by the following claims rather than the above detailed description, and the meaning and scope of the claims and their All changes or modifications derived from the concept of equivalents should be construed as being included in the scope of the present invention.

Claims

a first step of uploading, by the uploading unit 10, a raw data set file prepared in CSV (Comma-Separated Values) format;

a second step of parsing, by the parsing unit 20, the structure of the uploaded raw data set file;

a third step in which the number check unit 30 checks the number of clustered columns and the total number of column values in the parsed structure;

a fourth step in which the reference value calculator 40 calculates a reference value by using the number of clustered columns and the total number of column values;

a fifth step of comparing the reference value with the number of column values of the selected cluster after the reference value comparison unit 50 selects one cluster;

a sixth step in which the lower data set generating unit 60 determines the number of column values to be included using the values compared in the fifth step;

a seventh step in which the column value removal unit 70 combines data including the determined number of column values to create a lower data set;

an eighth step in which the iterative unit 80 repeats the fifth to seventh steps as many as the number of clustering of the columns to additionally generate a sub-dataset; and

To ensure fairness of the artificial intelligence learning dataset based on multidimensional subset association analysis, characterized in that the data collection unit 90 collects the data of the additionally generated sub-dataset; Way.
The method of claim 1,

Parsing the structure of the raw data set file in the second step is,

Multidimensional subset association analysis, characterized in that the extraction is performed by checking at least one of the number of column values, the number of columns, the clustering of columns, the number of clustering of columns, and the column header A method to ensure fairness of the artificial intelligence training dataset based on
The method of claim 1,

The reference value is a method of ensuring fairness of the artificial intelligence learning dataset based on multidimensional subset association analysis, characterized in that it is calculated by the following [Equation 1]:

[Equation 1]
The method of claim 1,

The data collection of the sub-dataset additionally created in the ninth step is,

It is characterized in that the number of column values of the selected cluster, the reference value, and the selected column are used as input values, and the sub data additionally generated in the seventh step is collected by an association module as an output value. A method to ensure fairness of artificial intelligence training datasets based on multidimensional subset association analysis.
The method of claim 1,

The sixth step is

a step 6-1 including only the number of column values having the same value as the reference value when the number of column values of the selected cluster in the fifth step is equal to or greater than the reference value; and

When the number of column values of the selected cluster in the fifth step is less than the reference value, the 6-2 step of comparing the number of column values of the selected cluster with a value of 10% of the reference value; characterized in that it is performed separately A method to ensure fairness of artificial intelligence training datasets based on multidimensional subset association analysis.
6. The method of claim 5,

The seventh step is

In step 6-2, when the number of column values of the selected cluster and 10% or more of the reference value, and in step 6-1, data including only the number of column values having the same value as the reference value are combined to create a subdataset Step 7-1 to do; and

When the number of column values of the selected cluster and the reference value in step 6-2 is less than 10%, the column value of the selected cluster is removed in step 7-2; A method to ensure fairness of artificial intelligence training datasets based on set association analysis.
an upload unit 10 for uploading a raw data set file prepared in CSV (Comma-Separated Values) format;

a parsing unit 20 for parsing the structure of the uploaded raw data set file;

a number check unit 30 for checking the number of clustered columns and the total number of column values in the parsed structure;

a reference value calculator 40 for calculating a reference value by using the number of clustering of the columns and the number of total column values;

a reference value comparison unit 50 for selecting one cluster and comparing the number of column values of the selected cluster with the reference value;

a lower data set generating unit 60 for combining data including only the number of column values in the selected cluster and 10% or more of the reference value and the number of column values having the same value as the reference value;

a column value removal unit 70 that removes the column values of the selected cluster when the number of column values in the selected cluster is less than 10% of the reference value;

An iterative performing unit that repeatedly performs (Loop) the reference value comparison unit 50, the lower data set generating unit 60, and the column value removing unit 70 by the number of clustering of the column to additionally generate a lower data set (80);

Artificial intelligence learning data based on multidimensional subset association analysis, characterized in that it is executed by; Three devices to ensure fairness.
8. The method of claim 7,

The parsing unit,

Extracting by checking one or more of the number of column values, the number of columns, the clustering of columns, the number of clustering of columns, and the column header in the structure of the raw data set file A device that guarantees the fairness of the artificial intelligence learning dataset based on multidimensional subset association analysis.
8. The method of claim 7,

The reference value is an apparatus for ensuring fairness of the artificial intelligence learning dataset based on multidimensional subset association analysis, characterized in that it is calculated by the following [Equation 1]:

[Equation 1]
8. The method of claim 7,

The data collection unit 90,

Multidimensional, characterized in that the number of column values of the selected cluster, the reference value, and the selected column are combined with an association module that uses the additionally generated sub-data as an output value. A device that guarantees fairness of artificial intelligence training datasets based on subset association analysis.