KR102109854B1

KR102109854B1 - Data pre processing system and method for clinical data analysis

Info

Publication number: KR102109854B1
Application number: KR1020190135494A
Authority: KR
Inventors: 유진태; 유진호
Original assignee: 유진바이오소프트 주식회사
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-05-13

Abstract

The present invention relates to a system for data preprocessing automation used in clinical data analysis to automate a preprocessing process for clinical data to be used in a statistical analysis system to sequentially and automatically perform the preprocessing process for each variable. Conventionally, a data editor such as Excel was used to manually perform data preprocessing such as missing value processing and typographical error correction work. Such methods are used when the data size is relatively small and are close to an individual data organization concept rather than a data preprocessing concept. The present invention automates a data preprocessing process to be used for clinical data for each variable, divides and sets continuous variables/categorical variables in clinical data, provides a data preprocessing process to automatically perform a preprocessing process to be performed for each variable, and allows a user to easily check and reset properties for a corresponding variable for each variable to automatically perform data preprocessing of clinical data.

Description

Data pre-processing automation system and method used for clinical data analysis {DATA PRE PROCESSING SYSTEM AND METHOD FOR CLINICAL DATA ANALYSIS}

본 발명은 통계분석 시스템에서 사용될 임상데이터에 대한 전처리 과정을 변수 별로 순차적으로 자동적으로 진행될 수 있도록 자동화 시키고자 한 임상데이터 분석에 사용되는 데이터 전처리 자동화 시스템에 관한 것이다.The present invention relates to a data pre-processing automation system used for analyzing clinical data intended to automate so that the pre-processing process for clinical data to be used in a statistical analysis system can be automatically performed sequentially for each variable.

삭제delete

과학, 의학 등 각종 기술 분야에서는 모수 집단으로부터 소정의 데이터를 샘플링하거나, 샘플링된 데이터로부터 각종 파라미터를 분석하는 통계 분석 방법이 일반화되어 있다. 예를 들어, 의학 분야에서 수행되는 임상 데이터는 각종 통계적 방법을 통해 분석된 후, 논문으로 출판된다.
여기서 제대로 된 통계분석을 위해서는 데이터 전처리 과정이 반드시 필요하다. In various technical fields such as science and medicine, a statistical analysis method has been generalized to sample predetermined data from a parameter group or to analyze various parameters from sampled data. For example, clinical data performed in the medical field are analyzed through various statistical methods, and then published as a paper.
Here, data pre-processing is essential for proper statistical analysis.

[통계분석에 사용될 데이터 전처리 필요성] [Necessity of preprocessing data to be used for statistical analysis]

통계분석에 사용되는 데이터를 만들 때는 사람의 수작업에서 발생하는 오류 뿐 아니라 전산으로 잘못 처리된 이유로 실제 분석에서 제외해야하거나, 수치를 변경할 필요가 있는 것들이 포함되어 있다. When creating the data used for statistical analysis, not only errors that occur in human manual work, but also those that need to be excluded from actual analysis for reasons of mishandling by computer, or those that need to be changed are included.

또한 데이터에 포함되어 있는 변수 뿐만 아니라 이들을 조합하여 새로운 변수를 만들어 분석에 사용되는 경우도 빈번하게 발생된다.In addition, not only the variables included in the data, but also a combination of them to create a new variable is often used in the analysis.

이러한 데이터를 정리하지 않고 분석에 바로 사용하게 되면 제대로 된 분석결과를 얻어낼 수 없게 됨으로써, 결과적으로 연구자가 원하는 연구를 수행하지 못하게 된다. If this data is used directly for analysis without organizing, it is impossible to obtain a proper analysis result, and as a result, the researcher cannot perform the desired research.

[종래 데이터 전처리 방법][Conventional data pre-processing method]

데이터 크기가 비교적 작은 경우에는 (예: 100명의 환자에서 측정된 변수 10개로 이루어진 데이터) 엑셀과 같은 데이터 편집기를 사용하여 수작업으로 결측치 처리, 오타 수정작업 등의 데이터 전처리가 이루어졌다.When the data size is relatively small (eg, data consisting of 10 variables measured in 100 patients), data processing such as missing value processing and typos correction was performed manually using a data editor such as Excel.

이와 같은 방법은 데이터 전처리 개념이라 보기 어렵고 개별적인 자료 정리 개념에 가깝다.This method is difficult to see as a concept of data pre-processing and is close to the concept of individual data organization.

데이터 크기가 상당히 큰 경우에는 수작업으로 데이터 전처리를 수행하는 것은 힘들기 때문에 SAS나 R scrip와 같은 통계프로그램 언어를 사용하여 데이터 전처리를 수행할 수 있는 프로그램을 직접 코딩하거나, 또는 SPSS 와 같은 GUI 기반 통계분석 소프트웨어를 사용하여 데이터 전처리를 수행하였다.If the data size is quite large, manual data preprocessing is difficult, so you can directly code a program that can perform data preprocessing using a statistical program language such as SAS or R scrip, or GUI-based statistics such as SPSS Data pre-processing was performed using analysis software.

그러나 프로그램을 직접 코딩하는 것은 IT와 통계전문가가 아닌 경우에는 접근하기 어려운 방법이며, 기존 소프트웨어에는 데이터 전처리를 위한 방법이 체계적으로 정리되어 있지 않기 때문에 기대되는 효과를 얻기 힘들다.However, coding the program directly is a difficult approach for non-IT and statistical experts, and it is difficult to obtain the expected effect because existing software does not have a systematic method for preprocessing data.

대한민국 공개특허공보 제10-2018-0132238호. 발명의 명칭 "논문 삽입용 통계 결과 표 또는 그림의 자동화 생성 및 편집 방법, 이를 구현하는 컴퓨터 프로그램 및 이를 수행하는 정보처리장치.Republic of Korea Patent Publication No. 10-2018-0132238. Name of invention "Automatic creation and editing method of statistical result table or picture for thesis insertion, computer program implementing it, and information processing device performing the same.

본 발명은 임상데이터에 사용될 데이터 전처리 과정이 변수 별로 자동화가 이루어질 수 있도록 한 것으로, 임상 데이터에서 연속형 변수/범주형 변수를 구분하여 설정하고, 각 변수 별로 진행될 전처리 과정이 자동적으로 이루어질 수 있도록 데이터 전처리 프로세스를 제공하여, 각 변수별로 사용자가 손쉽게 해당 변수에 대한 속성을 확인 및 재설정이 가능하도록 함으로써, 임상데이터의 데이터 전처리가 자동적으로 이루어질 수 있도록 한 것이다.The present invention is to enable the automation of data pre-processing to be used for clinical data by variable, and to set and classify continuous / categorical variables in clinical data, and to perform pre-processing for each variable automatically. By providing a pre-processing process, the user can easily check and reset the properties of the variable for each variable, so that the pre-processing of clinical data can be automatically performed.

본 발명 임상데이터 분석에 사용되는 데이터 전처리 자동화 시스템은 결측치 처리, 변수 속성 변경 등을 통해 변수의 판단 및 그 변수의 속성을 자동 설정하며 각 변수의 데이터 치환, 아웃라이어 처리 및 신규변수의 생성 등을 포함하는 각 변수별 데이터 전처리과정을 사용자에게 순차적으로 제공하여 통계분석에서 제외되어야 할 데이터 처리 및 각 변수 들의 변경될 수치 등을 변수별 자동적 처리 또는 사용자가 손쉽게 수행할 수 있도록 함을 특징으로 한다.The data pre-processing automation system used in the clinical data analysis of the present invention automatically determines the variables and sets the properties of the variables through the processing of missing values, changes in variable properties, etc., and performs data substitution of each variable, outlier processing, and creation of new variables. It is characterized in that data pre-processing for each variable included is provided to the user sequentially, so that the data processing that should be excluded from statistical analysis and the numerical value to be changed for each variable are automatically processed for each variable or easily performed by the user.

본 발명 임상데이터 분석에 사용되는 데이터 전처리 자동화 시스템은, Data pre-processing automation system used in the clinical data analysis of the present invention,

사용자의 데이터 전처리 처리를 위한 인터페이스를 제공하는 인터페이스수단과, 입력된 데이터에 포함된 문자를 확인하여 변수를 설정하고 설정된 변수 별 변수 속성을 자동 설정하는 변수속성자동설정수단과, 데이터 전처리 프로세스가 저장 관리되며 데이터전처리제어수단의 요청에 따라서 변수 별 데이터 전처리 프로세스를 제공 및 관리하기 위한 데이터 전처리프로세스 관리수단과, 데이터 전처리 프로세스에 따라서 변수 별 사용자에게 데이터 전처리를 위한 인터페이스를 제공하며 인터페이스를 통해 입력되는 설정 값에 따라서 데이터 전처리를 제어하는 데이터전처리제어수단과, 데이터전처리 완료 후 변수를 조합 하여 신규 변수를 생성시키기 위한 사용자 인터페이스를 상기 인터페이스 수단을 통해 제공하며, 사용자의 조합에 따라서 신규 변수를 생성 제어 관리하는 변수조합 변수관리수단과, 변수 간 조합된 데이터 분포 확인을 위한 사용자 인터페이스를 상기 인터페이스 수단을 통해 제공하며, 사용자의 요청에 따라서 분포확인 및 통계정보를 제공하기 위한 통계분석수단과, 사용자가 데이터를 추출하기 위한 조건 작성 및 입력을 위한 인터페이스 데이터를 제공하며 사용자가 입력한 조건에 따라서 데이터를 추출하여 제공하는 데이터추출수단을 포함하는 것을 특징으로 한다.The interface means providing an interface for the user's data pre-processing, the variable property automatic setting means for setting variables by checking the characters included in the input data and automatically setting the variable properties for each set variable, and the data pre-processing process are stored. It is managed and provides data preprocessing process management means for providing and managing data preprocessing process for each variable at the request of the data preprocessing control means, and an interface for data preprocessing to users for each variable according to the data preprocessing process, and input through the interface Data pre-processing control means for controlling data pre-processing according to the set value, and a user interface for generating new variables by combining variables after data pre-processing is provided through the interface means, and according to the combination of users, Statistical analysis to provide variable combination variable management means for generating and controlling rule variables and a user interface for checking the combined data distribution between variables through the interface means, and providing distribution confirmation and statistical information according to the user's request And means for extracting and providing data according to conditions input by the user and providing interface data for creating and inputting conditions for the user to extract data.

상기 데이터전처리제어수단은 연속형 변수, 범주형 변수 별로 상기 데이터 전처리 프로세스에 따라서 데이터 전처리를 제어하는 연속형 변수의 데이터 전처리 프로세스 수단, 범주형 변수의 데이터 전처리 프로세스 수단을 포함하며, 사용자에게 데이터 전처리에 필요한 인터페이스를 통해 제공될 정보를 데이터 전처리 프로세스에 따라서 제공하며, 사용자 인터페이스를 통해 사용자의 설정 값에 따라서 데이터 전처리를 수행 제어하며, The data preprocessing control means includes data preprocessing process means of continuous variables that control data preprocessing according to the data preprocessing process for each continuous variable and categorical variable, and data preprocessing process means for categorical variables, and preprocessing data to a user. Provides the information to be provided through the interface required for the data pre-processing process, and controls the data pre-processing according to the user's setting value through the user interface,

연속형 변수의 데이터 전처리 프로세스 수단은 연속형 변수의 변수속성 자동설정과정이 완료된 후, 연속형 변수를 유지할 것인지, 범주형 변수로 변환(신규) 생성할 것인지를 사용자에게 확인 요청하여 연속형 변수의 데이터 전처리과정을 수행할 것인지, 범주형 변수의 데이터 전처리과정을 수행할 것인지를 판단하는 과정, 연속형 변수로 유지하여 진행할 것으로 판단되면 사용자에게 변수 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름을 설정하는 변수이름설정과정, 아웃라이어로 사용할 수치 범위 선택을 사용자가 선택하여 설정할 수 있도록 하는 아웃라이어(outlier)정의 과정, 데이터의 값(value)을 사용자가 변경할 수 있도록 데이터 치환과정, 사용자가 결측치를 신규 등록 및 삭제할 수 있도록 하는 결측치 정의과정, 결측치를 결측치가 아닌 숫자로 대체하는 임퓨테이션(imputation)과정, 신규변수 생성 과정, 변수 데이터 분포 확인 과정을 포함하며, The method for preprocessing data of continuous variable is to ask the user to confirm whether to keep the continuous variable or convert it to a categorical variable (new) after the automatic setting process of the variable property of the continuous variable is completed. When determining whether to perform data pre-processing or data pre-processing of categorical variables, or if it is determined to proceed as a continuous variable, the user is provided with an interface to change the variable name and user input information According to the variable name setting process to set the variable name, the outlier definition process to allow the user to select and set the numerical range to be used as the outlier, and replace the data so that the user can change the value of the data Process, missing values that enable users to register and delete new missing values Of course, it includes a presentation impyu (imputation) process, the new process variable generation, variable data distribution process to determine the missing values are replaced with non-missing data number,

범주형 변수의 데이터 전처리 프로세스 수단은, 범주형 변수 자동설정과정이 완료되면, 사용자에게 범주형 변수를 유지할 것인지, 연속형 변수로 변환(신규) 생성할 것인지를 확인 요청하여 범주형 변수의 데이터 전처리과정을 수행할 것인지, 연속형 변수의 데이터 전처리과정을 수행할 것인지를 판단하는 과정, 연속형 변수로 변환하여 진행할 것으로 판단되면 데이터에 문자가 하나라도 포함되어 있는 경우에 결측치로 처리할 것인지 숫자로 치환할 것인지를 사용자에게 확인 요청하는 인터페이스를 제공하고 사용자의 확인에 따라서 처리한 후 연속형 변수를 생성하는 과정, 사용자에게 변수 이름 및 세부 범주 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름 및 세부범주 이름을 변경하는 변수이름설정과정, 세부범주순서를 변경하는 세부범주순서변경과정, 데이터의 값(value)을 사용자가 변경할 수 있도록 데이터 치환과정, 사용자가 결측치를 신규 등록 및 삭제할 수 있도록 하는 결측치 정의과정, 결측치를 결측치가 아닌 숫자 또는 문자로 대체하는 임퓨테이션(imputation)과정, 신규변수생성과정, 변수데이터 분포 확인과정을 포함하는 것을 특징으로 한다. The pre-processing method of data for categorical variables, upon completion of the automatic setting process of categorical variables, asks the user to confirm whether to keep the categorical variables or convert (new) to the categorical variables and preprocess the data of categorical variables. The process of determining whether to perform the process, or to preprocess the data of the continuous variable, or if it is determined to proceed by converting to the continuous variable, if the data contains any character, whether to process it as a missing value or not Provides an interface for asking the user whether to replace or not, and processes it according to the user's confirmation, and then creates a continuous variable, provides the user with an interface for changing the variable name and detailed category name, and provides the user with input information. Therefore, variable name setting process to change variable name and detailed category name, Detailed category order change process to change subcategory order, data replacement process to allow users to change the value of data, process to define missing values to allow users to register and delete missing values, numbers that are not missing values, or Characterized in that it includes an imputation process that replaces text, a new variable creation process, and a variable data distribution verification process.

그리고 본 발명 시스템은, 임상 데이터가 입력되면, 임상데이터의 변수에 문자가 포함되어 있는 지를 확인하는 문자포함판단과정과, And the system of the present invention, when clinical data is input, a character-containing judgment process for confirming whether a character is included in a variable of the clinical data,

문자포함판단과정을 통해 문자가 포함된 것이 확인되면, 결측치 리스트를 참조하여 결측치를 확인하고 포함된 문자가 모두 결측치 인 경우 해당 변수는 데이터에서 삭제하는 과정, 숫자외의 문자가 하나라도 포함된 경우에는 범주형 변수로 설정하는 과정, 문자가 숫자로만 이루어진 경우 숫자의 종류를 판단하고 숫자의 종류(N)가 설정된 개수(C) 보다 적을 경우 변수를 범주형 변수로 설정하고 숫자의 종류(N)가 설정된 개수(C)가 많을 경우 변수를 연속형 변수로 설정하는 변수설정과정과, When it is confirmed through the character inclusion determination process, the missing value is checked by referring to the list of missing values, and if all the included characters are missing values, the corresponding variable is deleted from the data, or if any non-numeric characters are included. In the process of setting as a categorical variable, if the letter consists only of numbers, the type of the number is judged. If the number type (N) is less than the set number (C), the variable is set as a categorical variable and the number type (N) is Variable setting process to set a variable as a continuous variable when the set number (C) is large,

상기 변수설정과정에서 범주형 변수로 판단되면 변수의 이름을 추출하고 변수의 제목을 설정하는 과정, 세부범주의 이름을 추출하고 변수의 제목을 설정하는 과정을 포함하는 범주형 변수의 변수속성 자동설정과정과, If it is determined to be a categorical variable in the variable setting process, the variable property is automatically set, including the process of extracting the name of the variable, setting the title of the variable, and extracting the name of the detailed category and setting the title of the variable. Course,

상기 변수설정과정에서 연속형 변수로 판단되면 변수의 이름을 추출하고 변수의 제목을 설정하는 과정, 수치분포를 계산하는 과정을 포함하는 연속형 변수의 변수속성 자동설정과정과, If it is determined that the variable is continuous in the variable setting process, the process of automatically setting the variable properties of the continuous variable, including the process of extracting the name of the variable, setting the title of the variable, and calculating the numerical distribution,

상기 연속형 변수의 변수속성 자동설정과정이 완료되면, 사용자에게 연속형 변수를 유지할 것인지, 범주형 변수로 변환(신규) 생성할 것인지를 확인 요청하여 연속형 변수의 데이터 전처리과정을 수행할 것인지, 범주형 변수의 데이터 전처리과정을 수행할 것인지를 판단하는 과정, 연속형 변수로 유지하여 진행할 것으로 판단되면 사용자에게 변수 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름을 설정하는 변수이름설정과정, 아웃라이어로 사용할 수치 범위 선택을 사용자가 선택하여 설정할 수 있도록 하는 아웃라이어(outlier)정의 과정, 데이터의 값(value)을 사용자가 변경할 수 있도록 데이터 치환과정, 사용자가 결측치를 신규 등록 및 삭제할 수 있도록 하는 결측치 정의과정, 결측치를 결측치가 아닌 숫자로 대체하는 임퓨테이션(imputation)과정, 신규변수 생성 과정, 변수 데이터 분포 확인 과정을 포함하는 연속형 변수의 데이터 전처리과정과, When the automatic setting process of the variable properties of the continuous variable is completed, the user is asked to confirm whether to keep the continuous variable or convert (new) to a categorical variable to perform a data preprocessing process of the continuous variable, The process of determining whether to perform the pre-processing process of the categorical variable, and if it is determined to proceed as a continuous variable, provides an interface for changing the variable name to the user and sets the variable name according to the user's input information Name setting process, outlier definition process that allows the user to select and set the numerical range to be used as an outlier, data replacement process so that the user can change the value of data, and the user registers missing values And the process of defining missing values that allows deletion, and the missing values as numbers rather than missing values. Capped impyu presentation (imputation) process, data pre-processing of continuous variables including a new generation process variable, the variable distribution data verification process, and,

상기 범주형 변수 자동설정과정이 완료되면, 사용자에게 범주형 변수를 유지할 것인지, 연속형 변수로 변환(신규) 생성할 것인지를 확인 요청하여 범주형 변수의 데이터 전처리과정을 수행할 것인지, 연속형 변수의 데이터 전처리과정을 수행할 것인지를 판단하는 과정, 연속형변수로 변환하여 진행할 것으로 판단되면 데이터에 문자가 하나라도 포함되어 있는 경우에 결측치로 처리할 것인지 숫자로 치환할 것인지를 사용자에게 확인 요청하는 인터페이스를 제공하고 사용자의 확인에 따라서 처리한 후 연속형 변수를 생성하는 과정, 사용자에게 변수 이름 및 세부 범주 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름 및 세부범주 이름을 변경하는 변수이름설정과정, 세부범주순서를 변경하는 세부범주순서변경과정, 데이터의 값(value)을 사용자가 변경할 수 있도록 데이터 치환과정, 사용자가 결측치를 신규 등록 및 삭제할 수 있도록 하는 결측치 정의과정, 결측치를 결측치가 아닌 숫자 또는 문자로 대체하는 임퓨테이션(imputation)과정, 신규변수생성과정, 변수데이터 분포 확인과정을 포함하는 범주형 변수의 데이터 전처리 과정과, When the automatic setting process of the categorical variable is completed, the user is asked whether to keep the categorical variable or convert (new) to the categorical variable, or to perform the data preprocessing process of the categorical variable, or the continuous variable The process of deciding whether or not to perform the pre-processing process of data, and if it is determined to proceed by converting to a continuous variable, asks the user to confirm whether to process it as a missing value or replace it with a number when the data contains any characters. The process of creating a continuous variable after providing the interface and processing according to the user's confirmation, providing the user with an interface for changing the variable name and detailed category name, and changing the variable name and detailed category name according to the user's input information The variable name setting process to be done, the detailed category order to change the detailed category order Data substitution process to allow the user to change the value of data, data, definition process of missing values to allow users to register and delete missing values, and imputation process to replace missing values with numbers or letters rather than missing values , Data pre-processing of categorical variables, including new variable creation process and variable data distribution verification process,

상기 연속형 변수의 데이터 전처리과정, 범주형 변수의 데이터 전처리 과정이 완료되면, 결측치만 있는 샘플은 데이터에서 삭제하는 결측치데이터 삭제 과정과, When the data pre-processing process of the continuous variable and the data pre-processing process of the categorical variable is completed, the process of deleting the missing value data to delete samples having only missing values from the data;

변수 조합하여 신규변수를 생성, 변수조합데이터 분포확인, 조건에 맞는 데이터 추출과정을 포함하는 데이터 전처리 완료과정으로 이루어지는 특징으로 한다.It is characterized by creating a new variable by combining variables, checking the distribution of variable combination data, and completing the data preprocessing process, including the process of extracting data that meets the conditions.

이와 같은 본 발명에 따르면, According to the present invention,

본 발명은 임상 데이터의 변수를 자동으로 판단하고, 그 변수의 속성을 자동 설정하며, 이후 각 변수 별 전처리가 순차 자동적으로 이루어질 수 있도록 프로세스를 제공함으로써, 통계분석에 이용될 임상 데이터의 전처리가 효과적으로 이루어질 수 있게 된다.The present invention automatically determines the variables of the clinical data, automatically sets the properties of the variables, and then provides a process so that the pre-processing for each variable can be automatically performed sequentially, thereby pre-processing the clinical data to be used for statistical analysis effectively. It can be done.

즉 임상 데이터의 전처리가 자동적으로 이루어지면서도 변수 별 각 데이터 전처리 과정에서 사용자가 각 변수별 특정 데이터의 치환 및 속성 설정을 수행할 수 있으며 또한 각 변수를 조합하여 새로운 변수를 생성할 수 있어서 보다 정확한 데이터 전처리가 이루어질 수 있다. That is, while the pre-processing of clinical data is automatically performed, the user can perform substitution and setting of specific data for each variable during the pre-processing of each data for each variable, and also create new variables by combining each variable, making it more accurate. Data pre-processing can be done.

또한 사용자는 각 데이터 전처리 과정에서 임상 데이터의 통계, 분석정보를 적응적으로 제공받을 수 있어서, 보다 정확한 통계분석이 가능한 데이터 전처리가 가능해진다. In addition, the user can adaptively receive statistics and analysis information of clinical data during each data pre-processing process, thereby enabling data pre-processing for more accurate statistical analysis.

도 1은 본 발명 임상데이터 분석에 사용되는 데이터 전처리 자동화 방법의 실행과정을 나타낸 플로우차트.
도 2a는 결측치 코드의 정의를 나타낸 도면이고, 도 2b는 결측치 문자 리스트를 나타낸 도면.
도 3은 변수가 가지는 기본 속성을 정의한 도표.
도 4a, 도 4b는 변수가 가지는 기본 속성이 정의된 데이터 파일 예를 나타낸 도표.
도 5는 연속형 변수 데이터 전처리 과정을 나타낸 플로우 차트.
도 6은 범주형 변수 데이터 전처리 과정을 나타낸 플로우 차트.
도 7은 본 발명 임상데이터 분석에 사용되는 데이터 전처리 자동화 시스템의 구성을 나타낸 블록도.
도 8 내지 도 15는 인터페이스수단을 통해 사용자에게 제공되는 연속형 변수데이터 전처리 제어 인터페이스를 나타낸 도면으로,
도 8은 사용자가 연속형 변수 데이터 전처리를 선택했을 때, 변수 이름/속성변경이 선택된 경우를 나타낸다.
도 9는 아웃라이어 정의를 위한 사용자 인터페이스를 나타낸다.
도 10 및 도 11은 데이터 치환 과정을 수행하기 위한 사용자 인터페이스를 나타낸 도면으로, 도 10은 데이터 값 리스트로부터 직접 선택하여 입력하여 변경하기 위한 개별 값 변경 인터페이스를 나타내며, 도 11은 조건에 맞는 데이터 값을 선택하여 변경하기 위한 조건 설정 후 치환 인터페이스를 나타낸다.
도 12는 결측치 정의 과정을 수행하기 위한 사용자 인터페이스를 나타낸다.
도 13은 임퓨테이션 과정을 수행하기 위한 인터페이스를 나타낸다.
도 14 및 도 15는 신규 변수 생성과정을 수행하기 위한 인터페이스를 나타낸 것으로, 도 14는 함수 변환을 적용한 신규 변수 생성 인터페이스를 나타내며, 도 15는 구간 범주화 설정을 이용한 신규 변수 생성 인터페이스를 나타낸다.
도 16 내지 도 22는 인터페이스수단을 통해 사용자에게 제공되는 범주형 변수데이터 전처리 제어 인터페이스를 나타낸 도면으로,
도 16은 사용자가 범주형 변수 데이터 전처리를 선택했을 때, 변수/세부 범주 이름 및 속성변경이 선택된 경우를 나타낸다.
도 17 및 도 18은 데이터 치환 과정을 수행하기 위한 사용자 인터페이스를 나타낸 도면으로, 도 17은 데이터 값 리스트로부터 직접 선택하여 입력하여 변경하기 위한 개별 값 변경 인터페이스를 나타내며, 도 18은 조건에 맞는 데이터 값을 선택하여 변경하기 위한 조건 설정 후 치환 인터페이스를 나타낸다.
도 19는 결측치 정의 과정을 수행하기 위한 사용자 인터페이스를 나타낸다.
도 20은 임퓨테이션 과정을 수행하기 위한 인터페이스를 나타낸다.
도 21 및 도 22는 신규 변수 생성과정을 수행하기 위한 인터페이스를 나타낸 것으로, 도 21은 세부 범주 통합에 의한 신규 변수 생성 인터페이스를 나타내며, 도 22는 세부 범주 분리에 의한 신규 변수 생성 인터페이스를 나타낸다.
도 23은 연속형 변수의 데이터 분포 확인 인터페이스를 나타낸다.
도 24는 상기 도 23의 인터페이스에서 박스 플로트의 그림수정수단을 선택할 경우 나타나는 박스 플로트 시각화 파라미터 수정 인터페이스를 나타내며, 도 25는 도 23의 인터페이스에서 히스토그램의 그림수정수단을 선택할 경우 나타나는 히스토그램 시각화 파라미터 수정 인터페이스를 나타낸다.
도 26은 범주형 변수 데이터의 분포를 확인하기 위한 인터페이스를 나타낸다.
도 27은 상기 도 26의 인터페이스에서 바 플로트의 그림수정수단을 선택할 경우 나타나는 바 플로트 시각화 파라미터 수정 인터페이스를 나타낸다.
도 28은 상기 도 25 및 도 27의 인터페이스에서 제공하는 그림에 삽입할 텍스트 설정수단을 통해 제공되는 인터페이스를 나타낸다.
도 29는 상기 도 25 및 도 27의 연속형, 범주형 변수에 대한 인터페이스에서 제공하는 색깔 조정 및 그림파일 형식 설정수단을 통해 제공되는 색깔 조정 인터페이스를 나타난다.
도 30 내지 도 32는 변수 조합을 통해 신규 변수를 생성하기 위한 인터페이스를 나타낸 것으로, 도 30은 연속형 변수를 조합하여 신규 변수를 생성하는 인터페이스를 나타내며, 도 31은 범주형 변수를 조합하여 신규 변수를 생성하는 인터페이스를 나타내며, 도 32는 연속형 변수와 범주형 변수를 조합하여 신규 변수를 생성하는 인터페이스를 나타낸다.
도 33 내지 도 35는 변수 조합 데이터 분포를 확인하기 위한 인터페이스를 나타낸 것으로, 도 33은 연속형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스이며, 도 34는 연속형변수, 범주형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스이며, 도 35는 범주형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스이다.
도 36은 조건에 맞는 데이터 추출하기 위한 인터페이스를 나타낸다.1 is a flowchart showing an execution process of a data preprocessing automation method used in analyzing clinical data of the present invention.
2A is a diagram showing the definition of a missing value code, and FIG. 2B is a diagram showing a list of missing value characters.
Figure 3 is a table defining the basic properties of the variable.
4A and 4B are diagrams showing an example of a data file in which basic attributes of variables are defined.
5 is a flow chart showing a continuous variable data pre-processing process.
6 is a flow chart showing a categorical variable data pre-processing process.
7 is a block diagram showing the configuration of a data pre-processing automation system used in clinical data analysis of the present invention.
8 to 15 is a view showing a continuous variable data pre-processing control interface provided to the user through the interface means,
8 shows a case where the variable name / attribute change is selected when the user selects continuous variable data preprocessing.
9 shows a user interface for outlier definition.
10 and 11 are diagrams showing a user interface for performing a data substitution process, and FIG. 10 is an interface for changing individual values for directly selecting and inputting data from a list of data values, and FIG. 11 is a data value that satisfies the conditions After selecting and changing the conditions, the substitution interface is displayed.
12 shows a user interface for performing a process of defining missing values.
13 shows an interface for performing an imputation process.
14 and 15 show an interface for performing a new variable generation process, FIG. 14 shows a new variable generation interface to which a function transformation is applied, and FIG. 15 shows a new variable generation interface using an interval categorization setting.
16 to 22 is a view showing a categorical variable data pre-processing control interface provided to the user through the interface means,
16 shows a case in which the variable / detail category name and attribute change are selected when the user selects preprocessing of categorical variable data.
17 and 18 are views showing a user interface for performing a data substitution process, and FIG. 17 shows an interface for changing individual values by directly selecting and changing from a data value list, and FIG. 18 is a data value that meets the conditions After selecting and changing the conditions, the substitution interface is displayed.
19 shows a user interface for performing a process of defining missing values.
20 shows an interface for performing an imputation process.
21 and 22 show an interface for performing a new variable generation process, FIG. 21 shows a new variable generation interface by sub-category integration, and FIG. 22 shows a new variable generation interface by sub-category separation.
23 shows the data distribution confirmation interface of the continuous variable.
FIG. 24 shows a box float visualization parameter modification interface that appears when selecting the picture correction means of the box float in the interface of FIG. 23, and FIG. 25 shows a histogram visualization parameter modification interface that appears when selecting the picture modification means of the histogram in the interface of FIG. Indicates.
26 shows an interface for confirming the distribution of categorical variable data.
FIG. 27 shows a bar float visualization parameter modification interface that appears when a picture correction means for bar floats is selected in the interface of FIG. 26.
FIG. 28 shows an interface provided through text setting means to be inserted into a picture provided by the interfaces of FIGS. 25 and 27.
FIG. 29 shows the color adjustment interface provided through the color adjustment and picture file format setting means provided by the interfaces for the continuous and categorical variables of FIGS. 25 and 27.
30 to 32 show an interface for generating new variables through variable combination, FIG. 30 shows an interface for generating new variables by combining continuous variables, and FIG. 31 shows new variables by combining categorical variables. Represents an interface for generating, and FIG. 32 shows an interface for generating a new variable by combining a continuous variable and a categorical variable.
33 to 35 show an interface for confirming a variable combination data distribution, FIG. 33 is an interface for confirming a combined data distribution between continuous variables, and FIG. 34 is a combination between continuous variables and categorical variables. An interface for checking data distribution, and FIG. 35 is an interface for checking the combined data distribution between categorical variables.
36 shows an interface for extracting data meeting conditions.

본 발명 임상데이터 분석에 사용되는 데이터 전처리 자동화 시스템을 첨부된 도면 도 7을 참조하여 설명하면 다음과 같다. The data preprocessing automation system used in the clinical data analysis of the present invention will be described with reference to FIG. 7 as follows.

사용자의 데이터 전처리 처리를 위한 인터페이스를 제공하는 인터페이스수단(10)과, Interface means 10 for providing an interface for the user's data pre-processing,

입력된 데이터에 포함된 문자를 확인하여 변수를 설정하고 설정된 변수 별 변수 속성을 자동 설정하는 변수속성자동설정수단(20)과, Variable property automatic setting means (20) for setting variables by checking the characters included in the input data and automatically setting variable properties for each set variable,

데이터 전처리 프로세스가 저장 관리되며 데이터전처리제어수단(40)의 요청에 따라서 변수 별 데이터 전처리 프로세스를 제공 및 관리하기 위한 데이터 전처리프로세스 관리수단(30)과, The data preprocessing process is stored and managed, and the data preprocessing process management means 30 for providing and managing the data preprocessing process for each variable at the request of the data preprocessing control means 40,

데이터 전처리 프로세스에 따라서 변수 별 사용자에게 데이터 전처리를 위한 인터페이스를 제공하며 인터페이스를 통해 입력되는 설정 값에 따라서 데이터 전처리를 제어하는 데이터전처리제어수단(40)과, Data pre-processing control means (40) that provides an interface for data pre-processing to users for each variable according to the data pre-processing process and controls data pre-processing according to a set value input through the interface.

데이터전처리 완료 후 변수를 조합 하여 신규 변수를 생성시키기 위한 사용자 인터페이스를 상기 인터페이스 수단을 통해 제공하며, 사용자의 조합에 따라서 신규 변수를 생성 제어 관리하는 변수조합 변수관리수단(50)과, A variable combination variable management means (50) for providing a user interface for generating new variables by combining variables after data preprocessing is completed, and generating and controlling new variables according to a combination of users,

변수 간 조합된 데이터 분포 확인을 위한 사용자 인터페이스를 상기 인터페이스 수단을 통해 제공하며, 사용자의 요청에 따라서 분포확인 및 통계정보를 제공하기 위한 통계분석수단(60)과, A statistical analysis means (60) for providing a user interface for confirming the combined data distribution between variables through the interface means and providing distribution confirmation and statistical information according to a user's request,

사용자가 데이터를 추출하기 위한 조건 작성 및 입력을 위한 인터페이스 데이터를 제공하며 사용자가 입력한 조건에 따라서 데이터를 추출하여 제공하는 데이터추출수단(70)을 포함한다. It provides interface data for creating and inputting conditions for the user to extract data, and includes data extraction means (70) for extracting and providing data according to the conditions input by the user.

이와 같은 본 발명 임상데이터 분석에 사용되는 데이터 전처리 자동화 시스템은, 임상데이터를 확인하여 연속형 변수인지, 범주형 변수인 지를 판단하고, 각 변수 별 데이터 전처리가 요구되는 프로세스 항목을 순차적으로 사용자가 손쉽게 설정(변경), 데이터 치환 등의 작업을 수행할 수 있도록 함으로써, 임상데이터의 데이터 전처리가 자동적으로 이루어질 수 있도록 한 것이다. The data pre-processing automation system used for the clinical data analysis of the present invention can determine whether it is a continuous variable or a categorical variable by checking the clinical data, and the user can easily determine the process items that require data pre-processing for each variable sequentially. By allowing the user to perform operations such as setting (change) and data substitution, data pre-processing of clinical data can be automatically performed.

상기 인터페이스수단(10)은 데이터 전처리에 필요한 정보 입력 및 확인을 위한 사용자 인터페이스를 제공하기 위한 수단이다.The interface means 10 is a means for providing a user interface for inputting and checking information necessary for data preprocessing.

상기 변수속성자동설정수단(20)은 입력된 데이터에 포함된 문자를 확인하여 변수를 설정하고 설정된 변수 별 변수 속성을 자동 설정하기 위한 수단으로, The variable property automatic setting means 20 is a means for setting a variable by checking a character included in the input data and automatically setting a variable property for each set variable,

입력된 데이터 내의 포함된 문자를 확인하고 결측치 리스트 정보를 참조하여 문자 내에 결측치를 처리하는 결측치처리수단(21), 문자 내에 포함된 숫자 정보에 따라서 연속형 변수 또는 범주형 변수를 설정하는 변수설정수단(22), 변수설정수단(22)에 의해 설정된 연속형 변수 또는 범주형 변수에 따라서 변수 속성을 자동 설정하는 변수속성설정수단(23), 결측치 리스트가 저장 관리되는 결측치 관리수단(24)을 포함한다.Variable setting means for checking the characters included in the input data and processing the missing values in the characters by referring to the missing value list information, and variable setting means for setting continuous variables or categorical variables according to numeric information contained in the characters (22), variable attribute setting means (23) for automatically setting variable attributes according to a continuous variable or categorical variable set by the variable setting means (22), and missing value management means (24) for storing and managing missing value lists do.

상기 변수속성자동설정수단(20)은 문자 내에 포함된 결측치를 제외하고, 모두 숫자인 경우에는 연속형 변수로 설정하되, 숫자의 종류가 정해진 개수 이상 있는 경우에는 연속형 변수로 설정하며, 상기 문자가 모두 숫자가 아닌 경우, 숫자의 종류가 정해진 개수 미만 인 경우에는 범주형 변수로 설정한다.The variable property automatic setting means 20 is set to a continuous variable when all numbers are excluding the missing value included in the letter, but is set to a continuous variable when the number of types is more than a predetermined number. If all are not numbers, if the number type is less than the specified number, set as a categorical variable.

변수속성설정수단(22)은 연속형 변수, 범주형 변수 별로 변수 속성을 자동 설정하기 위한 수단으로, 자동 설정되는 기본 속성정보는 연속형 변수는 변수 제목(variable title), 변수 이름(variable name), 수치분포 정보이며, 범주형 변수는 변수 제목(variable title), 변수 이름(variable name), 그리고 세부범주제목(title of variable subtype), 세부범주 이름(name of variable subtype) 정보이다. The variable property setting means 22 is a means for automatically setting variable properties for each continuous variable and categorical variable. For the basic property information that is automatically set, the continuous variable is a variable title and a variable name. , Numeric distribution information, and categorical variables are variable title, variable name, and title of variable subtype and name of variable subtype.

상기 데이터 전처리프로세스 관리수단(30)은 데이터 전처리 프로세스가 저장 관리되는 수단으로, 연속형 변수, 범주형 변수 별로 데이터 전처리 프로세스를 포함한다. The data preprocessing process management means 30 is a means for storing and managing the data preprocessing process, and includes a data preprocessing process for each continuous variable and categorical variable.

상기 데이터 전처리 프로세스 관리수단(30)은 데이터 전처리 프로세스의 추가, 삭제가 가능하도록 하는 데이터 전처리 편집수단을 더 포함할 수 있다.The data preprocessing process management means 30 may further include data preprocessing editing means for adding and deleting data preprocessing processes.

상기 데이터전처리제어수단(40)은 연속형 변수, 범주형 변수 별로 상기 데이터 전처리 프로세스에 따라서 데이터 전처리를 제어하는 수단으로, 사용자에게 데이터 전처리에 필요한 인터페이스 통해 제공될 정보를 데이터 전처리 프로세스에 따라서 제공하며, 사용자 인터페이스를 통해 사용자가 입력하는 데이터 전처리르 위한 설정 값에 따라서 데이터 전처리를 수행 제어 한다. The data pre-processing control means 40 is a means for controlling data pre-processing according to the data pre-processing for each continuous variable and categorical variable, and provides information to be provided to the user through an interface required for data pre-processing according to the data pre-processing process. In addition, data preprocessing is performed and controlled according to a set value for data preprocessing input by a user through a user interface.

연속형 변수의 데이터 전처리는 연속형 변수의 변수속성 자동설정과정이 완료된 후, 연속형 변수를 유지할 것인지, 범주형 변수로 변환(신규) 생성할 것인지를 사용자에게 확인 요청하여 연속형 변수의 데이터 전처리과정을 수행할 것인지, 범주형 변수의 데이터 전처리과정을 수행할 것인지를 판단하는 과정, 연속형 변수로 유지하여 진행할 것으로 판단되면 사용자에게 변수 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름을 설정하는 변수이름설정과정, 아웃라이어로 사용할 수치 범위 선택을 사용자가 선택하여 설정할 수 있도록 하는 아웃라이어(outlier)정의 과정, 데이터의 값(value)을 사용자가 변경할 수 있도록 데이터 치환과정, 사용자가 결측치를 신규 등록 및 삭제할 수 있도록 하는 결측치 정의과정, 결측치를 결측치가 아닌 숫자로 대체하는 임퓨테이션(imputation)과정, 신규변수 생성 과정, 변수 데이터 분포 확인 과정을 포함한다.Data pre-processing for continuous variables is pre-processed for data of continuous variables by asking the user to confirm whether to keep the continuous variables or convert them into categorical variables (new) after the automatic setting process of the variable properties of the continuous variables is completed. If it is determined that the process is to be performed, whether to perform the pre-processing process of the categorical variable, or if it is determined to proceed as a continuous variable, the user is provided with an interface for changing the variable name and according to the user's input information. The variable name setting process to set the variable name, the outlier definition process to allow the user to select and set the numerical range to be used as the outlier, the data substitution process to allow the user to change the value of the data, The process of defining missing values, allowing users to register and delete new missing values It includes an imputation process that replaces values with numbers rather than missing values, a new variable creation process, and a variable data distribution verification process.

범주형 변수의 데이터 전처리는, 범주형 변수 자동설정과정이 완료되면, 사용자에게 범주형 변수를 유지할 것인지, 연속형 변수로 변환(신규) 생성할 것인지를 확인 요청하여 범주형 변수의 데이터 전처리과정을 수행할 것인지, 연속형 변수의 데이터 전처리과정을 수행할 것인지를 판단하는 과정, 연속형변수로 변환하여 진행할 것으로 판단되면 데이터에 문자가 하나라도 포함되어 있는 경우에 결측치로 처리할 것인지 숫자로 치환할 것인지를 사용자에게 확인 요청하는 인터페이스를 제공하고 사용자의 확인에 따라서 처리한 후 연속형 변수를 생성하는 과정, 사용자에게 변수 이름 및 세부 범주 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름 및 세부범주 이름을 변경하는 변수이름설정과정, 세부범주순서를 변경하는 세부범주순서변경과정, 데이터의 값(value)을 사용자가 변경할 수 있도록 데이터 치환과정, 사용자가 결측치를 신규 등록 및 삭제할 수 있도록 하는 결측치 정의과정, 결측치를 결측치가 아닌 숫자 또는 문자로 대체하는 임퓨테이션(imputation)과정, 신규변수생성과정, 변수데이터 분포 확인과정을 포함한다. Data pre-processing of categorical variables, when the automatic setting process of categorical variables is completed, asks the user to confirm whether to keep categorical variables or convert (new) to categorical variables, and then process the data pre-processing of categorical variables. The process of determining whether to perform the pre-processing of data of the continuous variable, or converting it to the continuous variable, and if it is determined to proceed, replace it with a number to be processed as a missing value when the data contains any character. Provides an interface for asking the user to confirm whether it is processed and processes it according to the user's confirmation, and then creates a continuous variable, provides the user with an interface for changing the variable name and detailed category name, and provides the variable according to the user's input information. Variable name setting process to change name and detailed category name, detailed category order Kyung-ha is a process to change the detailed category order, the data substitution process to allow the user to change the value of the data, the process of defining the missing value to enable the user to register and delete the missing value, and to replace the missing value with a number or character instead of the missing value It includes the imputation process, the new variable creation process, and the variable data distribution verification process.

상기 변수조합 변수관리수단(50), 통계분석수단(60), 데이터추출수단(70)은 데이터 전처리가 완료된 후 추가적으로 사용자가 수행할 수 있는 과정을 제어하는 수단이다.The variable combination variable management means (50), statistical analysis means (60), and data extraction means (70) are means for controlling a process that can be additionally performed by the user after the data preprocessing is completed.

상기 변수조합 변수관리수단(50)은 데이터 전처리 완료된 변수 들을 조합하여 신규 변수를 생성하기 위한 수단으로, 데이터 전처리된 변수들 정보를 제공하며 그 정보들을 이용하여 사용자가 조합한 정보에 따라서 신규 변수를 생성 제어한다.The variable combination variable management means 50 is a means for generating new variables by combining data-preprocessed variables, and provides data pre-processed variables information, and uses the information to create new variables according to the user-combined information. Generate control.

상기 통계분석수단(60)은 데이터 전처리 완료된 변수 간 조합된 데이터의 분포 확인 및 통계정보를 제공하기 위한 수단이다. The statistical analysis means 60 is a means for confirming the distribution of combined data between variables that have been preprocessed and providing statistical information.

상기 데이터추출수단(70)은 사용자가 입력한 조건에 따라서 데이터를 추출하여 제공하기 위한 수단으로, 사용자가 데이터를 추출하기 위한 조건 작성 및 입력을 위한 인터페이스 정보를 제공하며, 사용자가 입력한 조건, 값에 따라서 데이터를 추출하여 제공한다. The data extracting means 70 is a means for extracting and providing data according to a condition input by the user, and provides interface information for creating and inputting conditions for the user to extract data, the conditions input by the user, Data is extracted and provided according to the value.

한편 이와 같은 본 발명 시스템에서 이루어지는 임상데이터 분석에 사용되는 데이터 전처리 자동화 방법은 다음과 같다. Meanwhile, the method for automating data preprocessing used in the analysis of clinical data in the system of the present invention is as follows.

임상 데이터가 입력되면, 임상데이터의 변수에 문자가 포함되어 있는 지를 확인하는 문자포함판단과정과, When the clinical data is input, the character-inclusive judgment process to check whether the variables in the clinical data contains characters,

문자포함판단과정을 통해 문자가 포함된 것이 확인되면, 결측치 리스트를 참조하여 결측치를 확인하고 포함된 문자가 모두 결측치 인 경우 해당 변수는 데이터에서 삭제하는 과정, 숫자외의 문자가 하나라도 포함된 경우에는 범주형 변수로 설정하는 과정, 데이터 내의 문자가 숫자로만 이루어진 경우 숫자의 종류를 판단하고 숫자의 종류(N)가 설정된 개수(C) 보다 적을 경우 변수를 범주형 변수로 설정하고 숫자의 종류(N)가 설정된 개수(C)가 같거나 많을 경우 변수를 연속형 변수로 설정하고 변수설정과정과, When it is confirmed through the character inclusion determination process, the missing value is checked by referring to the list of missing values, and if all the included characters are missing values, the corresponding variable is deleted from the data, or if any non-numeric characters are included. In the process of setting as a categorical variable, if the characters in the data consist only of numbers, the type of the number is judged, and if the type of the number (N) is less than the set number (C), the variable is set as the categorical variable and the type of the number (N ) If the set number (C) is the same or more, the variable is set as a continuous variable and the variable setting process,

상기 연속형 변수의 변수속성 자동설정과정이 완료되면, 사용자에게 연속형 변수를 유지할 것인지, 범주형 변수로 변환(신규) 생성할 것인지를 확인 요청하여 연속형 변수의 데이터 전처리과정을 수행할 것인지, 범주형 변수의 데이터 전처리과정을 수행할 것인지를 판단하는 과정, 연속형 변수로 유지하여 진행할 것으로 판단되면 사용자에게 변수 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름을 설정하는 변수이름설정과정, 아웃라이어(outlier)정의 과정, 데이터 치환과정, 결측치 정의 과정, 임퓨테이션(imputation)과정, 신규변수 생성 과정, 변수 데이터 분포 확인 과정을 포함하는 연속형 변수의 데이터 전처리과정과, When the automatic setting process of the variable properties of the continuous variable is completed, the user is asked to confirm whether to keep the continuous variable or convert (new) to a categorical variable to perform a data preprocessing process of the continuous variable, The process of determining whether to perform the pre-processing process of the categorical variable, and if it is determined to proceed as a continuous variable, provides an interface for changing the variable name to the user and sets the variable name according to the user's input information Data pre-processing of continuous variables, including name setting process, outlier definition process, data substitution process, missing value definition process, imputation process, new variable creation process, and variable data distribution verification process,

상기 범주형 변수 자동설정과정이 완료되면, 사용자에게 범주형 변수를 유지할 것인지, 연속형 변수로 변환(신규) 생성할 것인지를 확인 요청하여 범주형 변수의 데이터 전처리과정을 수행할 것인지, 연속형 변수의 데이터 전처리과정을 수행할 것인지를 판단하는 과정, 연속형변수로 변환하여 진행할 것으로 판단되면 데이터에 문자가 하나라도 포함되어 있는 경우에 결측치로 처리할 것인지 숫자로 치환할 것인지를 사용자에게 확인 요청하는 인터페이스를 제공하고 사용자의 확인에 따라서 처리한 후 연속형 변수를 생성하는 과정, 사용자에게 변수 이름 및 세부 범주 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름 및 세부범주 이름을 변경하는 변수이름설정과정, 세부범주순서를 변경하는 세부범주순서변경과정, 데이터치환과정, 결측치 정의과정, 임퓨테이션 과정, 신규변수생성과정, 변수데이터 분포 확인과정을 포함하는 범주형 변수의 데이터 전처리 과정과, When the automatic setting process of the categorical variable is completed, the user is asked whether to keep the categorical variable or convert (new) to the categorical variable, or to perform the data preprocessing process of the categorical variable, or the continuous variable The process of deciding whether or not to perform the pre-processing process of data, and if it is determined to proceed by converting to a continuous variable, asks the user to confirm whether to process it as a missing value or replace it with a number when the data contains any characters. The process of creating a continuous variable after providing the interface and processing according to the user's confirmation, providing the user with an interface for changing the variable name and detailed category name, and changing the variable name and detailed category name according to the user's input information The variable name setting process to be done, the detailed category order to change the detailed category order Information, data replacement process, missing values defined process, impyu presentation process, a data pre-processing of the categorical variable with the newly generated process variable, the variable distribution data verification process, and,

변수 조합하여 신규변수를 생성, 변수조합데이터 분포확인, 조건에 맞는 데이터 추출과정을 포함하는 데이터 전처리 완료과정으로 이루어진다.It consists of completing the data pre-processing, including creating new variables by combining variables, checking the distribution of variable combination data, and extracting data that meets the conditions.

이와 같은 본 발명은 임상데이터에 대한 통계분석 처리를 위하여 사용되는 데이터에 대한 전처리 방법에 관한 것으로, 데이터에 대하여 범주형 변수와 연속형 변수로 구분하여 각 범주형 변수와 연속형 변수 별로 데이터 전처리를 수행할 수 있도록 한 것이다.The present invention relates to a pre-processing method for data used for statistical analysis of clinical data, and categorizes data into categorical variables and continuous variables to preprocess data for each categorical variable and continuous variable. It was made possible.

도 1은 본 발명 임상데이터 분석에 사용되는 데이터 전처리 자동화 방법의 실행과정을 나타낸 플로우차트이다.1 is a flowchart showing an execution process of a data preprocessing automation method used in analyzing clinical data of the present invention.

상기 문자포함판단과정 및 변수설정과정은 데이터의 변수가 연속형 변수인지, 범주형 변수인지를 판단하는 과정이다. The character-containing judgment process and the variable setting process are processes for determining whether data variables are continuous variables or categorical variables.

상기 문자포함판단과정은 데이터 내에 문자가 포함되어 있는 지를 판단하기 위한 것으로, 여기서 문자는 숫자를 포함한다.The character inclusion determination process is for determining whether a character is included in the data, where the character includes a number.

상기 변수설정과정은,The variable setting process,

결측치 리스트를 참조하여 결측치를 확인하며Check the missing value by referring to the list of missing values.

(a). 문자 모두 결측치인 경우에는 해당 변수를 데이터에서 삭제한다. (a). If all characters are missing, the corresponding variable is deleted from the data.

(b). 결측치를 제외하고 문자가 모두 숫자인 경우에는 숫자의 종류(N)를 추출하고, 숫자의 종류(N)가 일정한 개수(C) 이상인 경우에는 연속형 변수로 설정하고, 그 외는 범주형 변수로 설정한다. (b). If all characters except numbers are missing, the type (N) of the number is extracted, and if the type (N) of the number is more than a certain number (C), it is set as a continuous variable, otherwise set as a categorical variable. do.

도 2a 결측치 코드의 정의를 나타낸 도면이고, 도 2b는 결측치 문자 리스트를 나타낸다. 2A is a diagram showing the definition of a missing value code, and FIG. 2B shows a list of missing value characters.

(c). 모두 숫자로 이루어지지 않은 경우에는 범주형 변수로 설정한다.(c). If not all are set as categorical variables.

즉 모두 숫자로 이루어진 경우에는 연속형 변수, 숫자외의 문자가 포함된 경우에는 범주형 변수로 설정하되, 모두 숫자로 이루어진 경우라도 숫자의 종류(N)에 따라서 변수를 설정할 수 있도록 한 것이다. In other words, if all of them are numbers, it is set as a continuous variable, and when non-numeric characters are included, it is set as a categorical variable, but even when all of them are numbers, variables can be set according to the type (N) of numbers.

상기 범주형 변수의 변수속성 자동설정과정, 연속형 변수의 변수속성 자동설정과정은 각 변수의 기본 속성을 자동 설정하기 위한 과정이다.The automatic setting of the variable properties of the categorical variable and the automatic setting of the variable properties of the continuous variable are processes for automatically setting the basic properties of each variable.

범주형 변수는 변수 제목(variable title), 변수 이름(variable name), 그리고 세부범주제목(title of variable subtype), 세부범주 이름(name of variable subtype)을 추출하여 자동 설정된다.The categorical variable is automatically set by extracting the variable title, variable name, and title of variable subtype and name of variable subtype.

연속형 변수는 변수 제목(variable title), 변수 이름(variable name)을 추출하여 자동 설정하고, 수치 분포를 계산한다. Continuous variables are automatically set by extracting variable titles and variable names, and the numerical distribution is calculated.

도 3은 변수가 가지는 기본 속성을 정의한 표이다. 3 is a table defining basic attributes of variables.

도 4a, 도 4b는 변수가 가지는 기본 속성이 정의된 데이터의 파일 예를 나타낸 도표이다.4A and 4B are diagrams showing an example of a file of data in which basic attributes of a variable are defined.

상기 연속형 변수의 데이터 전처리과정, 범주형 변수의 데이터 전처리 과정은 사용자가 손쉽게 해당 변수에 대한 속성을 확인 및 수정(재설정)할 수 있도록 하는 데이터 전처리 과정을 나타낸다.The data preprocessing process of the continuous variable and the data preprocessing process of the categorical variable represent a data preprocessing process that allows a user to easily check and modify (reset) the properties of the variable.

먼저, 상기 연속형 변수의 데이터 전처리 과정은,First, the data pre-processing process of the continuous variable,

(a). 연속형 변수의 데이터 전처리를 수행할 것인지, 변수를 새롭게 생성 또는 변환할 것인 지를 확인하는 과정, (a). The process of confirming whether to perform preprocessing of data of continuous variables or whether to newly create or convert variables,

(b). 변수이름설정과정, (c). 아웃라이어(outlier)정의 과정, (d). 데이터 치환과정, (e). 결측치 정의 과정, (f). 임퓨테이션(imputation)과정, (g). 신규변수 생성 과정, (h). 변수 데이터 분포 확인 과정을 포함한다.(b). Variable name setting process, (c). The outlier definition process, (d). Data substitution process, (e). The process of defining missing values, (f). Imputation process, (g). New variable creation process, (h). It includes the process of confirming the distribution of variable data.

도 5는 연속형 변수 데이터 전처리 과정을 나타낸다.5 shows a continuous variable data preprocessing process.

먼저 연속형 변수를 유지할 것인지, 범주형 변수로 변환할 것인지를 사용자에게 요청하고 사용자의 선택에 따라서 데이터 전처리 과정을 수행한다.First, the user is asked whether to keep the continuous variable or convert to a categorical variable, and performs a data preprocessing process according to the user's selection.

즉, 변수속성 자동설정과정이 완료되면, 사용자에게 연속형 변수를 유지할 것인지, 범주형 변수로 변환 또는 범주형 변수로 신규 생성할 것인지를 확인 요청하여 이에 따라서 연속형 변수의 데이터 전처리 과정을 수행할 것인지, 범주형 변수의 데이터 전처리 과정을 수행할 것인 지를 판단한다.That is, when the automatic setting process of the variable properties is completed, the user is asked whether to keep the continuous variable, convert to a categorical variable, or create a new one as a categorical variable, thereby performing the data preprocessing process of the continuous variable accordingly. It is determined whether or not to perform a data preprocessing process for categorical variables.

예를 들어, (1,2,3,...,10)의 숫자로 이루어진 연속형 변수 ‘conVar1’ 을 범주형 변수로 변환 하게 되면, ‘conVar1’은 기존에는 연속형 이었지만 변환 후에는 범주형 변수되고, 이에 따라서 데이터 전처리 과정이 이루어지게 된다.For example, when converting the continuous variable 'conVar1' consisting of numbers of (1,2,3, ..., 10) into a categorical variable, 'conVar1' was previously continuous, but after conversion It is a variable, and accordingly, a data pre-processing process is performed.

이때, ‘conVar1’ 변수를 범주형 변수로 신규 생성 하게 되면 ‘conVar1’ 변수는 그대로 있고 ‘conVar1’ 변수가 범주형으로 변환된 새로운 변수가 생성된다. At this time, when the 'conVar1' variable is newly created as a categorical variable, the 'conVar1' variable remains, and a new variable in which the 'conVar1' variable is converted to a categorical type is created.

상기 변수이름설정과정은 사용자에게 변수 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름을 설정하기 위한 과정이다. The variable name setting process is a process for providing a user with an interface for changing a variable name and setting a variable name according to user input information.

상기 아웃라이어(outlier)정의 과정은 아웃라이어로 사용할 수치 범위 선택을 사용자가 선택하여 설정할 수 있도록 하는 과정이다. The outlier definition process is a process that allows a user to select and set a numerical range to be used as an outlier.

선택할 수 있는 수치범위를 하나 이상 제공하며, 또한 사용자가 직접 수치를 입력하여 그 범위를 지정할 수 있도록 인터페이스를 제공하고 이의 선택, 입력에 따라서 아웃라이어를 설정한다. It provides one or more numeric ranges to choose from, and also provides an interface to allow users to directly enter numbers to specify the range, and sets outliers according to their selection and input.

아웃라이어과정은 연속형 변수에만 해당되는 것으로, 실제 분석에 사용할 최소~최대 범위를 벗어나는 숫자를 의미한다.The outlier process is only applicable to continuous variables, and means a number outside the minimum to maximum range to be used for the actual analysis.

초기값으로 자동 계산되는 아웃라이어 관련 수치는 다음과 같다.The outlier-related figures that are automatically calculated as initial values are as follows.

Q1 (1^st quartile), Q3 (3^rd quartile), IQR (inter quartile range) 계산 → 이 수치들은 모두 통계학적으로 자동 계산 되는 것으로 사용자가 수정하는 것은 불가하다.Calculation of Q1 (1 ^st quartile), Q3 (3 ^rd quartile), and IQR (inter quartile range) → These numbers are all calculated statistically and cannot be modified by the user.

Outlier 하한값=(minimum value, Q1 - 1.5 x IQR 또는 Q1 - 3 x IQR)→ 위에서 자동 계산된 Q1,Q3 또는 IQR을 사용하여 자동 계산되는 것으로, 사용자가 이들 중 하나를 선택하거나 또는 직접 수치 수정이 가능하다.(초기치 = minimum value)Outlier lower limit = (minimum value, Q1-1.5 x IQR or Q1-3 x IQR) → automatically calculated using Q1, Q3 or IQR automatically calculated above, the user can select one of them or directly modify the numerical value It is possible (initial value = minimum value)

Outlier 상한값=(maximum value, Q3 + 1.5 x IQR 또는 Q3 + 3 x IQR)→위에서 자동 계산된 Q1,Q3 또는 IQR을 사용하여 자동 계산되는 것으로, 사용자가 이들 중 하나를 선택하거나 또는 직접 수치 수정이 가능하다(초기치 = maximum value)Outlier upper limit = (maximum value, Q3 + 1.5 x IQR or Q3 + 3 x IQR) → auto-calculated using Q1, Q3 or IQR automatically calculated above, the user can select one of them or directly modify the numerical value Possible (initial value = maximum value)

상기 데이터 치환과정은 데이터의 값(value)을 사용자가 변경할 수 있도록 하는 과정으로, 현재의 값과 함께 사용자가 값을 입력할 수 있도록 인터페이스를 제공하고 사용자의 입력에 따라서 데이터 값을 설정한다.The data replacement process is a process that allows a user to change the value of data, and provides an interface for a user to input a value along with the current value and sets the data value according to the user's input.

사용자가 치환한 값은 치환리스트에 등록하여 별도로 관리할 수 있다.The value substituted by the user can be separately managed by registering in the substitution list.

데이터 치환은 결측치가 아닌 값을 결측치가 아닌 숫자, 또는 문자로 바꾸는 것을 의미한다.Data substitution means replacing non-missing values with numbers or letters that are not missing.

데이터 치환하는 방법은, (a). 개별 데이터를 직접 입력하는 방식, (b). 조건에 맞는 데이터를 자동으로 모두 치환하는 방식.The method of substituting data is (a). Method of directly inputting individual data, (b). A method that automatically replaces all data that meets the conditions.

예: 문자 또는 숫자가 모두 일치하는 것을 다른 문자 또는 숫자로 치환, 문자 또는 숫자 일부를 지정한 문자 또는 숫자 일부로 치환.Example: Replace all letters or numbers that match, with another letter or number, or a part of a letter or number with a specified letter or part of a number.

상기 결측치 정의과정은 사용자가 결측치를 신규 등록 및 삭제할 수 있도록 하는 과정으로, 사용자가 새롭게 결측치를 입력하여 결측치 리스트에 등록할 수 있도록 하며, 또한 결측치 리스트로부터 결측치를 읽어 삭제 할 수 있는 결측치 리스트 수정을 위한 인터페이스를 제공한다. The process of defining a missing value is a process that allows a user to register and delete a missing value, allowing a user to input a missing value and register it in the missing value list, and also modify a missing value list that can be deleted by reading the missing value from the missing value list. Interface.

상기 임퓨테이션(imputation)과정은 결측치를 결측치가 아닌 숫자로 대체하는 과정이다. The imputation process is a process of replacing missing values with numbers rather than missing values.

사용자에게 결측치를 대체할 수 있도록 인터페이스를 제공하고, 사용자가 입력하는 값에 따라서 결측치를 대체한다. The interface is provided to the user to replace the missing value, and the missing value is substituted according to the value input by the user.

임퓨테이션은 결측치를 결측치가 아닌 숫자, 또는 문자로 바꾸는 것을 의미하는 것으로, 구체적인 임퓨테이션 방법은 다음과 같다.Imputation means to replace missing values with numbers or letters that are not missing, and the specific method of implantation is as follows.

(a). 변수별로 임퓨테이션 하는 경우 ; 해당 변수의 모든 결측치를 평균(mean), 중앙값(median), 최소/최대값 중 선택(초기치=중앙값)(a). In case of imputation by variable; Select all missing values of the variable from mean, median, and min / max (default = median)

(b). 시간 차이를 두고 반복 측정된 데이터인 경우; LOCF(Last Observation Carried Forward) 방식 적용 : 결측치를 결측치가 아닌 가장 최근에 관측된 값으로 대체하는 방법→(초기치로 사용), 반복 측정된 데이터의 평균(mean), 중앙값(median), 최소/최대값 중 선택.(b). Repeatedly measured data with a time difference; Apply LOCF (Last Observation Carried Forward) method: Replace the missing value with the most recently observed value rather than the missing value → (used as the initial value), mean of the repeated measured data (mean), median (median), min / max Selection of values.

상기 신규변수생성과정은 변수를 신규로 생성하기 위한 과정으로, 사용자가 선택한 변수에 적용할 함수 항목 선택 인터페이스 및 함수를 직접 입력할 수 있도록 인터페이스를 제공하며, 사용자가 선택 또는 입력하는 신규 생성되는 변수 리스트를 생성하며, 사용자가 리스트를 수정 관리할 수 있도록 인터페이스를 제공한다. The new variable creation process is a process for creating a new variable, and provides a function item selection interface to apply to the variable selected by the user and an interface for directly inputting a function, and a newly generated variable selected or input by the user It creates a list and provides an interface for users to modify and manage the list.

상기 데이터 분포확인과정은 사용자가 등록된 연속형 변수의 데이터 분포를 확인하는 과정으로, 사용자가 각 연속형 변수 단위로 수치 분포를 확인할 수 있으며, 수치 분포를 도형(블록), 그래프로 제공하는 인터페이스를 제공한다. The data distribution confirmation process is a process in which the user checks the data distribution of the registered continuous variable, and the user can check the numerical distribution in units of each continuous variable, and provides an interface for providing a numerical distribution as a figure (block) or graph. Provides

연속형 변수일 때, 데이터 분포(data distribution of variable)는, When it is a continuous variable, the data distribution (data distribution of variable),

(a). 최소(minimum), 최대(maximum), 평균(mean), 표준편차(standard deviation), 중앙값(median)(a). Minimum, maximum, mean, standard deviation, median

(b). 1분위수(1^st quartile), 3분위수(3^rd quartile), IQR(inter quartile range)(b). Quartile 1 (1 ^st quartile), 3 quartile ^{(3 rd quartile), IQR (} inter quartile range)

(c). 95% 신뢰구간(confidence interval)(c). 95% confidence interval

(d). 수치 분포 형태(d). Numerical distribution

- Skewness (왜도), Kurtosis (첨도)-Skewness (skewness), Kurtosis (Kurtosis)

- 정규분포 여부-Normal distribution

Kolmogorov-Smirnov normality test 결과 p valueKolmogorov-Smirnov normality test result p value

Lilliefors normality test 결과 p valueLilliefors normality test result p value

Shapiro-Wilk normality test 결과 p valueShapiro-Wilk normality test result p value

(e). 유효 표본 수 및 비율(%)(e). Effective sample count and percentage (%)

(f). 결측 표본 수 및 비율(%)(f). Number and percentage of missing samples (%)

(g) 수정 불가(g) cannot be modified

(h). 자동 작성되는 결과(h). Automatically created results

- 변수별 데이터 분포 표, histogram with normal distribution line, box plot-Data distribution table by variable, histogram with normal distribution line, box plot

- 변수 데이터 분포가 통합된 결과 표-Table of results incorporating variable data distribution

상기 범주형 변수의 데이터 전처리 과정은, The data pre-processing process of the categorical variable,

(a). 범주형 변수의 데이터 전처리를 수행할 것인지, 변수를 새롭게 생성 또는 변환할 것인 지를 확인하는 과정, (a). The process of confirming whether to perform pre-processing of data for categorical variables or whether to create or convert new variables,

(b). 변수 이름 및 세부범주 이름을 변경하는 변수이름설정과정, (c). 세부범주순서변경과정, (d). 데이터치환과정, (e). 결측치 정의과정, (f). 임퓨테이션 과정, (g). 신규변수생성과정, (h). 변수데이터 분포 확인과정을 포함한다. (b). Variable name setting process to change variable name and detailed category name, (c). Detailed category order change process, (d). Data replacement process, (e). The process of defining missing values, (f). Imputation process, (g). New variable creation process, (h). It includes the process of confirming the distribution of variable data.

또한, 변수를 새롭게 생성 또는 변환해야하는 경우 숫자만 있는 경우에는 연속형 변수로 생성하여 상기 연속형 변수 자동설정과정을 수행하도록 하고 데이터에 숫자를 제외한 문자가 하나라도 있는 경우 문자를 결측치 정의 또는 숫자로 치환하도록 하는 연속형 변환/신규생성과정을 더 포함한다. In addition, if a variable needs to be newly created or converted, if there is only a number, it is created as a continuous variable to perform the automatic setting process of the continuous variable, and if there is any character except a number in the data, the character is defined as a missing value or a number. It further includes a continuous conversion / new generation process to allow substitution.

도 6은 범주형 변수 데이터 전처리 과정을 나타낸다.6 shows a categorical variable data pre-processing process.

먼저 범주형 변수를 유지할 것인지, 연속형 변수로 변환할 것인지를 사용자에게 요청하고 사용자의 선택에 따라서 데이터 전처리 과정을 수행한다.First, the user is asked whether to keep the categorical variable or convert it to a continuous variable, and performs a data preprocessing process according to the user's selection.

변수속성 자동설정과정이 완료되면, 사용자에게 범주형 변수를 유지할 것인지, 연속형 변수로 변환 또는 연속형 변수로 신규 생성할 것인지를 확인 요청하여 이에 따라서 범주형 변수의 데이터 전처리 과정을 수행할 것인지, 연속형 변수의 데이터 전처리 과정을 수행할 것인 지를 판단한다.When the automatic setting process of the variable properties is completed, the user is asked whether to keep the categorical variable, convert to the continuous variable, or create a new one as the continuous variable, thereby performing the data preprocessing process of the categorical variable accordingly. It is determined whether or not to perform the data preprocessing process of the continuous variable.

상기 연속형 변수에서 설명한 바와 같이, 범주형 변수를 연속형 변수로 변환하게 되면, 기존 범주형 변수는 연속형 변수로 강제 변환되며, 상기의 연속형 변수 전처리 과정이 이루어지게 된다.As described in the continuous variable, when the categorical variable is converted into a continuous variable, the existing categorical variable is forcibly converted into a continuous variable, and the continuous variable preprocessing process is performed.

또한 연속형 변수로 신규 생성 하게 되면 해당 변수는 범주형 변수로 유지되면서, 새로운 연속형 변수가 생성된다. Also, when a new variable is created as a continuous variable, the variable is maintained as a categorical variable, and a new continuous variable is created.

상기 변수이름설정과정은 사용자에게 변수 이름 및 세부 범주 이름을 변경하기 위한 인터페이스를 제공하고 사용자의 입력정보에 따라서 변수 이름 및 세부 범주 이름을 설정하기 위한 과정이다. The variable name setting process is a process for providing a user with an interface for changing variable names and detailed category names and setting variable names and detailed category names according to user input information.

상기 세부범주순서변경과정은 세부범주의 순서를 변경하는 과정으로 세부범주 이름의 리스트를 제공하며 사용자가 리스트에서 세부범주의 순서를 변경할 수 있도록 인터페이스를 제공하고, 사용자가 선택하는 순서대로 세부범주의 순서를 설정하는 과정이다.The detailed category order changing process is a process of changing the order of detailed categories, providing a list of detailed category names, providing an interface for a user to change the order of detailed categories in the list, and providing detailed categories in the order the user selects. This is the process of setting the order.

상기 데이터 치환과정은 데이터의 값(value)을 사용자가 변경할 수 있도록 과정으로, 현재의 값과 함께 사용자가 값을 입력할 수 있도록 인터페이스를 제공하고 사용자의 입력에 따라서 데이터 값을 설정한다.The data substitution process is a process that allows a user to change the value of data, and provides an interface for a user to input a value along with the current value and sets the data value according to the user's input.

데이터 치환하는 방법은, 상기 연속형 변수의 데이터 전처리 과정에서도 밝힌바와 같이, (a). 개별 데이터를 직접 입력하는 방식, (b). 조건에 맞는 데이터를 자동으로 모두 치환하는 방식으로 구분할 수 있다. As described in the data pre-processing method of the continuous variable, the data substitution method is (a). Method of directly inputting individual data, (b). It can be classified by automatically replacing all data that meets the conditions.

상기 임퓨테이션(imputation)과정은 결측치를 결측치가 아닌 숫자 또는 문자로 대체하는 과정이다. The imputation process is a process of replacing missing values with numbers or letters rather than missing values.

상기 연속형 변수의 데이터 전처리 과정에서도 밝힌 바와 같이, 임퓨테이션은 결측치를 결측치가 아닌 숫자, 또는 문자로 바꾸는 것을 의미하는 것으로, 구체적인 임퓨테이션 방법은 다음과 같다.As also revealed in the data preprocessing process of the continuous variable, imputation means to replace missing values with numbers or letters rather than missing values, and the specific imputation method is as follows.

상기 신규변수생성과정은 변수를 신규로 생성하기 위한 과정으로, 사용자가 선택한 변수에 적용할 함수 항목 선택 인터페이스 및 함수를 직접 입력할 수 있도록 인터페이스를 제공하며, 사용자가 선택 또는 입력하는 신규 생성되는 변수 리스트를 생성하며, 사용자가 리스트를 수정 관리할 수 있도록 인터페이스를 제공한다.The new variable creation process is a process for creating a new variable, and provides a function item selection interface to apply to the variable selected by the user and an interface for directly inputting a function, and a newly generated variable selected or input by the user It creates a list and provides an interface for users to modify and manage the list.

범주형 변수일 때, For categorical variables,

(a). 세부 범주에 포함된 표본 수 및 표본 수 비율(%), 표본 수 비율의 95% 신뢰구간(confidence interval)에서, (a). The number of samples included in the subcategory and the percentage of the number of samples (%), at a 95% confidence interval of the percentage of samples,

- 결측 표본 제외한 유효 표본만 대상으로 하여 계산-Calculation is based on only valid samples excluding missing samples

- 결측 표본 제외한 모든 표본을 대상으로 하여 계산-Calculation is performed on all samples except for missing samples

(b). 유효 표본 수 및 비율(%)(b). Effective sample count and percentage (%)

(c). 결측 표본 수 및 비율(%)(c). Number and percentage of missing samples (%)

(d). 수정 불가(d). Cannot be modified

(e). 자동 작성되는 결과(e). Automatically created results

- 별수별 데이터 분포 표, bar plot-Data distribution table for each star, bar plot

한편 변수를 새롭게 생성 또는 변환해야 하는 경우 숫자만 있는 경우에는 연속형 변수로 생성하여 상기 연속형 변수 자동설정과정을 수행하도록 하고 데이터에 숫자를 제외한 문자가 하나라도 있는 경우 문자를 결측치 정의 또는 숫자로 치환하도록 하는 연속형 변환/신규생성과정을 더 포함한다. On the other hand, if a variable needs to be newly created or converted, if there is only a number, it is created as a continuous variable to perform the automatic setting process of the continuous variable, and if there is any character except a number in the data, the character is defined as a missing value or a number. It further includes a continuous conversion / new generation process to allow substitution.

상기 연속형 변환/신규 생성과정은 The continuous conversion / new generation process

연속형 변수로 변환 또는 신규 생성하는 경우에 데이터에 문자가 하나라도 포함되어 있는 경우에는 이를 결측치로 처리할 것인지 숫자로 치환할 것인지를 결정한 후 연속형 변수를 변환 또는 생성하고자 한 것으로, 사용자에게 인터페이스를 제공하고 사용자의 확인에 따라서 처리한 후 연속형 변수를 생성하여 상기의 연속형 변수 데이터 전처리 과정을 진행하도록 하기 위한 것이다. When converting to a continuous variable or creating a new one, if the data contains at least one character, decide whether to treat it as a missing value or replace it with a number, and then attempt to convert or create the continuous variable. It is to provide and provide a continuous variable after processing according to the user's confirmation, to proceed with the continuous variable data preprocessing process.

상기 결측치데이터 삭제과정은 상기 연속형 변수의 데이터 전처리과정, 범주형 변수의 데이터 전처리 과정을 완료한 후, 결측치만 있는 데이터를 삭제하기 위한과정이다. The deletion data deletion process is a process for deleting data having only missing values after completing the data pre-processing process for the continuous variable and the data pre-processing process for the categorical variable.

상기 변수 조합 신규변수생성과정은 변수들의 조합으로 새롭게 변수를 생성하기 위한 과정이다. The variable combination new variable creation process is a process for creating a new variable with a combination of variables.

(a). 변수 1개를 사용하는 경우, (a). When using one variable,

연속형 변수일 때, (1). 변수에 수학 함수 처리 하여 연속형 변수 신규 생성 (예: log 변환), 변수의 수치범위를 구간화 하여 범주형 변수 신규생성(예; 40세 미만, 41-59세, 60세이상) When it is a continuous variable, (1). New generation of continuous variables by processing mathematical functions on variables (e.g., log conversion), new generation of categorical variables by segmenting the numerical range of variables (e.g., under 40, 41-59, over 60)

범주형 변수일 때, (1). 변수의 세부 범주를 통합하여 범주형 변수 신규 생성(예; 금연, 예전 흡연이나 현재 금연, 현재 흡연→ 금연, 흡연), (2). 변수(특정 세부 범주 vs. others)와 같은 방식으로 범주 개수 만큼의 2분형 신규 범주 변수 생성(예; DM, HTN, HL → DM vs. others, HTN. vs others, HL vs. others) For categorical variables, (1). Create a new categorical variable by integrating the subcategories of variables (eg, no smoking, old smoking or current smoking, current smoking → smoking cessation, smoking), (2). Create a new two-way category variable with the number of categories in the same way as variables (specific detailed categories vs. others) (e.g. DM, HTN, HL → DM vs. others, HTN. Vs others, HL vs. others)

(b). 연속형 변수 2개 이상을 조합하여 사용하는 경우 (b). When using two or more continuous variables in combination

연속형 변수 2개 이상을 사칙 연산을 사용하여 연속형 변수 신규 생성(예; 체질량 지수 =체중/키² Create new continuous variables using four or more continuous variables (eg body mass index = weight / key ²⁾

(c). 범주형 변수 2개 이상을 조합하여 사용하는 경우 (c). When two or more categorical variables are used in combination

각 범주형 변수의 세부 범주를 조합하여 범주형 변수 신규 생성, 예;(40세미만, 40세 이상)×(남,여)→(40세 미만 남, 40세 미만 여, 40세 이상 남, 40세 이상 여) Combining detailed categories of each categorical variable to create a new categorical variable, e.g. (under 40, over 40) × (male, female) → (male under 40, female under 40, female over 40, 40 years old or older)

상기 변수조합 데이터 분포확인과정은 상기와 같이 변수 조합으로 이루어진 변수조합 데이터의 분포를 확인하는 과정으로, The variable combination data distribution confirmation process is a process of checking the distribution of variable combination data consisting of variable combinations as described above.

(a). 연속형 변수*연속형 변수 간 관계일 때, (a). Continuous variable * When the relationship between continuous variables,

[자동 적용되는 통계분석] Pearson correlation, Spearman correlation 분석[Automatic statistical analysis] Pearson correlation, Spearman correlation analysis

[자동 작성되는 결과] 상관관계지수(coefficient) 및 p value, Scatter plot with regression line[Automatic result] Correlation index, p value, Scatter plot with regression line

(b). 연속형 변수* 범주형 변수 간 관계일 때(b). Continuous variables * When there is a relationship between categorical variables

[자동 적용되는 통계 분석][Statistic analysis applied automatically]

(1). 범주형 변수에 포함된 세부 범주가 2개일 때(two sample T test) (One). When there are two subcategories included in a categorical variable (two sample T test)

모수적 방법(parametric method); Student T test, Welch T test 결과 p value, 비모수적 방법(non-parametric method): Mann-Whitney U test 결과 p valueParametric method; Student T test, Welch T test result p value, non-parametric method: Mann-Whitney U test result p value

(2). 범주형 변수에 포함된 세부 범주가 3개 이상일 때(2). When there are 3 or more subcategories included in a categorical variable

모수적 방법(parametric method): 1-way ANOVA 결과 p value, 비모수적 방법(non-parametric method): Kruskal-Wallis H test 결과 p valueParametric method: 1-way ANOVA result p value, non-parametric method: Kruskal-Wallis H test result p value

[자동 작성되는 결과] [Results created automatically]

(1). 전체 표본 및 세부 범주에 포함된 표본 수, 비율(%), 평균±표준편차, p value 가 정리된 표(One). Table showing the number of samples, percentage (%), mean ± standard deviation, and p value in all samples and subcategories

(2). Bar plot(2). Bar plot

(c). 범주형 변수 * 범주형 변수 간 관계일 때(c). Categorical variable * When there is a relationship between categorical variables

[자동 적용되는 통계분석][Statistic analysis applied automatically]

(1). chi-squared test(with/without Yates’ continuity correction) 결과 p value(One). chi-squared test (with / without Yates ’continuity correction) result p value

(2). Fisher’s exact test 결과 p value(2). Fisher ’s exact test result p value

[자동 작성되는 결과] [Results created automatically]

(1). 세부 범주 조합에 포함된 표본 수 및 비율(%), p value가 정리된 표(One). Table that shows the number of samples, percentage (%), and p values in the subcategory combination

(2). Bar plot(2). Bar plot

상기 조건에 맞는 데이터 추출과정은 연속형 변수의 수치 범위, 범주형 변수의 세부 범주를 조합하여 조건 범위에 맞는 데이터만 추출한다. The data extraction process that satisfies the above conditions extracts only data that satisfies the condition range by combining the numerical range of the continuous variable and the detailed category of the categorical variable.

이때, 데이터 추출은 AND, OR, NOT 연산자를 1개 이상 조합하여 이루어진다.At this time, data extraction is performed by combining one or more of the AND, OR, and NOT operators.

이와 같은 이루어지는 본 발명 임상데이터 분석에 사용되는 데이터 전처리 자동화 시스템의 전처리 자동화 방법에 대하여 첨부된 도면에 도시된 실시예를 참조하여 설명하면 다음과 같다.The method for automating the preprocessing of the data preprocessing automation system used in the clinical data analysis of the present invention made as described above will be described with reference to the embodiments shown in the accompanying drawings.

본 발명은 임상데이터를 판별하여 연속형 변수와 범주형 변수를 자동 설정하고, 각 연속형 변수와 범주형 변수에 대하여 전처리 작업 프로세스에 따라서 자동으로 이루어질 수 있도록 사용자에게 인터페이스를 제공한다.The present invention determines the clinical data and automatically sets continuous variables and categorical variables, and provides an interface to the user so that each continuous variable and categorical variables can be automatically made according to a pre-processing process.

먼저 데이터가 입력되면, 데이터 전처리 제어수단(40)에서는 변수속성자동설정수단(20)을 통해 데이터 내에 포함된 문자를 확인하여 변수를 설정하도록 한다.When data is first input, the data pre-processing control means 40 checks the characters included in the data through the variable attribute automatic setting means 20 to set the variables.

상기 변수속성자동설정수단(20)은 결측치처리수단(21)에서는 입력된 데이터 내에 문자가 포함되어 있는 지 확인한 후, 결측치 관리수단(24) 내의 도 2b와 같은 결측치 리스트를 참조하여 결측치 여부를 판단하게 된다. The variable attribute automatic setting means 20 determines whether a missing value is included in the input data in the missing value processing means 21, and then determines whether a missing value is made by referring to the missing value list shown in FIG. 2B in the missing value management means 24. Is done.

이때 문자가 모두 결측치 인 경우에는 해당 변수를 삭제한다. At this time, if all characters are missing values, the corresponding variable is deleted.

한편 문자의 모두가 결측치가 아닌 경우에는 변수설정수단(22)에서는 숫자의 종류 개수를 판단하여 연속형변수, 범주형 변수를 판단하게 된다. On the other hand, if all of the characters are not missing values, the variable setting means 22 determines the number of types of numbers to determine continuous variables and categorical variables.

즉, 모두 숫자로 이루어진 경우에는 범주형 변수로 설정하며, 숫자의 종류 개수(N)가 정해진 개수(C)이상인 경우에는 연속형 변수로 설정하며, 정해진 개수(C) 미만 인 경우에는 범주형 변수로 설정한다. That is, if all of them are numbers, they are set as categorical variables, and if the number of types of numbers (N) is greater than or equal to the set number (C), they are set as continuous variables. Set to

이후, 변소속성설정수단(23)에서는 연속형 변수와 범주형 변수의 자동속성 정보를 설정한다.Thereafter, the toilet property setting means 23 sets the automatic property information of the continuous variable and the categorical variable.

이와 같이 변수속성자동설정수단(20)을 통해 연속형 변수와 범주형 변수에 대한 자동설정 정보가 설정되면, 데이터전처리 제어수단(40)에서는 연속형 변수 및 범주형 변수의 전처리 과정이 이루어질 수 있도록 인터페이스수단(10)을 통해 사용자에게 데이터 전처리 인터페이스를 제공한다. In this way, when the automatic setting information for the continuous variable and the categorical variable is set through the variable property automatic setting means 20, the data preprocessing control means 40 enables the preprocessing process of the continuous variable and the categorical variable to be performed. A data pre-processing interface is provided to the user through the interface means 10.

변수별 데이터 전처리를 위해 제공되는 인터페이스에는 변수이름표시수단, 데이터 전처리를 실행할 변수를 선택하기 위한 변수데이터 전처리 선택수단이 구성되고, 상기 변수 데이터 전처리 선택수단의 선택에 따라 나타나는 데이터전처리 서브 메뉴를 포함한다. The interface provided for data pre-processing for each variable includes variable name display means, variable data pre-processing selecting means for selecting variables to execute data pre-processing, and includes a data pre-processing sub-menu that appears according to the selection of the variable data pre-processing selecting means. do.

도 8 내지 도 15는 인터페이스수단(10)을 통해 사용자에게 제공되는 연속형 변수데이터 전처리 제어 인터페이스를 나타낸다. 8 to 15 show a continuous variable data pre-processing control interface provided to the user through the interface means 10.

사용자가 연속형 변수 데이터 전처리를 선택하면 상단에 변수이름표시수단에 해당 변수 이름이 표시되며, 사용자는 연속형 변수 데이터 전처리 선택수단의 하단 서브 메뉴를 순서대로 선택하면서 변수의 속성 확인 및 재설정과정을 수행하게 된다. When the user selects the continuous variable data pre-processing, the variable name is displayed on the variable name display means at the top, and the user selects the lower sub-menu of the continuous variable data pre-processing selection means in order to check and reset the properties of the variable. It will perform.

도 8은 사용자가 연속형 변수 데이터 전처리를 선택했을 때, 변수 이름/속성변경이 선택된 경우를 나타낸다. 8 shows a case where the variable name / attribute change is selected when the user selects continuous variable data preprocessing.

변수 이름/속성 변경을 선택하게 되면, 그 하단으로 변수이름 변경수단이 구성되며, 변수이름변경수단에는 설정된 변수 이름이 나타나며, 그 하단으로는 연속형변수로 그대로 사용할 것인지 아니면 범주형 변수로 변환할 것인지를 선택하기 위한 변수재설정을 위한 변수설정수단이 구성된다.If variable name / property change is selected, the variable name change means is configured at the bottom, and the set variable name appears on the variable name change means, and at the bottom, it is used as a continuous variable or converted to a categorical variable. Variable setting means for resetting the variable for selecting whether or not is configured.

그리고 데이터 전처리 다음 단계로 진행 또는 이전 단계로 진행할 수 있도록 데이터전처리 진행단계선택수단(다음단계로 진행, 이전단계로 복귀)과 데이터 속성 및 분포를 확인할 수 있는 데이터속성분포확인수단을 포함한다.It also includes a data preprocessing step selection means (proceeding to the next step, returning to the previous step) and a data attribute composition checking means to check data attributes and distribution so that the data can be processed to the next step or to the previous step.

사용자는 상기 변수설정수단을 통해 연속형 변수를 범주형 변수로 재설정 하거나 연속형 변수로 그대로 유지할 것인지를 선택할 수 있다.The user may select whether to reset the continuous variable to the categorical variable or keep the continuous variable as the continuous variable through the variable setting means.

사용자가 범주형 변수로 설정하게 되면, 범주형 변수 데이터 전처리가 실행되며, 연속형 변수로 유지하면, 현재 상태의 데이터 전처리가 유지된다. When the user sets it as a categorical variable, pre-processing of categorical variable data is executed, and if it is maintained as a continuous variable, the current data pre-processing is maintained.

이후 사용자는 서브 메뉴에서 아웃라이어정의를 선택하거나 다음단계로 진행을 선택하여 아웃라이어 정의과정을 진행할 수 있다.Thereafter, the user can select the outlier definition in the sub-menu or proceed to the next step to proceed with the outlier definition process.

도 9는 아웃라이어 정의를 위한 사용자 인터페이스를 나타낸다.9 shows a user interface for outlier definition.

사용자에게 제공되는 아웃라이어 정의 인터페이스는 아웃라이어로 사용할 수치 범위를 선택할 수 있도록 범위선택수단을 포함한다.The outlier definition interface provided to the user includes a range selection means to select a numerical range to be used as an outlier.

범위선택수단은 아웃라이어로 사용할 수치를 선택할 수 있도록 선택항목을 포함하며, 수치를 직접 사용자가 입력할 수 있도록 입력수단을 포함한다.The range selection means includes a selection item to select a value to be used as an outlier, and an input means to directly input the value.

도 10 및 도 11은 데이터 치환 과정을 수행하기 위한 사용자 인터페이스를 나타낸 도면으로, 도 10은 데이터 값 리스트로부터 직접 선택하여 입력하여 변경하기 위한 개별 값 변경 인터페이스를 나타내며, 도 11은 조건에 맞는 데이터 값을 선택하여 변경하기 위한 조건 설정 후 치환 인터페이스를 나타낸다. 10 and 11 are diagrams showing a user interface for performing a data substitution process, and FIG. 10 is an interface for changing individual values for directly selecting and inputting data from a list of data values, and FIG. 11 is a data value that satisfies the conditions After selecting and changing the conditions, the substitution interface is displayed.

개별 값 변경 인터페이스는 도 10에서와 같이, 데이터 이전 데이터 값(old Value)과 새롭게 변경되는 값(new Value)이 표시되는 데이터 치환 리스트를 포함하며, 데이터 값을 변경하기 위한 데이터 값 입력수단을 포함한다.As shown in FIG. 10, the individual value change interface includes a data substitution list in which data old data values (old values) and new values (new values) are displayed, and includes data value input means for changing data values. do.

사용자는 이전 데이터 값을 확인하고 데이터 값 입력수단을 통해 새롭게 데이터 값을 변경할 수 있으며, 변경되는 값은 데이터 치환 리스트의, ‘New Value’에 기록된다. The user can check the previous data value and change the new data value through the data value input means, and the changed value is recorded in the 'New Value' of the data substitution list.

조건 설정 후 치환 인터페이스는 도 11에서와 같이, 변경하고자 하는 조건 값입력수단, 변경될 값 입력수단이 구성되고, 데이터 치환 리스트를 포함한다. After setting the conditions, as shown in FIG. 11, the substitution interface includes a condition value input means to be changed and a value input means to be changed, and includes a data substitution list.

사용자는 변경하고자 하는 데이터 값을 직접 입력 또는 데이터 치환 리스트로부터 선택하여 치환하고자 하는 데이터 값을 선택하고, 변경 후 값을 입력하여 치환리스트에 추가할 수 있다. The user may select a data value to be substituted by directly inputting a data value to be changed or selecting from a data substitution list, and input the value after the change to add it to the substitution list.

또한 사용자는 데이터 치환리스트의 값들을 선택(더블 클릭, 또는 모두 삭제)하여 삭제할 수 있다.In addition, the user can delete values by selecting (double clicking or deleting all) values of the data substitution list.

도 12는 결측치 정의 과정을 수행하기 위한 사용자 인터페이스를 나타낸 도면이다. 12 is a view showing a user interface for performing a process of defining missing values.

결측치 정의 인터페이스는 도 12에 도시된 바와 같이, 결측치 리스트와, 결측치 리스트에 추가하기 위한 결측치 입력수단을 포함하며, 결측치 리스트는 결측치의 삭제 수단을 포함한다.As shown in FIG. 12, the missing value definition interface includes a missing value list and missing value input means for adding to the missing value list, and the missing value list includes means for deleting missing values.

사용자는 결측치로 정의할 값을 결측치 입력부에 신규 입력하여 결측치 리스트에 추가할 수 있으며, 결측치 리스트를 개별적으로 선택(더블클릭) 또는 리스트모두 삭제를 선택하여 결측치 리스트 내의 결측치를 모두 삭제할 수 있다.The user can add a value to be defined as a missing value to the missing value list by entering a new value in the missing value input section, and select the missing value list individually (double-click) or select Delete All to delete all missing values in the missing value list.

도 13은 임퓨테이션 과정을 수행하기 위한 인터페이스를 나타낸다.13 shows an interface for performing an imputation process.

임퓨테이션(imputation ; 결측치 대체) 인터페이스는 도 13에 도시된 바와 같이, 결측치 대체방법(imputation)을 선택하기 위한 항목을 포함한다.13, the interface includes an item for selecting a missing value replacement method, as shown in FIG.

사용자는 다음과 같은 항목을 선택하여 임퓨테이션(결측치 대체)과정을 수행할 수 있다.The user can perform the imputation (replacement of missing values) process by selecting the following items.

(a). 사용하지 않음, (a). Not used,

(b). 결측치를 대체할 값 선택, (b). Select values to replace missing values,

(c). 시간에 따른 반복측정변수 수치를 사용하는 경우 (c). When using the repeated measurement variable value over time

도 14 및 도 15는 신규 변수 생성과정을 수행하기 위한 인터페이스를 나타낸 것으로, 도 14는 함수 변환을 적용한 신규 변수 생성 인터페이스를 나타내며, 도 15는 구간 범주화 설정을 이용한 신규 변수 생성 인터페이스를 나타낸다. 14 and 15 show an interface for performing a new variable generation process, FIG. 14 shows a new variable generation interface to which a function transformation is applied, and FIG. 15 shows a new variable generation interface using an interval categorization setting.

함수 변환을 이용한 신규 변수 생성 인터페이스는 도 14에 도시된 바와 같이, 함수 변환 적용할 함수 선택항목 및 직접 입력수단을 제공하며, 함수 변환을 적용하여 신규 생성 변수 리스트를 포함한다.As illustrated in FIG. 14, the new variable generation interface using the function conversion provides a function selection item to be applied to the function conversion and a direct input means, and includes a list of new generation variables by applying the function conversion.

상기 신규 생성 변수 리스트는 신규 생성된 변수 정보를 제공하며, 신규 생성된 변수를 선택하여 삭제 또는 모두 삭제할 수 있는 수단을 포함한다. The newly created variable list provides newly created variable information, and includes means for selecting a newly created variable and deleting or deleting all of them.

사용자는 함수 변환 적용할 함수 선택항목 중에 하나 또는 직접 입력수단을 통해 함수를 입력하여 신규 변수를 생성할 수 있으며, 신규생성 변수 리스트로부터 생성된 신규 변수를 선택(더블클릭)하여 삭제 또는 모두 삭제를 선택하여 신규 생성 리스트를 모두 삭제할 수 있다. The user can create a new variable by entering a function through one of the function selection items to be applied to the function conversion or through a direct input method, and select (double-click) the new variable created from the list of newly created variables to delete or delete all. You can delete all newly created lists by selecting it.

구간 범주화 설정을 이용한 신규 변수 생성 인터페이스는 도 15에 도시된 바와 같이, 범주화할 구간설정수단이 구성되며, 범주화 구간이 적용된 신규 변수 생성 리스트가 포함된다.As shown in FIG. 15, the new variable generation interface using the section categorization setting includes a section setting means to be categorized, and includes a new variable generation list to which the categorization section is applied.

상기 범주화할 구간설정수단은 구간 수 선택수단, 소수점 자리수 선택수단이 구성되며, 구간 수 선택수단, 소수점 자리수 선택수단에 의해 선택된 구간범위리스트를 제공하여 구간 범위를 수정할 수 있도록 한다. The section setting means to be categorized includes section number selecting means and decimal point number selecting means, and a section range list selected by section number selecting means and decimal point number selecting means is provided so that the section range can be modified.

사용자는 범주화할 구간을 설정하고, 그 범주화 구간을 적용하여 신규변수를 생성할 수 있으며, 신규 변수 생성 리스트에서 생성된 신규 변수를 선택 삭제 및 모두 삭제가 가능하다.The user can set the section to be categorized, and apply the categorization section to create new variables, and select and delete all the new variables created from the new variable creation list.

이와 같이 도 8 내지 도 15는 연속형 변수의 데이터 전처리 과정을 나타내며, 각 인터페이스 공통적으로 데이터 속성/분포 확인수단을 포함하며, 데이터 전처리 과정 중 데이터 전처리 과정을 마칠 수 있도록 데이터 전처리 마치기 선택수단을 포함하며, 연속형 변수 데이터 전처리 즉, 각 연속형 변수 데이터 전처리 서브 메뉴를 모두 수행하면 데이터 전처리를 적용할 수 있도록 활성화되는 데이터 전처리실행 선택수단을 포함한다. As described above, FIGS. 8 to 15 show a data preprocessing process of a continuous variable, and each interface commonly includes data attribute / distribution checking means, and includes data preprocessing finishing selection means to complete the data preprocessing process during the data preprocessing process. It includes a pre-processing data selection means that is activated to apply data pre-processing when all of the sub-menus of continuous variable data pre-processing are performed.

도 16 내지 도 22는 인터페이스수단을 통해 사용자에게 제공되는 범주형 변수데이터 전처리 제어 인터페이스를 나타낸다. 16 to 22 show a categorical variable data pre-processing control interface provided to a user through interface means.

사용자가 범주형 변수 데이터 전처리를 선택하면 상단에 변수이름표시수단에 해당 변수 이름이 표시되며, 사용자는 연속형 변수 데이터 전처리 선택수단의 하단 서브 메뉴를 순선대로 선택하면서 변수의 속성 확인 및 재설정과정을 수행하게 된다. When the user selects pre-processing of categorical variable data, the variable name is displayed on the variable name display means at the top, and the user selects the lower sub-menu of the continuous variable data pre-processing selection method in order and checks and resets the properties of the variable. It will perform.

도 16은 사용자가 범주형 변수 데이터 전처리를 선택했을 때, 변수/세부 범주 이름 및 속성변경이 선택된 경우를 나타낸다. 16 shows a case in which the variable / detail category name and attribute change are selected when the user selects preprocessing of categorical variable data.

변수/세부 범주 이름 및 속성 변경을 선택하게 되면, 그 하단으로 변수이름 변경수단이 구성되며, 변수이름변경수단에는 설정된 변수 이름이 나타나며, 그 하단으로는 범주형 변수로 그대로 사용할 것인지 아니면 연속형 변수로 변환할 것인지를 선택하기 위한 변수재설정을 위한 변수설정수단이 구성된다.If you select to change the variable / detail category name and attribute, the variable name changing means is configured at the bottom, and the variable name setting means appears on the variable name changing means, or at the bottom, whether to use it as a categorical variable or a continuous variable. Variable setting means for resetting the variable for selecting whether to convert to is configured.

또한 세부범주 이름 및 순서 변경수단을 포함하며, 세부 범주 이름 및 순서변경수단에는 세부 범주 순서 초기화 선택수단을 포함한다.It also includes a detailed category name and order change means, and a detailed category name and order change means includes detailed category order initialization selection means.

그리고 데이터 전처리 다음 단계로 진행 또는 이전 단계로 진행할 수 있도록 데이터전처리 진행단계선택수단(다음단계로 진행, 이전단계로 복귀)와 데이터 속성 및 분포를 확인할 수 있는 데이터속성분포확인수단을 포함한다.It also includes a data preprocessing step selection means (proceeding to the next step, returning to the previous step) and a data attribute composition checking means to check data attributes and distribution so that the data preprocessing can proceed to the next step or proceed to the previous step.

사용자는 상기 변수설정수단을 통해 범주형 변수를 연속형 변수로 재설정 하거나 범주형 변수로 그대로 유지할 것인지를 선택할 수 있다.The user can select whether to reset the categorical variable to a continuous variable or keep the categorical variable as it is through the variable setting means.

사용자가 연속형 변수로 설정하게 되면, 상기에서와 같은 연속형 변수 데이터 전처리가 실행되며, 범주형 변수로 유지하면, 현재 상태의 데이터 전처리가 유지된다. When the user sets the continuous variable, the preprocessing of the continuous variable data as described above is executed, and if the user maintains the categorical variable, the preprocessing of the current data is maintained.

또한 사용자는 세부범주 이름 및 순서 변경수단을 통해 각 세부 범주 이름을 변경하거나 그 순서를 변경할 수 있다. In addition, the user can change each detailed category name or change the order through the detailed category name and order change means.

이후 사용자는 서브 메뉴에서 개별 값 변경을 선택하거나 다음 단계로 진행을 선택하여 데이터 치환과정에 있어서 개별 값 변경 과정을 진행할 수 있다.Thereafter, the user can select the individual value change from the sub-menu or proceed to the next step to perform the individual value change process in the data substitution process.

도 17 및 도 18은 데이터 치환 과정을 수행하기 위한 사용자 인터페이스를 나타낸 도면으로, 도 17은 데이터 값 리스트로부터 직접 선택하여 입력하여 변경하기 위한 개별 값 변경 인터페이스를 나타내며, 도 18은 조건에 맞는 데이터 값을 선택하여 변경하기 위한 조건 설정 후 치환 인터페이스를 나타낸다. 17 and 18 are views showing a user interface for performing a data substitution process, and FIG. 17 shows an interface for changing an individual value to directly select and input from a data value list, and FIG. 18 is a data value that satisfies a condition After selecting and changing the conditions, the substitution interface is displayed.

개별 값 변경 인터페이스는 도 17에서와 같이, 데이터 이전 데이터 값(old Value)과 새롭게 변경되는 값(new Value)이 표시되는 데이터 치환 리스트를 포함하며, 데이터 치환 리스트는 데이터 값을 변경할 수 있도록 데이터 변경수단을 포함한다.The individual value change interface includes a data substitution list in which the data old data value (old Value) and the new value (new Value) are displayed, as shown in FIG. 17, and the data substitution list changes data to change the data value. Means.

사용자는 이전 데이터 값을 확인하고 데이터 값 입력부를 통해 새롭게 데이터 값을 직접 변경할 수 있다.The user can check the previous data value and directly change the data value through the data value input unit.

조건 설정 후 치환 인터페이스는 도 18에서와 같이, 변경하고자 하는 조건 값입력수단, 변경될 값 입력수단이 구성되고, 데이터 치환 리스트를 포함한다. After the condition is set, the substitution interface includes a condition value input means to be changed and a value input means to be changed, as shown in FIG. 18, and includes a data substitution list.

또한 사용자는 데이터 치환리스트의 값들을 선택(더블 클릭, 또는 모두 삭제)하여 삭제할 수 있다. In addition, the user can delete values by selecting (double clicking or deleting all) values of the data substitution list.

도 19는 결측치 정의 과정을 수행하기 위한 사용자 인터페이스를 나타낸 도면이다. 19 is a diagram showing a user interface for performing a process of defining missing values.

결측치 정의 인터페이스는 도 19에 도시된 바와 같이, 결측치 리스트와, 결측치 리스트에 추가하기 위한 결측치 입력수단을 포함하며, 결측치 리스트는 결측치의 삭제 수단을 포함한다.The missing value definition interface includes a missing value list and a missing value input means for adding to the missing value list, and the missing value list includes means for deleting missing values.

사용자는 결측치로 정의할 값을 결측치 입력수단에 신규 입력하여 결측치 리스트에 추가할 수 있으며, 결측치 리스트를 개별적으로 선택(더블클릭) 또는 리스트모두삭제를 선택하여 결측치 리스트 내의 결측치를 모두 삭제할 수 있다.The user can add a value to be defined as a missing value to the missing value list by entering a new value in the missing value input means, and select the missing value list individually (double-click) or select Delete All to delete all missing values in the missing value list.

도 20은 임퓨테이션 과정을 수행하기 위한 인터페이스를 나타낸다. 20 shows an interface for performing an imputation process.

임퓨테이션(imputation ; 결측치 대체) 인터페이스는 도 20에 도시된 바와 같이, 결측치 대체방법(imputation)선택수단을 포함한다.The implantation (imputation value replacement) interface includes a missing value replacement method selection means, as shown in FIG. 20.

사용자는 결측치 대체방법 선택수단에서 제공하는 다음과 같은 항목을 선택하여 임퓨테이션(결측치 대체)과정을 수행할 수 있다.The user may perform the imputation (missing value replacement) process by selecting the following items provided by the missing value replacement method selection means.

(a). 사용하지 않음, (a). Not used,

도 21 및 도 22는 신규 변수 생성과정을 수행하기 위한 인터페이스를 나타낸 것으로, 도 21은 세부 범주 통합에 의한 신규 변수 생성 인터페이스를 나타내며, 도 22는 세부 범주 분리에 의한 신규 변수 생성 인터페이스를 나타낸다. 21 and 22 show an interface for performing a new variable generation process, FIG. 21 shows a new variable generation interface by sub-category integration, and FIG. 22 shows a new variable generation interface by sub-category separation.

세부 범주 통합에 의한 신규 변수 생성인터페이스는 도 21에 도시된 바와 같이, 변수에 포함된 세부 범주 표시수단이 제공되며, 상기 표시수단에는 세부 범주를 설정하기 위한 설정수단을 포함한다.As shown in FIG. 21, a new variable generation interface by subcategory integration is provided with detailed category display means included in the variable, and the display means includes setting means for setting detailed categories.

그리고 설정수단에 선택된 세부 범주를 통합 적용하여 신규 변수를 생성하기 위한 세부 범주 통합 적용수단과, 세부 범주 통합 적용된 신규 생성 변수 리스트를포함한다. In addition, it includes a detailed category integration application means for generating a new variable by integrating and applying the selected detailed category to the setting means, and a list of newly generated variables applied with the detailed category integration.

신규 생성 변수 리스트는 신규 생성된 변수 정보를 제공하며, 신규 생성된 변수를 선택하여 삭제 또는 모두 삭제할 수 있는 수단을 포함한다. The newly created variable list provides information on newly created variables, and includes means for selecting the newly created variables and deleting or deleting all of them.

세부 범주 분리에 의한 신규 변수 생성 인터페이스는 도 22에 도시된 바와 같이, 신규 변수를 생성함에 있어서 세부 범주를 분리하지 않고 신규변수를 생성할 것인지, 분리 후 신규 변수를 생성할 것인지를 선택하기 위한 선택수단과, 신규 생성 변수 리스트를 포함한다. As shown in FIG. 22, the interface for generating new variables by sub-category separation is selected for selecting whether to create new variables or to create new variables after separation without separating the sub-categories in creating new variables. Means and a list of newly created variables.

사용자는 세부 범주를 분리하지 않을 것인지, 세부 범주 분리 후 개별 변수로 생성할 것인 지를 선택하고 이를 적용하여 신규 변수를 생성하게 되며, 이와 같이 생성된 변수는 신규 생성 변수 리스트에 기록된다.The user selects whether or not to separate the sub-category, or to create it as an individual variable after separating the sub-category, and creates a new variable by applying it, and the created variable is recorded in the new generated variable list.

사용자는 신규 생성 변수 리스트로부터 생성된 신규 변수를 선택(더블클릭)하여 삭제 또는 모두 삭제를 선택하여 신규 생성 리스트를 모두 삭제할 수 있다. The user can delete all of the new creation list by selecting (double clicking) the new variable created from the new creation variable list or selecting delete or delete all.

이와 같이 도 16 내지 도 22는 범주형 변수의 데이터 전처리 과정을 나타내며, 각 인터페이스 공통적으로 데이터 속성/분포 확인수단을 포함하며, 데이터 전처리 과정 중 데이터 전처리 과정을 마칠 수 있도록 데이터 전처리 마치기 선택수단을 포함하며, 연속형 변수 데이터 전처리 즉, 각 연속형 변수 데이터 전처리 서브 메뉴를 모두 수행하면 데이터 전처리를 적용할 수 있도록 활성화되는 데이터 전처리실행 선택수단을 포함한다. As described above, FIGS. 16 to 22 show a data preprocessing process for categorical variables, and each interface commonly includes data attribute / distribution checking means, and includes data preprocessing finish selecting means to complete the data preprocessing process during the data preprocessing process. It includes a pre-processing data selection means that is activated to apply data pre-processing when all of the sub-menus of continuous variable data pre-processing are performed.

그리고 도 23은 연속형 변수의 데이터 분포 확인 인터페이스를 나타낸다.And Figure 23 shows the data distribution confirmation interface of the continuous variable.

데이터 속성/분포 확인수단을 통해 실행되며, 변수이름표시수단과, 수치분포를 텍스트 수치로 표시하는 수치표시수단, 블록형태로 수치분포를 나타내는 박스 플로트(Box Plot) 표시수단, 그래프 형태로 수치분포를 나타내는 히스토그램(Histogram 표시수단을 포함하며, 상기 박스플로트 표시수단과 히스토그램 표시수단에는 각 그림을 수정할 수 있도록 그림수정수단, 팝업창으로 그림을 볼 수 있도록 하는 팝업창으로 그림보기 수단을 포함한다.Executed through data attribute / distribution checking means, variable name display means, numerical display means for displaying numerical distribution as text figures, box plot display means for numerical distribution in block form, numerical distribution in graph form A histogram indicating means (including histogram display means), and the box float display means and histogram display means include picture correction means so that each picture can be corrected, and picture viewing means as a pop-up window for viewing pictures in a pop-up window.

도 24는 상기 도 23의 인터페이스에서 박스 플로트의 그림수정수단을 선택할 경우 나타나는 박스 플로트 시각화 파라미터 수정 인터페이스를 나타내며, 도 25는 도 23의 인터페이스에서 히스토그램의 그림수정수단을 선택할 경우 나타나는 히스토그램 시각화 파라미터 수정 인터페이스를 나타낸다.FIG. 24 shows a box float visualization parameter modification interface that appears when selecting the picture correction means of the box float in the interface of FIG. 23, and FIG. 25 shows a histogram visualization parameter modification interface that appears when the picture modification means of the histogram is selected in the interface of FIG. Indicates.

사용자는 박스 플로트 시각화 파라미터 수정인터페이스를 통해 박스 플로트에서 박스의 너비, 수평꼬리선너비, 수직꼬리선 표시범위 축 표시범위, 축표시방법, 그림 시각화 방향, 그림의 가로세로 크기, 해상도, 그림 파일 형식 등을 설정할 수 있다.The user can modify the box float visualization parameters through the box float box width, horizontal tail width, vertical tail width display range, axis display range, axis display method, picture visualization direction, picture width and height, resolution, and picture file format. Etc. can be set.

또한 히스토그램 시각화 파라미터 수정 인터페이스를 통해 막대수, 축 표시범위, 축 표시방법, 그림의 가로세로 크기, 해상도, 그림파일 형식을 설정할 수 있다.In addition, you can set the number of bars, axis display range, axis display method, horizontal and vertical size, resolution, and picture file format through the histogram visualization parameter modification interface.

도 26은 범주형 변수 데이터의 분포를 확인하기 위한 인터페이스를 나타낸다. 26 shows an interface for confirming the distribution of categorical variable data.

범주형 변수 데이터의 변수이름표시수단과, 수치분포를 텍스트 수치로 표시하는 수치표시수단, 수치분포를 바 플로트(Bar plot)로 표시하는 바 플로트표시수단을 포함하며, 상기 바 플로트표시수단은 각 그림을 수정할 수 있도록 그림수정수단, 팝업창으로 그림을 볼 수 있도록 하는 팝업창으로 그림보기 수단을 포함한다.And variable name display means for categorical variable data, numerical display means for displaying numerical distributions as text figures, and bar float display means for displaying numerical distributions as bar plots, wherein the bar float display means are It includes picture modification means so that you can modify the picture, and picture viewing means as a pop-up window that allows you to view the picture in a pop-up window.

도 27은 상기 도 26의 인터페이스에서 바 플로트의 그림수정수단을 선택할 경우 나타나는 바 플로트 시각화 파라미터 수정 인터페이스를 나타낸다.FIG. 27 shows a bar float visualization parameter modification interface that appears when a picture correction means for bar floats is selected in the interface of FIG. 26.

사용자는 바 플로트 시각화 파라미터 수정인터페이스를 통해 바 플로트에서 바 너비, 바 사이간격 Y축 표시방법(샘플 수, 샘플비율), Y축 표시범위, 그림 시각화 방향(수직, 수평), 바 위 텍스트 표시 (샘플 수, 샘플 비율), 그림의 가로세로 크기, 그림 해상도, 그림 파일 형식 등을 설정할 수 있다.The user can modify the bar float visualization parameter through the interface to display the bar width, bar spacing Y-axis display (number of samples, sample rate), Y-axis display range, picture visualization direction (vertical, horizontal), and text on the bar through the interface ( You can set the number of samples, the sample rate), the aspect ratio of the picture, the picture resolution, and the picture file format.

도 28은 상기 도 25 및 도 27의 인터페이스에서 제공하는 그림에 삽입할 텍스트 설정수단을 통해 제공되는 인터페이스를 나타낸다.FIG. 28 shows an interface provided through text setting means to be inserted into a picture provided by the interfaces of FIGS. 25 and 27.

그림에 삽입할 텍스 설정수단에서 그림에 삽입할 텍스트 설정은, 텍스트 입력, 텍스트 크기, 텍스트 가로 위치, 텍스트 세로 위치, 폰트, 색깔을 포함하며, 삽입될 텍스트 리스트를 포함하며, 삽입될 텍스트 리스트에는 등록된 텍스트를 삭제하는 수단을 포함한다. In the text setting means to be inserted into the picture, the text setting to be inserted into the picture includes text input, text size, text horizontal position, text vertical position, font, color, and a text list to be inserted. And means for deleting the registered text.

도 29는 상기 도 25 및 도 27의 연속형, 범주형 변수에 대한 인터페이스에서 제공하는 색깔 조정 및 그림파일 형식 설정수단을 통해 제공되는 색깔 조정 인터페이스를 나타난다. FIG. 29 shows the color adjustment interface provided through the color adjustment and picture file format setting means provided by the interfaces for the continuous and categorical variables of FIGS. 25 and 27.

색깔 조정 인터페이스는 도형내부색깔, 도형 외부 선 색깔, 배경색깔을 조정할 수 있도록 하며, 각 색깔의 이름, 수치(R,G,B) 조정수단을 포함한다.The color adjustment interface allows you to adjust the inside color of the figure, the color of the outside line of the figure, and the background color, and includes a name and numerical (R, G, B) adjustment means for each color.

한편 도 30 내지 도 32는 변수 조합을 통해 신규 변수를 생성하기 위한 인터페이스를 나타낸 것으로, 도 30은 연속형 변수를 조합하여 신규 변수를 생성하는 인터페이스를 나타내며, 도 31은 범주형 변수를 조합하여 신규 변수를 생성하는 인터페이스를 나타내며, 도 32는 연속형 변수와 범주형 변수를 조합하여 신규 변수를 생성하는 인터페이스를 나타낸다.Meanwhile, FIGS. 30 to 32 show an interface for generating new variables through variable combination, FIG. 30 shows an interface for generating new variables by combining continuous variables, and FIG. 31 shows new by combining categorical variables. An interface for generating a variable is shown, and FIG. 32 shows an interface for generating a new variable by combining a continuous variable and a categorical variable.

사용자가 변수들을 조합하여 신규 변수 생성을 선택하면 도 30 내지 도 32에서와 같이, 연속형 변수 조합하여 변수생성, 범주형 변수 조합하여 변수 생성, 연속형 변수, 범주형 변수 조합하여 변수 생성의 메뉴를 갖는 인터페이스가 구성되고, 각 메뉴의 선택에 따라서 도 30 내지 도 32에서와 같은 각 변수 조합을 통한 신규 변수 생성 인터페이스가 제공된다.When the user selects to create a new variable by combining variables, as shown in FIGS. 30 to 32, the menu of variable generation by continuous variable combination, variable generation by combination of categorical variable, and variable generation by combination of continuous variable and categorical variable An interface having a configuration is configured, and a new variable creation interface through each combination of variables as shown in FIGS. 30 to 32 is provided according to selection of each menu.

연속형 변수 조합 신규 변수 생성 인터페이스는 도 30에서와 같이, 등록된 연속형 변수를 나타내고 사용자가 변수를 선택할 수 있도록 하는 변수선택수단, 변수선택수단으로부터 선택된 변수들을 이용하여 신규 변수를 생성하기 위한 신규변수생성수단, 신규 생성된 변수 리스트를 제공하는 신규변수리스트제공수단을 포함하여 구성되며, 신규변수리스트제공수단은 사용자의 선택에 따라서 신규 생성 등록된 변수를 삭제할 수 있도록 신규변수삭제수단을 포함한다.As shown in FIG. 30, the continuous variable combination new variable generation interface represents a registered continuous variable and allows a user to select a variable, a new variable for creating a new variable using variables selected from variable selection means and variable selection means. It comprises a variable generating means and a new variable list providing means providing a newly created variable list, and the new variable list providing means includes a new variable deleting means to delete the newly created registered variable according to a user's selection. .

범주형 변수 조합 신규 변수 생성 인터페이스는 도 31에서와 같이, 조합할 변수를 선택할 수 있도록 변수선택수단, 세부범주를 재설정할 수 있도록 하는 세부 범주 재설정수단, 신규 변수 생성수단, 신규 생성된 변수 리스트 제공수단을 포함하며, 상기 신규 생성된 변수 리스트 제공수단은 신규 생성된 변수를 삭제하는 삭제수단을 포함한다.As shown in FIG. 31, the categorical variable combination interface provides a variable selection means to select a variable to be combined, a detailed category reset means to reset a detailed category, a new variable generation means, and a newly created variable list. Means, and the newly created variable list providing means includes deletion means for deleting the newly created variable.

연속형 변수, 범주형 변수 조합 신규 변수 생성 인터페이스는 도 32에서와 같이, 범주화할 연속형 변수선택 및 구간설정수단, 연속형 변수와 조합할 범주형 변수 선택수단, 조합되어 생성될 세부 범주 조정수단, 신규변수 생성수단, 신규 생성된 변수 리스트 제공수단을 포함하며, 상기 신규 생성된 변수 리스트 제공수단은 신규 생성된 변수를 삭제하는 삭제수단을 포함한다. Continuous variable, categorical variable combination The new variable generation interface, as shown in FIG. 32, is a continuous variable selection and section setting means to be categorized, a categorical variable selection means to be combined with a continuous variable, and a detailed category adjustment means to be combined. , Means for generating new variables, and means for providing a list of newly created variables, and means for providing a list of newly created variables includes means for deleting a newly created variable.

도 33 내지 도 36은 변수 조합 데이터 분포를 확인하기 위한 인터페이스를 나타낸 것으로, 도 33은 연속형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스이며, 도 34는 연속형변수, 범주형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스이며, 도 35는 범주형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스이다.33 to 36 show an interface for confirming a variable combination data distribution, FIG. 33 is an interface for confirming a combined data distribution between continuous variables, and FIG. 34 is a combination between continuous variables and categorical variables. An interface for checking data distribution, and FIG. 35 is an interface for checking the combined data distribution between categorical variables.

도 33에 도시된 바와 같이, 연속형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스는 조합할 변수를 선택하기 위한 수단, 조합된 변수의 데이터 분포/통계를 확인하기 위한 수단, 데이터 분포/통계 확인수단을 통해 조합된 변수 데이터 분포를 나타내는 그림수단, 변수 간 연관성 분석 결과(Result of correlation analysis)를 제공하는 수단을 포함한다.As shown in FIG. 33, the interface for confirming the combined data distribution between continuous variables includes means for selecting the variable to be combined, means for confirming the data distribution / statistic of the combined variable, and data distribution / statistic verification. It includes picture means representing the combined distribution of variable data through means, and means for providing a result of correlation analysis between variables.

상기 그림수단은 연속형 변수의 속성을 고려하여 결과를 산점도(scatter plot)로 나타낸다.The drawing means shows the result in a scatter plot considering the properties of the continuous variable.

도 34에 도시된 바와 같이, 연속형 변수, 범주형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스는 조합할 변수를 선택하기 위한 수단, 조합된 변수의 데이터 분포/통계를 확인하기 위한 수단, 데이터 분포/통계 확인수단을 통해 조합된 변수 데이터 분포를 나타내는 그림수단, 변수 간 평균값 차 결과(Result of mean difference test)를 제공하는 수단을 포함한다.As shown in FIG. 34, the interface for checking the combined data distribution between continuous variables and categorical variables includes means for selecting variables to be combined, means for checking data distribution / statistics of the combined variables, data Includes plotting means for displaying the combined variable data distribution through distribution / statistics checking means, and means for providing a result of mean difference test between variables.

상기 그림수단은 연속형 변수와 범주형 변수의 속성을 고려하여 평균값 차를 나타내는 그래프로 나타낸다.The above-described drawing means is represented by a graph representing the difference in average values taking into account the properties of continuous variables and categorical variables.

도 35에 도시된 바와 같이, 범주형 변수 간 조합된 데이터 분포를 확인하기 위한 인터페이스는 조합할 변수를 선택하기 위한 수단, 조합된 변수의 데이터 분포/통계를 확인하기 위한 수단, 데이터 분포/통계 확인수단을 통해 조합된 변수 데이터 분포를 나타내는 그림수단, 변수 간 불확실표 분석 결과(Result of contingency table analysis)를 제공하는 수단을 포함한다. As shown in FIG. 35, the interface for checking the combined data distribution between categorical variables includes means for selecting variables to be combined, means for checking data distribution / statistics of the combined variables, and data distribution / statistics checking. Includes a means of plotting the distribution of the variable data combined through the means, and a means of providing a result of contingency table analysis between the variables.

상기 그림수단은 범주형 변수속성을 고려하여 그래프로 나타낸다.The above picture means is graphically considered in consideration of categorical variable properties.

도 36은 조건에 맞는 데이터 추출하기 위한 인터페이스를 나타낸다.36 shows an interface for extracting data meeting conditions.

변수 선택을 위한 변수리스트 제공수단, 범주형 변수의 서브레벨을 선택하기 위한 수단, 연속형 변수 범위를 지정하기 위한 수단, 조건을 설정하기 위한 설정조건설정수단, 데이터 추출 조건을 등록하기 위한 등록수단, 등록수단을 통해 등록된 조건 리스트를 제공하는 수단, 설정된 조건에 따라 데이터를 추출하고 저장하기 위한 데이터 추출 및 저장 수단을 포함한다. Variable list providing means for variable selection, means for selecting sub-levels of categorical variables, means for specifying continuous variable ranges, setting condition setting means for setting conditions, registration means for registering data extraction conditions , Means for providing a list of conditions registered through the registration means, and data extraction and storage means for extracting and storing data according to the set conditions.

이와 같이 본 발명은 임상데이터를 확인하여 연속형 변수, 범주형 변수를 판단하고, 각 연속형 변수에 대한 자동 설정 후, 각 변수 별 요구되는 전처리 과정을 사용자에게 제공하여 데이터 값 확인 및 변경(재설정) 등을 통해 자동적으로 임상데이터의 전처리가 이루어질 수 있도록 한다.As described above, the present invention checks clinical data to determine continuous variables and categorical variables, and automatically sets each continuous variable, and then provides a preprocessing process required for each variable to the user to check and change data values (reset) ) To automatically pre-process clinical data.

Claims

Interface means 10 for providing an interface for the user's data pre-processing,
Variable attribute automatic setting means (20) for determining whether a character is included in the input data, setting a variable according to the included character, and automatically setting a variable attribute for each set variable,
The data preprocessing process is stored and managed, and the data preprocessing process management means 30 for providing and managing the data preprocessing process for each variable at the request of the data preprocessing control means 40,
Data pre-processing control means (40) which provides an interface for data pre-processing to users for each variable according to the data pre-processing process and controls data pre-processing according to a set value for data pre-processing input through the interface.
A variable combination variable management means 50 that provides a user interface for creating new variables by combining variables after data preprocessing is completed, and generates and controls new variables according to a combination of users,
A statistical analysis means (60) for providing a user interface for confirming the combined data distribution between variables through the interface means (10), and providing distribution confirmation and statistical information according to a user's request,
It is used for clinical data analysis, characterized in that it comprises data extraction means (70) that provides interface data for creating and inputting conditions for extracting data and extracting and providing data according to conditions input by the user. Data preprocessing automation system.

The method according to claim 1, wherein the variable attribute automatic setting means (20) comprises: a missing value processing means (21) for identifying missing characters in the input data and processing missing values in the characters by referring to missing value list information. Variable setting means 22 for setting a continuous variable or categorical variable according to numeric information, variable property setting means for automatically setting variable properties according to the continuous variable or categorical variable set by the variable setting means 22 23), a data preprocessing automation system used for clinical data analysis, characterized in that it comprises a missing value management means (24) in which the missing value list is stored and managed.

The method of claim 2, wherein the variable attribute setting means (23) is a means for automatically setting variable attributes for each continuous variable and categorical variable, and the basic attribute information that is automatically set is a variable title for a continuous variable. , Variable name, numeric distribution information, and categorical variables are variable title, variable name, and title of variable subtype, name of variable subtype. ) Data pre-processing automation system used for clinical data analysis characterized by including information.

The method of claim 1, wherein the data preprocessing process management means (30) includes a data preprocessing process for each continuous variable and categorical variable, and further includes data preprocessing editing means for adding and deleting data preprocessing processes. Data pre-processing automation system used in clinical data analysis, characterized in that.

The method according to claim 1, wherein the data preprocessing control means (40) is a continuous variable, a data preprocessing process means of a continuous variable that controls data preprocessing according to the data preprocessing process for each categorical variable, and a data preprocessing process of a categorical variable. It includes a means, and provides the user with information to be provided through an interface required for data preprocessing according to the data preprocessing process, and performs and controls data preprocessing according to a user's setting value through a user interface.
After the automatic setting process of the variable properties of the continuous variable is completed, the method for preprocessing the data of the continuous variable asks the user to confirm whether to keep the continuous variable or convert (new) to a categorical variable, and input by the user In order to change the variable name to the user if it is determined to proceed as a continuous variable, the process of determining whether to perform the data preprocessing process of the continuous variable or the data preprocessing process of the categorical variable according to one information. A variable name setting process that provides an interface and sets variable names according to user input information, an outlier definition process that allows the user to select and set a numerical range to be used as an outlier, and the value of data Data substitution process so that the user can change, the user registers the missing value newly Definition process of missing values to be deleted, imputation process to replace missing values with numbers other than missing values, new variable creation process, and variable data distribution confirmation process to perform data preprocessing of continuous variables, and the categories If it is judged to be performed as a data preprocessing process of a type variable, the process includes switching to a categorical variable so that the data preprocessing process of the categorical variable is performed by the data preprocessing process means of the categorical variable,
The categorical variable data pre-processing process means, upon completion of the automatic setting of the categorical variable, asks the user to confirm whether to keep the categorical variable or convert (new) to a continuous variable, and to enter the user's input information. Therefore, the process of determining whether to perform the preprocessing process of the data of the categorical variable or the preprocessing process of the data of the continuous variable. Provides an interface for changing the name and detailed category name, the variable name setting process to change the variable name and detailed category name according to the user's input information, the detailed category order change process to change the detailed category order, and the value of the data (value ) So that the user can change the data substitution process, the user registers and deletes missing values Categorical variable data pre-processing including continuous process of defining variable data, including the process of defining missing values, imputation that replaces missing values with numbers or letters, missing variables, and checking the distribution of variable data If it is judged to proceed by converting to, it provides an interface for asking the user to check whether to treat it as a missing value or replace it with a number when the data contains any character, process it according to the user's input information, and then process the continuous variable A data pre-processing automation system used in clinical data analysis, comprising the step of generating a data pre-processing process of a continuous variable by means of a data pre-processing process of the continuous variable.

The method according to claim 1, wherein the interface for data pre-processing for each variable provided from the data pre-processing control means (40),
As the user selects preprocessing of continuous variable data and preprocessing of categorical variable data, the variable name is displayed on the variable name display means at the top, and the process of checking and resetting the properties of the variable is selected while preprocessing the data of each variable selected by the user. The sub-menu is configured to perform, and variable setting means for resetting the variable whether to proceed with the selected variable (continuous variable or categorical variable) as it is or to convert to another variable (categorical variable or continuous variable) Data pre-processing for each configured variable and set data pre-processing step selection means, data attribute composition checking means for checking data attribute and distribution so as to proceed to the next step or proceed to the previous step. Data preprocessing automation system used.

The method of claim 5, wherein the outlier definition interface provided to the user according to the continuous variable data preprocessing process from the data preprocessing control means (40) comprises:
The range selection means includes a range selection means to select a range of numbers to be used as an outlier, and the range selection means includes a selection item to select a number to be used as an outlier, and an input means to directly input a number. Data pre-processing automation system used in clinical data analysis, characterized in that.

The method of claim 5, wherein the user interface for performing the data substitution process according to the variable data pre-processing process from the data pre-processing control means (40),
An individual value change interface including a data substitution list in which data old data values (old values) and new values (new values) are displayed, and including data value input means for changing data values,
A condition value input means to be changed, a value input means to be changed are configured, and after setting a condition including a data substitution list, a substitution interface is included,
The user can select the data value to replace by directly inputting the data value to be changed or selecting from the data substitution list, and add the value after the change to the substitution list, and the user selects the values of the data substitution list ( Automated data pre-processing system used in clinical data analysis, characterized in that it comprises a means to be deleted by double-clicking or deleting all).

The method according to claim 5, wherein the user interface for performing the process of defining missing values according to the variable data preprocessing process from the data preprocessing control means (40) comprises:
A data preprocessing automation system used for clinical data analysis, comprising a missing value list and a missing value input means for adding to the missing value list, wherein the missing value list includes a deleting means.

The method of claim 5, wherein the user interface for performing an imputation process according to the variable data pre-processing process from the data pre-processing control means (40),
Contains items for selecting the missing value substitution method,
The item for the user to select the method for replacing the missing value,
(a). Not used,
(b). Select values to replace missing values,
(c). In case of using the repeated measurement variable over time,
Data pre-processing automation system used in clinical data analysis, characterized in that it comprises a.

The method according to claim 5, wherein the user interface for performing a new variable generation process according to the continuous variable data preprocessing process from the data preprocessing control means (40) includes a new variable creation interface to which a function transformation is applied, and a new variable using a section categorization setting. Includes a creation interface,
The new variable generation interface using the function conversion provides a function selection item and a direct input means to apply the function conversion, and includes a list of new generation variables by applying a function conversion, and the new generation variable list provides new generated variable information And includes means for selecting a newly created variable and deleting it or deleting all,
The new variable generation interface using the section categorization setting includes a section setting means to be categorized, and includes a new variable generation list to which the categorization section is applied,
The section setting means to be categorized includes a section number selection means and a decimal place number selection means, and a section range list selected by the section number selection means and the decimal place number selection means is provided so that the section range can be modified.
Includes means for selecting and deleting all new variables created in the new variable creation list, detailed category name and order changing means, and detailed category name and order changing means includes detailed category order initialization selecting means,
It includes data pre-processing step selection means (proceeding to the next step, returning to the previous step) and data attribute composition checking means to check data attributes and distribution so that data pre-processing can proceed to the next step or proceed to the previous step. Data pre-processing automation system used for clinical data analysis.

10. The method of claim 8, wherein the individual value change interface includes a data substitution list in which data old data values (old values) and new values (new values) are displayed, and the data substitution lists change data to change data values. Means,
After the condition is set, the substitution interface comprises a condition value input means to be changed and a value input means to be changed, and includes a data substitution list,
Data pre-processing automation system used in clinical data analysis, characterized in that it comprises a deletion means to select and delete the values of the data substitution list.

The method according to claim 5, wherein the user interface for performing a new variable generation process according to the categorical variable data preprocessing process from the data preprocessing control means (40) includes: a new variable generation interface by subcategory integration, and a new one by subcategory separation. Contains a variable creation interface,
The new variable creation interface by sub-category integration is provided with a sub-category display means included in the variable, and the display means includes setting means for setting sub-categories, and the selected sub-category is integrated into the setting means to apply new A detailed category integration application means for creating a variable, and a list of newly created variables applied with a detailed category integration, and the newly created variable list provides new created variable information, and selects or deletes all newly created variables. Means that include,
The new variable generation interface by the sub-category separation includes selection means for selecting whether to create a new variable or to create a new variable after separation, without separating the sub-category in creating the new variable. Data pre-processing automation system used in clinical data analysis characterized in that it comprises a list.

The method according to any one of claims 6 to 13, further comprising a data attribute / distribution checking means, a data preprocessing finishing selection means to complete the data preprocessing process during the data preprocessing process, and a variable data preprocessing submenu. It includes a data pre-processing execution selection means that is activated to apply the data pre-processing when all of the,
The interface to check the data distribution of continuous variables is executed through the data attribute / distribution verification means, the variable name display means, the numerical display means to display the numerical distribution as text figures, and the box plot to display the numerical distribution in block form. ) Display means, including a histogram display means indicating a numerical distribution in the form of a graph, wherein the box float display means and histogram display means are picture correction means so that each picture can be modified, and a pop-up window that allows you to view the picture with a pop-up window. Including means for viewing pictures, and
The interface for checking the distribution of categorical variable data is:
And variable name display means for categorical variable data, numerical display means for displaying numerical distributions as text figures, and bar float display means for displaying numerical distributions as bar plots, wherein the bar float display means are A data preprocessing automation system used for clinical data analysis, comprising a picture modification means so that a picture can be modified, a picture viewing means as a pop-up window allowing a picture to be viewed in a pop-up window.

The method of claim 5, Variable combination The interface for creating a new variable through the variable combination provided by the variable management means (50), the interface to create a new variable by combining the continuous variable, new by combining the categorical variable It includes an interface for creating variables, and an interface for creating new variables by combining continuous and categorical variables.
The continuous variable combination new variable creation interface represents a registered continuous variable and allows a user to select a variable, a variable selection means, a new variable generation means for generating a new variable using variables selected from the variable selection means, and a new generation It comprises a new variable list providing means for providing a variable list, the new variable list providing means includes a new variable deletion means to delete the newly created registered variable according to the user's selection,
The categorical variable combination interface includes a variable selection means for selecting variables to be combined, a detailed category resetting means for resetting detailed categories, a new variable generation means, and a new variable list providing means. The newly created variable list providing means includes deletion means for deleting the newly created variable,
Continuous variable, categorical variable combination The new variable creation interface includes continuous variable selection and section setting means to be categorized, categorical variable selection means to be combined with continuous variables, detailed category adjustment means to be combined, and new variable generation means. A data preprocessing automation system used in clinical data analysis, comprising means for providing a newly created variable list, and the means for providing a newly created variable list comprises a deletion means for deleting the newly created variable.

The method of claim 5, wherein the interface for checking the variable combination data distribution provided by the statistical analysis means (60) is an interface for checking the combined data distribution between continuous variables, continuous variables, combined between categorical variables. It includes an interface to check the data distribution, and an interface to check the combined data distribution between categorical variables.
The interface for checking the combined data distribution between continuous variables includes: means for selecting variables to be combined, means for checking data distribution / statistics of the combined variables, and variable data combined through data distribution / statistics checking means Includes a means for providing a picture representing the distribution and a result of correlation analysis between variables.
The interface for checking the combined data distribution between continuous variables and categorical variables includes means for selecting variables to be combined, means for checking the data distribution / statistics of the combined variables, and data distribution / statistics checking means. Includes plotting means representing the combined variable data distribution, and means for providing a result of mean difference test between the variables.
The interface for checking the combined data distribution between categorical variables includes means for selecting variables to be combined, means for checking the data distribution / statistics of the combined variables, and variable data combined through the data distribution / statistics checking means. Data pre-processing automation system used for clinical data analysis, characterized by comprising a means for providing a picture means for representing the distribution, a cross table analysis results between variables (Result of contingency table analysis).

17. The method according to claim 16, wherein the plotting means of the interface for confirming the combined data distribution between continuous variables shows the result as a scatter plot considering the properties of the continuous variables,
The plotting means of the interface to check the combined data distribution between continuous variables and categorical variables is represented by a graph representing the difference in average values taking into account the properties of continuous and categorical variables.
Data pre-processing automation system used for clinical data analysis, characterized in that the graphical means of the interface for confirming the combined data distribution between categorical variables is graphed considering categorical variable properties.

The method of claim 6, wherein the interface for extracting data meeting the conditions provided by the data extraction means (70),
Variable list providing means for variable selection, means for selecting sub-levels of categorical variables, means for specifying continuous variable ranges, setting condition setting means for setting conditions, registration means for registering data extraction conditions , A means for providing a list of conditions registered through the registration means, and a data pre-processing automation system used for clinical data analysis, comprising data extraction and storage means for extracting and storing data according to the set conditions.

When clinical data is inputted, the entered clinical data is checked to determine whether or not a character is included, the variable is automatically set according to the included character, and the variable attribute of the variable attribute automatic setting means 20 is used to automatically set the attribute for each set variable. It consists of an automatic setting process and a data preprocessing process of the data preprocessing control means 40 that performs data preprocessing according to the variables set in the automatic setting process and the properties of the variables,
The variable attribute automatic setting process of the variable attribute automatic setting means 20,
When clinical data is input, the character-inclusive judgment process to determine whether the variable of the clinical data contains a character, and when it is confirmed that the character is included through the character-inclusive judgment process, refer to the missing value list to confirm the missing value and include the If all the letters are missing, the corresponding variable is deleted from the data, if any non-numeric characters are included, the process is set as a categorical variable, and if the letters are composed only of numbers, the type of the number is judged and the type of the number (N ) Is less than the set number (C), the variable is set as a categorical variable, and when the number (C) of the number type (N) is set, the variable is set as a continuous variable, and the variable setting process and the variable setting process. If it is judged as a categorical variable in the process of extracting the name of the variable and setting the title of the variable, extracting the name of the detailed category and setting the title of the variable Auto-configuration process variables and properties of the categorical variable that contains the process,
In the variable setting process, if it is determined to be a continuous variable, it consists of the process of automatically setting the variable properties of the continuous variable, including the process of extracting the name of the variable, setting the title of the variable, and calculating the numerical distribution.
The data pre-processing process of the data pre-processing control means 40,
When the automatic setting process of the variable properties of the continuous variable is completed, the user is asked whether to keep the continuous variable or convert (new) to the categorical variable, and the data of the continuous variable according to the information input by the user. If it is determined to proceed with preprocessing, whether to perform preprocessing of data of categorical variables, or if it is determined to proceed as a continuous variable, the user is provided with an interface to change the variable name and the user's input information Therefore, the variable name setting process to set the variable name, the outlier definition process to allow the user to select and set the numerical range to be used as the outlier, and the data substitution process to allow the user to change the value of the data. , The process of defining missing values to enable users to register and delete missing values, The process of performing data pre-processing of continuous variables, including the process of imputation, new variable generation, and variable data distribution verification, replacing the measurement with numbers rather than missing values, and performing the pre-processing of data for the categorical variable. If it is determined that the pre-processing of the data of the continuous variable, including the process of switching to the categorical variable so that the data pre-processing of the categorical variable can be performed,
When the automatic setting process of the categorical variable is completed, the user is asked whether to keep the categorical variable or convert (new) to a continuous variable, and preprocess the data of the categorical variable according to the user's input information. The process of determining whether to perform or not to perform the preprocessing of data of continuous variables, provides the user with an interface for changing the variable name and detailed category name, and changes the variable name and detailed category name according to the user's input information. The variable name setting process to be performed, the detailed category order change process to change the detailed category order, the data substitution process to allow the user to change the value of data, the process to define the missing value to allow the user to register and delete the missing value, the missing value Immutation process, replacing new variables with numbers or letters rather than missing values, new variables If it is judged to perform the process of determining performance, variable data distribution, and preprocessing the data of the continuous variable by converting it to the continuous variable, if there is any character in the data, replace it with a number or treat it as a missing value. A data preprocessing process of a categorical variable that provides an interface for asking a user to confirm whether or not to process the data according to the user's input information, and then generates a continuous variable to perform the data preprocessing process of the continuous variable;
When the data pre-processing process of the continuous variable and the data pre-processing process of the categorical variable is completed, the process of deleting the missing value data to delete samples having only missing values from the data;
Create a new variable by combining variables, check the distribution of variable combination data, and complete the data preprocessing process, including the process of extracting data that meets the conditions.
Automated data pre-processing method used in clinical data analysis, characterized in that comprises.

The method of claim 19, wherein in the data pre-processing of the continuous variable and the data pre-processing of the categorical variable, the process of substituting the data is (a). Method of directly inputting individual data, (b). Automated data pre-processing method used in clinical data analysis, characterized in that it comprises a method that automatically replaces all of the data that meets the conditions.

20. The method of claim 19, In the data pre-processing process of the continuous variable and the data pre-processing process of the categorical variable, the process of defining the missing value provides an interface for modifying the missing value list so that the user can newly register and delete the missing value in the missing value list. Automated data pre-processing method used in clinical data analysis, characterized in that it comprises a provided process.

20. The method of claim 19, In the data pre-processing process of the continuous variable and the data pre-processing process of the categorical variable, the imputation process is that the imputation is to replace the missing value with a number or a character rather than a missing value,
(a). In case of imputation by variable; Select all missing values of the variable from mean, median, and min / max (default = median)
(b). Repeatedly measured data with a time difference; Apply LOCF (Last Observation Carried Forward) method: Replace the missing value with the most recently observed value rather than the missing value → (used as the initial value), mean of the repeated measured data (mean), median (median), min / max Data pre-processing automation method used for clinical data analysis, characterized in that it is selected from among values.

20. The method of claim 19, In the process of completing the data pre-processing, creating a new variable by combining variables provides a function item selection interface to be applied to a variable and a user interface to directly input a function to be selected by the user, and the user selects it. Alternatively, a method of automating data pre-processing used in clinical data analysis, characterized in that it consists of a process of generating a newly created variable list to be input and providing an interface for a user to modify and manage the list.

20. The method according to claim 19, In the process of completing the data pre-processing, the variable combination data distribution check allows the user to check the data distribution of the registered continuous variables, and the user distributes the numerical distribution in units of each continuous variable. Data preprocessing automation method used in clinical data analysis, characterized in that it consists of a process of providing an interface that provides a graphical distribution of figures (blocks) and graphs.

delete

20. The method of claim 19, In the process of completing the data pre-processing, the process of generating new variables by combining variables,
(a). When using one variable,
When it is a continuous variable, (1). Create new continuous variables by processing mathematical functions on variables, and create categorical variables by segmenting the numerical range of variables,
For categorical variables, (1). New creation of categorical variables by incorporating detailed categories of variables, (2). Create a new categorical variable of 2 minutes as many as the number of categories in the same way as variables (specific subcategory vs. specific subcategory; others)
(b). When using two or more continuous variables in combination
Creation of new continuous variables using four arithmetic operations on two or more continuous variables,
(c). When two or more categorical variables are used in combination
Combining detailed categories of each categorical variable to create a new categorical variable,
Automated data preprocessing method used in clinical data analysis, characterized in that.