KR101546041B1

KR101546041B1 - Method and apparatus for converting data, and method for verifying data conversion

Info

Publication number: KR101546041B1
Application number: KR1020120146639A
Authority: KR
Inventors: 윤창노; 이진각; 한원석
Original assignee: 한국과학기술연구원
Priority date: 2012-12-14
Filing date: 2012-12-14
Publication date: 2015-08-20
Also published as: KR20140077629A

Abstract

본 발명은, 복수의 이진 데이터 세트들 각각의 이진 데이터를 실수형 데이터로 변환하는 데이터 변환 방법 및 장치, 그리고 데이터 변환 검증 방법에 관한 것에 관한 것으로, 본 발명에 따르면, 측정 데이터의 상태를 정상 및 비정상 중 하나를 표현할 수 있는 이진 데이터를 기초로 보다 다양한 상태 예컨대, 정상에서 비정상으로 이동하는 과정 내지 비정상에 가까운 정도 등을 표현할 수 있는 실수형 데이터를 얻을 수 있게 된다.The present invention relates to a data conversion method and apparatus for converting binary data of each of a plurality of binary data sets into real data, and a data conversion verification method. According to the present invention, It is possible to obtain real-valued data that can express a variety of states, for example, the process of moving from normal to abnormal or the degree of abnormality based on binary data that can represent one of the abnormal states.

Description

METHOD AND APPARATUS FOR DATA CONVERSION DATA, AND METHOD FOR CONVERTING DATA CONVERSION

본 발명은, 데이터 변환 방법 및 장치, 그리고 데이터 변환 검증 방법에 관한 것으로서, 복수의 이진 데이터 세트들 각각의 이진 데이터를 실수형 데이터로 변환하는 데이터 변환 방법 및 장치, 그리고 데이터 변환 검증 방법에 관한 것이다.The present invention relates to a data conversion method and apparatus, and a data conversion verification method, and more particularly, to a data conversion method and apparatus for converting binary data of each of a plurality of binary data sets into real data, and a data conversion verification method .

질병의 연관관계 분석을 위한 생체 측정 데이터들이 나타내는 상태를 표현하는 상태 데이터로는 보통 질병 및 정상 둘 중 하나로만 구분 가능한 이진 데이터가 사용되는 것이 일반적이다. 특히, 질병 상태의 경우, 질병으로 진행되는 단계 중 질병으로 진단이 완료된 후에만 질병 상태로 분류되기 때문에, 정상 상태에서 질병 상태로 변화하는 과정에 있는 환자의 경우에, 현재의 상태 데이터로는 이러한 환자를 표현하기 어렵고 따라서, 현재의 상태 데이터를 기초로 생체 측정 데이터를 검색할 경우에 정상 상태에서 질병 상태로 변화하는 과정에 있는 환자를 별도로 추출하기 어렵다는 문제점이 있다.As the state data representing the state represented by the biometric data for analysis of disease association, it is common to use binary data that can be classified into only one disease or normal one. Particularly, in the case of a disease state, since the disease state is classified only after completion of the diagnosis by the disease among the stages progressing to the disease, in the case of the patient who is in the process of changing from the normal state to the disease state, It is difficult to express the patient and it is difficult to separately extract the patient in the process of changing from the normal state to the disease state when biometric data is searched based on the current state data.

이러한 문제점을 해결하기 위해, 질병 및 정상 둘 중 하나로만 구분가능한 이진 데이터를 질병과의 구체적인 연관 정도를 나타내낼 수 있는 실수형 데이터로 변환하는 것이 필요하다.To solve this problem, it is necessary to convert binary data, which can be distinguished only to either disease or normal, into real data, which can indicate the degree of specific association with the disease.

본 발명이 이루고자 하는 기술적 과제는, 적어도 하나 이상의 측정 데이터 및 이러한 측정 데이터의 상태로 두 가지 상태를 표현하는 이진 데이터를 포함하는 복수의 이진 데이터 세트들 각각의 이진 데이터를 셋 이상의 다양한 상태를 표현할 수 있는 실수형 데이터로 변환하는 데이터 변환 방법 및 장치, 그리고 데이터 변환 검증 방법을 제공하는 데에 있다.SUMMARY OF THE INVENTION The object of the present invention is to provide a method and apparatus for representing binary data of each of a plurality of binary data sets including at least one measurement data and binary data representing two states in the state of such measurement data, A data conversion method and apparatus for converting data into real data, and a data conversion verification method.

본 발명이 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not intended to limit the invention to the particular embodiments that are described. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, There will be.

상기의 기술적 과제를 이루기 위하여, 본 발명의 일 실시예에 따른, 데이터 변환 방법은 적어도 하나 이상의 측정 데이터 및 상기 측정 데이터의 상태로 제1상태 및 제2상태 중 하나를 표현하는 이진 데이터를 포함하는 복수의 이진 데이터 세트들을 입력받은 이진 데이터 세트 입력 단계; 상기 복수의 이진 데이터 세트들 각각의 이진 데이터를 기초로 상기 복수의 이진 데이터 세트들 각각에 서로 다른 실수형 데이터를 할당한 실수형 데이터 순열들을 복수 개 생성하는 실수형 데이터 순열 생성 단계; 상기 생성된 실수형 데이터 순열들 각각에 대한 다중 회귀 분석을 수행하여 상기 실수형 데이터 순열들에 각각 대응되는 회귀식들을 도출하는 회귀식 도출 단계; 상기 도출된 회귀식들의 결정 계수, 상기 결정 계수의 교차검증 상관 계수 및 피어슨 상관 계수 중 적어도 하나 이상을 기초로 하나의 최적 회귀식을 선정하는 최적 회귀식 선정 단계; 상기 선정된 최적 회귀식을 기초로 상기 복수의 이진 데이터 세트들에 각각 대응되는 실수형 데이터들을 제공하는 단계를 포함할 수 있다.According to an aspect of the present invention, there is provided a method for transforming data, the method including converting binary data representing one of a first state and a second state into at least one measurement data and a state of the measurement data, A binary data set input step of inputting a plurality of sets of binary data; A real data type permutation generation step of generating a plurality of real data type permutation data in which different real data types are allocated to each of the plurality of binary data sets based on binary data of each of the plurality of binary data sets; Performing a multiple regression analysis on each of the generated real data type permutations to derive a regression equation corresponding to each of the real data type permutations; An optimal regression equation selecting step of selecting one optimal regression equation based on at least one of a decision coefficient of the derived regression equations, an intersection verification correlation coefficient of the decision coefficient, and a Pearson correlation coefficient; And providing real data types corresponding to the plurality of binary data sets based on the selected optimal regression equation.

상기의 기술적 과제를 이루기 위하여, 본 발명의 다른 일 실시예에 따른, 데이터 변환 검증 방법은 적어도 하나 이상의 측정 데이터 및 상기 측정 데이터의 상태로 제1상태 및 제2상태 중 하나를 표현하는 이진 데이터를 포함하는 복수의 이진 데이터 세트들의 일부를 복수의 학습용 이진 데이터 세트로 분류하고, 나머지 일부를 복수의 검증용 이진 데이터 세트로 분류하는 단계; 상기 학습용 이진 데이터 세트들 각각의 이진 데이터를 기초로 상기 학습용 이진 데이터 세트들 각각에 서로 다른 실수형 데이터를 할당한 실수형 데이터 순열들을 복수 개 생성하는 실수형 데이터 순열 생성 단계; 상기 생성된 실수형 데이터 순열들 각각에 대한 다중 회귀 분석을 수행하여 상기 실수형 데이터 순열들에 각각 대응되는 회귀식들을 도출하는 회귀식 도출 단계; 상기 도출된 회귀식들의 결정 계수, 상기 결정 계수의 교차검증 상관 계수 및 피어슨 상관 계수 중 적어도 하나 이상을 기초로 하나의 최적 회귀식을 선정하는 최적 회귀식 선정 단계; 상기 검증용 이진 데이터 세트에 상기 선정된 최적 회귀식을 적용하여 상기 검증용 이진 데이터 세트들에 각각 대응되는 실수형 데이터들을 산출하는 단계; 상기 검증용 이진 데이터 세트들에 대응되는 실수형 데이터들을 기초로 상기 최적 회귀식을 검증하는 단계를 포함할 수 있다.According to another aspect of the present invention, there is provided a data conversion verification method for converting binary data representing one of a first state and a second state into at least one measurement data and a state of the measurement data, Categorizing a portion of a plurality of binary data sets including a plurality of binary data sets for learning into a plurality of binary data sets for learning, and classifying the remaining portions into a plurality of binary data sets for verification; A real data type permutation generation step of generating a plurality of real data type permutations in which different real data types are allocated to each of the learning binary data sets based on the binary data of each of the learning binary data sets; Performing a multiple regression analysis on each of the generated real data type permutations to derive a regression equation corresponding to each of the real data type permutations; An optimal regression equation selecting step of selecting one optimal regression equation based on at least one of a decision coefficient of the derived regression equations, an intersection verification correlation coefficient of the decision coefficient, and a Pearson correlation coefficient; Calculating real data types corresponding to the verification binary data sets by applying the selected optimal regression equation to the verification binary data set; And verifying the optimal regression equation based on the real data corresponding to the verification binary data sets.

상기의 기술적 과제를 이루기 위하여, 본 발명의 다른 일 실시예에 따른, 데이터 변환 장치는 기저장된 적어도 하나 이상의 측정 데이터 및 상기 측정 데이터의 상태로 제1상태 및 제2상태 중 하나를 표현하는 이진 데이터를 포함하는 복수의 이진 데이터 세트들의 적어도 일부를 로드하는 이진 데이터 세트 로딩부; 상기 로딩된 이진 데이터 세트들 각각의 이진 데이터를 기초로 상기 복수의 이진 데이터 세트들 각각에 서로 다른 실수형 데이터를 할당한 실수형 데이터 순열들을 복수 개 생성하는 실수형 데이터 순열 생성부; 상기 생성된 실수형 데이터 순열들 각각에 대한 다중 회귀 분석을 수행하여 상기 실수형 데이터 순열들에 각각 대응되는 회귀식들을 도출하는 회귀식 도출부; 상기 도출된 회귀식들의 결정 계수, 상기 결정 계수의 교차검증 상관 계수 및 피어슨 상관 계수 중 적어도 하나 이상을 기초로 하나의 최적 회귀식을 선정하는 회귀식 선정부; 상기 선정된 최적 회귀식을 기초로 상기 복수의 이진 데이터 세트들에 각각 대응되는 실수형 데이터들을 산출하는 실수형 데이터 산출부를 포함할 수 있다.According to another aspect of the present invention, there is provided a data conversion apparatus for converting binary data representing one of a first state and a second state into a state of at least one measurement data and a measurement data, A binary data set loading unit loading at least a portion of the plurality of binary data sets; A real data type permutation generation unit for generating a plurality of real data type permutation data in which different real data types are allocated to each of the plurality of binary data sets based on binary data of each of the loaded binary data sets; A regression derivation unit for performing a multiple regression analysis on each of the generated real data type permutations to derive regression equations corresponding to the real data type permutations; A regression equation selection unit for selecting one optimal regression equation based on at least one of a decision coefficient of the derived regression equations, an intersection verification correlation coefficient of the decision coefficient, and a Pearson correlation coefficient; And a real-valued data calculation unit for calculating real-valued data corresponding to each of the plurality of binary data sets based on the selected optimal regression equation.

상기 기술적 과제를 해결하기 위하여, 본 발명의 다른 일 실시예는, 상기한 데이터 변환 방법 또는 데이터 변환 검증 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터 판독가능한 기록 매체를 제공할 수 있다.According to another aspect of the present invention, there is provided a computer readable recording medium recording a program for causing a computer to execute the data conversion method or the data conversion verification method.

본 발명에 따르면, 측정 데이터에 대하여 정상 및 비정상 중 하나를 표현할 수 있는 이진 데이터를 기초로 보다 다양한 상태 예컨대, 정상에서 비정상으로 이동하는 과정 내지 비정상에 가까운 정도 등을 표현할 수 있는 실수형 데이터를 얻을 수 있게 된다.According to the present invention, it is possible to obtain real-valued data that can express more various states, for example, a process of moving from normal to abnormal or a degree of abnormality based on binary data that can represent one of normal and abnormal states with respect to measurement data .

특히, 본 발명에 따른 데이터 변환 방법을 기존의 질병 및 정상 중 하나로만 상태를 분류하던 환자의 생체 측정 데이터의 처리 과정에 적용하는 경우에 환자의 상태가 실수형 데이터를 통해 정상에서 질병으로 변화하고 있는 과정에 있음을 나타내거나 질병에 가까운 정도를 수치로 나타낼 수 있게 되어 생체 측정 데이터를 질병의 예측 및 진단에 보다 효과적으로 제공할 수 있게 된다.Particularly, when the data conversion method according to the present invention is applied to a process of biometric data of a patient who has been classified into one of existing diseases and normal ones, the patient's state changes from normal to disease through real- It is possible to indicate the presence of a disease or a degree close to a disease, so that the biometric data can be more effectively provided for predicting and diagnosing a disease.

도 1는 본 발명의 바람직한 일실시예에 따른 데이터 변환 장치의 구성을 도시한 도면이다.
도 2는 본 발명의 바람직한 일실시예에 따른 데이터 변환 방법의 흐름을 도시한 도면이다.
도 3은 본 발명의 바람직한 일실시예에 따른 데이터 변환 방법 중 브루트포스 탐색 방식으로 복수 개의 실수형 데이터 순열들을 생성하는 과정을 예시한 도면이다.
도 4는 본 발명의 바람직한 일실시예에 따른 데이터 변환 방법에서 선정되는 최적 회귀식을 검증하는 데이터 변환 검증 과정을 예시한 도면이다.
도 5는 본 발명의 바람직한 일실시예에 따른 데이터 변환 방법을 환자의 생체 측정 데이터 처리에 적용하여 환자의 상태를 나타내는 실수형 데이터를 생성한 결과를 예시한 도면이다.1 is a block diagram illustrating a data conversion apparatus according to an embodiment of the present invention.
2 is a flowchart illustrating a data conversion method according to an exemplary embodiment of the present invention.
3 is a diagram illustrating a process of generating a plurality of real data type permutations using a brute force search method in a data conversion method according to an embodiment of the present invention.
4 is a diagram illustrating a data conversion verification process for verifying an optimal regression equation selected in a data conversion method according to an exemplary embodiment of the present invention.
FIG. 5 is a diagram illustrating a result of generating real-valued data representing a patient's state by applying a data conversion method according to a preferred embodiment of the present invention to biometric data processing of a patient.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다. 또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한, 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다. The following merely illustrates the principles of the invention. Thus, those skilled in the art will be able to devise various apparatuses which, although not explicitly described or shown herein, embody the principles of the invention and are included in the concept and scope of the invention. Furthermore, all of the conditional terms and embodiments listed herein are, in principle, intended only for the purpose of enabling understanding of the concepts of the present invention, and are not intended to be limiting in any way to the specifically listed embodiments and conditions . It is also to be understood that the detailed description, as well as the principles, aspects and embodiments of the invention, as well as specific embodiments thereof, are intended to cover structural and functional equivalents thereof. It is also to be understood that such equivalents include all elements contemplated to perform the same function irrespective of currently known equivalents as well as equivalents to be developed in the future.

따라서, 프로세서 또는 이와 유사한 개념으로 표시된 기능 블록을 포함하는 도면에 도시된 다양한 소자의 기능은 전용 하드웨어뿐만 아니라 적절한 소프트웨어와 관련하여 소프트웨어를 실행할 능력을 가진 하드웨어의 사용으로 제공될 수 있다. 프로세서에 의해 제공될 때, 기능은 단일 전용 프로세서, 단일 공유 프로세서 또는 복수의 개별적 프로세서에 의해 제공될 수 있고, 이들 중 일부는 공유될 수 있다. 또한, 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니 되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지 관용의 다른 하드웨어도 포함될 수 있다. Thus, the functions of the various elements shown in the drawings, including the functional blocks shown in the figures or similar concepts, may be provided by use of dedicated hardware as well as hardware capable of executing software in connection with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, a single shared processor, or a plurality of individual processors, some of which may be shared. Also, the use of terms such as processor, control, or similar concepts should not be construed as exclusive reference to hardware capable of executing software, but may include, without limitation, digital signal processor (DSP) hardware, (ROM), random access memory (RAM), and non-volatile memory. Other hardware may also be included.

상술한 목적, 특징 및 장점들은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 더욱 분명해 질 것이다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략하거나 간략하게 설명하는 것으로 한다. The above objects, features and advantages will become more apparent from the following detailed description in conjunction with the accompanying drawings. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

한편 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.On the other hand, when an element is referred to as "including " an element, it does not exclude other elements unless specifically stated to the contrary.

이하, 첨부된 도면을 참조하여 바람직한 실시예에 따른 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1는 본 발명의 바람직한 일실시예에 따른 데이터 변환 장치(100)의 구성을 도시한 도면이다.1 is a diagram illustrating a configuration of a data conversion apparatus 100 according to a preferred embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 데이터 변환 장치(100)는 저장부(110), 이진 데이터 로딩부(120), 실수형 데이터 순열 생성부(130), 회귀식 도출부(140), 최적 회귀식 선정부(150), 최적 회귀식 검증부(160) 및 디스플레이부(170)를 포함하여 구성될 수 있다. 상기한 구성요소들 이외에 다른 구성요소들이 본 실시예에 따른 데이터 변환 장치(100)에 포함될 수 있음은 자명하다.Referring to FIG. 1, a data conversion apparatus 100 according to the present invention includes a storage unit 110, a binary data loading unit 120, a real data permutation generation unit 130, a regression formula derivation unit 140, A regression formula selecting unit 150, an optimal regression formula verifying unit 160, and a display unit 170. [ It is apparent that other components other than the above-described components can be included in the data conversion apparatus 100 according to the present embodiment.

본 실시예에 따른 이진 데이터 로딩부(120), 실수형 데이터 순열 생성부(130), 회귀식 도출부(140), 최적 회귀식 선정부(150), 최적 회귀식 검증부(160)은 그 중 적어도 일부가 외부 단말장치나 외부 서버등과 통신하는 프로그램 모듈일 수 있으며, 이러한 프로그램 모듈들은 운영 시스템, 응용 프로그램 모듈 및 기타 프로그램 모듈로서 데이터 변환 장치(100)에 포함될 수 있으며, 물리적으로는 다양한 종류의 공지된 기억 장치에 저장될 수 있다. 또한, 이러한 프로그램 모듈들은 데이터 변환 장치(100)와 통신 가능한 원격 기억 장치에 저장될 수도 있다. 한편, 이러한 프로그램 모듈들은 본 발명에 따라 후술할 특정 업무를 수행하거나 특정 추상 데이터 유형을 실행하는 루틴, 서브루틴, 프로그램, 오브젝트, 컴포넌트, 데이터 구조 등을 포괄하지만, 본 발명 자체가 이에 한정되지는 않는다.The binary data loading unit 120, the real data permutation generator 130, the regression equation derivation unit 140, the optimal regression equation selection unit 150, May be included in the data conversion apparatus 100 as an operating system, an application program module, and other program modules, and may be physically various Type of known memory device. These program modules may also be stored in a remote storage device capable of communicating with the data conversion apparatus 100. [ These program modules, on the other hand, encompass routines, subroutines, programs, objects, components, data structures, etc., that perform particular tasks or perform particular abstract data types as described below in accordance with the present invention, Do not.

저장부(110)는 복수의 이진 데이터 세트들을 저장하는 구성요소로, 본 실시예에 따른 데이터 변환 장치(100) 내부의 저장 매체나 해당 저장 매체 상의 DBMS 일 수 있다.The storage unit 110 is a component for storing a plurality of sets of binary data, and may be a storage medium in the data conversion apparatus 100 according to the present embodiment or a DBMS on the storage medium.

본 실시예에 따른 이진 데이터 세트는 적어도 하나 이상의 측정 데이터 및 이러한 측정 데이터의 상태를 나타내는 상태 데이터를 포함하여 구성될 수 있으며, 일반적으로 이러한 상태 데이터는 정상 및 비정상과 같은 두 가지 상태 중 하나를 의미할 수 있는 이진 데이터로 구성된다.The binary data set according to the present embodiment may be configured to include at least one measurement data and status data indicating the status of such measurement data. Generally, such status data includes one of two states such as normal and abnormal And binary data that can be used.

대표적인 이진 데이터 세트로는 의료 정보 데이터베이스 등에서 사용되는 환자의 진단 데이터 세트가 이에 해당될 수 있다.A representative binary data set may be a set of diagnostic data of a patient used in a medical information database or the like.

환자의 진단 데이터 세트는 환자의 호르몬 측정치와 같은 생체 측정 데이터를 포함하며, 아울러, 이러한 생체 측정 데이터가 나타내는 환자의 상태 내지 진단 결과를 질병 및 정상 중 하나의 상태로서 나타내며, 이러한 환자의 상태 내지 진단 결과를 나타내는 상태 데이터의 데이터 형으로는 주로 0 또는 1의 값을 값을 갖는 이진 데이터가 활용된다.The patient's diagnostic data set includes biometric data such as hormone measurements of the patient and also indicates the status or diagnosis result of the patient represented by such biometric data as one of disease and normal status, As the data type of the state data representing the result, binary data having a value of 0 or 1 is mainly used.

이진 데이터 세트 로딩부(120)는 저장부(110)에 저장된 복수의 이진 데이터 세트들 중 적어도 일부에 해당하는 데이터 세트들을 데이터 변환을 위해 로딩한다.The binary data set loading unit 120 loads data sets corresponding to at least some of the plurality of binary data sets stored in the storage unit 110 for data conversion.

실수형 데이터 순열 생성부(130)는 이진 데이터 세트 로딩부(120)에서 로딩된 이진 데이터 세트들 각각의 이진 데이터를 기초로 이진 데이터 세트들 각각에 서로 다른 실수형 데이터를 할당하여, 로딩된 이진 데이터 세트들의 개수만큼의 크기를 갖는 실수형 데이터 순열을 복수 개 생성한다.The real data type permutation generation unit 130 assigns different real data types to each of the binary data sets based on the binary data of each of the binary data sets loaded in the binary data set loading unit 120, And generates a plurality of real data type permutations having a size as many as the number of data sets.

이 때, 이진 데이터 세트들 각각에 실수형 데이터는 서로 다른 실수값으로 임의로 할당되며, 브루트 포스 탐색을 위해 기정의된 범위 내의 실수값들을 가능한 모든 배열에 대한 실수형 데이터 순열을 생성한다.In this case, the real data in each of the binary data sets is arbitrarily assigned with a different real number value, and generates a real data type permutation for all possible arrays of real values within the predetermined range for the brute force search.

이 때, 실수형 데이터를 할당함에 있어서, 이진 데이터 값이 “1”인 이진 데이터 세트들과 이진 데이터 값이 “0”인 이진 데이터 세트들을 서로 분류하여 서로 중첩되지 않은 구간의 실수형 데이터를 할당하는 것이 바람직하다.At this time, when real data is allocated, binary data sets having a binary data value of "1" and binary data sets having a binary data value of "0" are classified into each other, and real data .

예컨대, 이진 데이터 값이 “1”인 이진 데이터 세트들에게는 1 이상의 값을 갖는 서로 다른 실수형 데이터를 할당하고, 이진 데이터 값이 “0”인 이진 데이터 세트들에게는 0 이상 1 미만의 값을 갖는 서로 다른 실수형 데이터를 할당할 수 있다.For example, different binary data sets having a binary data value of " 1 " are assigned different real data types having a value of 1 or more, and binary data sets having a binary data value of " Different real data can be allocated.

이에 따라, 이진 데이터 세트 로딩부(120)에서 로딩된 이진 데이터 세트가 총 14개이며, 2진형 데이터 세트가 트레이닝 세트 8개와 테스트 세트 6개로 이루어진다면, 변환된 모든 가능한 순서의 데이터 세트는 8!*6! = 29,030,400개가 될 수 있다.Thus, if there are a total of 14 binary data sets loaded in the binary data set loading unit 120 and the binary data set consists of 8 training sets and 6 test sets, the data set of all converted orders is 8! * 6! = 29,030,400.

이러한 총 14개의 이진 데이터 세트 중 이진 데이터 값이 “1”인 이진 데이터 세트가 8개이고 이진 데이터 값이 “1”인 이진 데이터 세트가 6개인 경우를 가정하면, 실수형 데이터 순열 생성부(130)에서 가능한 모든 배열에 대한 데이터 순열을 생성할 경우에, 실수형 데이터 순열 생성부(130)는 8!*6! = 29,030,400개의 데이터 순열을 생성하게 된다.Assuming that there are six binary data sets with binary data values of "1" and eight binary data sets among the total of 14 binary data sets, and the binary data value is "1", the real data type permutation generation unit 130, The real-number data permutation generator 130 generates a permutation of 8! * 6! = 29,030,400 data permutations.

회귀식 도출부(140)는 실수형 데이터 순열 생성부(130)에서 생성된 실수형 데이터 순열들 각각에 대한 다중 회귀 분석을 수행하여, 실수형 데이터 순열들 각각에 대응되는 회귀식들을 도출할 수 있다.The regression equation derivation unit 140 may perform a multiple regression analysis on each of the real data type permutations generated by the real data type permutation generation unit 130 to derive regression equations corresponding to each of the real data type permutations have.

본 실시예에 따른 다중 회귀 분석은 본 출원의 발명자들이 이전에 공지한 다중 회귀 분석 모델(참조 문헌: Moon, T., Chi, M.H., Kim, D.H., Yoon, C.N., Choi, Y.S., Quantitative structure-activity relationships (QSAR) study of flavonoidderivatives for inhibition of cytochrome P450 1A2. Quant Struct-Act Rel 19(3), 257-263, 2000)을 이용하여 수행할 수 있으며, 이 과정에서 최적의 표현자를 얻기 위해서 교차 검증 방법을 사용할 수 있다. 상기한 다중 회귀 분석 모델의 변수 간에 다중공선성(multi-collinearity)은 분산팽창요인(variance inflation factor)를 사용할 수 있다(참조 문헌: Myers, R.H., Classical and modern regression withapplications. PWS/KENT: Boston, 1990). 그러나, 이들은 본 발명의 설명의 편의를 위한 하나의 실시예에 불과하고, 본 발명 자체가 이에 한정되는 것은 아니며, 기타 다른 종래에 공지된 다중 회귀 분석 모델을 적용하여 본 실시예에 따른 다중 회귀 분석을 수행할 수 있다고 할 것이다.The multiple regression analysis according to the present embodiment can be applied to a multiple regression analysis model previously known by the inventors of the present application (refer to Moon, T., Chi, MH, Kim, DH, Yoon, CN, activity relationship (QSAR) study of flavonoid derivatives for inhibition of cytochrome P450 1A2. Quant Struct-Act Rel 19 (3), 257-263, 2000). In this process, Method can be used. The multi-collinearity among the variables of the multiple regression analysis model can use a variance inflation factor (see, eg, Myers, RH, Classical and modern regression withapplications, PWS / KENT: Boston, 1990). However, it is to be understood that the present invention is not limited to the above embodiment, and various other regression analysis models may be applied to the present invention, It can be done.

회귀식 선정부(150)는 회귀식 도출부(140)에서 도출된 회귀식들의 결정 계수(r²), 결정 계수의 교차 검증 상관 계수(q²) 및 피어슨 상관계수(PCC) 중 적어도 하나 이상을 기초로 하여, 회귀식 도출부(140)에서 도출된 회귀식들 중 하나의 최적 회귀식을 선정한다.The regression equation selection unit 150 selects at least one of the determination coefficient r ² of the regression equations derived from the regression equation derivation unit 140, the cross correlation verification coefficient q ² of the determination coefficient, and the Pearson correlation coefficient PCC , One of the regression equations derived from the regression equation derivation unit 140 is selected.

다중 회귀 분석에 따른 회귀식의 결정계수(r² 또는 r square)는 해당 회귀식의 예측 능력을 평가할 수 있는 계수로써, 하기의 수학식 1과 같이 정의될 수 있다.The decision coefficient (r ² or r square) of the regression equation according to the multiple regression analysis is a coefficient that can evaluate the predictive ability of the regression equation, and can be defined as Equation 1 below.

[수학식 1][Equation 1]

r² = 1.0 - ∑(y_pred-y_actual)²/∑(y_pred-y_mean)² r ² = 1.0 -? (y _pred- y _actual ) ² /? (y _pred- y _mean ) ²

여기에서 y_pred는 회귀식에 의해서 예측되어진 y값을 의미하며, y_actual은 y_pred에 대응되는 y값의 실측치를 의미하며, y_mean은 이러한 실측치들의 평균값을 의미한다.Where y _pred means the y value predicted by the regression equation, y _actual means the _actual value of the y value corresponding to y _pred , and y _mean means the _mean value of these real values.

이러한 회귀식의 결정계수(r²)과 더불어, 전술한 교차 검증 방법에 의한, 결정계수의 교차검증 상관계수("cross-validated correlation coefficient r²"또는 q² 또는 q square)를 최적 회귀식을 결정하는 기준으로 사용할 수 있다.In addition to the coefficient of determination (r ² ) of this regression equation, the cross-validated correlation coefficient r ^{2 (} or q ² or q square) of the coefficient of determination by the above-described cross- It can be used as a criterion to decide.

아울러, 최적 회귀식의 변수간의 상관관계를 확인하기 위해서 피어슨 상관계수(PCC)(참조 문헌: Pearson, K., Mathematical contributions to the theory ofevolution. III. Regression, heredity and panmixia. Philos. Trans. Royal Soc.London Ser. A 187, 253-318, 1896)를 최적 회귀식을 결정하는 기준으로 사용할 수 있다.In addition, Pearson correlation coefficients (PCC) (Pearson, K., Mathematical contributions to the theory of evolution, III. Regression, heredity and panmixia, Philos. (187, 253-318, 1896) can be used as a criterion for determining the optimal regression equation.

최적 회귀식 검증부(160)는 저장부(110)에 저장된 이진 데이터 세트들 중에서 이전에 로딩되어 최적 회귀식을 도출하는 데에 사용된 이진 데이터 세트와 다른 일부의 이진 데이터 세트를 이진 데이터 로딩부(120)를 통해 로딩하여, 로딩된 다른 일부의 이진 데이터 세트를 기초로 하여 회귀식 선정부(150)에 의해 선정된 최적 회귀식이 적어도 하나 이상의 측정 데이터의 상태를 표현하는 실수형 데이터를 도출하는 데에 적합한지를 검증한다.The optimal regression equation verifying unit 160 may output a binary data set that is different from the binary data set used to derive the optimal regression equation previously loaded from the binary data sets stored in the storage unit 110, (120), and the optimal regression equation selected by the regression equation selection unit (150) based on the other set of the binary data sets is used to derive real data representing the state of at least one measurement data Verify that it is suitable for your environment.

예컨대, 최적 회귀식 검증부(160)는 로딩된 다른 일부의 이진 데이터 세트를 회귀식 선정부(150)에 의해 선정된 최적 회귀식에 적용하여, 다른 일부의 이진 데이터 세트들 각각에 대응되는 실수형 데이터를 도출하고, 해당 도출된 결과를 기초로 해당 최적 회귀식을 검증할 수 있다. For example, the optimal regression equation verifying unit 160 applies a set of other loaded binary data to the optimal regression equation selected by the regression equation selecting unit 150, and calculates a real number corresponding to each of the other partial binary data sets Type data can be derived and the optimal regression equation can be verified based on the derived result.

디스플레이부(170)는 이진 데이터 로딩부(120)를 통해 로딩된 이진 데이터 세트들을 회귀식 선정부(150)에 의해 선정된 최적 회귀식에 적용하여 도출되는 이진 데이터 세트들에 각각 대응되는 실수형 데이터들, 또는 실수형 데이터들 각각이 의미하는 해당 이진 데이터 세트의 측정 데이터들의 다양한 상태들을 디스플레이할 수 있다.The display unit 170 displays the binary data sets loaded through the binary data loading unit 120 in the optimal regression formula selected by the regression formula selecting unit 150, Data, or various states of the measurement data of the corresponding binary data set, which each of the real-valued data means.

뿐만 아니라, 디스플레이부(170)는 회귀식 선정부(150)에 의해 선정된 최적 회귀식, 최적 회귀식 검증부(160)에 의한 최적 회귀식의 검증 결과 등을 디스플레이할 수 있다. In addition, the display unit 170 can display the optimal regression equation selected by the regression equation selector 150, the verification result of the optimal regression equation by the optimal regression equation verification unit 160, and the like.

디스플레이부(170)는 시각적인 정보 혹은 청각적인 정보를 본 실시예에 따른 데이터 변환 장치의 사용자에게 제공할 수 있으며, 시각적인 정보를 제공하기 위해, 디스플레이부(170)는 LCD(Liquid Crystal Display), TFT(Thin Film Transistor) 및 유기 EL(Organic Electroluminescence) 등을 소자로 하는 디스플레이 패널을 포함하도록 구성될 수 있다.The display unit 170 may provide visual or auditory information to the user of the data conversion apparatus according to the present embodiment. In order to provide visual information, the display unit 170 may include a liquid crystal display (LCD) , A TFT (Thin Film Transistor), and an organic EL (Organic Electroluminescence) device.

도 2는 본 발명의 바람직한 일실시예에 따른 데이터 변환 방법의 흐름을 도시한 도면으로, 본 실시예에 따른 데이터 변환 방법은 도 1에 도시된 데이터 변환 장치(100)에서 수행될 수 있다. 따라서, 도 1에 도시된 데이터 변환 장치(100)에서와 동일한 사항에 대해서는 이를 참조한다.FIG. 2 is a flowchart illustrating a data conversion method according to an embodiment of the present invention. The data conversion method according to the present embodiment may be performed in the data conversion apparatus 100 shown in FIG. Therefore, the same matters as those in the data conversion apparatus 100 shown in Fig. 1 are referred to.

먼저, 데이터 세트 각각에 포함된 이진 데이터를 실수형 데이터로 변환할 복수의 이진 데이터 세트들을 입력받는다(S201).First, a plurality of sets of binary data to be converted into real data are input into the binary data included in each of the data sets (S201).

S201 단계에서 입력받은 이진 데이터 세트들 각각의 이진 데이터를 기초로 하여 복수의 이진 데이터 세트들 각각에 서로 다른 실수형 데이터를 할당한 실수형 데이터 순열들을 복수 개 생성한다(S202). 이 때, 이진 데이터 세트들 각각에 실수형 데이터를 서로 다른 실수값으로 임의로 할당하며, 브루트 포스 탐색을 위해 기정의된 범위 내의 실수값들을 가능한 모든 배열에 대한 실수형 데이터 순열을 생성하는 것이 바람직하다.In step S202, a plurality of real data type permutation data in which different real data types are allocated to each of the plurality of binary data sets is generated based on the binary data of each of the binary data sets received in step S201. At this time, it is desirable to randomly assign real-valued data to each of the binary data sets with different real-valued values and to generate a real-valued data permutation for all possible arrays of real values within the predetermined range for the brute force search Do.

S202 단계에서 생성된 실수형 데이터 순열들 각각에 대한 다중 회귀 분석을 수행하여 실수형 데이터 순열들에 각각 대응되는 회귀식들을 도출한다(S203).A multiple regression analysis is performed on each of the real data type permutations generated in step S202 to derive regression equations corresponding to real data type permutations (S203).

S203 단계에서 도출된 회귀식들의 결정 계수(r²), 결정 계수의 교차 검증 상관 계수(q²) 및 피어슨 상관계수(PCC) 중 적어도 하나 이상을 기초로 하나의 최적 회귀식을 선정한다(S204).Determining coefficients of a regression equation derived from the S203 step (r ^2), and selects the one best regression equation on the basis of at least one of a cross-validation correlation coefficient (q ²⁾ and Pearson's correlation coefficient (PCC) in the coefficient of determination (S204 ).

예컨대, S204 단계에서 최적 회귀식을 선정함에 있어, S203 단계에서 도출된 회귀식들 중 r2값이 0.6 이상, q2값이 0.6 이상 또는 피어슨 상관계수 값이 0.6 이하인 회귀식을 최적 회귀식으로 선정할 수 있다. 이때, r2, q2 값들은 회귀식의 적합도와 예측력를 보여 주며, 각각 0.6 이상의 값을 갖는 것이 바람직하다. 또한 PCC 값은 변수간의 상관관계를 보여주는 값으로, 예컨대, PCC 값이 0.6 이상인 회귀식의 경우 해당 회귀식의 변수들이 독립변수가 아니란 것을 의미하기 때문에 해당 회귀식이 최적 회귀식으로 선정되기에 부적합하다고 판단할 수 있다.For example, when selecting the optimal regression equation in step S204, a regression equation with an r2 value of 0.6 or more, a q2 value of 0.6 or more, or a Pearson correlation coefficient value of 0.6 or less among the regression equations derived in step S203 is selected as an optimal regression equation . At this time, the r2 and q2 values show the fitness and predictive power of the regression equation, and it is preferable that they have a value of 0.6 or more. In addition, the PCC value is a value showing the correlation between variables. For example, in the case of a regression equation in which the PCC value is 0.6 or more, it means that the variables of the regression equation are not independent variables. It can be judged.

S204 단계에서 선정된 최적 회귀식을 기초로 이진 데이터 세트들 각각에 대응되는 실수형 데이터들을 제공한다(S205).In step S205, real-type data corresponding to each of the binary data sets is provided based on the optimal regression equation selected in step S204.

S205 단계에서 제공되는 실수형 데이터는 해당 실수형 데이터에 대응되는 이진 데이터 세트의 측정 데이터(들)의 상태가 정상 및 비정상 중 하나의 상태에 해당한다는 사실 외에 보다 다양한 정보, 예컨대, 정상에서 비정상으로 이동하는 과정 내지 비정상에 가까운 정도 등을 표현할 수 있게 된다.The real-valued data provided in the step S205 includes various information, for example, normal to abnormal, in addition to the fact that the state of the measurement data (s) of the binary data set corresponding to the corresponding real- It is possible to express the degree of movement or the degree of abnormality.

도 3은 본 발명의 바람직한 일실시예에 따른 데이터 변환 방법 중 브루트포스 탐색 방식으로 복수 개의 실수형 데이터 순열들을 생성하는 과정을 예시한 도면으로, 본 실시예에 따른 실수형 데이터 순열 생성 과정은 도 1에 도시된 실수형 데이터 순열 생성부(130) 및/또는 도 2에 도시된 실수형 데이터 순열 생성 단계(S203)에서 수행될 수 있다. 따라서, 도 1에 도시된 데이터 변환 장치(100) 및/또는 도 2에 도시된 데이터 변환 방법에서와 동일한 사항에 대해서는 이를 참조한다.FIG. 3 is a diagram illustrating a process of generating a plurality of real data type permutations using a brute force search method in a data conversion method according to an embodiment of the present invention. In the real data type permutation generation process according to the present embodiment, May be performed in the real data type permutation generation unit 130 shown in FIG. 1 and / or the real data type permutation generation step shown in FIG. 2 (S203). Therefore, the same matters as those in the data conversion apparatus 100 shown in Fig. 1 and / or the data conversion method shown in Fig. 2 are referred to.

실수형 데이터 변환 대상이 되는 복수의 이진 데이터 세트들 각각의 이진 데이터를 기초로 복수의 이진 데이터 세트들을 제1상태에 대응되는 제1그룹 및 제2상태에 대응되는 제2그룹으로 분류한다(S310). 여기에서 제1상태 및 제2상태는 각각 이진 데이터 세트에 포함된 측정 데이터(들)의 상태를 분류한 것으로, 본 실시예에서는 “정상” 및 “비정상”을 각각 의미할 수 있으며, 이들 상태를 나타내는 이진 데이터 값은 “1” 및 “0”이 될 수 있다. The plurality of binary data sets are classified into the first group corresponding to the first state and the second group corresponding to the second state based on the binary data of each of the plurality of binary data sets to be converted into the real data type data ). Here, the first state and the second state are the states of the measurement data (s) included in the binary data set, respectively, and may be " normal " and " abnormal " The binary data values represented may be " 1 " and " 0 ".

제1상태(“정상”)에 대응되는 제1그룹에 속하는 이진 데이터 세트들에는 1 이상의 값(예컨대, 1.1, 1.2, 1.3 등)을 갖는 서로 다른 실수형 데이터를 할당할 수 있다(S320). Different real data having one or more values (e.g., 1.1, 1.2, 1.3, etc.) may be assigned to the binary data sets belonging to the first group corresponding to the first state (" normal "

또한, 제2상태(“비정상”)에 대응되는 제2그룹에 속하는 이진 데이터 세트들에는 0 이상 1 미만의 값(예컨대, 0.1, 0.2, 0.3 등)을 갖는 서로 다른 실수형 데이터를 할당할 수 있다(S330).It is also possible to assign different real data types having values of 0 to less than 1 (e.g., 0.1, 0.2, 0.3, etc.) to the binary data sets belonging to the second group corresponding to the second state (S330).

실수형 데이터 변환 대상이 되는 복수의 이진 데이터 세트들 모두에 서로 다른 실수형 데이터들이 할당되었는지를 확인하고(S340), 복수의 이진 데이터 세트들 모두에 각각 대응되는 서로 다른 실수형 데이터들을 갖는 하나의 실수형 데이터 순열을 생성한다(S350).It is checked whether different real data types are allocated to all of the plurality of binary data sets to be converted into real data (S340). Then, one of the plurality of binary data sets having different real data types A real data type permutation is generated (S350).

그리고, 브루트 포스 탐색을 위해, 다른 배열의 실수형 데이터 순열이 가능한지를 확인하여(S360), 가능한 다른 배열이 있는 경우에는 S320 내지 S340 단계들의 실수형 데이터 순열 생성 과정을 반복하여 수행함으로써, 가능한 모든 배열에 대한 실수형 데이터 순열들을 생성한다.In step S360, it is checked whether a real-number data permutation of another array is available for searching for brute force. If there are other possible arrays, the real data permutation generation process of steps S320 to S340 is repeatedly performed Generates real data permutations for all arrays.

예컨대, 7명의 환자들 각각에 대한 호르몬 농도의 실측 데이터들 및 이들의 상태를 각각 정상 및 비정상 중 하나로 분류한 이진 데이터들로 각각 구성된 7개의 이진 데이터 세트들을 이진 데이터에 따라 3명의 환자들이 속한 정상군인 A그룹과 4명의 환자들이 속한 비정상(질병)군인 B그룹으로 나눌 수 있다.For example, seven sets of binary data, each consisting of actual data of hormone concentrations for each of the seven patients and their binary data classified as normal and abnormal, respectively, (A) and (B) abnormal (disease) soldiers belonging to four patients.

정상군인 A그룹의 3명의 환자들의 이진 데이터 세트에 각각 1.1, 1.2, 1.3의 서로 다른 실수형 데이터를 할당하고, 비정상군인 B그룹의 4명의 환자들의 이진 데이터 세트에 각각 0.1, 0.2, 0.3, 0.4의 서로 다른 실수형 데이터를 할당하는 가정 하에, 이러한 7개의 이진 데이터 세트들의 이진 데이터를 실수형 데이터로 변환 경우에 4!*3! = 144 개의 실수형 데이터 세트가 생성될 수 있다.1.2, and 1.3 different real data types were assigned to the binary data sets of the three patients in the normal army group A and 0.1, 0.2, 0.3, and 0.4 were assigned to the binary data sets of the four patients in the abnormal group B group, respectively 3! * 3, when converting binary data of these seven sets of binary data into real data under the assumption of assigning different real data types of the binary data sets. = 144 real data sets can be generated.

도 4는 본 발명의 바람직한 일실시예에 따른 데이터 변환 방법에서 선정되는 최적 회귀식을 검증하는 데이터 변환 검증 과정은 도 1에 도시된 데이터 변환 장치(100)에서 수행될 수 있다. 따라서, 도 1에 도시된 데이터 변환 장치(100)에서와 동일한 사항에 대해서는 이를 참조한다.FIG. 4 illustrates a data conversion verification process for verifying an optimal regression equation selected in the data conversion method according to an exemplary embodiment of the present invention. The data conversion verification process may be performed in the data conversion apparatus 100 shown in FIG. Therefore, the same matters as those in the data conversion apparatus 100 shown in Fig. 1 are referred to.

기저장된 복수의 이진 데이터 세트들의 일부를 학습용 이진 데이터 세트로 분류하고(S410), 나머지 일부를 검증용 이진 데이터 세트로 분류한다(S420).A part of the plurality of pre-stored binary data sets is classified into a learning binary data set (S410), and the remaining part is classified into a binary data set for verification (S420).

S410 단계에서 분류된 학습용 이진 데이터 세트를 기초로 최적 회귀식의 후보 회귀식들을 도출한다. 후보 회귀식들을 도출하는 과정은 이전의 도 1 내지 도 2에 대한 설명에서 기술된 실수형 데이터 순열에 대응되는 회귀식들을 도출하는 과정을 참조한다.The candidate regression equations of the optimal regression equation are derived on the basis of the learning binary data set classified in step S410. The process of deriving the candidate regression equations refers to the process of deriving the regression equations corresponding to the real data type permutations described in the previous description of FIG. 1 to FIG.

예컨대, 도 3에 대한 설명에서 기술된 예와 같이, 144 개의 실수형 데이터 세트가 생성된 경우에, 이들 각각에 대하여 다중회귀분석을 수행하여 144 개의 후보 회귀식들이 도출될 수 있다.For example, if 144 real-valued data sets are generated, as in the example described in the description of FIG. 3, then 144 multiple candidate regression equations can be derived by performing multiple regression analysis on each of them.

예컨대, 후보 회귀식들 중 하나가 이하의 수학식 2 일 수 있다.For example, one of the candidate regression equations may be Equation 2 below.

[수학식 2]&Quot; (2) "

Index = 2.7V₁ - 3.2V₂ + 2.5V₃ (r² = 0.92, q² = 0.67, MaxPCC=0.5)Index = 2.7 V ₁ - 3.2 V ₂ + 2.5 V ₃ (r ² = 0.92, q ² = 0.67, MaxPCC = 0.5)

여기에서, V₁, V₂, V₃는 각각 해당 회귀식의 변수들(이진 데이터 세트에 포함된 측정 데이터들)이고, Index는 해당 회귀식으로 도출되는 실수형 데이터이다.Here, V ₁ , V ₂ , and V ₃ are variables of the regression equation (measurement data included in the binary data set), and Index is real data derived by the regression equation.

이 때, 후보 회귀식들 중에서 최적 회귀식을 선택하는 기준으로는, 제1임계조건(r² >= 0.6), 제2임계조건(q² >= 0.6), 제3임계조건(Max(PCC) <= 0.6) 중 적어도 하나 이상을 적용할 수 있다.At this time, as a criterion for selecting the optimal regression equation from the candidate regression equations, the first threshold condition r ² > 0.6), a second threshold condition (q ² > = 0.6), and a third threshold condition (Max (PCC) <= 0.6).

여기에서 피어슨 상관계수는 회귀식을 구성하는 변수들 간의 상관관계이므로, 상기한 수학식 2의 피어슨 상관계수는 PCC_v1v2, PCC_v2v3, PCC_v1v3의 3개가 존재하게 되며, 이들중 최대값을 기준으로 제3임계조건을 적용하게 된다.Here, since the Pearson correlation coefficient is a correlation between the variables making up the regression equation, there are three Pearson correlation coefficients of the equation (2): PCC _v1v2 , PCC _v2v3 , and PCC _v1v3 . The third threshold condition is applied.

본 실시예에서는 제1임계조건(r² >= 0.6), 제2임계조건(q² >= 0.6), 제3임계조건(Max(PCC) <= 0.6) 중 적어도 하나 이상을 만족하는 회귀식이 존재하는지를 기준으로 최적 회귀식을 선택한다.In this embodiment, the first threshold condition r ² The optimal regression equation is selected on the basis of whether there is a regression equation that satisfies at least one of the second critical condition (q ² > = 0.6), the second critical condition (q ² > = 0.6), and the third critical condition (Max (PCC) <= 0.6).

예컨대, 제1임계조건(r² >= 0.6), 제2임계조건(q² >= 0.6), 제3임계조건(Max(PCC) <= 0.6) 중 적어도 하나 이상을 만족하는 회귀식이 둘 이상 존재하는 경우(S450)에는 가장 높은 r² 값을 Hr²로 표시한다면, Max(0.6, 0.9Hr²) ~ Hr² 사이의 r² 값을 갖는 제4임계조건을 만족하는 회귀식들 중 가장 높은 q² 값을 갖는 회귀식을 최적 회귀식으로 선정할 수 있다(S460).For example, the first threshold condition r ² > = 0.6), the second threshold condition (q ^2> = 0.6), the third threshold condition (Max (PCC) is the highest, if present more than one regression equation that satisfies at least one of <= 0.6) (S450) If the r ² value is represented by Hr ² , the regression equation having the highest q ² among the regression equations satisfying the fourth threshold condition having the r ² value between Max (0.6, 0.9Hr ² ) and Hr ² is optimized And can be selected by a regression equation (S460).

만약, 제1임계조건(r² >= 0.6), 제2임계조건(q² >= 0.6), 제3임계조건(Max(PCC) <= 0.6) 중 적어도 하나 이상을 만족하는 회귀식이 하나만 존재하는 경우(S470)에는 이에 해당하는 하나의 회귀식을 최적 회귀식으로 선정할 수 있다(S480).If the first threshold condition r ² If there is only one regression equation satisfying at least one of the second threshold condition (q ² > = 0.6), the third threshold condition (Max (PCC) <= 0.6) One regression equation can be selected as the optimal regression equation (S480).

만약, 제1임계조건(r² >= 0.6), 제2임계조건(q² >= 0.6), 제3임계조건(Max(PCC) <= 0.6) 중 적어도 하나 이상을 만족하는 회귀식이 하나도 없는 경우에는 S410 단계로 돌아가 기저장된 이진 데이터 세트들 중 다른 일부를 학습용 이진 데이터 세트를 다시 분류하게 된다(S410).If the first threshold condition r ² , The second threshold condition (q ² > = 0.6), and the third threshold condition (Max (PCC) <= 0.6), there is no regression equation satisfying at least one of the first threshold condition Another part of the data sets is re-classified into a binary data set for learning (S410).

S460 단계 또는 S480 단계에서 선정된 최적 회귀식을 S420 단계에서 분류된 검증용 이진 데이터 세트에 적용하여, 검증용 이진 데이터 세트들에 각각 대응되는 실수형 데이터들을 산출할 수 있다(S490).The optimal regression equation selected in step S460 or step S480 may be applied to the verification binary data set classified in step S420 to calculate real data corresponding to the verification binary data sets in step S490.

S490에서 산출된 검증용 이진 데이터 세트들 각각의 실수형 데이터들을 기초로 최적 회귀식을 검증한다(S500).The optimal regression equation is verified based on the real data of each of the verification binary data sets calculated in S490 (S500).

도 5는 본 발명의 바람직한 일실시예에 따른 데이터 변환 방법을 환자의 생체 측정 데이터 처리에 적용하여 환자의 상태를 나타내는 실수형 데이터를 생성한 결과를 예시한 도면이다.FIG. 5 is a diagram illustrating a result of generating real-valued data representing a patient's state by applying a data conversion method according to a preferred embodiment of the present invention to biometric data processing of a patient.

본 발명에 따른 데이터 변환 방법을 갑상선 질병의 바이오 메디컬 데이터 분석에 적용하기 위해, 갑상선 암환자들과 정상인을 포함하는 분석 대상자(환자)들의 소변을 GC-MS-SIM(Gas Chromatography-Mass Spectrometry-Selected Ion-Monitoring) 시스템(Hewlett-Packard GC-MS: 5890A GC model, 5970B mass-selective detector, HP 59970C MS chemstation)을 사용하여 분석하여 안드로겐과 에스트로겐 대사경로상의 18가지의 대사체(metabolite) 호르몬에 대한 프로파일 데이터를 획득하고, 이들 호르몬들 중 갑상선 질병의 위험 인자를 찾기 위해서 해당 호르몬 프로파일 데이터에 대해 t-test 분석을 수행한 결과, 총 18가지의 대사체 호르몬들 중 2-히드록시에스트론(2-OH-E1), 2-히드록시에스트라디올(2-OH-E2), 2-메톡시에스트론(2-MeO-E1), 2-메톡시에스트라디올(2-MeO-E2), 및 2-메톡시에스트라디올-3-메틸에테르(2-MeO-E2-3-메틸에테르) 호르몬이 갑상선 암의 진단 및 진행 정도를 판단하는 데에 유의한 위험 인자로 판명되었다.In order to apply the data conversion method according to the present invention to the analysis of biomedical data of thyroid diseases, the urine of the subjects (patients) including thyroid cancer patients and normal persons is analyzed by Gas Chromatography-Mass Spectrometry-Selected Ion-Monitoring system (Hewlett-Packard GC-MS: 5890A GC model, 5970B mass-selective detector, HP 59970C MS chemstation) for 18 metabolite hormones on androgen and estrogen metabolic pathways To obtain profile data and to identify risk factors for thyroid disease among these hormones, t-test analysis was performed on the corresponding hormone profile data. As a result, the total of 18 metabolic hormones, 2-hydroxyestrone (2- OH-E1), 2-hydroxyestradiol (2-OH-E2), 2-methoxyestrone (2-MeO-E1), 2-methoxyestradiol Ethoxyestradiol-3-methyl ether (2-MeO-E2-3- Butyl ether) hormone was found to be a significant risk factor for determining the diagnosis and progression of thyroid cancer.

본 실시예에서는 8명의 정상인들의 상태 데이터를 0으로 설정하고, 6명의 갑상선 암환자들의 상태 데이터는 1로 설정하여 총 14개의 이진 데이터 세트를 대상으로 실수형 데이터 변환을 수행하였다.In the present embodiment, the state data of eight normal persons are set to 0, and the state data of six thyroid cancer patients are set to 1 to perform real data conversion on a total of 14 binary data sets.

본 실시예에 따른 이진형 데이터 세트들 각각의 이진형 데이터를 실수형 데이터로 변환하기 위해, 정상인들에게는 0.2에서 0.7의 실수형 데이터를 할당하고, 갑상선 암환자들에게는 1.2에서 1.9의 실수형 데이터를 할당하여 모든 가능한 임의의 실수형 데이터 순열을 생성하였다. 이 결과, 8!*6! = 29,030,400개의 실수형 데이터 순열을 획득할 수 있었다.In order to convert binary data of each of the binary data sets according to the present embodiment into real data, real data of 0.2 to 0.7 is allocated to normal persons and real data of 1.2 to 1.9 are allocated to patients of thyroid cancer To generate all possible real-valued data permutations. As a result, 8! = 29,030,400 real data type permutations were obtained.

이러한 모든 실수형 데이터 순열에 각각 대응되는 회귀식들을 구하여 이들 중 가장 좋은 결과를 갖는, 예컨대, 결정계수(r²)이 가장 큰 회귀식으로서 하기의 수학식 3과 같은 회귀식을 획득할 수 있다.The regression equations corresponding to all of these real data type permutations are obtained and a regression equation such as the following equation (3) can be obtained as the regression equation having the best result among them, for example, having the largest determination coefficient (r ² ) .

이하에서는 수학식 3의 회귀식을 통해 얻을 수 있는 실수형 데이터를 갑상선 질병으로의 진행 정도를 나타낸다는 의미로, 갑상선 질병지수(TDI)라고 명명한다. Hereinafter, the real-valued data obtained through the regression equation of Equation (3) is referred to as the thyroid disease index (TDI), which means the degree of progression to thyroid disease.

[수학식 3]&Quot; (3) "

갑상선 질병지수 (TDI) = 0.0153 * 2-MeO-E2-3-메틸에테르 + 0.0874 * 2-MeO-E2/2-OH-E2 - 2.2571 * 2-MeO-E2-3-메틸에테르/2-OH-E1 + 0.5649 Thyroid Disease Index (TDI) = 0.0153 * 2-MeO-E2-3-methyl ether + 0.0874 * 2-MeO-E2 / 2-OH-E2-2.2571 * 2-MeO- -E1 + 0.5649

도 5를 참조하면, 이진 데이터 세트에 포함된 대상자 각각의 생체 측정 데이터들(501) 및 이러한 생체 측정 데이터들을 상기한 수학식 3에 적용하여 산출된 실수형 데이터(502)가 예시된다.Referring to FIG. 5, biometric data 501 of each subject included in the binary data set and the real data 502 calculated by applying the biometric data to Equation (3) are illustrated.

실제 대상자의 정보(환자 정보)와 비교한 결과, 갑상선 질병지수 0 에서 0.8 사이는 정상인에 속하는 것으로 나타났으며, 갑상선 질병지수 1.0 이상은 갑상선암 환자에 해당하는 것으로 나타났다. 지수 1.0 이하의 경계지역에 위치한 4명의 대상자들은 각각, 갑상선 종(thyroid mass, 0.8055), 갑상선 종(goiter, 0.8806), thyroid mass(수술전 갑상선암, 0.8951), thyroid mass (수술전 갑상선암, 0.9112)인 것으로 나타났다. 이러한 결과에 따르면, 본 실시예에 따라 획득된 최적의 회귀식인 갑상선 질병지수식은 갑상선암의 진행 순서에 따라 그 값이 증가하는 것으로 확인되었으므로, 수학식 3의 회귀식에 따라 갑상선 질병을 진단하고 그 진행 정도를 분류할 수 있는 것으로 확인된다.As a result of comparison with actual patient information (patient information), the thyroid disease index ranged from 0 to 0.8 in normal subjects, and a thyroid disease index of 1.0 or more was found in patients with thyroid cancer. Four subjects located in the border area below the index of 1.0 were classified as thyroid mass (0.8055), goiter (0.8806), thyroid mass (preoperative thyroid cancer, 0.8951), thyroid mass (preoperative thyroid cancer, 0.9112) Respectively. According to these results, since the thyroid disease index, which is the best regression equation obtained according to the present embodiment, was found to increase in accordance with the progression of thyroid cancer, the thyroid disease was diagnosed according to the regression formula of Equation 3, It is possible to classify the degree.

이에 따라, 본 실시예에 따라 획득된 갑상선 질병지수식에 따라 얻어지는 갑상선 질병지수의 범위에 기초하여 갑성선암의 여부 및 갑상선암의 진행정도를 판단할 수 있다. 이처럼 본 실시예에 따른 데이터 변환 방법은 질병의 연관관계 분석을 위한 바이오메디컬 데이터에 대해 특히 유용하게 사용될 수 있음을 확인할 수 있다. Based on the range of the thyroid disease index obtained according to the thyroid disease index obtained according to the present embodiment, it is possible to determine the presence or absence of thyroid cancer and the progress of the thyroid cancer. As described above, it can be seen that the data conversion method according to the present embodiment can be particularly useful for biomedical data for analyzing the association of diseases.

아울러, 본 발명에 따라 이진 데이터 세트의 이진 데이터를 실수형 데이터로 변환하는 데이터 변환 방법 및 이러한 데이터 변환을 검증하는 데이터 변환 검증 방법은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.In addition, according to the present invention, a data conversion method for converting binary data of a binary data set into real data and a data conversion verification method for verifying such data conversion are also computer readable codes on a computer readable recording medium It is possible to implement. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like. In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily inferred by programmers of the technical field to which the present invention belongs.

본 발명은 첨부된 도면에 도시된 일 실시예를 참고로 설명되었으나, 이는 예시적인 것에 불과하며, 당해 기술분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 수 있을 것이다. 따라서, 본 발명의 진정한 보호 범위는 첨부된 청구 범위에 의해서만 정해져야 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be taken by way of limitation, You will understand. Accordingly, the true scope of protection of the present invention should be determined only by the appended claims.

100: 데이터 변환 장치
110: 저장부
120: 로딩부
130: 실수형 데이터 순열 생성부
140: 회귀식 도출부
150: 회귀식 선정부
160: 최적 회귀식 검증부
170: 디스플레이부100: Data conversion device
110:
120: loading section
130: Real-type data permutation generator
140: regression equation derivation unit
150: Regression Expression Selection
160: Optimal Regression Equation Verification Unit
170:

Claims

A binary data set input step of inputting a plurality of sets of binary data including at least one measurement data and binary data representing one of a first state and a second state in a state of the measurement data;
A real data type permutation generation step of generating a plurality of real data type permutation data in which different real data types are allocated to each of the plurality of binary data sets based on binary data of each of the plurality of binary data sets;
Performing a multiple regression analysis on each of the generated real data type permutations to derive a regression equation corresponding to each of the real data type permutations;
An optimal regression equation selecting step of selecting one optimal regression equation based on at least one of a decision coefficient of the derived regression equations, an intersection verification correlation coefficient of the decision coefficient, and a Pearson correlation coefficient; And
And providing real data types corresponding to the plurality of binary data sets based on the selected optimal regression equation.

The method according to claim 1,
The real data type permutation generation step
Classifying the plurality of binary data sets into a first group corresponding to the first state and a second group corresponding to the second state based on the binary data of each of the plurality of binary data sets Further included,
Wherein each of the real data type permutations assigns real data of a section that does not overlap with each other between binary data sets belonging to the first group and binary data sets belonging to the second group.

3. The method of claim 2,
Wherein each of the real data type permutations assigns different real data types having one or more values to binary data sets belonging to the first group and assigns a value of 0 or more and less than 1 to the binary data sets belonging to the second group And assigning different real-valued data having the same value to each other.

The method according to claim 1,
The step of inputting the binary data set may include classifying a portion of the plurality of binary data sets into a plurality of sets of binary data for learning and classifying the remaining portions of the plurality of binary data sets into a plurality of sets of binary data for verification Further included,
Wherein the real data type permutation generation step generates the real data type permutations based on the binary data sets for learning among the plurality of binary data sets,
Wherein the optimal regression equation selected in the optimal regression equation selection step is an optimal regression equation for the learning binary data sets.

5. The method of claim 4,
Calculating real data types corresponding to the verification binary data sets by applying an optimal regression equation for the learning binary data sets to the verification binary data sets; And
And verifying the optimal regression equation based on real-valued data corresponding respectively to the verification binary data sets.

The method according to claim 1,
The optimal regression equation selection step
At least one of a predetermined first threshold condition with respect to the decision coefficient among the derived regression equations, a predetermined second threshold condition with respect to the cross validation correlation coefficient of the decision coefficient, and predetermined threshold conditions with respect to the Pearson correlation coefficient And extracting a regression equation that satisfies the threshold condition.

The method according to claim 6,
The optimal regression equation selection step
When the extracted regression equations are a plurality of regression equations having the highest cross correlation coefficient among regression equations having a decision coefficient satisfying a fourth threshold condition set based on a maximum value among the decision coefficients of the extracted plurality of regression equations Wherein the regression equation is selected by an optimal regression equation.

The method according to claim 1,
Wherein the measurement data is biometric data for diagnosis of a specific disease,
The binary data is disease occurrence data indicating whether or not a specific disease has occurred,
Wherein the real type data is disease index data indicating a degree of progress of a specific disease.

Classifying a part of a plurality of binary data sets including at least one measurement data and binary data representing one of a first state and a second state into a state of the measurement data into a plurality of binary data sets for learning, A binary data extracting step of classifying the binary data sets into a plurality of verification binary data sets;
A real data type permutation generation step of generating a plurality of real data type permutations in which different real data types are allocated to each of the learning binary data sets based on the binary data of each of the learning binary data sets;
Performing a multiple regression analysis on each of the generated real data type permutations to derive a regression equation corresponding to each of the real data type permutations;
An optimal regression equation selecting step of selecting one optimal regression equation based on at least one of a decision coefficient of the derived regression equations, an intersection verification correlation coefficient of the decision coefficient, and a Pearson correlation coefficient;
Calculating real data types corresponding to the verification binary data sets by applying the selected optimal regression equation to the verification binary data set; And
And verifying the optimal regression equation based on real-valued data corresponding to the verification binary data sets.

10. The method of claim 9,
The real data type permutation generation step
Classifying the plurality of binary data sets into a first group corresponding to the first state and a second group corresponding to the second state based on the binary data of each of the plurality of binary data sets Further included,
Wherein each of the real data type permutations is allocated real data of a section that does not overlap with each other between binary data sets belonging to the first group and binary data sets belonging to the second group. .

11. The method of claim 10,
Wherein each of the real data type permutations assigns different real data types having one or more values to binary data sets belonging to the first group and assigns a value of 0 or more and less than 1 to the binary data sets belonging to the second group And a plurality of different real-valued data having the plurality of real-valued data are allocated.

10. The method of claim 9,
The step of inputting the binary data set may include classifying a portion of the plurality of binary data sets into a plurality of sets of binary data for learning and classifying the remaining portions of the plurality of binary data sets into a plurality of sets of binary data for verification Further included,
Wherein the real data type permutation generation step generates the real data type permutations based on the binary data sets for learning among the plurality of binary data sets,
Wherein the optimal regression equation selected in the optimal regression equation selection step is an optimal regression equation for the learning binary data sets.

10. The method of claim 9,
The optimal regression equation selection step
A second threshold condition preset for the cross correlation correlation coefficient of the decision coefficient, and a third threshold condition preset for the Pearson correlation coefficient with respect to the decision coefficient among the derived regression equations And extracting a regression equation that satisfies one or more threshold conditions.

14. The method of claim 13,
Wherein the optimum regression equation selecting step includes a step of calculating a best fit among the regression equations having a decision coefficient satisfying a fourth threshold condition set on the basis of a maximum value among decision coefficients of the extracted plurality of regression equations when there is a plurality of the extracted regression equations And selecting a regression equation having a high cross-validation correlation coefficient as an optimal regression equation.

14. The method of claim 13,
The optimal regression equation selecting step may include returning to the binary data extracting step if all of the derived regression equations do not satisfy both of the first threshold condition, the second threshold condition, and the third threshold condition The data conversion verification method comprising:

A binary data set loading unit loading at least a part of a plurality of binary data sets including at least one previously stored measurement data and binary data representing one of a first state and a second state in a state of the measurement data;
A real data type permutation generation unit for generating a plurality of real data type permutation data in which different real data types are allocated to each of the plurality of binary data sets based on binary data of each of the loaded binary data sets;
A regression derivation unit for performing a multiple regression analysis on each of the generated real data type permutations to derive regression equations corresponding to the real data type permutations;
A regression equation selection unit for selecting one optimal regression equation based on at least one of a decision coefficient of the derived regression equations, an intersection verification correlation coefficient of the decision coefficient, and a Pearson correlation coefficient; And
And a real-valued data calculation unit for calculating real-valued data corresponding to each of the plurality of binary data sets based on the selected optimal regression equation.

A computer-readable recording medium recording a program for causing a computer to execute the method according to any one of claims 1 to 15.