KR100839221B1

KR100839221B1 - Method for integrated analysis of microarray

Info

Publication number: KR100839221B1
Application number: KR1020060095827A
Authority: KR
Inventors: 이재원; 김철민; 전명식; 송석헌; 이정복; 손인석; 박미라
Original assignee: 고려대학교 산학협력단
Priority date: 2006-09-29
Filing date: 2006-09-29
Publication date: 2008-06-19
Also published as: KR20080030142A

Abstract

마이크로어레이 통합 분석 방법이 개시된다.A microarray integrated analysis method is disclosed.

본 발명은 실험 설계, 표준화, 추정과 검정, 군집 분석 및 분류로 구성된 상위 메뉴의 아이템에 대하여, 상기 아이템 중 상기 실험 설계가 선택되면, 염료 교환, 기준설계 또는 루프설계 중 어느 하나의 실험 디자인을 선택하는 인터페이스와 상기 선택된 실험 디자인에 사용될 설계 파라미터를 입력하는 인터페이스를 디스플레이하고, 상기 입력된 설계 파라미터를 이용하여 상기 선택된 실험 디자인에 대한 아노바 설계 매트릭스를 생성하는 단계, 상기 아이템 중 상기 표준화가 선택되면, 입력된 슬라이드 정보를 이용한 단일 슬라이드 표준화, 단일 배치 표준화 및 다중 슬라이드 표준화를 순차적으로 수행하여 표준화 결과를 생성하는 단계, 상기 아이템 중 상기 추정과 검정이 선택되면, 상기 표준화 결과에 대해, 소정의 추정과 검정 방법을 수행하여 마이크로어레이 실험에서 유의한 유전자를 선택하는 단계, 상기 아이템 중 상기 군집 분석이 선택되면, 상기 표준화 결과를 이용하여 소정의 군집 분석을 수행하여 생성된 군집 분석 결과를 그래프로 디스플레이하는 단계 및 상기 아이템 중 상기 분류가 선택되면, 상기 표준화 결과 및 처리군 간의 변동과 처리군 내에서의 변동의 비를 이용하여 후보 유전자를 선정하고,상기 후보 유전자에 대해 소정의 분류를 수행하여 상기 설계 파라미터에 의한 관심 대상 그룹을 예측하며, 오분류율을 출력하는 단계를 포함한다.According to the present invention, when the experimental design is selected among the above items for an item of the upper menu including experimental design, standardization, estimation and test, cluster analysis, and classification, the experimental design of any one of dye exchange, reference design, or loop design is performed. Displaying an interface for inputting a selection interface and a design parameter to be used for the selected experimental design, and generating an annova design matrix for the selected experimental design using the input design parameters, wherein the standardization is selected from the items When the standardization result is generated by sequentially performing a single slide standardization, a single batch standardization, and a multiple slide standardization using the input slide information, when the estimation and the test are selected among the items, By performing estimation and testing methods Selecting a significant gene in a microarray experiment; if the cluster analysis is selected among the items, performing a predetermined cluster analysis using the standardization result and displaying a cluster analysis result generated by the graph and among the items When the classification is selected, candidate genes are selected by using the standardization result and the ratio between the variation between treatment groups and the variation within the treatment group, and a predetermined classification is performed on the candidate genes to be interested in the design parameters. Predicting the group, and outputting a misclassification rate.

본 발명에 의하면, 하나의 통합시스템에서 마이크로어레이 자료 분석의 전 과정을 처리함으로써, 데이터베이스를 공유하여 마이크로어레이 스캐닝 이미지 자료에 대한 체계적인 통계분석을 수행할 수 있고, 각 실험의 상황에 따른 최적의 통계분석법을 적용함으로써 부적절한 방법의 적용으로 인한 위양성 및 위음성 오류율을 최소화하고 연구결과의 신뢰성을 높일 수 있으며, 편리하고 사용자 친화적인 인터페이스를 제공하여 분석이 용이하다.According to the present invention, by processing the entire process of analyzing the microarray data in one integrated system, it is possible to perform a systematic statistical analysis of the microarray scanning image data by sharing the database, the optimal statistics according to the situation of each experiment By applying the analytical method, the false positive and false negative error rates due to the inappropriate method can be minimized and the reliability of the research results can be increased, and the analysis is easy by providing a convenient and user-friendly interface.

Description

{Method for integrated analysis of microarray}

도 1a 및 도 1b는 본 발명에 적용되는 실험 설계 방법들의 개념도이다.1A and 1B are conceptual diagrams of experimental design methods applied to the present invention.

도 2는 본 발명에 따른 마이크로어레이 통합 분석 방법의 흐름도이다.2 is a flowchart of a microarray integrated analysis method according to the present invention.

도 3은 도 2를 위한 서브 프로그램인 실험 설계의 화면 구성의 일 예를 도시한 것이다.FIG. 3 illustrates an example of a screen configuration of an experimental design that is a subprogram for FIG. 2.

도 4a 내지 도 4f는 도 2를 위한 서브 프로그램인 표준화의 화면 구성의 일 예를 도시한 것이다.4A to 4F illustrate an example of a screen configuration of standardization, which is a subprogram for FIG. 2.

도 5a 내지 도 5n는 도 2를 위한 서브 프로그램인 추정과 검정의 옵션 입력의 화면 구성 예 및 실행 결과 그래프의 일 예를 도시한 것이다.5A to 5N illustrate an example of a screen configuration of an option input of estimation and test, which is a subprogram for FIG. 2, and an example of an execution result graph.

도 6a 내지 도 6o는 도 2를 위한 서브 프로그램인 군집 분석의 옵션 입력의 화면 구성 예 및 실행 결과 그래프의 일 예를 도시한 것이다.6A to 6O illustrate an example of a screen configuration of an option input of cluster analysis, which is a subprogram for FIG. 2, and an example of an execution result graph.

도 7a 내지 도 7c는 도 2를 위한 서브 프로그램인 분류의 옵션 입력의 화면 구성 예 및 실행 결과 그래프의 일 예를 도시한 것이다.7A to 7C illustrate an example of a screen configuration of an option input of classification, which is a subprogram for FIG. 2, and an example of an execution result graph.

본 발명은 DNA 칩에 관한 것으로, 특히, 마이크로어레이 통합 분석 방법에 관한 것이다.The present invention relates to a DNA chip, and more particularly, to a microarray integrated analysis method.

마이크로어레이 실험도 여타의 생물학 실험과 마찬가지로 실험 설계가 우선되어야 하는데, 실험 설계는 실험 목적에 의존적일 뿐 아니라, 분석 방법을 염두에 두고 결정되어야 한다. 실험의 목적은 두 군간 유의적 차이를 보이는 유전자를 검색하고, 이를 기초로 모집단 선별용 바이오마커를 개발하는 연구일 수도 있고, 현재까지의 임상 기술로 판별이 어려운 Small Round Blue Cell Tumor의 네 가지 서브 타입을 판별하는 것이 목적일 수도 있다. 또한, 유전자발현의 차이를 마이크로어레이를 이용하여 광범위하게 분석하는 연구 또는 진단에 사용될 수 있다. 혹은 초파리의 전사체에서 연령, 성별, 유전자型의 세 요인의 교호작용 크기를 파악하는 것이나, 어느 특정 종의 모집단 간,그리고 모집단 내 유전자 발현의 변이 양상을 파악하는 것이 실험의 목적일 수 있으며, 두개의 항암제를 동시에 투여할 때 반응을 보이는 유전자 검색이 연구의 목적일 수도 있다. 연구의 목적에 따라 실험 설계가 달라지고, 분석 방법이 달라지게 되는 것은 물론이다.As with other biological experiments, microarray experiments should have a design priority, which depends not only on the purpose of the experiment but also on the analytical method. The purpose of the experiment could be to search for genes with significant differences between the two groups, and to develop a biomarker for population selection on the basis of the four groups of the Small Round Blue Cell Tumor. The purpose may be to determine the type. It can also be used in research or diagnostics to broadly analyze differences in gene expression using microarrays. Alternatively, the purpose of the experiment may be to determine the interaction size of three factors of age, sex, and gene type in Drosophila transcripts, or to identify variation of gene expression between and within a certain species. Searching for genes that respond when two anticancer drugs are administered at the same time may be the purpose of the study. Of course, the design of the experiment and the method of analysis vary depending on the purpose of the study.

마이크로어레이는 실험의 속성상 여러 단계를 거치게 되고, 매단계마다 실험 오차가 개입할 여지가 있으므로 일반적으로 오차가 많은 실험이라고 할 수 있다. 여러 단계에서 누적된 오차는 통계 분석을 어렵게 하는 한 요인이 되고 있지만, 그 보다 더 통계학자들을 '당혹'스럽게 하는 것은 수천 개의 유전자중에서 불과 몇 개를 선별하여야 하는데 분석대상의 관찰치를 구성하는 마이크로어레이는 불과 수십 개에 불과한 이른 바 '큰 p, 작은 n'의 문제를 구성하고 있다는 점이다. 여기서 p 는 설명변수의 개수이고, n은 관찰치의 개수를 나타내고 있으며, 전통적으로 통계학자들은 '큰 n, 작은 p' 자료를 주로 다루어 왔으나 마이크로어레이 자료가 대표적인 '큰 p, 작은 n' 자료를 구성하고 있다. 따라서, 마이크로어레이 실험 자료의 통계적 분석은 통계학자들에게도 새로운 도전이 되고 있다. Due to the nature of the experiment, the microarray goes through several steps, and there is room for experiment error at every step. Accumulated errors at various stages are one of the factors that make statistical analysis difficult, but even more embarrassing to statisticians is that only a few of the thousands of genes have to be selected. Is composed of only a few dozen so-called 'big p, small n' problem. Where p is the number of explanatory variables, n is the number of observations, and traditionally, statisticians have mainly dealt with 'large n, small p' data, but microarray data make up 'large p, small n' data. Doing. Thus, statistical analysis of microarray experimental data presents new challenges for statisticians.

종래의 마이크로어레이 실험 분석 방법은 각 분석 방법을 수행할때마다 별도의 프로그램을 필요로 하고, 각 프로그램마다 데이터 호환이 어렵고, 분석 데이터들을 체계적으로 관리할 수 없으며, 부적절한 방법의 적용으로 인한 위양성 및 위음성 오류율이 증가되는 것을 최소화할 수 없는 문제점이 있다.Conventional microarray experimental analysis method requires a separate program for each analysis method, data compatibility is difficult for each program, the analysis data can not be managed systematically, false positives due to inappropriate application of the method and There is a problem that cannot increase the false negative error rate.

따라서, 본 발명이 이루고자 하는 기술적 과제는 데이터베이스를 공유하여 마이크로어레이 스캐닝 이미지 자료에 대한 체계적인 통계분석을 수행할 수 있고, 부적절한 방법의 적용으로 인한 위양성 및 위음성 오류율을 최소화하고 연구결과의 신뢰성을 높일 수 있으며, 하나의 통합시스템에서 마이크로어레이 자료 분석의 전 과정을 처리할 수 있어 분석이 용이한 마이크로어레이 통합 분석 방법을 제공하는데 있다.Therefore, the technical task of the present invention is to share a database to perform systematic statistical analysis on the microarray scanning image data, to minimize false positive and false negative error rates due to the application of inappropriate methods and to increase the reliability of the results. In addition, it is possible to process the entire process of microarray data analysis in one integrated system to provide an easy microarray integrated analysis method.

상기의 기술적 과제를 이루기 위하여, 본 발명은 실험 설계, 표준화, 추정과 검정, 군집 분석 및 분류로 구성된 상위 메뉴의 아이템에 대하여, 상기 아이템 중 상기 실험 설계가 선택되면, 염료 교환, 기준설계 또는 루프설계 중 어느 하나의 실험 디자인을 선택하는 인터페이스와 상기 선택된 실험 디자인에 사용될 설계 파 라미터를 입력하는 인터페이스를 디스플레이하고, 상기 입력된 설계 파라미터를 이용하여 상기 선택된 실험 디자인에 대한 아노바 설계 매트릭스를 생성하는 단계, 상기 아이템 중 상기 표준화가 선택되면, 입력된 슬라이드 정보를 이용한 단일 슬라이드 표준화, 단일 배치 표준화 및 다중 슬라이드 표준화를 순차적으로 수행하여 표준화 결과를 생성하는 단계, 상기 아이템 중 상기 추정과 검정이 선택되면, 상기 표준화 결과에 대해, 소정의 추정과 검정 방법을 수행하여 마이크로어레이 실험에서 유의한 유전자를 선택하는 단계, 상기 아이템 중 상기 군집 분석이 선택되면, 상기 표준화 결과를 이용하여 소정의 군집 분석을 수행하여 생성된 군집 분석 결과를 그래프로 디스플레이하는 단계 및 상기 아이템 중 상기 분류가 선택되면, 상기 표준화 결과 및 처리군 간의 변동과 처리군 내에서의 변동의 비를 이용하여 후보 유전자를 선정하고, 상기 후보 유전자에 대해 소정의 분류를 수행하여 상기 설계 파라미터에 의한 관심 대상 그룹을 예측하며, 오분류율을 출력하는 단계를 포함하는 마이크로어레이 통합 분석 방법을 제공한다.In order to achieve the above technical problem, the present invention, if the experimental design is selected among the items of the upper menu consisting of experimental design, standardization, estimation and testing, cluster analysis and classification, dye exchange, reference design or loop Displays an interface for selecting an experimental design of any one of the designs and an interface for inputting design parameters to be used for the selected experimental design, and generates an annova design matrix for the selected experimental design using the input design parameters And generating the standardization result by sequentially performing a single slide standardization, a single batch standardization, and a multiple slide standardization using the input slide information when the standardization is selected among the items. If the standardization result Selecting a significant gene in a microarray experiment by performing a predetermined estimation and testing method; when the cluster analysis is selected among the items, a cluster analysis result generated by performing a predetermined cluster analysis using the standardization result Displaying a graph and selecting the classification among the items, selecting a candidate gene using the standardization result and the ratio between the variation between treatment groups and the variation within the treatment group, and selecting a predetermined classification for the candidate gene. It provides a microarray integrated analysis method comprising the step of predicting the target group of interest by the design parameters, and outputting a misclassification rate.

본 발명에 따른 마이크로어레이 통합 분석 방법은 마이크로어레이 스캐닝 이미지에 대해 블럭 인덱싱 및 스팟 인덱싱을 수행하여 생성된 이미지 자료를 통계적으로 분석하는 소프트웨어에 관한 것이다. 본 발명에 따른 마이크로어레이 통합 분석 방법은 크게 다섯 분야로 구분된다. 즉, 실험 설계, 표준화, 추정과 검정, 군집 분석 그리고 분류이다.The microarray integrated analysis method according to the present invention relates to software for statistically analyzing image data generated by performing block indexing and spot indexing on a microarray scanning image. Microarray integrated analysis method according to the present invention is largely divided into five areas. That is, experimental design, standardization, estimation and testing, cluster analysis, and classification.

먼저, 실험 설계는 염료 교환, 기준설계(reference design), 루프설계, 분석 방법을 제공한다.First, the experimental design provides dye exchange, reference design, loop design, and analytical methods.

표준화는 단일 슬라이드 표준화, 염료 교환 표준화, 다중 슬라이드 표준화 그리고 단일 슬라이드 배치 방법을 제공한다. 단일 슬라이드 표준화는 평균 표준화, 강도 의존 표준화, 프린트 팁 표준화, 스케일 표준화 방법이 가능하고, 다중 슬라이드 표준화는 다중 스케일 표준화(Multiple scale normalization)와 순위수 정규화(quantile normalization) 방법이 가능하다. 추정과 검정에서는 유의한 유전자를 추정하고 검정하는 방법을 제공한다. 군집 분석은 유의한 유전자를 각 군집별로 분류하는 방법이다. 마지막으로, 분류에서는 고전적 분류 방법(DLDA, DQDA)과 Tree 방법이 가능하다. Standardization provides single slide standardization, dye exchange standardization, multiple slide standardization and single slide batch method. Single slide normalization is possible with average standardization, intensity dependent standardization, print tip standardization, and scale normalization method, and multiple slide normalization is possible with multiple scale normalization and quantile normalization. Estimation and testing provide methods for estimating and testing significant genes. Cluster analysis is a method of classifying significant genes into groups. Finally, classification is possible with classical classification methods (DLDA, DQDA) and Tree method.

이하에서는 도면을 참조하여 본 발명의 바람직한 실시 예를 설명하기로 한다. 그러나, 다음에 예시하는 본 발명의 실시 예는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 상술하는 실시 예에 한정되는 것은 아니다.Hereinafter, with reference to the drawings will be described a preferred embodiment of the present invention. However, embodiments of the present invention illustrated below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below.

다음은 실험 설계에 대한 설명이다. 염료 편향은 표준화가 필요한 여러 사유중 하나를 구성한다. 그러나 표준화가 모든 슬라이드에서 모든 스팟에 대하여 염료 편향을 완벽하게 제거하여 주는 것은 아니기 때문에 실험자들은 염료교체 실험을 권장한다. 암 연구에서 흔히 활용되는 설계는 (공통) 준거(reference) 설계, 블럭 설계 (block design), 그리고 루프설계 (loop design)가 있다. 준거(reference)설계는 상술한 공통준거(reference)설계와 같은 말로서 각 어레이마다 같은 내부 준거(reference) 표본을 사용하여 스팟 간 생길 수 있는 변동을 통제한다. 이를 도식으로 표현하면 도 1a와 같다. 간단한 형태의 환설계는 다음 도 1b와 같다. 물고기를 대상으로 종間 변동과 종內 변동의 크기를 비교하기 위하여 환(loop)설계로써 마이크로어레이 실험을 실시할 수 있다. The following is a description of the experimental design. Dye deflection constitutes one of several reasons for standardization. However, because standardization does not completely eliminate dye deflection for every spot on every slide, the experimenter recommends dye replacement experiments. Common designs used in cancer research include (common) reference design, block design, and loop design. Reference design is synonymous with the common reference design described above and uses the same internal reference sample for each array to control the possible variations between spots. This is represented as a diagram in FIG. 1A. The simple design of the ring design is as follows in FIG. Microarray experiments can be conducted with a loop design to compare the magnitude of longitudinal and vertical variation in fish.

본 발명에 따른 마이크로어레이 통합 분석 방법은 대용량의 데이터베이스를 처리할 수 있는 컴퓨터 시스템, 서버 및 클라이언트를 포함하는 네트워크 시스템 또는 이와 유사한 처리 능력을 갖는 시스템 상에서 구현되는 것을 가정한다.The microarray integrated analysis method according to the present invention is assumed to be implemented on a network system including a computer system capable of processing a large database, a server and a client, or a system having similar processing power.

먼저, 실험 설계, 표준화, 추정과 검정, 군집 분석 및 분류로 구성된 상위 메뉴의 아이템을 화면으로 출력한다(210 과정).First, the items of the upper menu composed of experiment design, standardization, estimation and test, cluster analysis, and classification are output to the screen (step 210).

아이템 중 실험 설계가 선택되면(221 과정), 염료 교환(dye-swap), 기준설계(reference design) 또는 루프설계(loop design) 중 어느 하나의 실험 디자인을 선택하는 인터페이스와 선택된 실험 디자인에 사용될 설계 파라미터를 입력하는 인터페이스를 디스플레이하여, 사용자의 파라미터 입력에 따라 설계 파라미터를 지정시킨다(231 과정). 다음, 입력된 설계 파라미터를 이용하여 선택된 실험 디자인에 대한 아노바(ANOVA) 설계 매트릭스를 생성한다(232 과정).If an experimental design is selected among the items (step 221), an interface to select one of the dye designs, the reference design or the loop design, and the design to be used for the selected experimental design The interface for inputting parameters is displayed to designate design parameters according to the user's parameter input (step 231). Next, an ANOVA design matrix for the selected experimental design is generated using the input design parameters (step 232).

한편, 아이템 중 표준화가 선택되면(222 과정), 슬라이드 정보를 입력받고, 입력된 슬라이드 정보를 이용한 단일 슬라이드 표준화, 단일 배치 표준화 및 다중 슬라이드 표준화를 순차적으로 수행하여 표준화 결과를 생성한다(242 과정). 생성된 표준화 결과는 파일로 저장하여 추정과 검정, 군집 분석 및 분류의 수행과정에서 독출되도록 한다(243 과정).On the other hand, if standardization is selected among the items (step 222), the slide information is input, and standardization results are generated by sequentially performing single slide standardization, single batch standardization, and multiple slide standardization using the input slide information (step 242). . The generated standardized results are stored in a file so that they can be read in the process of estimation, test, cluster analysis and classification (step 243).

한편, 아이템 중 추정과 검정이 선택되면(223 과정), 표준화 결과를 독출한다(251 과정). 다음, 독출된 표준화 결과에 대해, 소정의 추정과 검정 방법을 수행 하여 마이크로어레이 실험에서 유의한 유전자를 선택한다(252 과정). 이때, 수행하는 분석 방법에 따라 선택된 유전자에 따른 잘못 발견할 확률을 연산할 수 있다. 이때, 소정의 추정 방법은 뉴턴 분석(Newton's method) 방법, B 통계량(B-statistic) 방법, 조정유의확률 방법(ADP), 에스에엠(Significant Analysis of Microarray, SAM) 그리고 이비에이엠(Empirical Bayes Analysis of Microarray, EBAM) 등의 분석방법을 포함한다. 다음, 추정된 유전자 정보, 잘못 발견할 확률(연산한 경우) 등의 데이터들을 데이터베이스에 저장하여 이하의 다른 분석 과정에서 참조되도록 한다(253 과정).On the other hand, if estimation and testing are selected among the items (step 223), the standardization result is read (step 251). Next, with respect to the read standardization result, a predetermined estimation and assay method is performed to select a significant gene in a microarray experiment (step 252). In this case, the probability of erroneous discovery according to the selected gene may be calculated according to the analysis method performed. At this time, the predetermined estimation methods are Newton's method, B-statistic method, Probability of adjustment (ADP), Significant Analysis of Microarray (SAM) and EBIRical Bayes Analysis. of Microarray (EBAM)). Next, data such as estimated genetic information and probability of erroneous discovery (if computed) are stored in a database for reference in another analysis process (step 253).

한편, 아이템 중 군집 분석이 선택되면(224 과정), 표준화 결과를 독출한다(261 과정). 다음, 독출된 표준화 결과를 이용하여 소정의 군집 분석을 수행하여 생성된 군집 분석 결과를 그래프로 디스플레이한다(262 과정). 이때, 소정의 군집 분석은 계층적 군집 분석(hierachical clustering), K-평균 군집 분석(k-means clustering), 주성분 분석(PCA: Principal Component Analysis), 자기조직도(SOM: Self-Organizing Map), 진쉐이빙(geneshaving) 등을 포함한다. 다음, 군집 분석 결과를 데이터베이스에 저장하여 이하의 다른 분석 과정에서 참조되도록 한다(263 과정).On the other hand, if the cluster analysis of the item is selected (step 224), the standardization result is read (step 261). Next, a predetermined cluster analysis is performed using the read standardized result to display the generated cluster analysis result as a graph (step 262). At this time, the predetermined cluster analysis includes hierarchical clustering, k-means clustering, principal component analysis (PCA), self-organizing map (SOM), and true Geneshaving and the like. Next, the cluster analysis result is stored in a database for reference in another analysis process (step 263).

한편, 아이템 중 분류가 선택되면, 표준화 결과를 독출한다(271 과정). On the other hand, if the classification is selected among the items, the standardization result is read (step 271).

다음, 독출된 표준화 결과를 이용하여 처리군 간의 변동과 처리군 내에서의 변동의 비를 연산하고, 연산된 변동의 비를 이용하여 후보 유전자를 선정한다(272 과정). 또한, 선정된 후보 유전자에 대해 소정의 분류를 수행하여 상기 설계 파라 미터에 의한 관심 대상 그룹을 예측하며, 이에 따른 오분류율을 출력한다(272 과정). 특히, 각각의 처리군에 유의한 유전자에 의해 분류된 그룹 즉, 관심 대상 그룹과 오분류율을 출력한다(272 과정). 이때, 관심 대상 그룹은 실험에 사용된 Test Set 중에서 실험의 결과로 알아내고자 하는 질병이 발현된 그룹으로서, 어떠한 유전자를 가진 그룹이 간암과 같은 특정 질병과 연관성이 있는지 파악하기 위한 정보이다. 이때, 소정의 분류는 고전적 분류 방법(DLDA, DQDA)과 Tree 방법(또는 의사결정 나무(Decision Tree))를 포함한다. 다음, 유의한 유전자가 분류된 그룹 정보 및 오분류율을 데이터베이스에 저장하여 다른 분석 과정에서 참조되도록 한다(273 과정).Next, the ratio between the variation between treatment groups and the variation within the treatment group is calculated using the read standardized result, and the candidate gene is selected using the calculated ratio of the variation (step 272). In addition, a predetermined classification is performed on the selected candidate genes to predict a target group of interest based on the design parameter, and the misclassification rate is output according to the selected parameter (step 272). In particular, the group classified by the genes significant for each treatment group, that is, the group of interest and the misclassification rate are output (step 272). In this case, the interested group is a group in which the disease to be found as a result of the experiment is expressed among the test sets used in the experiment, and information for identifying which group having a gene is associated with a specific disease such as liver cancer. In this case, the predetermined classification includes a classical classification method (DLDA, DQDA) and a tree method (or a decision tree). Next, the group information and misclassification rate in which significant genes are classified are stored in a database for reference in another analysis process (step 273).

마지막으로, 모든 분석이 종료되었는지 판단한다(280 과정). 이 과정(280 과정)은 사용자에게 모든 과정을 종료 시킬 것인지 아니면 다른 분석을 연이어 수행할 것인지 선택을 요청하는 메시지를 출력하고, 사용자로부터 선택 결과를 입력받는 과정을 포함할 수 있다. 이때, 사용자가 모든 분석의 종료를 선택하면 모든 과정을 종료하고, 그렇지 않으면 위의 과정들(210-280 과정)을 반복한다.Finally, it is determined whether all the analysis is completed (step 280). This process (280) may include a process of outputting a message requesting the user to select whether to terminate all processes or to perform another analysis in succession, and receiving a selection result from the user. At this time, if the user selects the end of all the analysis is terminated all processes, otherwise repeat the above steps (210-280).

이하에서는 염료 교환(dye-swap), 기준(reference), 루프(loop) 실험자료 에 대한 ANOVA 분석 소프트웨어를 설명한다.The following describes ANOVA analysis software for dye-swap, reference and loop experimental data.

입력 데이터 형식은 다음의 표 1과 같이 구성한다.The input data format is configured as shown in Table 1 below.

열Heat 변수명Variable name 비고Remarks col 1col 1 gene idgene id 유전자를 구분할 수 있는 공백 없는 문자열 형태Character string without space to distinguish gene col 2col 2 RR 붉은색 염료(Cy5)로 염색된 처리의 발현도 log(foreground red-background red)Expression of treatment stained with red dye (Cy5) log (foreground red-background red) col 3col 3 GG 녹색 염료(Cy3)로 염색된 처리의 발현도 log(foreground green-background green)Expression of treatment stained with green dye (Cy3) log (foreground green-background green) col 4col 4 flagflag 에러가 발생한 유전자는 1, 아닌 경우는 0으로 표시The gene where the error occurred is 1, otherwise 0

출력 데이터 형식은 다음의 표 2와 같이 구성된다.The output data format is configured as shown in Table 2 below.

열Heat 변수명Variable name 데이터 형Data type 비고Remarks col 1col 1 Gene Gene integerinteger 각 유전자별로 고유의 숫자를 부여Give each gene its own number col 2col 2 intensityintensity doubledouble R, G값을 한 컬럼으로 병합Merge R and G values into one column col 3col 3 ArrayArray integerinteger 1∼어레이 개수1 to array number col 4col 4 DyeDye integerinteger 1,21,2 col 5col 5 VarietyVariety integerinteger 1∼Variety 개수1 to Variety

도 3에서 왼쪽 메뉴의 실험 설계(Experimental Design)을 선택하고 나타나는 하위메뉴에서 염료 교환(dye-swap), 기준(reference), 루프(loop) 중 하나를 선택한다.In FIG. 3, Experimental Design is selected from the left menu and one of dye-swap, reference, and loop is selected from the submenu that appears.

이때, Experiment Type은 실험 디자인 선택, Array file은 Array 파일 리스트, Number of Arrays은 분석에 이용될 어레이 개수 입력, Array Replication은 어레이 간 반복 여부, Multiple Spotting은 어레이 내 반복 여부, RNA Species List은 처리(Variety) 리스트, Array information은 각 어레이 별 처리 정보 입력, Name of the ANOVA Design Matrix는 저장될 파일 이름, Generate a Design Matrix은 데이터 셋 생성 버튼, Test sample은 비교 대상 처리 선택, Boostrap Replications은 붓스트랩 반복 연산 횟수, Significance Level은 유의수준, RUN은 ANOVA 실행 버튼을 나타낸다.At this time, Experiment Type selects experiment design, Array file is Array file list, Number of Arrays is the number of arrays to be used for analysis, Array Replication is repeat between arrays, Multiple Spotting is repeat within array, RNA Species List is processed ( Variety) List, Array information is input processing information for each array, Name of the ANOVA Design Matrix is file name to be saved, Generate a Design Matrix is Data set generation button, Test sample is selection of processing to compare, Boostrap Replications is Bootstrap repetition The number of operations, Significance Level is the significance level, and RUN is the ANOVA run button.

먼저, 염료 교환(dye-swap), 기준설계(Reference Design), 루프설계(Loop Design) 중에서 실험 디자인을 선택한다.First, the experimental design is selected from dye-swap, reference design, and loop design.

Array file은 Array 파일 리스트로서, 각 파일은 하나의 어레이 자료(gene id, Red intensity, Green intensity, flag 로 구성)이다, Array file은 리스트를 보여주는 기능일 뿐 특별한 조작이 필요 없다.Array file is an array file list. Each file is an array data (gene id, red intensity, green intensity, and flag). Array file is a list display function.

다음, Number of Arrays에서 분석에 이용될 어레이 개수를 입력한다. 실험된 어레이 중 일부분만 사용할 수도 있으므로 분석에 이용할 어레이 개수를 직접 입력한다. 또한, Array Replication은 염료 교환(dye swap) 실험을 두 번 반복한 경우와 같이, 실험에 어레이 간 반복이 있는 경우 체크한다. Multiple Spotting은 칩 제작시 유전자를 동일 횟수(2회 이상) 반복하여 점적한 경우에 체크한다. RNA Species List에는 실험에 이용된 처리를 차례대로 입력한다. 예를 들어, 시간에 따른 발현의 차이를 보는 경우에는 time0, time1, time2, time3 등으로 입력하고, 약품 투여 양에 따른 차이를 보는 경우에는 0mg, 10mg, 20mg, 30mg 등으로 입력한다.Next, enter the number of arrays to be used for analysis in Number of Arrays. Only part of the tested array can be used, so enter the number of arrays to use for analysis. Array Replication also checks if there are inter-array repetitions in the experiment, such as repeating a dye swap experiment twice. Multiple Spotting checks if a gene is repeatedly dipped the same number of times (two or more times) during chip fabrication. In the RNA Species List, enter the treatments used in the experiment. For example, if you see the difference in expression over time, enter as time0, time1, time2, time3, etc., and if you see the difference according to the drug dosage, enter as 0mg, 10mg, 20mg, 30mg.

Name of the ANOVA Design Matrix은 설계 매트릭스가 저장될 파일 이름이다.Name of the ANOVA Design Matrix is the file name where the design matrix will be stored.

Generate a Design Matrix는 데이터 셋 즉, 아노바 설계 매트릭스를 생성하기 위한 버튼이다. 실험 정보를 모두 입력하고 이를 선택하면 데이터셋을 생성한다. Generate a Design Matrix is a button for generating a data set, an Nova design matrix. Enter all the experiment information and select it to create a dataset.

Test sample은 비교 대상 처리를 선택하는 버튼이고, Boostrap Replications는 붓스트랩 반복 연산 횟수이며, Significance Level는 유의수준을 나타낸다.The test sample is a button to select the processing to be compared, the Boostrap Replications is the number of bootstrap repetition operations, and the Significance Level represents a significance level.

실험의 오차를 줄이기 위하여 염료 교환(dye swap) 실험을 3회에 걸쳐 반복 실험하는 경우, 표 1c와 같이, 총 6개의 어레이가 생성된다.When the dye swap experiment is repeated three times in order to reduce the error of the experiment, a total of six arrays are generated as shown in Table 1c.

Red(Cy5)Red (Cy5) Green(Cy3)Green (Cy3) Array1Array1 Variety2Variety2 Variety1Variety1 Array2Array2 Variety2Variety2 Variety1Variety1 Array3Array3 Variety1Variety1 Variety2Variety2 Array4Array4 Variety2Variety2 Variety1Variety1 Array5Array5 Variety1Variety1 Variety2Variety2 Array6Array6 Variety1Variety1 Variety2Variety2

다음은 표준화에 대한 설명이다.The following is a description of standardization.

표준화는 마이크로어레이 실험 후 통계분석을 행함에 앞서 계통적인 변동(systematic variation)을 제거하는 작업으로 시스템에서 구현된 표준화 방법은 다음과 같다. Standardization is a process to remove systematic variation prior to performing statistical analysis after microarray experiment. The standardization method implemented in the system is as follows.

cDNA 마이크로어레이 실험에서 표준화는 Cy5, Cy3의 비를 보정하는데 사용한다. cDNA 마이크로어레이 분석을 할 경우 각각의

,

로 하는 것보다

로 변환하여 하는 장점은 절대적 강도는 마이크로어레이의 고유한 스팟-스팟(spot-spot)변동에 의해서 교락(confound)될 수 있기 때문에 여러 슬라이드에서 절대적 강도보다

으로 분석하는 것이 더 안정적이기 때문이다. 그래서 표준화도 각 절대적 강도의 로그비를 보정하는데 중점을 두는 것이다.Normalization in cDNA microarray experiments is used to correct for the ratios of Cy5 and Cy3. For cDNA microarray analysis, each

,

Than to do it

The advantage of converting to is that absolute strength can be confounded by the unique spot-spot variation of the microarray.

This is because the analysis is more stable. So standardization also focuses on correcting the logarithmic ratio of each absolute intensity.

표준화 중 단일 슬라이드 표준화(Single Slide normailzation)는 한 장의 마이크로어레이 슬라이드(Microarray slide) 를 분석하기 위한 방법으로 뒤에서 사용할 단일 배치 에서 사용할 방법중 어떤 것이 가장 잘 적용 될 수 있을 지를 사전에 살펴보는 단계이다. 즉, 한 슬라이드에서 Cy3, Cy5의 비를 맞추는 방법이다.Single slide normailzation is one of the first steps in analyzing a single microarray slide, and it is a step in advance to determine which of the methods is best used in a single batch for later use. In other words, the ratio of Cy3 and Cy5 in one slide.

입력 파일 형식은 [gene_id] [Cy5] [Cy5 Background] [Cy3] [Cy3 Background]으로서, gene_id은 문자, 숫자로 구성되고, Cy5, Cy5 Background, Cy3, Cy3 Background는 숫자로 구성되며, 데이터는 탭 분리 되어야 한다. 이때, 결과 파일 형식은 [geneid] [

] [

]이다.The input file format is [gene_id] [Cy5] [Cy5 Background] [Cy3] [Cy3 Background], where gene_id consists of letters and numbers, Cy5, Cy5 Background, Cy3, and Cy3 Background consists of numbers, and data is tabbed. Should be separated. In this case, the result file format is [geneid] [

] [

]to be.

단일 슬라이드 메뉴를 선택하면 나타나는 분석 대상 중 하나를 선택하면 도 4a와 같은 화면이 나타나게 된다. 화면은 Base Info 테이블과 분석을 하기위한 정보가 들어있는 'Statistical Analysis' 테이블로 구성되어 있다.If one of the analysis targets that appears when a single slide menu is selected is selected, a screen as shown in FIG. 4A is displayed. The screen consists of a Base Info table and a 'Statistical Analysis' table that contains information for analysis.

분석은 'Statistical Analysis' 테이블의 'Standardization Input File' 항목에 있는 'Execute' 를 선택하면 분석이 시작된다. 이하의 분석 방법 수행을 위해, Java Web start를 실행시키도록 소프트웨어를 구성할 수 있다.Analysis starts by selecting 'Execute' in the 'Standardization Input File' item of the 'Statistical Analysis' table. To perform the following analysis method, the software can be configured to execute Java Web start.

도 4b는 분석을 위한 선택사항을 입력하는 화면이다. 단일 슬라이드는 Mean, Intensity, Print-tip, Scale의 4가지 표준화 방법을 지원한다. 이 방법중 사용하기를 원하는 방법을 하나 선택하여서 밑의 'ok' 를 선택하면, 분석이 시작된다. 이 때 분석 방법중 Print-tip 이나 scale 을 입력하면 'Number of Tip' 항목과 'Number of genes in block' 옵션이 활성화 되도록 소프트웨어를 구성할 수 있다.4B is a screen for inputting options for analysis. A single slide supports four standardization methods: Mean, Intensity, Print-tip, and Scale. Choose one of these methods and choose 'ok' below to begin the analysis. At this time, if you input Print-tip or scale among the analysis methods, you can configure the software to activate the 'Number of Tip' item and 'Number of genes in block' option.

Global 표준화 는 Cy5의 강도와 Cy3의 강도가 상수배로 되어있다고 가정하는 경우이다. 강도 의존 표준화(Intensity dependent normalization)는 Cy5의 강도와 Cy3의 강도가 A-dependent한 관계가 되어 있다고 가정하는 경우이다. 프린트 팁 표준화(print-tip normalization)는 프린트 팁 그룹(print-tip-group) 효과가 있다고 가정하는 경우이다. 스케일 표준화(scale normalization)는 프린트 팁 표준화한 후 프린트 팁 그룹(print-tip-group)간에 분산을 보정하는 방법이다.Global standardization assumes that the strength of Cy5 and that of Cy3 are constant multiples. Intensity dependent normalization is a case where it is assumed that there is an A-dependent relationship between the strength of Cy5 and the strength of Cy3. Print-tip normalization is a case where there is a print-tip-group effect. Scale normalization is a method of correcting variance between print tip groups after normalizing print tips.

도 4c는 도 4b의 옵션에 따른 분석 결과를 보여주는 화면이다. 이 결과는 입력한 자료를 선택한 방법으로 표준화 시킨 화면이다. 화면에는 각 유전자의 분포와 노란색 선으로 표시된 lowess 선을 표시된다.4C is a screen illustrating an analysis result according to the option of FIG. 4B. This result is a standardized image of the inputted data. The screen displays the distribution of each gene and the lowess line, shown by a yellow line.

표준화 중 염료 교환은 실험시에 동일한 조건에서 Cy3 와 Cy5만을 반대로 하여 실험한 두 결과를 비교하여 표준화하는 방법이다. 염료 교환 표준화(Dye-Swap normalizatoin)는 염료 교환(Dye swap)된 실험에서 표준화하는 방법이다.Dye exchange during normalization is a method of comparing two experimental results by comparing only Cy3 and Cy5 under the same conditions. Dye-Swap normalizatoin is a method of standardization in dye swapped experiments.

입력 파일로서, 염료교환은 아래와 같은 형식의 파일을 2개 입력 받게 된다. 이 두 파일은 동일한 조건에서 Cy3 와 Cy5만을 반대로 하여서 실험한 결과에 대한 파일이다. 입력 파일 형식은 [gene_id] [Cy5] [Cy5 Background] [Cy3] [Cy3 Background]으로서, gene_id은 문자나 숫자 Cy5, Cy5 Background, Cy3, Cy3 Background는 숫자, 데이터는 탭 분리 되어야 한다. 분석 결과 파일은 [geneid] [

] [

]의 형식이다.As an input file, dye exchange will receive two files of the form: These two files are the result of experimenting with only Cy3 and Cy5 reversed under the same conditions. The input file format is [gene_id] [Cy5] [Cy5 Background] [Cy3] [Cy3 Background], where gene_id must be a letter or number Cy5, Cy5 Background, Cy3, Cy3 Background numbers or data separated by tabs. The analysis file is [geneid] [

] [

].

도 4d는 단일 염료 교환(single dyeswap) 분석 장면이다. 염료 교환 분석은 화면에 출력되는 슬라이드들 중 분석에 사용할 2개의 슬라이드를 선택하고 실행버튼 즉, 'execute' 를 선택하면 결과 파일이 생성되도록 소프트웨어를 구성할 수 있다. 4D is a single dyeswap analysis scene. The dye exchange analysis can be configured to select the two slides to be used for analysis among the slides displayed on the screen, and select the execute button, that is, 'execute' to generate a result file.

바람직하게는, 왼쪽에서 슬라이드를 선택한후 가운데 있는 '->'화살표를 선택하면 선택이 되고 오른쪽에서 슬라이드를 선택한뒤에 '<-'를 선택하면 해제가 되도록 소프트웨어를 구성할 수 있다. Preferably, the software can be configured to be selected by selecting the slide on the left side and then selecting the '->' arrow in the middle, and selecting '<-' after the slide on the right side.

표준화 중 단일 배치는 단일 슬라이드에서 분석했던 결과를 기준으로 하여서 여러개의 슬라이드를 동일한 방법으로 표준화 하는 서브 프로그램이다. 여기서 생성된 결과 파일은 자동으로 다중 슬라이드 분석 방법의 목록으로 넘어가게 된다. 단일 슬라이드 배치 표준화는 모든 슬라이드를 동일한 표준화 방법으로 한번에 처리하는 기능이다.A single batch of standardization is a subprogram that standardizes multiple slides in the same way, based on the results analyzed in a single slide. The resulting file is then automatically transferred to the list of multiple slide analysis methods. Single slide batch standardization is the ability to process all slides at once using the same standardization method.

입력 파일 형식은 단일 배치 는 염료 교환 표준화와 같은 형식으로서, 입력 파일을 여러개 입력 받게 된다. 분석 결과 파일은 다음과 같은 형식으로 나타난다. 즉, [geneid] [

] [

] ....이다. 여기서 M 값은 선택한 슬라이드 들의 M값이며 각 슬라이드에 따른 M값이 연속으로 나타나게 된다.The input file format is the same as a single batch dye exchange standardization, where multiple input files are input. The analysis result file is shown in the following format. That is, [geneid] [

] [

] ....to be. The M value is the M value of the selected slides, and the M value of each slide is displayed continuously.

도 4e는 단일 배치 분석화면이다. 단일 배치 분석화면은 단일 염료 교환과 비슷한 인터페이스를 가지고 있다. 단, 선택할 수 있는 슬라이드의 개수에 제한이 없다. 자신이 적용하고 싶은 슬라이드를 선택해서 프로그램을 실행하면 된다. 4E is a single batch analysis screen. Single batch assay screens have an interface similar to single dye exchange. However, there is no limit to the number of slides that can be selected. Simply select the slide you want to apply and run the program.

단일 배치는 표준화 분석 방법을 지정해야 한다. 이것은 단일 슬라이드에서 사용하던 방법과 동일한 방법이다. 이때, Global 표준화 는 Cy5의 강도와 Cy3의 강도가 상수배로 되어있다고 가정하는 경우이다. 강도 의존 표준화(Intensity dependent normalization)는 Cy5의 강도와 Cy3의 강도가 A-dependent한 관계가 되어 있다고 가정하는 경우이다. 프린트 팁 표준화(print-tip normalization)는 프린트 팁 그룹(print-tip-group) 효과가 있다고 가정하는 경우이다. 화면의 슬라이드 선택 부분 바로 위에서, 프린트 팁을 선택할 경우 도 4e처럼 선택부분 오른쪽에 추가 정보를 입력하는 부분이 나타나게 된다.A single batch should specify the standardized analysis method. This is the same method used for a single slide. At this time, global standardization is a case where the strength of Cy5 and the strength of Cy3 are assumed to be a constant multiple. Intensity dependent normalization is a case where it is assumed that there is an A-dependent relationship between the strength of Cy5 and the strength of Cy3. Print-tip normalization is a case where there is a print-tip-group effect. Immediately above the slide selection portion of the screen, when selecting a print tip, a portion for inputting additional information appears on the right side of the selection portion as shown in FIG. 4E.

표준화 중 다중 슬라이드는 뒤의 군집 분석, 추정과 검정, 분류에서 사용될 실질적인 자료를 만드는 방법이다. 앞의 단일 배치 에서 결과로 나온 파일을 이용하여 작동된다. 다중 슬라이드 표준화(Multiple slide normalization)는 반복된 슬라이드에서 분산이나 슬라이드의 변위치를 맞추는 방법이다.During normalization, multiple slides are a way of creating practical data for later cluster analysis, estimation, testing, and classification. It works with the resulting file from the previous single batch. Multiple slide normalization is a method of matching the variance or displacement of a slide in a repeated slide.

입력 파일 형식은 단일 배치 결과로 나온 파일을 입력 받게 된다. 분석 결과 파일 역시 단일 배치 의 출력 형식과 동일하다.The input file format receives the file resulting from a single batch. The analysis result file is also identical to the output format of a single batch.

도 4f는 다중 슬라이드 분석 화면이다. 화면상에 나와있는 Multiscale, Quantile 중 하나를 선택하고, 'execute' 를 선택하면 실행이 된다.4F is a multiple slide analysis screen. Select one of Multiscale and Quantile on the screen and select 'execute' to execute.

분석옵션은 2가지가 있다. 먼저, 다중 스케일 표준화는 각 슬라이드의 분산을 조정하는 방법이다. 순위수 정규화는 각 슬라이드의 변위치 를 조정하여 표준화하는 방법이다. 'execute' 를 선택하면, 결과파일이 생성된다.There are two analysis options. First, multi-scale normalization is a way to adjust the variance of each slide. Rank normalization is a method of normalizing by adjusting the displacement value of each slide. If you choose 'execute', the result file is created.

다중 슬라이드는 슬라이드의 순서가 분석 결과에 영향을 미치는 방법이다. 따라서 사용자가 자료에 들어있는 슬라이드의 정보를 직접 변경 시킬 수 있다. 이것은 'modify'를 선택하면, 액셀과 같은 편집 프로그램을 작동시켜 편집하고 저장하도록 할 수 있다.Multiple slides are a way in which the order of the slides affects the analysis results. Therefore, the user can directly change the information of the slide in the material. This allows you to select 'modify' and activate an editing program such as Excel to edit and save.

다음은 추정과 검정에 관한 설명이다.The following is an explanation of estimation and testing.

추정과 검정을 실시하고자 할 때는 분석 방향을 설정하여야 한다. 이 부분은 크게 두 부분으로 나뉜다. 첫째, 어떤 통계량을 사용하여 발현에 대한 유전자의 순위를 매길 것인가 하는 추정부분, 둘째 해당 통계량이 주어지고 그에 따라 추정을 실시하여 발견한 유전자들에 대한 신뢰수준을 결정하는 검정부분이다. In order to estimate and test, the direction of analysis should be set. This part is divided into two parts. First, it is the estimation part which statistics are used to rank genes for expression, and the second is the test part that determines the level of confidence in the genes found by giving the corresponding statistics.

B 통계량(Lonnstedt 등의 베이지안 B 통계량)은 다르게 발현되는 각 유전자에 대한 사후 log-odds를 추정하는 통계량이다. B 통계량 역시 S 통계량처럼 t 통계량에 일정의 패널티를 부여한 방식이다. 특히 t-값이 높은 자료에 상당한 패널티를 가하는 방식이다. B 통계량을 통한 log-odd 순위는 bayesian 프로그램으로 얻을 수 있다. The B statistic (the Bayesian B statistic by Ronnstedt et al.) Is a statistic that estimates post-log-odds for each gene that is expressed differently. Like the S statistic, the B statistic is a method in which t statistics are given a certain penalty. In particular, significant penalties are applied to data with high t-values. The log-odd rank via the B statistic can be obtained with the bayesian program.

에이디피(ADP) 프로그램으로 전통적인 t 통계량, 에스에이엠(SAM) 과 이비에이엠(EBAM) 으로 S 통계량, Bayesian 프로그램으로 B 통계량을 사용하여 가능한 모든 통계량의 순위를 결정할 수 있다. 특히 EBAM에서 S 통계량의 비율을 사용자가 대표적 몇 개를 입력하여 순위를 산출한다. You can rank all possible statistics using the traditional t-statistic with the ADP program, the S-statistic with the SAM and EBAM, and the B-statistic with the Bayesian program. In particular, the user inputs a representative number of ratios of S statistics in EBAM to calculate the ranking.

조정유의확률(Adjusted p-value)은 다음과 같다. Family-wise error rate 는 유전자의 어떤 구성이 실제로 다르게 발현되는지 상관없이, 다르게 발현된 유전자들 중에서 적어도 한 유전자가 거짓으로 나타날 확률을 의미한다. Family-wise error rate를 추정하는 방법은 상당히 다양한 버젼 이 있으며, 가장 보수적인 방법으로 조정유의확률을 들 수 있다. ADP 프로그램에서 조정유의확률을 얻을 수 있다. The adjusted p-value is: The family-wise error rate refers to the probability that at least one of the differently expressed genes will be false, regardless of which composition of the gene is actually expressed differently. There are quite a few versions of the family-wise error rate estimation method, and the most conservative method is the probability of adjustment. Probability of adjusted oil can be obtained from the ADP program.

FDR은 유의적으로 다르게 발현된 것으로 선택된 유전자들 중에서 오차비율의 기대값으로 정의되며 "잘못 발견할 비율"을 의미한다. Tusher 의 SAM 방법은 임의순열 방법에 의해 각 유전자의 기대순위를 연산하고 실제 유전자와 차이를 통해 FDR을 추정한다. FDR 은 SAM 프로그램에서 얻을 수 있다. FDR is significantly differently expressed and is defined as the expected value of the error rate among the selected genes and means "the rate of misdetection." Tusher's SAM method calculates the expected order of each gene by random permutation and estimates the FDR based on the difference from the actual gene. FDR can be obtained from the SAM program.

SAM과 거의 유사한 방법으로 EBAM을 통해 FDR을 추정할 수 있다. 다만, 사용되는 S 통계량에 차이가 있으므로 에스에이엠(SAM)과 다른 FDR 결과를 유도한다. 이 방법에서 FDR 추정은 보통 사후확률 0.9에서 정하며 연구자가 사후확률 선택을 반대로 여러 FDR과 기각유전자 수를 고려하여 경험적으로 역선택할 수도 있다. 다양한 S 통계량에 대한 FDR은 EBAM 프로그램에서 얻을 수 있다. You can estimate FDR through EBAM in much the same way as SAM. However, since there is a difference in the S statistic used, it induces an FDR result different from SAM. In this method, the FDR estimate is usually set at a posterior probability of 0.9 and the investigator may choose to empirically reverse the posterior probability selection, taking into account the number of FDRs and rejected genes. FDRs for the various S statistics can be obtained from the EBAM program.

추정과 검정 중 뉴턴 분석(Newton method)은 단일 슬라이드 마이크로어레이 실험에서 유의한 유전자를 추정하는 서브 프로그램이다. 입력 파일 형식은 [유전자이름] [A(_)] [M(_)]와 같다.Newton's method of estimation and testing is a subprogram that estimates significant genes in single slide microarray experiments. The input file format is as follows: [gene name] [A (_)] [M (_)].

분석 결과 파일은 Log posterior odds가 0보다 큰 유전자 리스트, Log posterior odds가 0보다 작거나 같은 유전자 리스트, 최대값과 최소값을 이용해서 등고선을 그리기 위한 Log posterior odds 값, 첸(Chen) 그래프를 그리기 위한( 95%, 99%) 4개의 절편 값을 포함한다.The analysis file contains a list of genes whose log posterior odds are greater than zero, a list of genes whose log posterior odds are less than or equal to zero, a log posterior odds value for plotting contours using maximum and minimum values, and a Chen graph. (95%, 99%) Contains four intercept values.

도 5a의 실행 결과 그래프에서, 검정색 등고선은 posterior odds의 변화가 1:1, 10:1, 100:1인 등고선을 나타내고, 하늘색 직선은 첸(Chen) 방법의 95%와 99% 신뢰구간을 나타내며, 빨강색 점은 Log posterior odds가 0보다 큰 유전자를 나타내고, 녹색 점은 Log posterior odds가 0보다 작거나 같은 유전자를 나타낸다.In the execution result graph of FIG. 5A, the black contours represent contours with changes in posterior odds of 1: 1, 10: 1, and 100: 1, and the light blue lines represent 95% and 99% confidence intervals of the Chen method. The red dots represent genes with log posterior odds greater than zero, and the green dots represent genes with log posterior odds less than or equal to zero.

추정과 검정 중 B 통계량(B-statistic)은 반복된 마이크로어레이 실험에서 유의한 유전자를 추정하는 서브 프로그램이다. 입력 파일의 데이터 형식은 다음의 표 2와 같이 탭 분리한다.B-statistic during estimation and testing is a subprogram that estimates significant genes in repeated microarray experiments. The data format of the input file is tab separated as shown in Table 2 below.

gene1 1.269 0.277 -0.343 0.659 gene2 0.606 0.838 -0.405 1.241 gene3 0.852 0.627 1.918 -1.077 gene4 -0.065 -0.570 -0.446 0.364 ...... gene1 1.269 0.277 -0.343 0.659 gene2 0.606 0.838 -0.405 1.241 gene3 0.852 0.627 1.918 -1.077 gene4 -0.065 -0.570 -0.446 0.364 ...

분석 결과는 각 유전자의 M(_)값 평균, 각 유전자의 log posterior odds 값을 포함한다.The analysis result includes the mean M (_) value of each gene and the log posterior odds value of each gene.

도 5b의 옵션 입력 화면에서, Data는 *.txt 형식이고, Number of rows는 유전자의 개수이며, Number of replication은 슬라이드의 수, P는 임의의 유전자가 유의하게 발현될 확률이다.In the option input screen of FIG. 5B, Data is in * .txt format, Number of rows is the number of genes, Number of replication is the number of slides, and P is the probability that any gene is significantly expressed.

도 5c는 도 5b의 옵션에 따른 결과 그래프의 일 예이다.5C is an example of a result graph according to the option of FIG. 5B.

이때, 빨강색은 유의한 유전자 (log posterior odds 값이 0보다 큰 유전자), 초록색은 유의하지 않은 유전자 (log posterior odds 값이 0보다 작은 유전자)를 나타낸다.At this time, red represents a significant gene (gene whose log posterior odds is greater than zero), and green represents an insignificant gene (gene whose log posterior odds is less than zero).

추정과 검정 중 에이디피(ADP)는 4 종류의 반복된 마이크로어레이 실험에서 유의한 유전자의 조정유의확률을 연산하는 서브 프로그램이다. 입력 파일 형식은 B-통계량의 경우와 같이 탭 혹은 공백문자로 분리한다.ADP is a subprogram that calculates the probability of significant genetic adjustment in four repeated microarray experiments. Input file formats are separated by tabs or spaces, as in the case of B-statistics.

분석 결과는 10종류의 조정유의확률과 3 종류의 유의확률을 포함하고, 에이디피(ADP) 실행과 관련된 각종 정보를 제공한다.The analysis results include 10 types of adjusted probability and 3 types of significant probability and provide various information related to ADP implementation.

5d의 옵션 입력 화면에서, Open data는 data는 텍스트 형식이고, Response type은 4종류의 실험유형(1그룹, 2그룹, 3그룹 이상, 반응형식)을 나타내고, Number of permutations는 임의순열반복횟수, Number of bootstraps는 붓스트랩반복횟수, Seed Numbers는 임의난수를 발생하는 초기값, Number of groups는 각 그룹의 반복횟수 입력, Quantitative response file는 반응형식이 있는 실험에 대한 반응형식 지정파일, p-value는 10종류의 조정유의확률, Run Result Capture는 실행과정 표시, Raw p-value는 임의순열과 붓스트랩 유의확률, Tmp Outfile Capture는 Tmp file 에 대한 내용 출력을 나타낸다.In the option input screen of 5d, open data is in text format, response type represents four types of experiments (1 group, 2 groups, 3 groups or more, response format), and Number of permutations is random permutation, Number of bootstraps is the bootstrap repeat count, Seed Numbers is the initial value to generate random random numbers, Number of groups is the number of iterations of each group, Quantitative response file is the response format specification file for experiments with response format, p-value The probability of 10 types of adjustment is shown, the Run Result Capture shows the execution process, the Raw p-value shows the significance of random permutation and bootstrap, and the Tmp Outfile Capture shows the output of the Tmp file.

도 5e는 도 5d의 옵션에 따른 실행결과 그래프의 일 예이다. 이때, 파랑색은 raw p-value를, 빨강색은 adjusted p-value를 나타낸다.5E is an example of an execution result graph according to the option of FIG. 5D. In this case, blue represents a raw p-value and red represents an adjusted p-value.

도 5e에서, 특정 그림에 있어 유의한 유전자를 마우스로 탐색하면 도 5f와 같은 화면을 볼 수 있도록 소프트 웨어를 구성할 수 있다.In FIG. 5E, the software can be configured to view a screen as shown in FIG. 5F by searching for a significant gene with a mouse.

추정과 검정 중 에스에이엠(SAM)은 4 종류의 반복된 마이크로어레이 실험에서 유의한 유전자를 탐색하고 FDR을 연산하는 서브 프로그램이다. 입력 파일 형식은 B-통계량의 경우와 같이 탭 혹은 공백문자로 분리한다.In estimation and testing, SAM is a subprogram that searches for significant genes and computes FDR in four repeated microarray experiments. Input file formats are separated by tabs or spaces, as in the case of B-statistics.

분석 결과는 Delta 값에 따른 유의한 유전자 수와 3종류의 선택기준에 의한 FDR을 제공하고, 에스에이엠(SAM) 실행과 관련된 각종 정보와 초기 Delta 값을 제공하며, 초기 Delta 값에 대한 에스에이엠(SAM) 도면의 정보, 다른 Delta 값에 대한 조정된 에스에이엠(SAM) 도면의 정보, 최종 Delta 값에 대한 유의한 유전자 리스트 및 Q-value 제공한다.The analysis results provide significant gene number according to Delta value and FDR based on three kinds of selection criteria, provide various information related to execution of SAM and initial Delta value, and SAM) information, adjusted SAM diagram information for different Delta values, significant gene lists for final Delta values, and Q-values.

도 5g의 옵션입력 화면에서, Import data는 data는 텍스트 형식이고, Response type은 4종류의 실험유형(1그룹, 2그룹, 3그룹 이상, 반응형식), Number of permutations는 임의순열 반복횟수, Seed Numbers는 임의난수를 발생하는 초기값, Number of groups는 각 그룹의 반복횟수 입력, Quantitative response file은 반응형식이 있는 실험에 대한 반응형식 지정파일, Gene Observed score는 Gene 과 Observed score 와의 도면, Gene Expected score는 Gene 과 Expected score 와의 도면, Gene plot은 에스에이엠(SAM) 통계량의 분모(표준오차)와 Expected score 와의 도면, 에스에이엠(SAM) plot은 Observed score 와 Expected score 와의 도면을 나타낸다. FDR plot은 Mean FDR, Median FDR, 90th percentile FDR plot 제공하고, FDR table list는 FDR 결과 파일을 제시한다.In the option input screen of FIG. 5G, Import data is data in text format, Response type is 4 types of experiment types (1 group, 2 groups, 3 or more groups, response format), Number of permutations is random permutation, and Seed Numbers is the initial value for generating random random numbers, Number of groups is the number of repetitions of each group, Quantitative response file is the response format file for the experiment with response format, Gene Observed score is the drawing of Gene and Observed score, Gene Expected The score plots the Gene and Expected scores, the Gene plot plots the denominator (standard error) and the Expected score of the SAM statistics, and the SAM plot plots the Observed and Expected scores. FDR plot provides Mean FDR, Median FDR, 90th percentile FDR plot, and FDR table list presents FDR result file.

도 5h는 도 5g의 옵션에 따라 연산된 FDR을 보여주는 화면이다. 도 5h의 FDR을 보고 Delta를 결정할 수 있다.5H is a screen illustrating an FDR calculated according to the option of FIG. 5G. Delta can be determined by looking at the FDR of FIG. 5H.

도 5i는 도 5g의 옵션에 따른 SAM plot의 일 예를 도시한 것이다. 이때, Input Delta value를 통해 Delta 값을 선택할 수 있다. 도 5j는 도 5h 또는 도 5i에 따라 입력된 Delta에 따른 SAM plot의 일 예를 도시한 것이다. 이때, 빨강부분은 Positive 하게 유의한 유전자, 파랑부분은 Negative 하게 유의한 유전자를 나타낸다.5I illustrates an example of a SAM plot according to the option of FIG. 5G. At this time, the Delta value may be selected through the input delta value. FIG. 5J illustrates an example of a SAM plot according to Delta input according to FIG. 5H or 5I. In this case, the red part represents a positively significant gene and the blue part represents a negatively significant gene.

추정과 검정 중 이비에이엠(EBAM)은 4 종류의 반복된 마이크로어레이 실험에서 Empirical Bayesian 방법을 통한 유의한 유전자 탐색하고 FDR을 연산하는 서브 프로그램이다. 입력 파일 형식은 B-통계량의 경우와 같이 탭 혹은 공백문자로 분리한다.Among the estimation and testing, EBAM is a subprogram that calculates significant genes and calculates FDR using Empirical Bayesian method in four repeated microarray experiments. Input file formats are separated by tabs or spaces, as in the case of B-statistics.

분석 결과는 사후확률에 따른 유의한 유전자 수와 3종류의 선택기준에 의한 FDR을 제공하고, 이비에이엠(EBAM) 실행과 관련된 각종 정보와 초기 사후확률을 제공하며, 초기 사후확률에 대한 이비에이엠(EBAM) 도면의 정보, 다른 사후확률에 대한 이비에이엠(EBAM) 도면의 정보, 로짓회귀분석에 의한 사후확률과 spline basis 값을 제공하고, 최종 사후확률에 대한 유의한 유전자 리스트 및 Q-value를 제공한다.The analysis results provide a significant number of genes according to the posterior probabilities and FDRs based on three selection criteria, provide various information and initial posterior probabilities related to the execution of EBAM, and EBAM) provides information on drawings, EBAM drawings for other posterior probabilities, posterior probabilities and spline basis values by logit regression analysis, and provides significant gene lists and Q-values for final posterior probabilities. do.

도 5k의 옵션입력 화면에서, Import data에서 data는 텍스트 형식이고, Response type는 4종류의 실험유형(1그룹, 2그룹, 3그룹 이상, 반응형식), Spline df는 Spline의 자유도를 입력(보통 5), Furge Factor Percentile은 표준오차를 보정해줄 표준오차의 위치값 (0~1의 값), Number of permutations는 임의순열 반복횟수, Seed는 임의난수를 발생하는 초기값, Number of groups는 각 그룹의 반복횟수 입력, Quantitative response file는 반응형식이 있는 실험에 대한 반응형식 지정파일, Standard Z score는 Gene 과 보정되지 않은 Z score 도면, Expression score는 Gene 과 보정된 Z score 도면, Numerator:Denominator는 이비에이엠(EBAM) 통계량의 분모(표준오차)와 분자 도면, Natural spline LOGIT prob..는 LOGIT 결과 도면, 이비에이엠(EBAM) plot은 Expected score 와 사후확률간의 도면이다. FDR plot은 Mean FDR, Median FDR, 90th percentile FDR plot 제공하고, FDR table list는 FDR 결과 파일을 제시한다.In the option input screen of FIG. 5K, the data in the import data is in text format, the response type is 4 types of experiment types (1 group, 2 groups, 3 groups or more, response format), and the spline df inputs the degrees of freedom of the spline (usually 5), Furge Factor Percentile is the position value of the standard error (value of 0 ~ 1) to correct the standard error, Number of permutations is the random number of iterations, Seed is the initial value that generates random random numbers, and Number of groups is each group. The number of iterations of the input, the quantitative response file is the response format specification file for the experiment with the response format, the Standard Z score is the Gene and the uncorrected Z score plot, the Expression score is the Gene and the corrected Z score plot, and the Numerator: Denominator is The denominator (standard error) and molecular plot of the EBAM statistic, the natural spline LOGIT prob .. is the LOGIT result plot, and the EBAM plot is the plot between the Expected score and the posterior probability. FDR plot provides Mean FDR, Median FDR, 90th percentile FDR plot, and FDR table list presents FDR result file.

도 5l은 도 5k의 옵션에 따라 연산된 FDR의 출력화면이다. 이때, 출력된 FDR을 보고 사후확률을 결정할 수 있다.5L is an output screen of the FDR calculated according to the option of FIG. 5K. At this time, the post-probability may be determined by looking at the output FDR.

도 5m은 도 5k의 옵션에 따른 EBAM) plot의 일 예이다. 이때, Posterior probability setting을 통해 사후확률 값을 선택할 수 있다. 이때, 빨간색선은 EBAM plot, 파란색선은 조정되지 않은 EBAM plot, 초록부분은 유전자의 expression score를 나타낸다.5M is an example of an EBAM) plot according to the option of FIG. 5K. In this case, the posterior probability value may be selected through the posterior probability setting. At this time, the red line shows the EBAM plot, the blue line shows the unadjusted EBAM plot, and the green part shows the gene's expression score.

도 5l 또는 도 5m에 따라 사후확률이 입력되면, 입력된 사후확률에 따른 EBAM plot이 출력되도록 소프트웨어를 구성할 수 있다.When the post probability is input according to FIG. 5L or 5M, the software may be configured to output an EBAM plot according to the input post probability.

도 5n은 도 5l 또는 도 5m에 따라 입력된 사후확률에 따른 유의한 유전자 리스트의 일 예이다.FIG. 5N is an example of a significant gene list according to the posterior probability input according to FIG. 5L or 5M.

다음은 군집 분석에 관한 설명이다.The following is a description of cluster analysis.

계층적 군집분석(Hierachical clustering)에는 병합적 방법과 분할적 방법의 두가지가 있다. 병합적 방법은 모든 개체를 각각의 클러스터로 본 후에 유사한 클러스터들을 모아 나가는 것이고 분할적 방법은 모든 개체를 묶어서 하나의 클러스터로 본후에 그룹을 쪼개나가는 것이다. 이때 개체의 유사성이 트리 구조의 도면으로 표현된다.There are two types of hierarchical clustering: the merging method and the partitioning method. The merging method is to look at all the objects in each cluster and then gather similar clusters. The partitioning method is to tie all the objects together into one cluster and then split the group. At this time, the similarity of the objects is represented by a tree structure drawing.

K-평균 군집분석 (k-means clustering)은 개체들을 미리 정해진 수(k)의 클러스터에 할당하는 것으로 클러스터의 초기값을 규정하여 결정된 초기클러스터에 각 개체들을 할당한 후 정해진 기준에 따라 최적화될때까지 재할당하는 과정을 거쳐 최종 클러스터가 정해지게 된다. 클러스터의 중심을 기준으로 개체를 할당하게 되는데 중심으로는 평균(k-means)이나 중앙값(k-medoids)을 쓸 수 있다.K-means clustering is the allocation of objects to a predetermined number of clusters (k), which defines the initial values of the clusters and assigns each individual to the determined initial cluster until it is optimized according to the specified criteria. After reallocation, the final cluster is determined. We assign objects based on the center of the cluster. We can use mean (k-means) or median (k-medoids) as the center.

자기조직도 (SOM :Self Organizing Map)는 신경망모형의 한 종류로서 k-평균 군집분석과 비슷하지만 결과로 나오는 유전자들의 그룹에서 위치상 이웃한 그룹들은 떨어진 그룹보다 유사한 패턴을 보여주게 된다. 분석을 위해서는 사용자가 노드의 형태와 그리드(grid)의 행렬수를 미리 정해야 한다. 바람직하게는, 직사각형과 6각형모양의 노드를 정할 수 있다. Self Organizing Map (SOM) is a type of neural network model similar to k-means clustering, but the positional neighboring groups in the resulting group of genes show a similar pattern than the separated groups. For the analysis, the user has to decide in advance the node type and the number of grid matrices. Preferably, a rectangular and hexagonal node can be defined.

주성분 분석(PCA:Principal Component Analysis)은 구조가 복잡하고 서로 상관되어 있는 다차원 자료를 변환시켜 주성분이라는 서로 독립적인 인공의 변수들을 만들어냄으로써 몇 개의 의미있는 주성분으로 많은 부분을 설명할 수 있도록 하는 것이다. 이는 자료의 단순화나 요약, 특이값이나 클러스터의 탐색에 적절하다. 주성분은 상관행렬이나 공분산 행렬로부터 연산할 수 있다. Principal Component Analysis (PCA) transforms multidimensional data that is complex in structure and correlated with each other to produce independent artificial variables called principal components, which can explain many of them with several significant principal components. This is suitable for simplifying or summarizing data, searching for outliers or clusters. Principal components can be computed from correlation or covariance matrices.

진쉐이빙(Geneshaving) 방법은 제 1주성분으로 "Super gene"을 연산한 다음 이와 가장 관련이 없는 유전자들(하위 10%)을 잘라내는 과정을 마지막 유전자 하나만이 남을 때까지 반복한다. 이 절차는 연쇄적인 유전자 블록을 생성하게 되는데 이때 각 클러스터는 클러스터간의 분산은 최대화하고 클러스터내의 분산은 최소화하도록 구성된다. 이 방법을 이용할 때, 유전자들은 단 하나의 클러스터로만 할당되지 않고 하나 이상의 클러스터로 할당될 수 있다. Geneshaving method calculates "Super gene" as the first principal component and then cuts the least relevant genes (lower 10%) until only the last gene remains. This procedure produces a chain of genetic blocks, where each cluster is configured to maximize variance between clusters and minimize variance within clusters. When using this method, genes can be assigned to one or more clusters rather than just one cluster.

계층적 군집분석(hierachical clustering) 메뉴를 선택하면, 유전자와 샘플에 대한 계층적 군집분석을 실행하고 결과로서 덴드로그램(트리)을 제공한다.Selecting the hierarchical clustering menu performs a hierarchical clustering of genes and samples and provides a dendrogram (tree) as a result.

분석에 사용될 데이터는 탭으로 분리된 텍스트 파일(*.txt)형식이어야한다. 아래의 예제와 같이 각 유전자들은 행방향으로, 실험(샘플,어레이)는 열방향으로 입력된다. 첫 열에는 해당 유전자의 이름이 입력되고 첫 행에는 실험(샘플, 어레이)의 이름이 들어간다.The data to be used for analysis should be in the form of a tab-delimited text file (* .txt). As shown in the example below, each gene is input in the row direction, and experiments (samples and arrays) are input in the column direction. The first column contains the name of the gene and the first row contains the name of the experiment (sample, array).

도 6a에서, Import data에는 업로드한 데이터가 지정된다. In FIG. 6A, uploaded data is designated as Import data.

Linkage Method에서는 군집간의 거리를 정의하는 척도를 선택한다. 이때, 척도로서, Average linkage는 각 군집 내에서 하나씩의 개체를 택해 연결가능한 모든 가능한 경우의 거리를 평균한 것을 두 군집간 거리로 정의한다. Complete linkage는 각 군집에 속하는 두 개체 사이의 거리들 중에서 최장 거리를 두 군집간의 거리로 정의한다. Single linkage는 각 군집에 속하는 두 개체 사이의 거리들중에서 최단거리를 두 군집간의 거리로 정의한다. The linkage method selects a measure that defines the distance between clusters. In this case, as a measure, the average linkage defines an average of distances between all possible cases by taking one object in each cluster as the distance between two clusters. Complete linkage defines the longest distance as the distance between two clusters among the distances between two entities in each cluster. Single linkage defines the shortest distance between two clusters within two clusters.

Similarity measure는 두 개체간 상사성의 척도를 정한다. correlation(uncentered)은 피어슨의 상관계수를 사용한다. correlation(centered)은 두 개체들의 평균을 0으로 한 피어슨 상관계수를 사용한다. absolute correlation(uncentered)은 상관(비중심화) 척도의 절대값을 사용한다. absolute correlation(centered)은 상관(중심화) 척도의 절대값을 사용한다. Euclidean distance는 유클리드 거리를 이용한다.Similarity measures measure the similarity between two entities. correlation (uncentered) uses Pearson's correlation coefficient. correlation (centered) uses Pearson's correlation coefficient with the mean of the two entities as zero. absolute correlation (uncentered) uses the absolute value of the correlation (uncentered) measure. absolute correlation (centered) uses the absolute value of the correlation (center) measure. Euclidean distance uses Euclidean distance.

도 6b의 덴드로그램에서, 화면 왼쪽으로는 유전자클러스터링, 화면 아래쪽으로는 실험(샘플, 어레이)클러스터링 결과가 보인다. 가장 오른쪽의 색상표를 기준으로 데이터값을 볼 수 있다.In the dendrogram of FIG. 6B, gene clustering is shown to the left of the screen, and experimental (sample, array) clustering results are shown to the bottom of the screen. You can see the data value based on the rightmost color palette.

확대하여 보고 싶은 부분의 트리를 선택하면 도 6c와 같은 새 창이 뜨면서 확대 도면을 보여준다. 유전자 이름과 실험이름이 나타나 있다. 이 화면에서, 그림과 글씨 크기를 조절할 수 있도록 소프트웨어를 구성할 수 있다.If you select the tree of the part you want to enlarge, a new window appears as shown in FIG. Gene name and experiment name are shown. On this screen, you can configure the software to resize pictures and text.

도 6c의 화면에서, 찾고 싶은 유전자 이름을 입력하면, 새 창에 그 유전자만 확대된 도면이 표시되도록 소프트웨어를 구성할 수 있다.In the screen of FIG. 6C, when a gene name to be searched is inputted, the software can be configured to display an enlarged view of only the gene in a new window.

k-평균 군집분석(k-means clustering)에서는 유전자 또는 샘플에 대한 k-평균 군집분석(k-means clustering)을 실행하고 주어진 k수에 따른 유전자 또는 샘플의 군집결과를 그래프와 함께 제공한다. k-means clustering performs k-means clustering on genes or samples and provides clustering of genes or samples according to a given k number along with graphs.

분석에 사용될 데이터는 탭으로 분리된 텍스트 파일(*.txt)형식이어야 한다. 아래의 예제와 같이 각 유전자들은 행방향으로, 실험(샘플,어레이)는 열방향으로 입력된다. 첫 열에는 해당 유전자의 이름이 입력되고 첫 행에는 실험(샘플, 어레이)의 이름이 들어간다.The data to be used for analysis should be in the form of a tab-delimited text file (* .txt). As shown in the example below, each gene is input in the row direction, and experiments (samples and arrays) are input in the column direction. The first column contains the name of the gene and the first row contains the name of the experiment (sample, array).

도 6d의 화면에서, Import data 부분에 분석할 데이터가 지정된다. Target에서 유전자클러스터링을 할지 실험에 대해 클러스터링할지 선택한다. 유전자클러스터링을 위해서는 Gene 을, 시험에 대해 클러스터링하려면 Experiments 선택한다. Number of clusters에 원하는 클러스터의 수를 지정한다. centroid definition에서는 평균정의 방법을 선택한다. K-means는 클러스터의 중심점을 평균으로 정한다. K-medoids는 클러스터의 중심점을 중간값으로 정한다. Number of iteration에는 분석을 실행하기 위한 최대 반복수를 지정한다.In the screen of FIG. 6D, data to be analyzed is designated in the Import data portion. Select whether to cluster genes on the target or cluster them for the experiment. Select Gene for clustering genes or Experiments for clustering tests. Specify the number of clusters you want in Number of clusters. In the centroid definition, the average definition method is chosen. K-means averages the centers of clusters. K-medoids set the midpoint of the cluster as the median. For Number of iteration, specify the maximum number of iterations to run the analysis.

도 6e는 도 6d에 따른 결과 그래프의 일 예이다. 각 클러스터마다 프로파일의 평균을 그린 그래프이다. 6E is an example of a result graph according to FIG. 6D. This is a graph of the average of the profiles for each cluster.

한 클러스터를 선택하면 도 6f와 같은 화면이 나타나 자세한 정보를 알 수 있다. 왼쪽은 이 클러스터에 속하는 유전자(실험) 이름이 써있고 오른쪽은 그들의 프로파일들을 그린 그래프가 나타난다. 왼쪽 유전자(실험) 이름 중에서 하나를 선택하면 해당 프로파일이 오른쪽에 나타난다. 이때, 여러 개의 유전자를 한꺼번에 지정할 수 있도록 소프트웨어를 구성할 수 있다. If one cluster is selected, a screen as shown in FIG. 6F appears and detailed information can be known. The left side shows the names of the genes belonging to this cluster and the right side shows a graph of their profiles. Select one of the left gene (experimental) names and the profile will appear on the right. At this time, the software can be configured to specify several genes at once.

자기조직도(Self Organizing Map:SOM)에서는 유전자 또는 샘플에 대한 자기조직도(Self Organizing Map:SOM)분석을 실행하고 주어진 노드의 수 및 형태에 따른 유전자 또는 샘플의 군집결과를 그래프와 함께 제공한다. In the Self Organizing Map (SOM), a Self Organizing Map (SOM) analysis of a gene or a sample is performed, and a clustering result of a gene or a sample according to a given number and type of nodes is provided along with a graph.

분석에 사용될 데이터는 탭으로 분리된 텍스트 파일(*.txt)형식이어야한다. 아래의 예제와 같이 각 유전자들은 행방향으로, 실험(샘플,어레이)는 열방향으로 입력된다. 첫 열에는 해당 유전자의 이름이 입력되고 첫 행에는 실험(샘플, 어레이)의 이름이 들어간다. The data to be used for analysis should be in the form of a tab-delimited text file (* .txt). As shown in the example below, each gene is input in the row direction, and experiments (samples and arrays) are input in the column direction. The first column contains the name of the gene and the first row contains the name of the experiment (sample, array).

도 6g의 화면에서, Import data 부분에는 업로드한 데이터가 지정된다. Target에서 유전자클러스터링을 할지, 실험에 대해 클러스터링할지 선택한다. 유전자클러스터링을 위해서는 Gene 을, 실험에 대해 클러스터링하려면 Experiments 선택한다. Number of rows에 원하는 행의 수를 지정한다. Number of columns에 원하는 열의 수를 지정한다. 도면은 3*3 SOM을 가정하여 행과 열의 수를 모두 3으로 지정한 것이다. Number of iteration에는 분석을 실행하기 위한 최대 반복수를 지정한다. Initial learning rate에 0과 1사이의 값인 최초의 학습률을 지정한다. Radius에 Map에서 이웃의 크기를 지정한다. 이웃의 크기는 1보다 커야 한다. Neighborhood function에는 커널함수의 형태를 지정한다. Bubble 또는 Gauss를 지정한다. Topology에는 격자의 형태를 정한다. rectangular 또는 hexagonal 형태의 격자 모양을 지정한다.In the screen of FIG. 6G, the uploaded data is designated in the Import data portion. Select whether to target gene clustering or clustering for the experiment. Select Gene for clustering genes or Experiments for clustering experiments. In Number of rows, specify the number of rows you want. In Number of columns, specify the number of columns you want. In the figure, the number of rows and columns is set to 3 assuming a 3 * 3 SOM. For Number of iteration, specify the maximum number of iterations to run the analysis. For Initial learning rate, specify the initial learning rate, which is a value between 0 and 1. In Radius, specify the size of the neighbors in the map. The size of the neighbor must be greater than one. The neighborhood function specifies the type of kernel function. Specify Bubble or Gauss. Topology sets the shape of the grid. Specifies a grid of rectangular or hexagonal shapes.

도 6h는 도 6g에 따른 결과 그래프의 일 예이다. 각 클러스터마다 프로파일의 평균을 그린 그래프이다.6H is an example of a result graph according to FIG. 6G. This is a graph of the average of the profiles for each cluster.

도 6h에서, 한 클러스터를 선택하면 도 6i와 같은 화면이 나타나 자세한 정보를 알 수 있다. 왼쪽은 이 클러스터에 속하는 유전자(실험) 이름이 써있고 오른쪽은 그들의 프로파일들을 그린 그래프가 나타난다. 왼쪽 유전자(실험) 이름 중에서 하나를 선택하면 해당 프로파일이 오른쪽에 나타난다. 이때, 여러 개의 유전자를 한꺼번에 지정할 수 있도록 소프트웨어를 구성할 수 있다.In FIG. 6H, when one cluster is selected, a screen as shown in FIG. 6I is displayed to know detailed information. The left side shows the names of the genes belonging to this cluster and the right side shows a graph of their profiles. Select one of the left gene (experimental) names and the profile will appear on the right. At this time, the software can be configured to specify several genes at once.

주성분분석(Principal Component Analysis;PCA)에서는 유전자 또는 샘플에 대한 주성분분석을 실행하고 결과를 그래프와 함께 제공한다. Principal Component Analysis (PCA) performs principal component analysis on genes or samples and provides the results with graphs.

분석에 사용될 데이터는 탭으로 분리된 텍스트 파일(*.txt)형식이어야 한다. 각 유전자들은 행방향으로, 실험(샘플,어레이)는 열방향으로 입력된다. 첫 열에는 해당 유전자의 이름이 입력되고 첫 행에는 실험(샘플, 어레이)의 이름이 들어간다.The data to be used for analysis should be in the form of a tab-delimited text file (* .txt). Each gene is input in the row direction and the experiment (sample, array) is input in the column direction. The first column contains the name of the gene and the first row contains the name of the experiment (sample, array).

도 6j의 화면에서, [Import data]에는 업로드된 데이터 파일이 표시된다. [PC based on]옆의 선택바에서 PCA분석을 할 때 어떤 데이터값에 근거하여 고유값과 고유벡터를 구할 것인지를 지정한다. 사용행렬에는 R[Correlation]행렬과 V[Covariance]행렬 그리고 S[SSCP] 행렬이 있다.In the screen of Fig. 6J, the uploaded data file is displayed in [Import data]. In the selection bar next to [PC based on], specify which data values to obtain eigenvalues and eigenvectors when performing PCA analysis. Use matrix includes R [Correlation] matrix, V [Covariance] matrix and S [SSCP] matrix.

도 6k는 도 6j에 따른 PCA 그래프의 일 예이다. 이 도면은 제1 주성분과 제2 주성분을 축으로 하여 각 유전자들을 플롯한 것이다. 창 아래쪽에는 제1 주성분과 제2 주성분 각각의 고유치와 비율이 나타난다. 각각의 점들에 마우스 포인터를 갖다 대면 해당하는 유전자의 이름이 뜨고 오른쪽에 해당하는 1축과 2축의 좌표값이 나타난다. 특정 유전자들을 보고 싶으면 마우스로 영역을 드래그 할 수 있다. 도 6l과 같이 해당하는 유전자들의 리스트와 프로파일이 새창으로 디스플레되도록 소프트웨어를 구성할 수 있다. 6K is an example of a PCA graph according to FIG. 6J. This figure plots each gene with the first principal component and the second principal component as axes. At the bottom of the window, the eigenvalues and ratios of each of the first and second principal components are shown. When you place the mouse pointer on each point, the name of the corresponding gene appears and the coordinate values of 1 axis and 2 axis on the right side are displayed. To see specific genes, you can drag the region with the mouse. As illustrated in FIG. 6L, the software may be configured to display a list of the corresponding genes and a profile in a new window.

이 중 어떤 한 유전자의 프로파일을 보고 싶으면 '유전자 리스트' 창에서 원하는 유전자를 선택하면 'Gene plot' 창과 'Gene Profile' 창에 빨간색으로 해당 유전자가 표시된다. 이때, 여러개의 유전자를 선택할 수도 있다.If you want to see the profile of any one of these genes, select the desired gene in the 'gene list' window, and the gene is displayed in red in the 'Gene plot' window and the 'Gene Profile' window. At this time, several genes may be selected.

도 6m은 View Experiment plot을 선택했을 때 나타나는 실험에 대한 화면으로 유전자에 관한 도 6l과 마찬가지로 해석한다. Figure 6m is a screen for the experiment that appears when the View Experiment plot is selected and interpreted in the same manner as Figure 6l for the gene.

진쉐이빙(Geneshaving)에서는 유전자에 대한 진쉐이빙 분석을 실행하고 결과를 그래프와 함께 제공한다.Geneshaving runs gene shaving analyzes on genes and presents the results with graphs.

도 6n의 화면에서, [Data file]에는 업로드된 데이터파일이 표시된다. [Number of clusters]에는 원하는 클러스터의 개수를 지정한다. [Number of permutation]에는 갭을 추정하기 위해 필요한 순열의 수를 선택한다. 도 6n은 클러스터의 수를 3으로, 순열의 수를 5로 지정한 경우이다.In the screen of FIG. 6N, the uploaded data file is displayed in [Data file]. In [Number of clusters], specify the desired number of clusters. For Number of permutation, choose the number of permutations needed to estimate the gap. 6N illustrates that the number of clusters is 3 and the number of permutations is 5.

도 6o는 도 6n에 따른 결과 그래프 화면의 일 예이다. 각 클러스터당 3개의 그래프가 제공된다.FIG. 6O is an example of a result graph screen according to FIG. 6N. Three graphs are provided for each cluster.

첫 번째 열의 그래프는 첫 번째 클러스터에서의 실험과 유전자발현정도를 보여준다. 가로축에 실험이, 세로축에는 해당 유전자가 있다. 실험의 순서는 각 클러스터별로 유전자의 평균 발현수치순으로 정렬되어 있다. 분산비율(%variance=VB/VT*100)과 그룹간-그룹내분산비(VB/VW)와 그룹간 분산(VB) 및 총분산(VT)의 수치를 함께 제공한다. 분산비율이 높을 수록 클러스터가 더 뭉쳐져 있다고 해석할 수 있다. 바람직하게는, 해당 그래프를 더블 선택하면 새로운 창으로 확대된 그래프를 제공할 수 있다. The graph in the first column shows the experiment and gene expression in the first cluster. Experiments on the horizontal axis and genes on the vertical axis. The order of experiments is arranged in order of average expression level of genes in each cluster. The variance ratio (% vari = VB / VT * 100), the inter-group intra-group dispersion ratio (VB / VW), the inter-group variance (VB) and total variance (VT) are provided together. The higher the dispersion ratio, the more clusters can be interpreted. Preferably, double-selecting the graph may provide an enlarged graph in a new window.

두 번째 그래프는 각 클러스터에 해당되는 유전자의 수, 즉 클러스터의 크기(Cluster Size)를 결정하기 위한 방법으로 실제 주어진 데이터(청색그래프)와 순열 데이터(적색그래프)에서의 분산비율 (VB/VT*100)을 보여준다. 순열 데이터는 유전자와 실험이 아무 관계가 없을 때 우연히 나오게 되는 분산비율을 연산하기 위한 것이다. 이 값과 실제 주어진 데이터에서의 분산비율의 차이가 가장 크게 되는 경우를 그 클러스터의 크기로 정하게 된다. 프로그램에서는 이 값에 따라 자동으로 연산된 크기에 의해 각 클러스터의 유전자수가 정해진다. 이 예에서는 1500개의 유전자중 첫 클러스터에 18개, 두 번째 클러스터에 2개, 세 번째 클러스터에 6개의 유전자가 할당되었다. 역시 해당 그래프를 선택하면 별도의 창으로 볼 수 있다. 이 예는 전체 유전자수가 1500개인 데이터로서 가로축은 0부터 1500까지의 수를 의미한다. The second graph is a method for determining the number of genes corresponding to each cluster, that is, cluster size, and the variance ratio (VB / VT *) in the data (blue graph) and permutation data (red graph). 100). Permutation data is used to calculate the variance that happens by chance when there is no relationship between genes and experiments. The size of the cluster is determined when the difference between this value and the variance ratio in the given data is greatest. In the program, the number of genes in each cluster is determined by the automatically calculated size according to this value. In this example, 18 genes in the first cluster, 2 in the second cluster, and 6 genes in the third cluster were allocated. If you select the graph, you can see it in a separate window. In this example, the total number of genes is 1500, and the horizontal axis represents numbers from 0 to 1500.

세 번째 그래프는 두 번째 그래프와 같은 정보를 주는 것으로 클러스터의 크기(Cluster Size)에 따른 갭 통계량(실제데이터와 순열 데이터의 분산비율의 차)을 보여준다. 따라서 이 값이 가장 큰 지점을 각 클러스터의 크기로 정하게 된다.The third graph gives the same information as the second graph and shows the gap statistics (difference between the variance ratio between the actual data and the permutation data) according to the cluster size. Therefore, the point where this value is greatest is set as the size of each cluster.

다음은 분류에 관한 설명이다.The following is a description of classification.

마이크로어레이 자료의 분석에서 분류분석(또는 판별분석)의 목적은 이미 정해진 각 처리(treatment, 예를 들어 정상 조직 및 암종 조직에서 얻은 표본)를 가장 잘 분별해주는 유의한 유전자를 찾아내는 것이다. The purpose of classification analysis (or discriminant analysis) in the analysis of microarray data is to find a significant gene that best distinguishes each of the already defined treatments (eg samples from normal and carcinoma tissues).

전통적 분류기법(Classical Discriminant Methods)은 다음과 같다. 가장 대표적인 전통적 분류방법으로 FLDA(Fisher's Linear Discriminant Analysis)는 다차원에 펼쳐진 자료들에 대해 처리간의 차이를 가장 극명하게 나타낼 수 있도록 일차원(즉, 선형결합)으로 축소하는 방법이다. 만약 각 처리마다 분산이 다를 경우 즉, 각 처리군에서 자료의 펴진 정도가 상이하게 되면 사용될 수 있는 좋은 방법이 DLDA(Diagonal Linear Discriminant Analysis)가 된다. 마지막으로 자료의 분산-공분산이 이차형식(quadratic form)의 형태를 보이게 되면, 이를 DQDA(Diagonal Quadratic Discriminant Analysis)라고 한다.The classical discriminant methods are as follows. The most typical traditional classification method, FLDA (Fisher's Linear Discriminant Analysis) is a method that reduces the data in one dimension (ie, linear combination) to most clearly show the difference between the processing of data spread over multiple dimensions. If the variance is different for each treatment, that is, the data spread in each treatment group is different, a good method that can be used is DLDA (Diagonal Linear Discriminant Analysis). Finally, when the variance-covariance of the data shows a quadratic form, it is called DQDA (Diagonal Quadratic Discriminant Analysis).

의사결정나무(Decision Tree)방법은 처리를 가장 잘 설명할 수 있는 후보 유전자 하나를 찾아 그 기준에 따라 자료를 나누어 주고(가지를 뻗음), 순차적으로 나누어진 각 자료에서 다시 가장 유의한 후보 유전자를 선택하여 가지를 뻗어가는 방법으로, 일정한 규칙에 따라 계속 가지를 뻗어가는 과정을 취하여, 전체 자료를 분류해 내는 유용한 방법중 하나이다. 나무 도면은 계층적 클러스터링의 도면과 매우 유사한 결과를 제공하고 있어 사용자에게 직관적으로 결과를 이해 할 수 있게 하는 장점을 가지고 있으며 마이크로어레이 자료뿐만 아니라 다양한 분야에서 사용되고 있다.The Decision Tree method finds one candidate gene that best describes the process, divides the data according to the criteria (stretches), and then selects the most significant candidate gene from each of the sequentially divided data. Selecting and stretching branches is one of the most useful ways to sort the whole data by taking the process of continuously stretching branches according to certain rules. Tree drawings provide results very similar to those of hierarchical clustering, which has the advantage of allowing users to intuitively understand the results and are used in various fields as well as microarray data.

변수선택 방법(Variable Selection)은 다음과 같다. 유의하다고 예상되는 후보 유전자를 미리 선택해야 하는 과정이 마이크로어레이 자료의 분류분석에는 필수적이다. 본 발명에 적용되는 변수선택 방법은 처리군 간의 변동(Within SS)과 처리군내에서의 변동(Between SS)의 비(ratio)를 이용한 방법(BSS/WSS)이다. 즉 처리군 간에는 수치값 차이가 크고, 동일 처리군에서는 그 수치값이 유사하여 변동이 작은 후보 유전자를 선택하는 방법이다. Variable Selection is as follows. The process of preselecting candidate genes that are expected to be significant is essential for the classification analysis of microarray data. The variable selection method applied to the present invention is a method (BSS / WSS) using a ratio of variation between treatment groups (Within SS) and variation within treatment groups (Between SS). That is, it is a method of selecting candidate genes with a large difference in numerical values between treatment groups and similar numerical values in the same treatment group.

마이크로어레이 자료에 대한 분류분석은 표준화된 자료를 이용하여 다음과 같은 과정을 거치게 된다. 첫번째, 결측치 보정이다. 어떤 분석을 실시하느냐에 따라 결측치를 사전에 정리해야 한다. 의사결정나무의 경우는 결측치를 보정하지 않아도 분석이 가능하나, 나머지 방법은 결측치가 보정이 되지 않은 경우 그 유전자는 분석에서 제외된다. 두번째, 변수선택법을 이용한 후보 유전자 선택이다. BSS/WSS에 의하여 유의할 것으로 예상되는 후보 유전자를 선정한다. 이때 후보 유전자의 수는 전체 관찰치의 수보다 작게 한다. 세번째, 크로스 확인(Cross Validation:CV)이다. 분류분석에서는 얻어진 결과가 얼마나 정확한 예측을 하는지 확인할 평가 도구가 필요한데, 가장 직관적인 것이 오분류율(misclassification rate)이다. 오분류율은 연산된 판별결과에 따라 잘못 분류된 개체의 비율을 의미하는데, 통상적으로 전체 자료를 n등분하여, n-1개로 분류분석 모형을 만들고, 나머지 하나를 이용하여 그 분류모형의 오분류율을 연산한다. 이 때 모형을 구성하기 위해서 필요한 자료를 training set이라고 하고, 평가를 위한 자료를 test set이라 한다. 통상적으로 n-fold cross validation이라 함은 위의 오분류율을 n회 반복하여 얻는 과정을 말한다. 따라서 분류분석의 평가를 위해 cross validation을 수행하여야 하고, 이를 위해 적절한 n수를 정해야 한다. 마지막으로, 분석방법 선택이다. 각각의 방법을 통하여 분석을 실시한다.Classification analysis of microarray data is performed using the standardized data as follows. First, missing value correction. Depending on the analysis being performed, missing values should be summarized in advance. In the case of decision trees, analysis is possible without correction of missing values, but the rest of the methods are excluded from the analysis if the missing values are not corrected. Second, candidate gene selection using variable selection. Candidate genes expected to be significant by BSS / WSS are selected. The number of candidate genes is smaller than the total number of observations. Third is Cross Validation (CV). Classification analysis requires an evaluation tool to determine how accurate the results obtained are. The most intuitive is the misclassification rate. The misclassification rate refers to the proportion of individuals misclassified according to the calculated discriminant results. Generally, the data is divided into n equal parts and the classification analysis model is made into n-1 pieces, and the other one is misclassified. Calculate the rate. In this case, the data necessary for constructing the model is called a training set, and the data for evaluation is called a test set. In general, n-fold cross validation refers to a process obtained by repeating the above misclassification rate n times. Therefore, cross validation should be performed for the evaluation of classification analysis. Finally, the choice of analysis method. The analysis is carried out through each method.

BSS/WSS gene selection method는 처리군 간의 변동(Within SS)과 처리군내에서의 변동(Between SS)의 비(ratio)를 이용하여 유의할 것으로 예상되는 후보 유전자를 선정하는 방법이다.The BSS / WSS gene selection method is a method of selecting candidate genes that are expected to be significant by using a ratio of variation between treatment groups (Within SS) and variation within treatment groups (Between SS).

입력 파일 형식은 다음의 표 3a과 같이 구성할 수 있다.The input file format can be configured as shown in Table 3a below.

[유전자 이름] [class] [class] [class] [class] [class] [class] [class] [class] one3 1 1 1 1 2 2 3 3 GENE0000 -0.549 -0.16135 -0.1183 -0.07699 -0.41637 -0.53994 -1.88314 -1.26389 GENE0001 -0.65766 -0.16635 -0.19691 -0.12792 -0.32658 -0.38762 -1.55486 -0.75601 GENE0002 -0.75556 -0.24525 -0.08846 -0.13201 -0.50722 -0.44909 -1.64757 -0.70395 GENE0003 -0.57216 -0.2546 -0.04408 -0.17327 -0.4554 -0.66926 -1.63925 -1.08307 GENE0004 -0.40403 -0.08962 -0.07734 -0.13194 -0.04299 -0.34925 -1.49124 -0.57431 GENE0005 -0.46961 0.04602 0.00438 0.03012 -0.26229 -0.41479 -1.41275 -0.55466 GENE0006 -0.49563 -0.05849 0.09863 -0.07333 -0.48383 -0.34593 -1.39221 -0.31024 GENE0007 -0.51306 -0.01078 0.01036 -0.15065 -0.4402 -0.4574 -1.2619 -0.76481 [Gene name] [class] [class] [class] [class] [class] [class] [class] [class] one3 1 1 1 1 1 2 2 3 3 GENE0000 -0.549 -0.16135 -0.1183 -0.07699 -0.41637 -0.53994 -1.88314 -1.26389 GENE0001 -0.65766 -0.16635 -0.19691 -0.12792 -0.32658 -0.38762 -1.55486 -0.75601 GENE0002 -0.75556 -0.24525 -0.08846 -0.13201 -0.50722 -0.44909 -1.64757 -0.70395 GENE0003 -0.57216 -0.2546 -0.044085540.1 0.66926 -1.63925 -1.08307 GENE0004 -0.40403 -0.08962 -0.07734 -0.13194 -0.04299 -0.34925 -1.49124 -0.57431 GENE0005 -0.46961 0.04602 0.00438 0.03012 -0.26229 -0.41479 -1.41275 -0.55466 GENE0006 -0.49563 -0.05849 -0.0730.383 -0.31024 GENE0007 -0.51306 -0.01078 0.01036 -0.15065 -0.4402 -0.4574 -1.2619 -0.76481

분석 결과 파일의 형식은 다음의 표 3b와 같이 구성할 수 있다.The format of the analysis result file can be configured as shown in Table 3b below.

[유전자이름] [BSS/WSS] GENE0000 7.252297 GENE0001 2.127878 GENE0002 1.411586 GENE0003 4.922214 GENE0004 2.015062 GENE0005 1.860405 GENE0006 0.877066 GENE0007 3.195392 GENE0008 4.270624 GENE0009 1.747254 GENE0010 6.200477[Gene name] [BSS / WSS] GENE0000 7.252297 GENE0001 2.127878 GENE0002 1.411586 GENE0003 4.922214 GENE0004 2.015062 GENE0005 1.860405 GENE0006 0.877066 GENE0007 3.195392 GENE0008 4.270624 GENE0009 1.747254 GENE0010 6.200477

도 7a에서, Number of Classes는 그룹의 개수를 의미한다.In FIG. 7A, Number of Classes means the number of groups.

Classical method는 마이크로어레이 자료의 분석에서 분류분석(또는 판별분석)의 목적은 이미 정해진 각 처리(treatment, 예를 들어 정상 조직 및 암종 조직에서 얻은 표본)를 가장 잘 분별해주는 유의한 유전자를 찾아내는 것이다. 본 발명에 따라 구현된 판별분석 프로그램은 고전적 방법으로서 (D)LDA - 선형판별분석, (D)QDA - 이차판별분석이 CV(cross validation)과 같이 구현된다.The classical method of classification (or discriminant analysis) in the analysis of microarray data is to find a significant gene that best distinguishes each treatment (eg, samples from normal and carcinoma tissues). The discriminant analysis program implemented according to the present invention is implemented as a classical method such as (D) LDA-linear discrimination analysis and (D) QDA-secondary discrimination analysis as cross validation (CV).

입력 파일 형식은 train 데이터와 test 데이터를 두개가 필요하며 입력 형태는 다음의 표 4a와 같이 구성할 수 있다.The input file format requires two train data and test data, and the input format can be configured as shown in the following Table 4a.

[index1] [index2] ... [indexN] [class] 32.2 43.2 ... 54.3 1 65.4 72.3 ... 24.3 2 [index1] [index2] ... [indexN] [class] 32.2 43.2 ... 54.3 1 65.4 72.3 ... 24.3 2

분석 결과 파일은 Misclassification Error(오분류율)와 각 유전자가 분류된 Group이 나타나며, CV로 연동될 경우는 오분류율만 출력된다.The analysis result file shows Misclassification Error and the group of each genes classified. In case of linking with CV, only misclassification rate is displayed.

옵션 입력 화면에서 DLDA는 Diagonal Linear Discriminant Analysis 방법 실행하는 것이고, DLDA(CA)는 Diagonal Linear Discriminant Analysis 방법의 모형을 평가를 위해 크로스 확인을 수행하는 것이며, DQDA는 Quadratic Discriminant Analysis 방법을 실행하는 것이고, DQDA(CA)는 Quadratic Discriminant Analysis 방법의 모형을 평가를 위해 크로스 확인을 수행하는 것이다.On the option input screen, DLDA is to run the Diagonal Linear Discriminant Analysis method, DLDA (CA) is to perform cross check to evaluate the model of Diagonal Linear Discriminant Analysis method, DQDA is to run the Quadratic Discriminant Analysis method, DQDA (CA) is to perform cross validation to evaluate the model of the Quadratic Discriminant Analysis method.

Decision tree는 의사결정나무를 생성하는 서브 프로그램이다. 의사결정나무는 의사결정규칙(decision rule)을 도표화하여 관심대상이 되는 집단(class)을 몇 개의 소집단으로 분류(classification)하거나 예측(prediction)을 수행하는 분석방법이다.Decision tree is a subprogram that creates a decision tree. Decision trees are analytical methods that classify decision rules and classify interest groups into several subgroups or perform predictions.

입력 형태는 다음의 표 4b와 같이 구성할 수 있다. 각 유전자는 공백이나 ,(콤마)로 구분되어야 하며 맨 마지막 column에는 예측의 대상이 되는 class(target변수)가 있어야 한다. class는 1부터 시작하는 정수이어야 한다.The input form can be configured as shown in Table 4b below. Each gene should be separated by a space or a comma, and the last column should contain the class (target variable) to be predicted. class must be an integer starting at 1.

32.2 43.2 ... 54.3 1 65.4 72.3 ... 24.3 2 32.2 43.2 ... 54.3 1 65.4 72.3 ... 24.3 2 1번째Gene 2번째Gene ..... class 1st Gene 2nd Gene ..... class

도 7b의 옵션 입력 화면에서, Import data 부분에는 업로드한 데이터가 지정된다. Numer of class는 클래스의 개수, Number of Genes는 유전자의 개수로서 자료의 column수에서 클래스를 제외한(column수-1)를 나타낸다. N-fold cross validation는 교차타당성 검토시 data를 N개로 나누게 된다. Pruning rate에 따라 tree가 불필요하게 크게 만들어진 경우 가지치기(Pruning)를 수행된다. 예를 들어 Pruning rate가 0.3으로 지정되었다면 70%의 자료만 tree를 만드는데 사용되고, 30%의 자료는 가지치기에 사용된다.In the option input screen of FIG. 7B, the uploaded data is designated in the Import data portion. Numer of class is the number of classes and Number of Genes is the number of genes. It represents the number of columns in the data except the class (column number-1). N-fold cross validation divides data into N when examining cross validity. Pruning is performed when the tree is made unnecessarily large according to the pruning rate. For example, if the pruning rate is specified as 0.3, only 70% of the data will be used to build the tree, and 30% of the data will be used for pruning.

도 7c의 결과는 클래스가 3개인 경우이다. 첫 번째 분류는 1671번째 유전자가 59.8를 기준점으로 분기가 일어나고 있다. 이 첫 번째 분기에 의해서 1번째 클래스를 모두 찾아낼 수 있다. 두 번째 분기는 첫 번째 분기의 오른쪽(1671번째 유전자가 59.8보다 큰 쪽)에서 일어난다. 1727번째 유전자가 324.8을 기준으로 분기하고 있다. tree의 결과 중 원은 분기가 일어나는 곳을 의미하며 원 아래의 []안의 숫자는 분기가 일어나는 유전자를 의미한다. 경계가 되는 값은 오른편 아래에 부등호와 함께 적혀있다. 더 이상 분기가 없는 곳은 정사각형으로 표현되어있다. 원과 정사각형내의 숫자는 각 클래스의 분포를 나타낸다.The result of FIG. 7C is the case of three classes. In the first category, the 1671st gene is diverging from baseline 59.8. This first branch will find all the first classes. The second branch occurs to the right of the first branch (the 1167th gene is greater than 59.8). The 1727th gene diverges at 324.8. The circle of the results of the tree indicates where the branching occurs, and the number in [] below the circle indicates the gene where the branching occurs. The bounding value is listed with an inequality sign on the bottom right. Where there are no more branches, they are represented by squares. The numbers in circles and squares represent the distribution of each class.

본 발명이 소프트웨어로 구현될 때, 실험 설계, 표준화, 추정과 검정, 군집 분석 및 분류를 표시하는 상위 메뉴, 기타 인터페이스는 자바 서버 페이지(JSP)를 이용하여 구현될 수 있다.When the present invention is implemented in software, a high-level menu displaying experimental design, standardization, estimation and testing, cluster analysis and classification, and other interfaces can be implemented using Java Server Pages (JSP).

본 발명이 소프트웨어로 구현될 때, 각 구성 요소인 실험 설계, 표준화, 추정과 검정, 군집 분석 및 분류는 데이터베이스를 공유하는 별도의 자바 애플릿으로 구현될 수 있다. 또한, 표준화에 포함된 단일 슬라이드 표준화, 염료 교환 표준화, 다중 슬라이드 표준화 그리고 단일 슬라이드 배치 방법을 별도의 자바 애플릿으로 구현할 수 있다. 마찬가지로, 추정과 검정에 포함된 뉴턴(Newton) 방법, B 통계량(B-statistic) 방법, 조정유의확률 방법(ADP), SAM 그리고 EBAM 의 분석방법을 별도의 자바 애플릿으로 구현할 수 있다. 또한, 군집 분석에 포함된 계층적 군집 분석, k-평균 군집 분석, 주성분 분석, 자기조직도, 진쉐이빙을 별도의 자바 애플릿으로 구현할 수 있다.When the present invention is implemented in software, each component, experimental design, standardization, estimation and testing, cluster analysis and classification, can be implemented as separate Java applets sharing a database. In addition, the single slide standardization, dye exchange standardization, multiple slide standardization, and single slide placement method included in the standardization can be implemented as separate Java applets. Similarly, Newton's, B-statistic's, Probability of Probability (ADP), SAM, and EBAM's analysis methods included in estimation and testing can be implemented as separate Java applets. In addition, hierarchical cluster analysis, k-means cluster analysis, principal component analysis, self organization chart, and true shaving included in the cluster analysis may be implemented as separate Java applets.

본 발명은 소프트웨어를 통해 실행될 수 있다. 소프트웨어로 실행될 때, 본 발명의 구성 수단들은 필요한 작업을 실행하는 코드 세그먼트들이다. 프로그램 또는 코드 세그먼트들은 프로세서 판독 가능 매체에 저장되거나 전송 매체 또는 통신망에서 반송파와 결합된 컴퓨터 데이터 신호에 의하여 전송될 수 있다.The invention can be implemented via software. When implemented in software, the constituent means of the present invention are code segments that perform the necessary work. The program or code segments may be stored on a processor readable medium or transmitted by a computer data signal coupled with a carrier on a transmission medium or network.

컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 테이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 장치의 예로는 ROM, RAM, CD-ROM, DVD±ROM, DVD-RAM, 자기 테이프, 플로피 디스크, 하드 디스크(hard disk), 광데이터 저장장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 장치에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The computer-readable recording medium includes all kinds of recording devices in which data is stored which can be read by a computer system. Examples of computer-readable recording devices include ROM, RAM, CD-ROM, DVD ± ROM, DVD-RAM, magnetic tape, floppy disks, hard disks, optical data storage devices, and the like. The computer readable recording medium can also be distributed over network coupled computer devices so that the computer readable code is stored and executed in a distributed fashion.

본 발명은 도면에 도시된 일 실시 예를 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시 예의 변형이 가능하다는 점을 이해할 것이다. 그러나, 이와 같은 변형은 본 발명의 기술적 보호범위내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.Although the present invention has been described with reference to one embodiment shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and variations may be made therefrom. However, such modifications should be considered to be within the technical protection scope of the present invention. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

상술한 바와 같이, 본 발명에 의하면, 하나의 통합시스템에서 마이크로어레이 자료 분석의 전 과정을 처리함으로써, 데이터베이스를 공유하여 마이크로어레이 스캐닝 이미지 자료에 대한 체계적인 통계분석을 수행할 수 있고, 각 실험의 상황에 따른 최적의 통계분석법을 적용함으로써 부적절한 방법의 적용으로 인한 위양성 및 위음성 오류율을 최소화하고 연구결과의 신뢰성을 높일 수 있으며, 편리하고 사용자 친화적인 인터페이스를 제공하여 분석이 용이한 효과가 있다.As described above, according to the present invention, by processing the entire process of microarray data analysis in one integrated system, it is possible to perform a systematic statistical analysis of the microarray scanning image data by sharing the database, the situation of each experiment By applying the optimal statistical analysis according to the method, it is possible to minimize false positive and false negative error rate due to improper application and to increase the reliability of the research results, and it is easy to analyze by providing a convenient and user-friendly interface.

Claims

When the experimental design is selected from the items of the upper menu including experimental design, standardization, estimation and test, cluster analysis, and classification, an interface for selecting any one of dye exchange, reference design, or loop design and the selected experimental design Displaying an interface for inputting design parameters including the number of arrays to be used in the array, whether to repeat between arrays, whether to repeat within the array, and a processing list; And

Generating an ANOVA design matrix for the selected experimental design using the input design parameters.

The method of claim 1, wherein when the standardization of the items is selected,

A microarray integrated analysis method comprising generating a standardization result by sequentially performing single slide standardization, single batch standardization, and multiple slide standardization using input slide information.

The method of claim 2,

Displaying an interface for selecting any one of average, intensity, print tip, or scale normalization methods when performing the single slide normalization, and normalizing the single slide with the selected method; And

Micro dye array analysis method comprising the step of performing the standardization by comparing the results of the experiment by reverse Cy3 and Cy5 in the same conditions for the dye exchange experiment.

4. The method of claim 3, wherein if the estimation and the test of the items are selected,

Estimating a significant gene in a microarray experiment by performing an analysis of either the Newton analysis or the B-statistic analysis on the standardization result;

With respect to the standardization result, one of the analysis of the SAM or the EBAM is performed to search for a significant gene in a repeated microarray experiment, and to determine the probability of erroneous detection according to the searched gene. Calculating; And

Performing normalization (ADP) analysis on the standardization result to estimate a significant gene in a microarray experiment and calculating a probability of the adjusted oil of the significant gene in the repeated microarray experiment, and displaying the probability of the adjusted oil. Microarray integrated analysis method characterized in that.

The method of claim 3, wherein when the cluster analysis of the items is selected,

Providing an interface for selecting any one of hierarchical clustering, k-means clustering, self-organization analysis, principal component analysis, or gene shaving analysis for genes or samples for genes or samples;

If a hierarchical clustering is selected, displaying a dendogram as a result of the clustering analysis;

Displaying an average graph of the profiles for each cluster as a result of the cluster analysis when any one of the k-means cluster analysis or the histogram analysis is selected;

If the principal component analysis is selected, displaying a graph in which the respective genes are plotted with the first principal component and the second principal component as the cluster analysis results; And

When the true shaving analysis is selected, displaying a graph indicating the degree of experiment and gene expression in each cluster, a graph indicating actual data and permutation data, and a graph indicating gap statistics according to the size of the cluster as the cluster analysis result. Microarray integrated analysis method comprising a.

The method of claim 3, wherein when the classification of the items is selected,

A candidate gene having a large variation between the standardization result and the treatment group and a small variation in the treatment group is selected, the candidate gene is classified, the target group of interest is predicted by the design parameter, and a misclassification rate is output. and,

The classification may be any one of linear discrimination analysis, secondary discrimination analysis, cross-check of linear discrimination analysis, or cross-check of secondary discrimination analysis.

The method of claim 6,

Predicting a target group of interest based on the design parameter and outputting a misclassification rate

And a step of classifying a decision rule into a plurality of sub-groups and generating a decision tree with respect to the standardization result.