WO2005034003A1 - Novel method for analyzing data collected by microarray experiment and the like - Google Patents

Novel method for analyzing data collected by microarray experiment and the like Download PDF

Info

Publication number
WO2005034003A1
WO2005034003A1 PCT/JP2003/015637 JP0315637W WO2005034003A1 WO 2005034003 A1 WO2005034003 A1 WO 2005034003A1 JP 0315637 W JP0315637 W JP 0315637W WO 2005034003 A1 WO2005034003 A1 WO 2005034003A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
genes
log
average
value
Prior art date
Application number
PCT/JP2003/015637
Other languages
French (fr)
Japanese (ja)
Inventor
Toshimichi Ikemura
Shigehiko Kanaya
Ken-Nosuke Wada
Yasushi Masuda
Tatsuya Nishi
Naotake Ogasawara
Kazuo Kobayashi
Original Assignee
Japan As Represented By The President Of National Institute Of Genetics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Japan As Represented By The President Of National Institute Of Genetics filed Critical Japan As Represented By The President Of National Institute Of Genetics
Publication of WO2005034003A1 publication Critical patent/WO2005034003A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/30Microarray design
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method.
  • genes that are expressed in a time-specific manner during the developmental stage or growth / division stage genes that are expressed in a tissue / organ-specific or disease / pathology-specific manner, activated by external stimuli such as chemical substances, heat, and light
  • Gene expression can be analyzed comprehensively and comprehensively under various conditions, such as gene groups controlled by transcription factors and genes regulated downstream of transcription factors.
  • the gene expression information (expression profile) obtained by such microarray experiments and the like contributes to a comprehensive understanding of the gene expression regulation mechanism, and further to the elucidation of life phenomena, but it is not limited to this. Development of new drugs based on this relationship ⁇ Genomic drug discovery, new testing ⁇ Diagnosis, prevention and prevention can also contribute to the establishment of treatment.
  • the most frequently used microarray experiments are (1) a method in which labeled cRNA is hybridized with a probe on the chip using an Affymetrix-type chip, and (2) a method in which labeled cDNA is labeled using a spot-type array.
  • the method is broadly classified into a probe on a slide glass and a hybridization method. Both methods are common in that changes in the expression levels of individual genes are evaluated based on mRNA extracted from cells or the like.
  • the mRNA extracted for the control experiment and the target (target) Fluorescently labeled cDNA is prepared from the mRNA extracted for the experiment and hybridized with a large amount of probe formed on the slide glass.
  • the measured signal intensity (measured value before subtracting the background intensity) of each gene is ideally greater than the background intensity. Some of the measured values were lower than the ground strength. Even if the intensity is higher than the background intensity, if the measured signal intensity is low, the logarithmic ratio (log [ XT (k) / xc (k)]) cannot guarantee the quantitativeness. However, the extent to which quantitativeness is guaranteed was determined subjectively or empirically.
  • the expression level (fluorescence intensity) of most genes was almost the same in the control experiment (c) and the target experiment (T), and the expression level changed for some genes.
  • This is the logarithmic ratio (log [x T (k) / x c (k)]) if the were convex, the majority of genes log [x T (k) / Xc (k)] 0 distributed near Means to do.
  • the logarithmic ratio often includes a bias error depending on the average intensity on the horizontal axis.
  • a transcription unit By the way, among a set of adjacent genes on the same DNA strand in the microorganism genome, a set of adjacent genes transcribed to the same mRNA is called a transcription unit, and clarifying this transcription unit is a mechanism of gene expression control in the genome. It is very important to understand. Therefore, for example, if a method for predicting and estimating a transcription unit by correlating the expression profiles of a plurality of genes based on data obtained from a plurality of microarray experiments would be useful industrially, such methods are still available. No method has been developed. Disclosure of the invention
  • the present invention has been made in view of the above problems, and an object of the present invention is to provide a novel method for analyzing data obtained from a large-scale gene expression analysis such as a microarray experiment or a macroarray experiment.
  • the log ratio (log [x T (k) / xc (k)]) method of evaluating statistically clearly separate the range not guaranteed a range of quantitative property is guaranteed to (2) a method for correcting and reducing bias errors depending on the average intensity of the logarithmic ratio; and (3) prediction of transcription units on the microorganism genome based on data obtained from multiple microarray experiments.
  • To provide a method for estimating, and a program or the like for executing these methods on a computer.
  • the first data analysis method is a method for analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems.
  • a signal intensity xc gene (k) in the experiment (k) the logarithmic ratio below (a) the extent permitted to be quantified with the ⁇ of the signal intensity x T (k) of the gene (k) at the target experiments ( c) is determined by the process are doing.
  • the first threshold is set to uxSDc (where u is an arbitrary positive number and SDc is a statistic represented by the following equation (1)), while the second threshold is set.
  • the UxSD T (where, u is an arbitrary positive number, SD T are statistics represented by the following formula (2)) it is preferable to set the.
  • the second data analysis method of the present invention is a method of analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems.
  • the deviation error depending on the average intensity in the logarithmic ratio between the signal intensity x c (k) of the gene (k) in the experiment and the signal intensity x T (k) of the gene (k) in the target experiment is as follows: The feature is that it is corrected by the process of (c).
  • the average intensity Av [k] of the kth gene is A V8 -1 (the average of the minimum and maximum values of the average intensity in the s-1st interval) and Avs (the average intensity in the sth interval) Between the minimum value and the maximum value), the reference intensity crit (k) at the average intensity Av [k] is calculated by linear interpolation using the average values PAv (sl) and PAv (8). Ask,
  • the k-th gene when the absolute value of the corrected logarithmic ratio LOG [k] of the k-th gene is equal to or greater than the threshold value Th set for the gene, the k-th gene is statistically significant. It is preferable to determine that the signal pair has a large change in the expression level. Further, it is preferable to set the threshold value Th by the following steps (a) to (c).
  • SD [k] S mth [u + l] + (Av [k] -Avu + l) (S mth [u + l]-S mth [u]) / (Avu + l-Avu)
  • SD [k] be the threshold Th.
  • first data analysis method and the second data analysis method of the present invention to classify signals obtained by experiments into a plurality of categories.
  • the third data analysis method of the present invention is a method of analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems.
  • the transcription unit is estimated by the following steps (a) to (d) based on the gene expression profiles and the genome information of these genes.
  • step (a) it is preferable to calculate the correlation coefficient by the following equation (4).
  • N st is the number of experiments for which values were obtained for both of the two genes s and t out of M experiments.
  • Each experiment is represented by j, where Xjs is the expression profile of the jth experiment for gene s, Xjt is the expression profile of the jth experiment for gene t, and Xs is the average of the expression profiles of Nst experiments for gene s. , Represents the average of the expression profiles of N st experiments for the gene t.
  • the program of the present invention provides at least one of the first to third data analysis methods for solving the above problems. It features a computer to perform one of the two methods.
  • the recording medium of the present invention is a computer-readable recording medium on which the program of the present invention is recorded.
  • a data analysis device of the present invention is a data analysis device comprising the program of the present invention described above, and a computer that executes at least one of the first to third data analysis methods using the program. is there.
  • (3) Predict the transcription units on the microbial genome with high accuracy by utilizing data obtained from microarray experiments, etc. Becomes possible.
  • FIG. 1 is a diagram for explaining that the present invention corrects a bias error in a logarithmic ratio of signal intensities of a control experiment and a target experiment.
  • FIG. 2 is a diagram illustrating that signal data obtained by a microarray experiment or the like can be classified into a plurality of groups according to the present invention.
  • FIG. 3 is a diagram illustrating measurement conditions in a microarray experiment of the present example.
  • FIG. 4 is a graph showing a comparison of the standard deviation of a signal for which a logarithmic ratio was quantitatively recognized and a signal for which a logarithmic ratio was not found according to the present invention.
  • FIG. 5 is a graph comparing the logarithmic ratio corrected according to the present invention with the standard deviation of the logarithmic ratio not corrected.
  • FIG. 6 is a graph showing a comparison between the logarithmic ratio corrected by the present invention and the standard deviation of the uncorrected logarithmic value for signal data obtained under different experimental conditions in a control experiment and a target experiment.
  • Negative values of s c (k) -b c (k) and s T (k) -b T (k) are values that should be zero because they are originally supposed to be zero.
  • the statistics SD c and SD ⁇ for evaluating this variation are represented by the following equations (1) and (2), respectively.
  • Xc (k) is considered to be 0 when
  • XT (k) is considered to be 0 when
  • the log ratio for the k-th gene I can't express it. That is, if any of these conditions is satisfied, it is determined that quantitativeness cannot be guaranteed in the logarithmic ratio.
  • the logarithmic ratio often includes a bias error depending on the average intensity on the horizontal axis.
  • This bias error is reduced by the following method.
  • the average intensity is divided into multiple sections at a fixed step size.
  • the average value PAv (s) of the log ratio is calculated by the following equation (5). v. ,-
  • Nstart-Nfinal 8-Set (Nstart-Nfinal) / (l-Stotal) + (Nfinal-Nstart-Stotal) / (l-Stotal) ⁇ ⁇ ⁇ (A).
  • set Nstart> Nfinal. Only when this condition is satisfied (that is, when the actual number of samples in each section is equal to or greater than the set value (A) or greater than this value) Average intensity Avs [min (s) + max (8)] / 2
  • the PAv (s) in is used as a representative value.
  • PAv (s) is obtained by linear interpolation of (Avq, PAv (q)) and (Avt, PAv (t)) before and after.
  • the reference strength crit (k) at the average strength Av [k] is obtained from PAv (sl) and PAv (s) by the following equation (6) by linear interpolation.
  • the result of correcting the deviation error of the log ratio by the above corrected log ratio LOG [k] is shown in the lower graph of Fig. 1.
  • S Dcrit is set by the user based on the data obtained as a result of the experiment.
  • the signals of the microphone and row array experiments are classified into the following four groups A to D.
  • Groups A and B are signal conditions that can be used for quantitative analysis, such as searching for genes with similar expression profiles in multiple microarray experimental data.
  • groups A to D can be used for qualitative analysis when searching for genes that have a significant difference in the two experiments, the control experiment and the target experiment (see also Fig. 2; EF and F).
  • Group A Quantitative log ratio is guaranteed and statistically significant changes in expression level The resulting signal pair.
  • Group B Signal pairs for which logarithmic ratio is guaranteed, but does not produce statistically significant changes in expression level.
  • Group C A force that does not guarantee quantitativeness in the log ratio because one of the signals is regarded as 0, a signal pair in which a difference is obtained between the two signals.
  • Group D Signal pairs for which both signals are considered to be 0 and quantification of the log ratio is not guaranteed and that there is no difference between the two signals.
  • the contiguous gene set that is transcribed to the same mRNA in the contiguous gene set on the same DNA strand of the genome in the gene expression of Pacteria is called a transcription unit. Predicting this transcription unit is very important from the viewpoint of controlling gene expression in the genome.
  • a method for estimating the transcription unit based on the expression amount or the expression change amount of various genes under various conditions represented by microarray data is described below.
  • a set of genes continuously arranged in the same direction on the genome is called a directon. When multiple genes belong to the same transcription unit, these genes are transcribed as the same mRNA, and therefore have a positive correlation to their expression profiles in theory. Therefore, the correlation of the microarray expression profile between the genes belonging to the same directon is calculated.
  • M types of microarray experiments microarrays that can obtain an expression profile for N genes
  • the expression profile of each gene can be represented by an NxM matrix as follows.
  • the expression profile X s of the s-th gene can be described as follows using an M-dimensional betatle.
  • a transcription unit is estimated based on the expression profile and genomic information by the following steps 1 to 4.
  • Step 1 Calculation of correlation coefficient between genes belonging to the same directon]
  • Correlation coefficient r (s in the expression profiles X s and X t of the s-th and t-th gene pairs on the same diton , T).
  • s l, 2 ".., N
  • t l, 2" .., N, where N is the total number of genes belonging to the focused directory.
  • a directive is a set of genes that are consecutively positioned on the same DNA strand.
  • Step 3 Searching for genes in the expression phase in the 5th direction in consideration of the adjacency] Perform the same operation as in Step 2 above for the s-1st, s-2th,.
  • the group of genes sandwiched between the gene with the lowest rank and the gene with the highest rank is estimated to be one transcription unit.
  • the present invention is not limited to the above-described methods (1) to (3) of the present embodiment, and various changes can be made within the scope of the present invention.
  • the values of the threshold value, the reference value, and the like used in the above methods (1) to (3) are arbitrary, and appropriate values may be set according to the application and purpose. Further, additional steps may be added to the steps (steps) of the above methods (1) to (3).
  • the present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method. Appropriate evaluation and use of this data are possible, and it is possible to use it not only as a research tool, but also for development of new drugs based on the relationship between diseases and genes.Genome drug discovery, It can also contribute to the establishment of new tests and diagnostic methods, and preventive methods.
  • the method of using the present invention it is possible to set an analysis method according to the signal intensity of a microarray experiment. For example, several microarray experiments When searching for genes with similar expression profiles in data, only genes that satisfy the conditions of Groups A and B described above are analyzed. Also, simply searching for a gene that has a significant change in one microarray experiment targets genes that satisfy the conditions of Groups A to D described above. Thus, the range of the target gene group suitable for quantitative analysis or qualitative analysis can be determined. This can improve the accuracy of multivariate analysis usually used for microarray analysis.
  • the program of the present invention causes a computer to execute the method of the present invention (for example, any one of the methods (1) to (3)), and the recording medium of the present invention records the program of the present invention.
  • Such recording media include magnetic recording media such as flexible disks, hard disks, and magnetic tapes; optical recording media such as CD_ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, and DVD-RW; and RAM. Examples thereof include, but are not limited to, an electrical storage medium such as a ROM and a ROM, and a magneto-optical storage medium such as an MO.
  • the “data analysis device” of the present invention includes a program of the present invention, and a computer that executes the method of the present invention (for example, any one of the methods (1) to (3)) by the program. Is done.
  • the computer basically has a configuration capable of executing the method of the present invention, and includes an input device, a data storage device, a central processing unit, and an output device.
  • Example 1 A method for detecting a signal whose logarithmic ratio (relative value) is quantitative]
  • the signal values xc (k) and XT (k) in the control experiment and the target experiment are
  • Condition 1 Logarithmic ratio is guaranteed to be quantitative.
  • Condition 2-1 One of the signals is 0, so the logarithmic ratio cannot be guaranteed quantitatively, but a difference is obtained between the two signals.
  • Condition 2-2 Since both signals are 0, quantification of the log ratio is not guaranteed and it is judged that there is no difference between the two signals.
  • FIG. 4 shows the standard deviation of the log ratio under the above three conditions calculated for the same mRNA sample extracted by culturing E. coli in LB medium.
  • (A) and (B) are the results of calculating the standard deviation of the log ratio under the above three conditions for the mRNA extracted twice independently in the logarithmic growth phase. In an ideal system without experimental errors, these standard deviations are all zero. However, in practice, errors cause data variations, which are quantified by the standard deviation. The magnitude of this error is clearly larger in Condition 2-1 and Condition 2-2 than in Condition 1. Therefore, the change in expression level was quantitatively determined by logarithmic ratio under conditions 2-1 and 2-2. It is concluded that it will be difficult to evaluate.
  • Group A Signal pairs that guarantee quantitativeness in the log ratio and that produce statistically significant changes in expression levels.
  • Group B Signal pairs that guarantee quantitativeness in the log ratio but do not produce statistically significant changes in expression levels.
  • the standard deviation of the log ratio and the corrected log ratio with respect to the origin was determined.
  • the fact that the bias error is reduced by the correction method of the present invention can be confirmed by the fact that the standard deviation with respect to the origin in the corrected log ratio becomes smaller than the standard deviation of the log ratio when no correction is performed. .
  • Figure 5 shows the standard deviation calculated for the uncorrected log ratio and the corrected log ratio (LOG [k]) of each signal that satisfies condition A or B for the same mRNA sample extracted by culturing E. coli in LB medium. .
  • the standard deviation of the corrected log ratio with respect to the origin was smaller than the standard deviation of the log ratio without correction.
  • (A) and (B) are the results of performing the same experiment on mRNA independently extracted twice in the logarithmic growth phase and calculating the standard deviation.
  • Condition B is a signal pair that guarantees quantitativeness in the log ratio but does not produce a statistically significant change in expression.Therefore, the condition B can be used even when mRNA is extracted under different conditions in the control experiment and the target experiment. Assuming that the bias error is reduced by the correction method of the present invention, the standard deviation of the corrected logarithmic ratio with respect to the origin is the logarithmic ratio when no correction is performed. Is smaller than the standard deviation of To confirm this, condition B was determined based on the expression intensity measured at the actual comparison between the defective strain and the wild-type strain shown in Fig. 3, or at a specific time and an arbitrary time.
  • a transcription unit is a group of genes that are transcribed into the same mRNA, and by finding genes that are on the same strand and adjacent to each other on the genome and have a positive correlation in the expression profile, A group of genes in the same transcription unit can be found.
  • the expression profile is measured by a microarray, the expression profile of several thousand genes can be measured for one experiment, but there are many genes whose expression profile cannot be measured depending on the conditions.
  • This estimating method can estimate the IS scoping unit even when the expression data for all the experimental conditions M for the i-th and j-th genes are not complete.
  • the transcription unit can be estimated even when the correlation coefficient itself of the adjacent gene is missing. That is, the transcription unit is estimated and predicted as follows while compensating for these two kinds of deletions.
  • Genes having the same transcription direction continuously on the genome are referred to as 1, 2, ..., i, j, ..., n in order from the 5 'side.
  • i ⁇ j. It is assumed that the expression profiles of the i-th and j-th genes are measured under 8y pairs of experimental conditions, and the correlation coefficients (j, Sjj) are obtained. In this method, the following Pearson's correlation equation was used.
  • r (s, t, N st ) indicates that, for the s-th and t-th genes, values could be obtained by experiments for NBt pairs out of M experiments.
  • n genes For n genes, a correlation coefficient corresponding to n (n-1) / 2 pairs is determined.
  • the correlation coefficient r (i, j, s) is larger than the reference value r (sij, a), it is guaranteed that the correlation coefficient, j, 83 ⁇ 4 ) has a statistically significant positive correlation.
  • the reference correlation value r (Sij, a) is
  • t a is the significance level a in the statistical test, and can be obtained from the t distribution table in statistics.
  • r (i, j, sy) is a significant positive correlation, it means that the i-th and j-th genes may be in the same transcription unit. That is, the i-th to j-th j-i + 1 genes may be in the same transcription unit.
  • the transcription unit was estimated by the following steps based on these three conditions (1) to (3) .c (Step 1) Genes having the same transcription direction on the genome were successively selected from the 5 ' , 2, ..., i, j, ..., n. Here, i ⁇ j. For the i-th and j-th genes, the correlation coefficient r (i, j, s) based on the expression profile for the Sij pair of experimental conditions is determined. For n genes, a correlation coefficient corresponding to n (nl) / 2 pairs is determined.
  • Step 2 A positive statistically significant correlation coefficient is selected from the correlation coefficients r (i, j, s).
  • the correlation coefficient r (i, j, Sij) is larger than the reference value r ( sti , a), it is guaranteed that there is a statistically significant positive correlation.
  • Step 3 A gene pair having a negative correlation coefficient r (i, j, Sij) is obtained. If these gene pairs are i and j, even if the expression profile of the gene pair (x, y) with the relation x ⁇ i, j ⁇ y has a positive correlation, these two genes x, y Are not included in the same transcription unit because they are in different transcription units.
  • Step 4 When the u-th gene is used as a reference, the u-1st, u-2th,..., U-th genes, u + 1th, u + 2th , ..., expression profiles and significant positive correlation of the u-th gene for u + k 2 -th k 2 genes can be obtained, and the process did not have the gene pair having a negative correlation coefficients by 3 and if, these k 2 - Id + 1 genes is estimated to be the same transcription unit.
  • the set of genes belonging to the q-th transcription unit is represented by T q , and the assigned genes are t ( Tl, t (Tq) 2 ,
  • the set of genes belonging to the transcription unit predicted by the present estimation method based on the Uth gene is denoted by Pu .
  • Pu Genes that are attributable to the transcription unit, respectively, p (Pu> l, p (Pu> 2, ... and p (Pu> N [Pu].
  • the U th genes belong to the set T q is the referred to as T q (U) that.
  • the U-th gene when it is assigned to T q, ideally, an element of the set Pu the set T q (u) is match . number of elements in common the number of elements N [P u nT q (U )] to. collectively P u the set T q (u) of the set Pu the set T q (U), respectively, N [P U ] And N [T q (u)].
  • E [P U ] and E [Tq (u)] are both greater than or equal to 0.
  • E [P u ]> 0 the predicted transcription unit contains more genes than known transcription units. That is, so-called excessive prediction.
  • the present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method.
  • Appropriate evaluation of data obtained by experiments, etc., and the new utilization of the data become possible, not only for use as a research tool, but also for new drugs based on the relationship between diseases and genes, for example. It can also contribute to the development of genome development, new testing and diagnostic methods, prevention and treatment methods.

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Signal Processing (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Complex Calculations (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

A method for analyzing data comprising the steps of (1) determining a threshold on the basis of a data value such that the remainder of the subtraction of the background strength from the measured signal strength is negative and determining the range in which the quantitativeness of the logarithmic ratio between the signal value in a comparative experiment and that of the object experiment is assured by the threshold, (2) dividing the average strength into sections, calculating the average of the logarithmic ratios between the above-mentioned signal values in the sections, and correcting the deviation of when the logarithmic ratio is plotted with respect to the average strength, and (3) determining whether or not there is any correlation between the expression profiles of flanking genes determined by a microarray experiment or the like and predicting the transcription unit on a microorganism genome with high accuracy.

Description

明 細 書 マイクロアレイ実験等から得られるデータの新規解析方法 技術分野  Description New analysis method for data obtained from microarray experiments, etc.
本発明は、 マイクロアレイ実験やマクロアレイ実験等の遺伝子発現解析から得 られるデータの新規解析方法、及び同方法を実行するためのプログラム等に関す るものである。 背景技術  The present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method. Background art
マイクロアレイ実験やマクロアレイ実験等は、膨大な遺伝子発現に関する情報 を 1回の実験で素早く得ることが可能である。 例えば、 発生段階や成長分裂段階 に時間特異的に発現する遺伝子群、 組織 ·器官特異的あるいは疾患 ·病態特異的 に発現する遺伝子群、 化学物質、 熱、 光等の外的刺激により活性化される遺伝子 群、 転写因子の下流で制御される遺伝子群、 といった様々な条件下での遺伝子発 現を網羅的 ·包括的に解析することができる。 このようなマイクロアレイ実験等 により得られた遺伝子発現情報 (発現プロファイル) は、 遺伝子発現調節機構の 網羅的理解、ひいては生命現象の解明に資するものであるが、これにとどまらず、 例えば疾患と遺伝子との関係に基づく新薬の開発 ·ゲノム創薬、 新たな検查 ·診 断法、 予防 '治療法の確立にも貢献しうるものである。  In microarray experiments and macroarray experiments, it is possible to quickly obtain vast amounts of information on gene expression in a single experiment. For example, genes that are expressed in a time-specific manner during the developmental stage or growth / division stage, genes that are expressed in a tissue / organ-specific or disease / pathology-specific manner, activated by external stimuli such as chemical substances, heat, and light Gene expression can be analyzed comprehensively and comprehensively under various conditions, such as gene groups controlled by transcription factors and genes regulated downstream of transcription factors. The gene expression information (expression profile) obtained by such microarray experiments and the like contributes to a comprehensive understanding of the gene expression regulation mechanism, and further to the elucidation of life phenomena, but it is not limited to this. Development of new drugs based on this relationship · Genomic drug discovery, new testing · Diagnosis, prevention and prevention can also contribute to the establishment of treatment.
現在最も多用されているマイクロアレイ実験は、 (1 ) Affymetrix型チップを 用いて、標識した c R N Aをチップ上のプローブとハイブリダイズさせる方法と ( 2 ) スポット型アレイを用いて、 標識した c D N Aをスライドガラス上のプロ ーブとハイブリダィズさせる方法とに大別されるが、 いずれも細胞等から抽出し た m R N Aをもとに個々の遺伝子発現量の変化を評価する点で共通する。 例えば 標識 c D N Aを用いた実験では、 対照実験用に抽出した m R N Aと目的 (標的) 実験用に抽出した m R N Aとからそれぞれ蛍光標識した c D N Aを調製し、 スラ ィドガラス上に形成された大量のプローブとハイブリダィズさせる。 その後、 ス キヤナ一によりプローブ各位置の蛍光を測定し、 対照実験 (c)における遺伝子 (k) の発現量(シグナル強度) xc(k)と、目的実験 (T) における遺伝子 (k)の発現量 xT(k) との対数比 (log[xT(k)/xc(k)])によって個々の遺伝子 (k)の発現量の変化を評価する。 マイクロアレイ実験は、 大量の遺伝子発現解析を迅速に行う上で今後益々重要 であり、 有効な方法であるが、 同実験から得られたデータには実験上の様々な理 由から測定誤差や偏り誤差が存在し、 そのため従来は遺伝子発現変化量の適切な 評価が困難であった。 例えば、 対照実験及び目的実験において、 各々の遺伝子の 測定シグナル強度 (パックグラウンド強度を引く前の実測値) は理想的にはパッ クグラウンド強度以上の値をとる箬であるが、 実際にはパックグラウンド強度よ り低い測定値をとるものがデータ上存在した。 また、 バックグラウンド強度以上 の値をとるものであっても、 測定シグナル強度の低いものでは、 上記対数比 (log[XT(k)/xc(k)])に定量性が保障されなくなるが、 どこまで定量性が保障される とするかは主観的ないし経験的に決められていた。 Currently, the most frequently used microarray experiments are (1) a method in which labeled cRNA is hybridized with a probe on the chip using an Affymetrix-type chip, and (2) a method in which labeled cDNA is labeled using a spot-type array. The method is broadly classified into a probe on a slide glass and a hybridization method. Both methods are common in that changes in the expression levels of individual genes are evaluated based on mRNA extracted from cells or the like. For example, in experiments using labeled cDNA, the mRNA extracted for the control experiment and the target (target) Fluorescently labeled cDNA is prepared from the mRNA extracted for the experiment and hybridized with a large amount of probe formed on the slide glass. Then, the fluorescence at each position of the probe was measured using a scanner, and the expression level (signal intensity) xc (k) of the gene (k) in the control experiment (c) and the gene (k) expression in the target experiment (T) were measured. the amount x T (k) and the logarithmic ratio (log [x T (k) / xc (k)]) by assessing the change in the expression level of individual genes (k). Microarray experiments will be increasingly important and effective in rapidly analyzing large amounts of gene expression in the future, but the data obtained from the experiments will be subject to measurement errors and bias errors due to various experimental reasons. Therefore, it has conventionally been difficult to appropriately evaluate the amount of change in gene expression. For example, in control experiments and target experiments, the measured signal intensity (measured value before subtracting the background intensity) of each gene is ideally greater than the background intensity. Some of the measured values were lower than the ground strength. Even if the intensity is higher than the background intensity, if the measured signal intensity is low, the logarithmic ratio (log [ XT (k) / xc (k)]) cannot guarantee the quantitativeness. However, the extent to which quantitativeness is guaranteed was determined subjectively or empirically.
マイクロアレイ実験では、 大半の遺伝子の発現量 (蛍光強度) は対照実験 (c) 及び目的実験 (T)の二つの実験においてほぼ同一であり、一部遺伝子について発現 量変化が認められる。 このことは、上記対数比(log[xT(k)/xc(k)])をとつた場合、 大半の遺伝子は log[xT(k)/Xc(k)] = 0近傍に分布することを意味する。 ところが、 実際には図 1の上段グラフに示すように、横軸の平均強度に依存した偏り誤差が 対数比に含まれることがしばしばある。 In the microarray experiment, the expression level (fluorescence intensity) of most genes was almost the same in the control experiment (c) and the target experiment (T), and the expression level changed for some genes. This is the logarithmic ratio (log [x T (k) / x c (k)]) if the were convex, the majority of genes log [x T (k) / Xc (k)] = 0 distributed near Means to do. However, in practice, as shown in the upper graph of Fig. 1, the logarithmic ratio often includes a bias error depending on the average intensity on the horizontal axis.
従来の補正方法として、 マイクロアレイ上にスポッ トされた全ての遺伝子 (u=l,2, ...N;ここで Nは全ての遺伝子数である)における log[xT(k)/xc(k)]につい て中央値 MDを求め、 すべての遺伝子に対して log[xT(k)/xc(k)]-MDを計算し補 正することは従来行われていた (「林崎良英著、 必ずデータが出る D N Aマイク ロアレイ実験マニュアル、 羊土社、 2000年」 参照)。 しかし、 このような従来の 補正方法では、 上記の平均強度に依存した偏り誤差を軽減することができない。 マイクロアレイ実験では、 前述のように、 対照実験及び目的実験の二つの実験 における遺伝子の発現強度の変化を比 (対数比) の形式で評価する。 この比 (対 数比) における定量性は、 複数のマイクロアレイ実験から類似の発現プロフアイ ルを有する遺伝子を探索する場合に影響を及ぼす。 As a conventional correction method, log [x T (k) / x c for all genes spotted on the microarray (u = l, 2, ... N, where N is the number of all genes) obtains a median value MD with the (k)], log for all genes [x T (k) / xc (k)] - MD calculations to be compensation was done conventionally ( "Yoshihide HAYASHIZAKI Author's Manual, DNA Microarray Experiment Manual, Yodosha, 2000 "). However, such a conventional The correction method cannot reduce the above-described bias error depending on the average intensity. In microarray experiments, as described above, changes in gene expression intensity in two experiments, a control experiment and a target experiment, are evaluated in the form of a ratio (log ratio). The quantification of this ratio (log ratio) affects the search for genes with similar expression profiles from multiple microarray experiments.
ところで、微生物ゲノムの同一の D N A鎖上の隣接遺伝子集合のうち同一の m R N Aに転写される隣接遺伝子集合を転写単位と呼ぶが、 この転写単位を明らか にすることは、 ゲノムにおける遺伝子発現制御機構を理解する上で非常に重要で ある。 そこで、 例えば複数のマイクロアレイ実験から得られるデータをもとに複 数の遺伝子の発現プロファイルの相関により転写単位を予測 ·推定する方法が開 発されれば産業上も有用であるが、 未だこのような方法は開発されていない。 発明の開示  By the way, among a set of adjacent genes on the same DNA strand in the microorganism genome, a set of adjacent genes transcribed to the same mRNA is called a transcription unit, and clarifying this transcription unit is a mechanism of gene expression control in the genome. It is very important to understand. Therefore, for example, if a method for predicting and estimating a transcription unit by correlating the expression profiles of a plurality of genes based on data obtained from a plurality of microarray experiments would be useful industrially, such methods are still available. No method has been developed. Disclosure of the invention
本発明は、 上記問題点に鑑みなされたものであり、 その目的は、 マイクロアレ ィ実験やマクロアレイ実験等の大量遺伝子発現解析から得られるデータの新規 解析方法を提供することにあり、 よ り具体的には、 ( 1 ) 上記対数比 (log[xT(k)/xc(k)]) に定量性が保障される範囲と保障されない範囲とを統計的に 明確に分けて評価する方法、 (2 ) 同対数比の平均強度に依存した偏り誤差を補 正 ·低減する方法、 及び、 (3 ) 複数のマイクロアレイ実験等から得られるデー タをもとに微生物ゲノム上の転写単位を予測 ·推定する方法、 並びに、 これらの 方法をコンピュータ上で実行するためのプログラム等を提供することにある。 本発明の第 1のデータ解析方法は、 上記の課題を解決するため、 マイクロアレ ィ実験、 マクロアレイ実験、 その他これら実験に類する遺伝子発現解析の結果得 られたデータの解析方法であって、 対照実験における遺伝子 (k)のシグナル強度 xc(k)と、 目的実験における遺伝子 (k)のシグナル強度 xT(k)との対数比に定量性が あると認められる範囲を下記 (a ) 〜 (c ) の工程により決定することを特徴と している。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a novel method for analyzing data obtained from a large-scale gene expression analysis such as a microarray experiment or a macroarray experiment. Specifically, (1) the log ratio (log [x T (k) / xc (k)]) method of evaluating statistically clearly separate the range not guaranteed a range of quantitative property is guaranteed to (2) a method for correcting and reducing bias errors depending on the average intensity of the logarithmic ratio; and (3) prediction of transcription units on the microorganism genome based on data obtained from multiple microarray experiments. · To provide a method for estimating, and a program or the like for executing these methods on a computer. The first data analysis method according to the present invention is a method for analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems. a signal intensity xc gene (k) in the experiment (k), the logarithmic ratio below (a) the extent permitted to be quantified with the ~ of the signal intensity x T (k) of the gene (k) at the target experiments ( c) is determined by the process are doing.
(a ) 対照実験における各々の遺伝子 (k)の測定シグナル強度 sc(k)及びバックグ ラウンド強度 bc(k)と、 目的実験における各々の遺伝子 (k)の測定シグナル強度 sT(k)及びパックグラウンド強度 bT(k)のデータを取得する、 (a) the measured signal strength sc (k) and the background intensity bc (k), the measured signal strength s T (k) and packs each gene in the objective experiment (k) of each gene (k) in the control experiment Get data of ground strength b T (k),
(b) sc(k)-bc(k)が負の値となるデータ値をもとに、シグナル強度 xc(k)が実質的 に 0とみなされる範囲を規定する第 1の閾値を決定する一方、 sT(k)-bT(k) が負の 値となるデータ値をもとに、 シグナル強度 xT(k)が実質的に 0とみなされる範囲 を規定する第 2の閾値を決定する、 (b) Based on the data value for which sc (k) -bc (k) is negative, determine the first threshold value that defines the range in which the signal intensity x c (k) is considered to be substantially zero. On the other hand, based on data values where s T (k) -b T (k) is a negative value, a second threshold value that defines a range in which the signal intensity x T (k) is considered to be substantially 0 Determine the
(c)シグナル強度 xc(k)が第 1の閾値以上又はこれより大きい値をとり、かつ、 シグナル強度 XT(k)が第 2の閾値以上又はこれより大きい値をとる場合に、 Xc(k) と xT(k)との対数比に定量性があると決定する。 (c) When the signal intensity x c (k) takes a value equal to or greater than the first threshold and the signal intensity XT (k) takes a value equal to or greater than the second threshold, Xc ( k) and determines that there is a quantitative property to the logarithm ratio between x T (k).
上記 (b) の工程において、 第 1の閾値を uxSDc (但し、 uは任意の正数で あり、 SDcは下記の式 (1) で表される統計量) と設定する一方、 第 2の閾値 を uxSDT (但し、 u は任意の正数であり、 SDTは下記の式 (2) で表される 統計量) と設定することは好ましい。 In the above step (b), the first threshold is set to uxSDc (where u is an arbitrary positive number and SDc is a statistic represented by the following equation (1)), while the second threshold is set. the UxSD T (where, u is an arbitrary positive number, SD T are statistics represented by the following formula (2)) it is preferable to set the.
Figure imgf000006_0001
Figure imgf000006_0001
(2)(2)
Figure imgf000006_0002
Figure imgf000006_0002
(但し、 yc(k)=sc(k)-bc(k) < 0及び yT(k)=sT(k)-bT(k) く 0であり、 式 (1) の kは sc(k)-bc(k)が負の値をとる Nc個のシグナルを、式( 2 )の kは sT(k)-bT(k) が負の値をとる Ντ個のシグナルをそれぞれ示す変数であり、 logの底は 1より大 きい任意の正数である。) (However, yc (k) = s c (k) −b c (k) <0 and y T (k) = s T (k) −b T (k) 0, and the equation (1) Where k is N c signals where sc (k) -bc (k) takes a negative value, and k in equation (2) is s T (k) -b T (k) takes a negative value. Variables indicating τ signals, respectively. The base of log is any positive number greater than 1. )
本発明の第 2のデータ解析方法は、 上記の課題を解決するため、 マイクロアレ ィ実験、 マクロアレイ実験、 その他これら実験に類する遺伝子発現解析の結果得 られたデータの解析方法であって、 対照実験における遺伝子 (k)のシグナル強度 xc(k)と、 目的実験における遺伝子 (k)のシグナル強度 xT(k)との対数比における平 均強度に依存した偏り誤差を下記 (a ) 〜 (c ) の工程により補正することを特 徴としている。 The second data analysis method of the present invention is a method of analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems. The deviation error depending on the average intensity in the logarithmic ratio between the signal intensity x c (k) of the gene (k) in the experiment and the signal intensity x T (k) of the gene (k) in the target experiment is as follows: The feature is that it is corrected by the process of (c).
( a ) xc(k)と xT(k)との平均強度 Av[k] (={log[xc(k)]+log[xT(k)]}/2) を値の大小 に応じて複数の区間に分割し、 s番目の区間に属する遺伝子 (u=l,2,...,Ns であ り Nsは s番目の区間に属する遺伝子の総数である) について、 xc(u)と XT(U)と の対数比の平均値 PAv(8)を求める、 (a) The average intensity Av [k] of x c (k) and x T (k) is set to the magnitude of the value, [{{log [xc (k)] + log [x T (k)]} / 2). Then, for each gene belonging to the s-th section (u = l, 2, ..., Ns, where Ns is the total number of genes belonging to the s-th section), x c ( u) and the average logarithmic ratio PAv (8) of XT (U)
( b ) k番目の遺伝子の平均強度 Av[k]が AV8-1 (s-1番目の区間の平均強度の最 小値と最大値との平均) と Avs (s番目の区間の平均強度の最小値と最大値との 平均) との間にあるとき、 その平均強度 Av[k]における基準強度 crit(k)を平均値 PAv(s-l)と PAv(8)とを用いて線形補間により求める、 (b) The average intensity Av [k] of the kth gene is A V8 -1 (the average of the minimum and maximum values of the average intensity in the s-1st interval) and Avs (the average intensity in the sth interval) Between the minimum value and the maximum value), the reference intensity crit (k) at the average intensity Av [k] is calculated by linear interpolation using the average values PAv (sl) and PAv (8). Ask,
( c )補正対数比 LOG[k]を下記の式(3 ) により定義することで、 xc(k)と xT(k) との対数比 log(xT(k)/ xc(k))を補正する。 (C) the correction log ratio LOG [k] to define the equation (3) below, xc log ratio of (k) and x T (k) log (x T (k) / x c (k) ) Is corrected.
LOG[k] = log(xT(k)/ xc(k))-crit(k) ( 3 ) LOG [k] = log (x T (k) / x c (k))-crit (k) (3)
上記データ解析方法において、 k番目の遺伝子の補正対数比 LOG[k]の絶対値 力、当該遺伝子につき設定した閾値 Th以上又はこれより大きい値をとる場合に、 k番目の遺伝子は統計的に有意な発現量変化が得られたシグナル対と判定する ことは好ましい。 また、 閾値 Thを下記 (a ) 〜 (c ) の工程により設定するこ とは好ましい。  In the above data analysis method, when the absolute value of the corrected logarithmic ratio LOG [k] of the k-th gene is equal to or greater than the threshold value Th set for the gene, the k-th gene is statistically significant. It is preferable to determine that the signal pair has a large change in the expression level. Further, it is preferable to set the threshold value Th by the following steps (a) to (c).
( a ) s番目の区間に属する遺伝子 (k)について、 これら遺伝子の LOG[k]に対し て S Dcritを設定し、 - S Dcrit < LOG[k] < S Dcritを満たすサンプルに対し て標準偏差を求め、 これを S D l [s]とする、 (a) For genes (k) belonging to the s-th section, the LOG [k] of these genes S Dcrit is set as follows:-Standard deviation is calculated for samples that satisfy S Dcrit <LOG [k] <S Dcrit, and this is defined as SD l [s].
( b ) 標準偏差 S D 1 [8]をもとに前後 2点 (S-2,s-l,s,s+l,s+2)の合計 5点の平均 値 S mth[s]を求め、 これを平均強度 Avs (s番目の区間の平均強度の最小値と最 大値との平均) における代表値とする、 (b) Based on the standard deviation SD 1 [8], find the average value S mth [s] of a total of 5 points of 2 points before and after ( S -2, sl, s, s + l, s + 2). Is the representative value at the average intensity Avs (the average of the minimum and maximum values of the average intensity in the s-th section),
( c ) 平均強度 Avu と Avu+1 との間に Av[k]が位置するとき、 S mth[u]と S mth[u+l]とを用いて線形補間により、  (c) When Av [k] is located between the average intensities Avu and Avu + 1, by linear interpolation using S mth [u] and S mth [u + l],
S D [k] = S mth[u+l]+(Av[k]-Avu+l)( S mth[u+l]- S mth[u])/(Avu+l-Avu) を求め、 2x S D [k]を閾値 Thとする。  SD [k] = S mth [u + l] + (Av [k] -Avu + l) (S mth [u + l]-S mth [u]) / (Avu + l-Avu) Let SD [k] be the threshold Th.
さらに、本発明の上記第 1のデータ解析方法と上記第 2のデータ解析方法とを 組み合わせて、 実験により得られたシグナルを複数の区分に分類することは好ま しい。  Further, it is preferable to combine the first data analysis method and the second data analysis method of the present invention to classify signals obtained by experiments into a plurality of categories.
本発明の第 3のデータ解析方法は、 上記の課題を解決するため、 マイクロアレ ィ実験、 マクロアレイ実験、 その他これら実験に類する遺伝子発現解析の結果得 られたデータの解析方法であって、複数の遺伝子の発現プロファイルとこれら遺 伝子のゲノム情報とに基づいて酝写単位を下記 (a ) 〜 (d ) の工程により推定 することを特徴としている。  The third data analysis method of the present invention is a method of analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems. The transcription unit is estimated by the following steps (a) to (d) based on the gene expression profiles and the genome information of these genes.
( a ) ゲノム上で隣接しかつ同一の核酸鎖に位置する二つの遺伝子の発現プロフ アイルにおける相関係数を算出する、  (a) calculating a correlation coefficient in an expression profile of two genes that are adjacent to each other on the genome and located on the same nucleic acid chain,
( b )上記相関係数に基づき s番目の遺伝子の発現プロファイルと同遺伝子の 3, 側に隣接する s+1番目の遣伝子の発現プロファイルとに有意な相関があると判定 されると、 転写単位集合にこれら遺伝子を帰属させ、 続いて、 s番目と 8+2番目 の遺伝子発現プロフアイルにおいて有意な相関が得られるときに s+ 2番目の遺 伝子を同集合に帰属させ、 以下同様の処理を 8+3, s+4,…と繰り返し有意な相関 が得られなくなったところで終了する、 (b) If it is determined that there is a significant correlation between the expression profile of the s-th gene and the expression profile of the s + 1-th gene adjacent to the third side of the gene based on the above correlation coefficient, Assign these genes to the transcription unit set, and then assign the s + 2 gene to the same set when a significant correlation is obtained in the s-th and 8 + 2 gene expression profiles, and so on. Is repeated with 8 + 3, s + 4, ... and ends when no significant correlation is obtained.
( c ) s番目の遺伝子と、同遺伝子の 5 '側に隣接する遺伝子(8-1番目,s-2番目,...) との発現プロファイルにおける有意な相関の有無を上記工程 (b ) と同様に判定 する、 -(c) The s-th gene and the gene adjacent to the 5'-side of the same gene (8-1st, s -2nd, ...) The presence or absence of a significant correlation in the expression profile with is determined in the same manner as in the above step (b).
( d ) 上記工程 (b ) と (c ) により得られた集合の中で、 最小の順位にある遺 伝子と最大の順位にある遺伝子とにより挟まれた遺伝子群を一つの転写単位と 推定する。 (d) In the set obtained in the above steps (b) and (c), the gene group sandwiched by the gene with the lowest rank and the gene with the highest rank is estimated as one transcription unit I do.
上記工程 (a ) において、 相関係数を下記の式 (4 ) により算出することは好 ましい。  In the above step (a), it is preferable to calculate the correlation coefficient by the following equation (4).
Figure imgf000009_0001
Figure imgf000009_0001
(但し、 隣接する二つの遺伝子を s、 tとする。 N stは M個の実験のうち二つの 遺伝子 s、 tの両方について値を得ることができた実験の個数であり、 この Nst 個の実験をそれぞれ jにより表している。 Xjsは遺伝子 sについて j番目の実験 の発現プロファイル、 Xjtは遺伝子 tについて j番目の実験の発現プロファイル、 X sは遺伝子 sについて Nst種類の実験の発現プロファイルの平均、 は遺伝 子 tについて N st種類の実験の発現プロファイルの平均、 をそれぞれ示す。) 本発明のプログラムは、 上記の課題を解決するため、 上記第 1〜第 3のデータ 解析方法のうち少なくとも 1つの方法をコンピュータに実行させることを特微 としている。 (However, the two adjacent genes are s and t. N st is the number of experiments for which values were obtained for both of the two genes s and t out of M experiments. Each experiment is represented by j, where Xjs is the expression profile of the jth experiment for gene s, Xjt is the expression profile of the jth experiment for gene t, and Xs is the average of the expression profiles of Nst experiments for gene s. , Represents the average of the expression profiles of N st experiments for the gene t.) The program of the present invention provides at least one of the first to third data analysis methods for solving the above problems. It features a computer to perform one of the two methods.
本発明の記録媒体は、 上記本発明のプログラムを記録した、 コンピュータで読 み取り可能な記録媒体である。 また、 本発明のデータ解析装置は、 上記本発明の プログラムと、 同プログラムにより上記第 1〜第 3のデータ解析方法のうち少な くとも 1つの方法を実行するコンピュータとを備えたデータ解析装置である。 本発明によれば、 (1 ) マイクロアレイ実験等から得られたシグナル強度の対 数比において定量性が保障される範囲を統計的手法により客観的に定めること ができる、 また (2 ) 同对数比の平均強度に依存した偏り誤差を補正 ·軽減する ことができるので、従来と比べて同対数比による遺伝子発現変化量の適切な評価 が可能になる、 さらに (3 ) マイクロアレイ実験等から得られたデータを活用し て、 微生物ゲノム上の転写単位を高精度に予測することが可能になる。 The recording medium of the present invention is a computer-readable recording medium on which the program of the present invention is recorded. Also, a data analysis device of the present invention is a data analysis device comprising the program of the present invention described above, and a computer that executes at least one of the first to third data analysis methods using the program. is there. According to the present invention, (1) a pair of signal intensities obtained from a microarray experiment or the like; The range in which the quantitative ratio is guaranteed in the numerical ratio can be objectively determined by a statistical method, and (2) the bias error depending on the average intensity of the equivalence ratio can be corrected and reduced. (3) Predict the transcription units on the microbial genome with high accuracy by utilizing data obtained from microarray experiments, etc. Becomes possible.
本発明のさらに他の目的、 特徴、 および優れた点は、 以下に示す記載によって 十分わかるであろう。 また、 本発明の利益は、 添付図面を参照した次の説明で明 白になるであろう。 図面の簡単な説明  Further objects, features, and advantages of the present invention will be more fully understood from the following description. Also, the advantages of the present invention will become apparent in the following description with reference to the accompanying drawings. Brief Description of Drawings
図 1は、 本発明により、 対照実験と目的実験とのシグナル強度の対数比におけ る偏り誤差が補正されることを説明する図である。  FIG. 1 is a diagram for explaining that the present invention corrects a bias error in a logarithmic ratio of signal intensities of a control experiment and a target experiment.
図 2は、 本発明により、 マイクロアレイ実験等により得られたシグナルデータ を複数のグループに分類できることを説明する図である。  FIG. 2 is a diagram illustrating that signal data obtained by a microarray experiment or the like can be classified into a plurality of groups according to the present invention.
図 3は、本実施例のマイクロアレイ実験における測定条件を説明する図である。 図 4は、本発明により対数比に定量性が認められたシグナルとそうでないシグ ナルの標準偏差を比較して示すグラフである。  FIG. 3 is a diagram illustrating measurement conditions in a microarray experiment of the present example. FIG. 4 is a graph showing a comparison of the standard deviation of a signal for which a logarithmic ratio was quantitatively recognized and a signal for which a logarithmic ratio was not found according to the present invention.
図 5は、本発明により補正した対数比と補正しない対数比の標準偏差を比較し て示すグラフである。  FIG. 5 is a graph comparing the logarithmic ratio corrected according to the present invention with the standard deviation of the logarithmic ratio not corrected.
図 6は、対照実験と目的実験において異なった実験条件で得られたシグナルデ ータについて、 本発明により補正した対数比と補正しない対数比の標準偏差を比 較して示すグラフである。 発明を実施するための最良の形態  FIG. 6 is a graph showing a comparison between the logarithmic ratio corrected by the present invention and the standard deviation of the uncorrected logarithmic value for signal data obtained under different experimental conditions in a control experiment and a target experiment. BEST MODE FOR CARRYING OUT THE INVENTION
以下、 本発明の実施の一形態について説明する。  Hereinafter, an embodiment of the present invention will be described.
( 1 ) シグナルデータの対数比に定量性が保障される範囲を決定する方法 (対 数比に定量性が保障されるシグナルの検出方法) (1) Method for determining the range in which the logarithmic ratio of signal data is guaranteed to be quantitative ( Signal detection method that guarantees quantitativeness in number ratio)
ここでは、 c D N Aを試料に用いたマイクロアレイ実験 (c D N Aマイクロア レイ実験) により得られたデータの解析を例に挙げて説明する。 通常、 c D N A マイクロアレイ実験においては、 各々の遺伝子 (k=l,2,...,N)について対照実験 (c) および 的実験 (T)それぞれについての測定シグナル強度とバックグラウンド強 度とからなる。 これらをそれぞれ、 sc(k)、 bc(k)ならびに sT(k)、 bT(k)とする。 sc(k)-bc(k)および 8T(k)-bT(k)の値は理想的には 0以上である。 sc(k)-bc(k)および sT(k)-bT(k)における負の値は本来 0となるべき値であることから、 0となるべき 値に対するバラツキである。 このバラツキを評価する統計量 S D cならびに S D τ をそれぞれ以下の式 (1 ) · ( 2 ) で示す。 Here, analysis of data obtained by a microarray experiment (cDNA microarray experiment) using cDNA as a sample will be described as an example. Normally, in cDNA microarray experiments, for each gene (k = l, 2, ..., N), the control signal (c) and the experimental signal (T) are used to determine the measured signal intensity and background intensity. Become. Let these be s c (k), b c (k) and s T (k), b T (k), respectively. The values of sc (k) -b c (k) and 8 T (k) -b T (k) are ideally greater than or equal to zero. Negative values of s c (k) -b c (k) and s T (k) -b T (k) are values that should be zero because they are originally supposed to be zero. The statistics SD c and SD τ for evaluating this variation are represented by the following equations (1) and (2), respectively.
Figure imgf000011_0001
Figure imgf000011_0002
Figure imgf000011_0001
Figure imgf000011_0002
(但し、 yc(k)=8c(k)-bc(k) く 0及び yT(k)=sT(k)-bT(k) く 0であり、 式 (1 ) の kは sc(k)-bc(k)が負の値をとる Nc個のシグナルを、式(2 )の kは sT(k)-bT(k) が負の値をとる NT個のシグナルをそれぞれ示す変数であり、 logの底は 1より大 きい任意の正数である。) (However, y c (k) = 8c (k) −b c (k) 及 び 0 and y T (k) = s T (k) −b T (k) く 0, and k in equation (1) Is N c signals where sc (k) -b c (k) takes a negative value, and k in equation (2) is N T where s T (k) -b T (k) takes a negative value Is a variable indicating each signal, and the base of log is any positive number greater than 1.)
一方、 (シグナル強度) - (バックグラウンド強度) が 0以上の値を有する場合 には xc(k)=sc(k)-bc(k) (≥0) および xT(k)=BT(k)-bT(k) (≥0)で表す。 本方法においては、 対照実験におけるシグナル値 xc(k)が、 On the other hand, if (signal intensity)-(background intensity) has a value of 0 or more, xc (k) = s c (k) -b c (k) (≥0) and x T (k) = BT It is represented by (k) -b T (k) (≥0). In this method, the signal value xc (k) in the control experiment is
xc(k) < u * S Dc (I)  xc (k) <u * S Dc (I)
を満たすとき、 xc(k)は 0であるとみなされる。 ここで u は任意の正数であり、 u=lのとき統計的には 68%の範囲のデータが分布する領域を示す。 Xc (k) is considered to be 0 when Here, u is an arbitrary positive number, and when u = l, indicates a region where data in a range of 68% is statistically distributed.
同様に、 目的実験におけるシグナル強度 xT(k)が、 Similarly, the signal intensity x T (k) in the target experiment is
xT(k) < u * S DT (II) x T (k) <u * SD T (II)
を満たすとき、 XT(k)は 0であるとみなされる。 XT (k) is considered to be 0 when
本方法においては、 xc(k)が上記の式 (I) を満たす場合、 あるいは xT(k)が上記 の式 (II) を満たす場合には、 k番目の遺伝子に対しては対数比により表現がで きない。 即ち、 これらの何れかを満たす場合には対数比において定量性は保障さ れないと判断する。 In this method, if the case xc (k) satisfies the above formula (I), or the x T (k) satisfies the above formula (II), the log ratio for the k-th gene I can't express it. That is, if any of these conditions is satisfied, it is determined that quantitativeness cannot be guaranteed in the logarithmic ratio.
( 2 ) 対数比における偏り誤差の低減方法  (2) Method for reducing bias error in logarithmic ratio
k番目の遺伝子について上記の式 (I) および (II) の両方を満たさないとき対 数比による評価が可能となる。 縦軸に対数比 log[xT(k)/xc(k)] (又は log[xc(k)/xT(k)], logの底は 1より大きい任意の正数で、 例えば 10) をとり横軸 に平均強度 Av[k] ( ={log[xc(k)]+log[xT(k)]}/2 ) をとれば、 大半の遺伝子は log[XT(k)/Xc(k)]=0近傍に分布すると期待される。 ところが、実際には図 1の上段 ダラフに示すように、横軸の平均強度に依存した偏り誤差が対数比に含まれるこ とがしばしばある。 この偏り誤差を以下の方法で軽減する。 まず、 平均強度を一 定の刻み幅で複数の区間に分割する。 s番目の区間に属する遺伝子 kについて、 対数比の平均値 PAv(s)を下記の式 (5 ) により求める。 v。、 -
Figure imgf000012_0001
When both of the above formulas (I) and (II) are not satisfied for the k-th gene, evaluation by a log ratio becomes possible. Log ratio on the vertical axis log [x T (k) / xc (k)] ( or log [xc (k) / x T (k)], the bottom of the log is greater than 1 any positive number, for example 10) mean intensity Av [k] to the horizontal axis a (= {log [xc (k )] + log [x T (k)]} / 2) taking the majority of genes lo g [XT (k) / It is expected to be distributed near Xc (k)] = 0. However, as shown in the upper graph in Fig. 1, the logarithmic ratio often includes a bias error depending on the average intensity on the horizontal axis. This bias error is reduced by the following method. First, the average intensity is divided into multiple sections at a fixed step size. For the gene k belonging to the s-th section, the average value PAv (s) of the log ratio is calculated by the following equation (5). v. ,-
Figure imgf000012_0001
' N.  'N.
(但し、 kは s番目の区間に属する N8個のシグナルを示す変数であり、 min(s) は 8番目の区間の平均強度の最小値を、 max(s)は s番目の区間の平均強度の最大 値をそれぞれ示す。) (Where, k is a variable indicating the s-th N 8 pieces of signals belonging to the segment, m in (s) Indicates the minimum value of the average intensity in the 8th section, and max (s) indicates the maximum value of the average intensity in the sth section. )
マイクロアレイ実験においては、平均強度が小さいほど対数比のばらつきが大 きくなる傾向があるため、平均強度が小さい区間ほど多数のサンプルにより平均 値を求めることが重要となる。 いま、 平均強度が小さい区間から大きい区間に向 かって s=l,2,.:.,Stotalとした場合に、 各区間に必要なサンプル数を、  In microarray experiments, the smaller the average intensity, the greater the variation in the logarithmic ratio. Therefore, it is important to obtain the average value from a larger number of samples in a section with a lower average intensity. Now, when s = l, 2,.:., Stotal from the section where the average intensity is small to the section where the average intensity is large, the number of samples required for each section is
8 - (Nstart-Nfinal)/(l-Stotal) + (Nfinal-Nstart- Stotal)/(l-Stotal) · · · (A) と設定する。ここで Nstartは s=lにおけるサンプル数であり、Nfinalは s=Stotal におけるサンプル数である。 (例えば Nstart=40, Nfinal=5などと設定する。) こ こで Nstart〉Nfinal と設定する。 この条件を満たすときのみ (つまり、 各区間 の実際のサンプル数が上記設定値 (A) 以上、 またはこの値より大きいとき) 平 均強度 Avs=[min(s)+max(8)]/2における PAv(s)を代表値として用いる。十分な数 のサンプルが得られないときには Avsにおける代表値が得られないため、前後の (Avq, PAv(q))および (Avt, PAv(t))の線形補間により PAv(s)を求める。 8-Set (Nstart-Nfinal) / (l-Stotal) + (Nfinal-Nstart-Stotal) / (l-Stotal) · · · (A). Where Nstart is the number of samples at s = l and Nfinal is the number of samples at s = Stotal. (For example, set Nstart = 40, Nfinal = 5, etc.) Here, set Nstart> Nfinal. Only when this condition is satisfied (that is, when the actual number of samples in each section is equal to or greater than the set value (A) or greater than this value) Average intensity Avs = [min (s) + max (8)] / 2 The PAv (s) in is used as a representative value. When a sufficient number of samples cannot be obtained, the representative value of Avs cannot be obtained. Therefore, PAv (s) is obtained by linear interpolation of (Avq, PAv (q)) and (Avt, PAv (t)) before and after.
k番目の遺伝子に対する平均強度 Av[k]が Avs-1 (=[min(s-l)+max(s-l)]/2) と Avs (=[min(8)+max(s)]/2) との間にあるとき、 PAv(s-l)および PAv(s)から平均強 度 Av[k]における基準強度 crit(k)を線形補間によって下記の式 (6 ) により求め る。 The average intensity Av [k] for the kth gene is Avs-1 (= [min (sl) + max (sl)] / 2) and Avs (= [min ( 8 ) + max (s)] / 2) When it is between, the reference strength crit (k) at the average strength Av [k] is obtained from PAv (sl) and PAv (s) by the following equation (6) by linear interpolation.
-",、 DA , 、 (Av[k]一 Avs)(PAv(s) - PAv -",, DA,, ( Av [ k ] One Avs) (PAv (s)-PAv
cnt(k)=PAv(s)+ ― (s D) (6)  cnt (k) = PAv (s) + ― (s D) (6)
(Avs - Avs 1) この基準強度をもとにして以下の式 (7 ) により補正対数比 LOG[k]を定義す る。  (Avs-Avs 1) Based on this reference intensity, the corrected log ratio LOG [k] is defined by the following equation (7).
LOG(k) = log(xT(k)/xc(k))-crit(k) ( 7 ) LOG (k) = log (x T (k) / xc (k))-crit (k) (7)
上記の補正対数比 LOG[k]によって対数比の偏り誤差を補正した結果が図 1の 下段グラフに示される。 ユーザーにより決められた LOG[k]=0 のばらつきの範囲を- S D crit く LOG[k] < S Dcritとする。 つまり、 S Dcritは実験の結果得られたデータ等に 基づきユーザーにより設定される。 平均強度の値にしたがってサンプルを複数 ( s ) の区間 (s=0,l,...,Stotal-l;区間は 0.1きざみで規定したが、 勿論刻み幅は これに限定されるものではない。) に分割し、 それぞれの区間について- S D crit < LOG[k] < S Dcrit を満たすサンプルに対して標準偏差を求める (S Dcrit はそれぞれの区間について独自の値に設定してもよいし、全区間について同じ値 でもよい)。 これを S D l [s]とする。 S D l [s]は 5 0以上のサンプル数をもとに 算出されることが望ましいが、 これに限定されるものではない。 The result of correcting the deviation error of the log ratio by the above corrected log ratio LOG [k] is shown in the lower graph of Fig. 1. The range of variation of LOG [k] = 0 determined by the user is -SD crit and LOG [k] <S Dcrit. In other words, S Dcrit is set by the user based on the data obtained as a result of the experiment. According to the value of the average intensity, the sample is divided into multiple (s) intervals (s = 0, l, ..., Stotal-l; the interval is specified in increments of 0.1, but of course the interval width is not limited to this. ), And for each section, find the standard deviation for the sample that satisfies-SD crit <LOG [k] <S Dcrit (S Dcrit may be set to a unique value for each section, The same value may be used for all sections). This is SD l [s]. It is desirable that SD l [s] be calculated based on the number of samples of 50 or more, but is not limited to this.
こ の よ う に して得られた標準偏差 S D l [s]をも と に前後 2 点 ( s-2,8-l,8,s+l,s+2)の合計 5点の平均値 S mth[s]を求め、 これを平均強度 Avs=[min(s)+max(s)]/2における代表値とする。  Based on the standard deviation SD l [s] obtained in this way, the average value of a total of 5 points of 2 points before and after (s-2, 8-l, 8, s + l, s + 2) Smth [s] is obtained, and this is set as a representative value at the average intensity Avs = [min (s) + max (s)] / 2.
k番目の遺伝子が平均強度 Av[k]で LOG[k]の値をもち、平均強度 AVuと Avu+1 との間に Av[k]が位置するときには、 S mth[u]と S mth[u+1]とを用いて線形補間 により、  When the k-th gene has the value of LOG [k] at the average intensity Av [k] and Av [k] is located between the average intensity AVu and Avu + 1, S mth [u] and S mth [ u + 1] and linear interpolation
S D [k] = S mth[u+l]+(Av[k]-AVu+l)( S mth[u+l]- S mth[u])/(AVu+ 1-AVu) を求める。 そして、 2x S D [k] < I LOG[k] lのとき、 k番目の遺伝子は統計的に 有意な発現量変化が得られたシグナル対と判定する。  S D [k] = S mth [u + l] + (Av [k] −AVu + l) (S mth [u + l] −S mth [u]) / (AVu + 1−AVu) When 2xSD [k] <ILOG [k] l, the k-th gene is determined to be a signal pair for which a statistically significant change in expression level has been obtained.
本方法により、 マイク,ロアレイ実験のシグナルは以下の 4つのグループ A〜D に分類される。 グループ A · Bは複数のマイクロアレイ実験データにおいて類似 の発現プロファイルをもつ遺伝子を探索するといった定量解析に用いることが できるシグナルの条件ということができる。 また、 グループ A〜Dは対照実験お よび目的実験の二つの実験において有意な差のある遺伝子を探索するといつた 定性解析に用いることができる (あわせて図 2参照。 尚、 図 2ではさらにグルー プ E · Fを加えた 6つのグループに分類している)。  According to this method, the signals of the microphone and row array experiments are classified into the following four groups A to D. Groups A and B are signal conditions that can be used for quantitative analysis, such as searching for genes with similar expression profiles in multiple microarray experimental data. In addition, groups A to D can be used for qualitative analysis when searching for genes that have a significant difference in the two experiments, the control experiment and the target experiment (see also Fig. 2; EF and F).
グループ A:対数比に定量性が保障され、 かつ、 統計的に有意な発現量変化が 得られるシグナル対。条件: SDC く xc(k)かつ SDT < xT(k)であり、 2xSD[k] < I LOG[k] I Group A: Quantitative log ratio is guaranteed and statistically significant changes in expression level The resulting signal pair. Condition: SD C xc (k) and SD T <x T (k), 2xSD [k] <I LOG [k] I
グループ B :対数比に定量性が保障されるが、 統計的に有意な発現量変化が得 られないシグナル対。条件: SDC く xc(k)かつ SDT < xT(k)であり、 2xSD[k〗 > I LOG[k] I Group B: Signal pairs for which logarithmic ratio is guaranteed, but does not produce statistically significant changes in expression level. Condition: SD C xc (k) and SD T <x T (k), 2xSD [k〗> I LOG [k] I
グループ C:シグナルの一方が 0とみなされるため対数比に定量性が保障され ない力 S、二つのシグナルに差が得られているシグナル対。条件:(i) SDC > xc(k) かつ SDT < xT(k:)、 あるいは、 (ii) SDC < xc(k)かつ SDT > xT(k) Group C: A force that does not guarantee quantitativeness in the log ratio because one of the signals is regarded as 0, a signal pair in which a difference is obtained between the two signals. Conditions: (i) SD C > xc (k) and SD T <x T (k :), or (ii) SD C <xc (k) and SD T > x T (k)
グループ D:シグナルの両方が 0とみなされるため対数比に定量性が保障され ず、 かつ二つのシグナルに差がないと判断されるシグナル対。 条件: SDC > xc(k)かつ SDT > xT(k) Group D: Signal pairs for which both signals are considered to be 0 and quantification of the log ratio is not guaranteed and that there is no difference between the two signals. Condition: SD C > xc (k) and SD T > x T (k)
(3) 転写単位の推定法  (3) Estimation method of transcription unit
パクテリアの遺伝子発現におけるゲノムの同一の DNA鎖上の隣接遺伝子集 合のうち同一の mRN Aに転写される隣接遺伝子集合を転写単位と呼ぶ。 この転 写単位を予測することは、 ゲノムにおける遺伝子発現制御の観点から非常に重要 である。 マイクロアレイデータに代表される個々の遺伝子の様々な条件における 発現量もしくは発現変化量をもとに転写単位を推定する方法を以下に述べる。 ゲノム上で連続して同一方向に並ぶ遺伝子集合をディレクトンと呼ぶ。複数の 遺伝子が同一の転写単位に属するとき、 これらの遺伝子は同一の mRNAとして 転写されるので、 理論的にはこれらの発現プロファイルに正の相関を有する。 そ こで、 同一のディレクトンに属する遺伝子間のマイクロアレイ発現プロファイル の相関を算出する。 いま M種類のマイクロアレイ実験 (遺伝子 N個に対しての発 現プロファイルが得られるマイクロアレイ) を行ったとすると、 各遺伝子の発現 プロファイルは NxMの行列により次のように表記することができる。 4 The contiguous gene set that is transcribed to the same mRNA in the contiguous gene set on the same DNA strand of the genome in the gene expression of Pacteria is called a transcription unit. Predicting this transcription unit is very important from the viewpoint of controlling gene expression in the genome. A method for estimating the transcription unit based on the expression amount or the expression change amount of various genes under various conditions represented by microarray data is described below. A set of genes continuously arranged in the same direction on the genome is called a directon. When multiple genes belong to the same transcription unit, these genes are transcribed as the same mRNA, and therefore have a positive correlation to their expression profiles in theory. Therefore, the correlation of the microarray expression profile between the genes belonging to the same directon is calculated. Now, assuming that M types of microarray experiments (microarrays that can obtain an expression profile for N genes) are performed, the expression profile of each gene can be represented by an NxM matrix as follows. Four
Figure imgf000016_0001
ノ ここで、 第 s番目の遺伝子の発現プロファイル Xsは M次元のベタトルにより 次のように記述することができる。
Figure imgf000016_0001
No. Here, the expression profile X s of the s-th gene can be described as follows using an M-dimensional betatle.
s = ( 81, 82,…, 8j"-" X sM ) s = (81,82, ..., 8j "-" XsM)
べクトルで記述された Xs,Xtを用いて、 ゲノム上で隣接しかつ同一の D N A鎖 に位置する二つの遺伝子(s,t) の発現プロファイルの相関により転写単位を予測 する本方法のアルゴリズムを以下説明する。 このアルゴリズムでは必ずしも隣接 遺伝子間の相関に関する情報が得られなくとも転写単位を推定することができ る。 X s written in base vector, with X t, two located adjacent and identical DNA strand on the genome gene (s, t) of the method for predicting the transcription unit by the correlation of the expression profiles of The algorithm is described below. With this algorithm, it is possible to estimate the transcription unit without necessarily obtaining information on the correlation between adjacent genes.
本方法では、 発現プロファイルとゲノム情報とに基づき、 下記のステップ 1〜 4により転写単位を推定する。  In this method, a transcription unit is estimated based on the expression profile and genomic information by the following steps 1 to 4.
〔ステップ 1 :同一のディレクトンに属する遺伝子間の相関係数の算出〕 同一のディレクトン上の第 s番目と第 t番目の遺伝子対の発現プロファイル X s,Xtにおける相関係数 r(s,t)を算出する。 ここで、 s = l,2"..,N、 t= l,2"..,Nであ り、 Nは、 注目したディレク トンに属する遺伝子の総数である。 ここでディレク トンとは連続して同一の DNA鎖に位置づけられている遺伝子集合である。 [Step 1: Calculation of correlation coefficient between genes belonging to the same directon] Correlation coefficient r (s in the expression profiles X s and X t of the s-th and t-th gene pairs on the same diton , T). Here, s = l, 2 ".., N, t = l, 2" .., N, where N is the total number of genes belonging to the focused directory. Here, a directive is a set of genes that are consecutively positioned on the same DNA strand.
〔ステップ 2 :隣接関係を考慮した発現相間のある遺伝子の 3 '方向への探索〕 8番目の遺伝子の発現プロファイル XBと隣接する遺伝子 (s+1) の発現プロフ アイル Xs+1とに有意な相関があるとき、転写単位集合にこの遺伝子を帰属させる。 続いて、 8番目と s+ 2番目の遺伝子発現プロファイルにおいて有意な相関が得ら れるときに s+ 2番目の遺伝子を同集合に帰属させる。 この操作を s+3, 8+4,...と 繰り返し有意な相関が得られなくなったところで終了する。 Expression of genes adjacent to the expression profile X B of: [Step 2 3 'search in the direction of a between the expression considering adjacency phase gene] 8 th gene (s + 1) Prof When there is a significant correlation with Isle X s + 1 , this gene is assigned to the transcription unit set. Subsequently, the s + 2 gene is assigned to the same set when a significant correlation is obtained between the eighth and s + 2 gene expression profiles. This operation is repeated until s + 3, 8 + 4, ... no significant correlation is obtained.
〔ステップ 3 :隣接関係を考慮した発現相間のある遺伝子の 5,方向への探索〕 s-1番目, s-2番目, · · 'に対して上記ステップ 2と同様の操作を行う。  [Step 3: Searching for genes in the expression phase in the 5th direction in consideration of the adjacency] Perform the same operation as in Step 2 above for the s-1st, s-2th,.
〔ステップ 4 :転写単位区間の候補の列挙〕  [Step 4: List transcription unit section candidates]
上記ステップ 2と 3とにより得られた集合の中で、最小の順位にある遺伝子と 最大の順位にある遺伝子とにより挟まれた遺伝子群を一つの転写単位と推定す る。  In the set obtained in steps 2 and 3, the group of genes sandwiched between the gene with the lowest rank and the gene with the highest rank is estimated to be one transcription unit.
尚、 上記ステップ 1における相関係数 r(s,t)の具体的な算出方法などについて は、 後述の実施例で説明する。  The specific method of calculating the correlation coefficient r (s, t) in step 1 will be described in an embodiment described later.
勿論、 本発明は、 これまで説明した本実施形態の上記 (1 ) 〜 (3 ) の方法に 限定されるものではなく、 本発明の範囲内で種々の変更が可能である。 例えば、 上記(1 )〜(3 ) の方法で使用した閾値、基準値などの値は任意であり、用途 · 目的などに応じて適切な値を設定すればよい。 また、 上記 (1 ) 〜 (3 ) の方法 の工程 (ステップ) に対してさらに付加的ステップを追加してもよい。  Of course, the present invention is not limited to the above-described methods (1) to (3) of the present embodiment, and various changes can be made within the scope of the present invention. For example, the values of the threshold value, the reference value, and the like used in the above methods (1) to (3) are arbitrary, and appropriate values may be set according to the application and purpose. Further, additional steps may be added to the steps (steps) of the above methods (1) to (3).
( 4 ) 本発明の有用性 (利用分野)  (4) Utility of the present invention (field of application)
本発明は、 マイクロアレイ実験やマクロアレイ実験等の遺伝子発現解析から得 られるデータの新規解析方法、及び同方法を実行するためのプログラム等に関す るものであり、 マイクロアレイ実験等により得られたデータの適切な評価、 及ぴ 当該データの新たな活用が可能になり、研究用ツールとしての用途は勿論のこと、 これにとどまらず、 例えば疾患と遺伝子との関係に基づく新薬の開発 ·ゲノム創 薬、 新たな検査 ·診断法、 予防 '治療法の確立にも貢献しうるものである。  The present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method. Appropriate evaluation and use of this data are possible, and it is possible to use it not only as a research tool, but also for development of new drugs based on the relationship between diseases and genes.Genome drug discovery, It can also contribute to the establishment of new tests and diagnostic methods, and preventive methods.
本発明の利用方法の一例を挙げれば、マイクロアレイ実験のシグナル強度に応 じて解析法を設定することが可能となる。 例えば、 複数のマイクロアレイ実験デ ータにおいて類似の発現プロファイルをもつ遺伝子を探索する場合には、 前述の グループ A · Bの条件を満たす遺伝子のみを解析の対象とする。 また、 単に、 一 つのマイクロアレイ実験において有意な変化がある遺伝子の探索には、 前述のグ ループ A〜Dの条件を満たす遺伝子を対象とする。 このように、 定量解析あるい は定性解析に適した対象遺伝子群の範囲を決めることができる。 これにより、 通 常、 マイクロアレイ解析として用いられる多変量解析の精度を向上できる。 本発明のプログラムは、 本発明の方法 (例えば前記 (1) 〜 (3) の何れかの 方法) をコンピュータに実行させるものであり、 本発明の記録媒体は、 本発明の プログラムを記録したものであって、 コンピュータによってアクセスされ、 読み 取られうる任意の記録媒体をいう。 このような記録媒体としては、 フレキシブル ディスク、 ハードディスク、 磁気テープ等の磁気記憶媒体、 CD_ROM、 CD 一 R、 CD-RW, DVD-ROM, DVD-RAM, DVD— RW等の光学記 憶媒体、 RAMや ROM等の電気記憶媒体、 および MO等の磁気 光学記憶媒体 を例示することができるが、 これらに限定されるものではない。 As an example of the method of using the present invention, it is possible to set an analysis method according to the signal intensity of a microarray experiment. For example, several microarray experiments When searching for genes with similar expression profiles in data, only genes that satisfy the conditions of Groups A and B described above are analyzed. Also, simply searching for a gene that has a significant change in one microarray experiment targets genes that satisfy the conditions of Groups A to D described above. Thus, the range of the target gene group suitable for quantitative analysis or qualitative analysis can be determined. This can improve the accuracy of multivariate analysis usually used for microarray analysis. The program of the present invention causes a computer to execute the method of the present invention (for example, any one of the methods (1) to (3)), and the recording medium of the present invention records the program of the present invention. And refers to any recording medium that can be accessed and read by a computer. Such recording media include magnetic recording media such as flexible disks, hard disks, and magnetic tapes; optical recording media such as CD_ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, and DVD-RW; and RAM. Examples thereof include, but are not limited to, an electrical storage medium such as a ROM and a ROM, and a magneto-optical storage medium such as an MO.
本発明の 「データ解析装置」 は、 本発明のプログラムと、 同プログラムにより 本発明の方法 (例えば前記 (1) 〜 (3) の何れかの方法) を実行するコンビュ 一タとを備えて構成される。 コンピュータは、 基本的に、 本発明の方法を実行し うる構成であれば足り、 入力装置、 データ記憶装置、 中央演算処理装置および出 力装置によって構成される。  The “data analysis device” of the present invention includes a program of the present invention, and a computer that executes the method of the present invention (for example, any one of the methods (1) to (3)) by the program. Is done. The computer basically has a configuration capable of executing the method of the present invention, and includes an input device, a data storage device, a central processing unit, and an output device.
以下、 実施例により本発明をより具体的に説明するが、 本発明はこれら実施例 により何ら限定されるものではない。  Hereinafter, the present invention will be described more specifically with reference to examples, but the present invention is not limited to these examples.
〔解析例〕  [Analysis example]
特定の遺伝子が欠損した大腸菌変異株ならびに大腸菌の時系的変化における 発現プロファイルを市販の大腸菌マイクロアレイを用いて測定した。 測定条件は 図 3に示される。  Expression profiles of Escherichia coli mutants deficient in specific genes and changes over time in Escherichia coli were measured using a commercially available Escherichia coli microarray. Figure 3 shows the measurement conditions.
〔実施例 1 :対数比 (相対値) に定量性が認められるシグナルの検出方法〕 対照実験及び目的実験におけるシグナル値 xc(k)、 XT(k)が、 [Example 1: A method for detecting a signal whose logarithmic ratio (relative value) is quantitative] The signal values xc (k) and XT (k) in the control experiment and the target experiment are
xc(k) < S Dc (I)  xc (k) <S Dc (I)
xT(k) < S DT (ID x T (k) <SD T (ID
を満たすとき、 xc(k)、 xT(k)を 0とみなした場合についての解析結果を以下に示 す。 xc(k)が式 (I) を満たす場合、 あるいは xT(k)が式 (Π) を満たす場合には、 k番目の遺伝子に対しては対数比により表現ができない。 即ち、 これら何れかの 式を満たす場合には対数比において定量性は保障されない。 このことは、 xc(k) ならびに xT(k)を同一の条件により測定したときに、 xT(k)と xc(k)に対する対数比 log[xT(k)/xc(k)]において下記の条件 2-1ならびに条件 2-2の標準偏差は条件 1の 標準偏差に比べ大きくなることにより示される。 When satisfying, to indicate xc (k), the analysis result for the case where regarded as x T a (k) 0 to less than. When xc (k) satisfies the formula (I), or if x T (k) satisfies the formula ([pi) can not express the log ratio for the k-th gene. That is, if any of these formulas is satisfied, quantitativeness is not guaranteed in the log ratio. This is, xc (k) and x T (k) is as measured under the same conditions, the logarithmic ratio log x T (k) and for xc (k) [x T ( k) / xc (k)] The standard deviation of the conditions 2-1 and 2-2 below is larger than the standard deviation of the condition 1.
条件 1 :対数比に定量性が保障される。 条件: S De < xc(k)かつ S DT < xT(k) Condition 1: Logarithmic ratio is guaranteed to be quantitative. Condition: SD e <xc (k) and SD T <x T (k)
条件 2-1:シグナルの一方が 0となるため対数比に定量性が保障されないが、 二つのシグナルに差が得られている。条件: (i) S DC > xc(k)かつ S DT < xT(k) あるいは (ii) S DC < xc(k)かつ S DT > xT(k) Condition 2-1: One of the signals is 0, so the logarithmic ratio cannot be guaranteed quantitatively, but a difference is obtained between the two signals. Conditions: (i) SD C > xc (k) and SD T <x T (k) or (ii) SD C <xc (k) and SD T > x T (k)
条件 2-2: シグナルの両方が 0となるため対数比に定量性が保障されず、 かつ 二つのシグナルに差がないと判断される。 条件: S D C > xc(k)かつ S D T > xT(k) Condition 2-2: Since both signals are 0, quantification of the log ratio is not guaranteed and it is judged that there is no difference between the two signals. Condition: SD C > xc (k) and SD T > x T (k)
L B培地で大腸菌を培養し抽出された同一の m R N Aサンプルについて算出 された上述の 3つの条件における対数比の標準偏差を図 4に示す。(A)および ( B ) は、 それぞれ対数増殖期において独立に 2回抽出された m R N Aに対して上述の 3つの条件における対数比の標¾1偏差を算出した結果である。 実験誤差のない理 想的系においてこれらの標準偏差は全て 0となる。 しかし実際には誤差によりデ ータのばらつきが生じ、 このことが標準偏差により定量される。 この誤差の程度 は条件 2-1ならびに条件 2-2において明らかに条件 1に比べて大きくなる。 この ことから条件 2-1ならびに条件 2-2において対数比により発現量変化を定量的に 評価することが困難となると結論付けられる。 FIG. 4 shows the standard deviation of the log ratio under the above three conditions calculated for the same mRNA sample extracted by culturing E. coli in LB medium. (A) and (B) are the results of calculating the standard deviation of the log ratio under the above three conditions for the mRNA extracted twice independently in the logarithmic growth phase. In an ideal system without experimental errors, these standard deviations are all zero. However, in practice, errors cause data variations, which are quantified by the standard deviation. The magnitude of this error is clearly larger in Condition 2-1 and Condition 2-2 than in Condition 1. Therefore, the change in expression level was quantitatively determined by logarithmic ratio under conditions 2-1 and 2-2. It is concluded that it will be difficult to evaluate.
〔実施例 2 :対数比における偏り誤差の低減方法〕  [Example 2: Method for reducing bias error in logarithmic ratio]
前述したグループ A · Bの何れかの条件を満たすシグナルについては、 対数比 において定量性が保障される。  For signals that satisfy either of the conditions in Groups A and B, quantitativeness is guaranteed in the log ratio.
グループ A: 対数比に定量性が保障され、 かつ、 統計的に有意な発現量変化 が得られるシグナル対。 条件: SDC < xc(k)かつ SDT < ; sT(k)であり、 2xS D[k] < I LOG[k] I Group A: Signal pairs that guarantee quantitativeness in the log ratio and that produce statistically significant changes in expression levels. Condition: SD C <xc (k) and SD T <; s T (k), 2xS D [k] <I LOG [k] I
グループ B : 対数比に定量性が保障されるが、 統計的に有意な発現量変化が 得られないシグナル対。 条件: SDC < xc(k)かつ SDT < xT(k)であり、 2x S D[k] > I LOG[k] I Group B: Signal pairs that guarantee quantitativeness in the log ratio but do not produce statistically significant changes in expression levels. Condition: SD C <xc (k) and SD T <x T (k), 2x SD [k]> I LOG [k] I
上記 A · B何れかの条件を満たすシグナルについて、 対数比おょぴ補正対数比 のそれぞれ原点を基準としたときの標準偏差を求めた。 本発明の補正法により偏 り誤差が軽減されていることは、補正対数比において原点を基準とした標準偏差 が補正をしない場合の対数比の標準偏差よりも小さくなることにより確認する ことができる。  For signals that satisfy either of the above conditions A and B, the standard deviation of the log ratio and the corrected log ratio with respect to the origin was determined. The fact that the bias error is reduced by the correction method of the present invention can be confirmed by the fact that the standard deviation with respect to the origin in the corrected log ratio becomes smaller than the standard deviation of the log ratio when no correction is performed. .
LB培地で大腸菌を培養し抽出された同一の mRNAサンプルについて、条件 A又は Bを満たす各シグナルの補正なし対数比と補正対数比 (LOG[k]) につい て算出した標準偏差を図 5に示す。 同図に示すように、 補正対数比において原点 を基準とした標準偏差は、補正をしない場合の対数比の標準偏差よりも小さくな る結果が得られた。 尚、(A)および (B)は、それぞれ対数増殖期において独立に 2 回抽出された mRNAに対して同様の実験を行い、標準偏差を算出した結果であ る。  Figure 5 shows the standard deviation calculated for the uncorrected log ratio and the corrected log ratio (LOG [k]) of each signal that satisfies condition A or B for the same mRNA sample extracted by culturing E. coli in LB medium. . As shown in the figure, the standard deviation of the corrected log ratio with respect to the origin was smaller than the standard deviation of the log ratio without correction. (A) and (B) are the results of performing the same experiment on mRNA independently extracted twice in the logarithmic growth phase and calculating the standard deviation.
条件 Bは、対数比に定量性が保障されるが統計的に有意な発現変化が得られな いシグナル対であるから、 対照実験と目的実験において異なった条件で mRNA を抽出した場合にも本発明の補正方法により偏り誤差が緩和されているとする と、補正対数比における原点を基準とした標準偏差は補正をしない場合の対数比 の標準偏差よりも小さくなる。 このことを確かめるため、 実際に図 3に示される 欠損株と野生株との比較、 あるいは特定の時刻とそれに対する任意の時刻との比 較において測定された発現強度をもとに、 条件 Bを満たすシグナルについて、 補 正対数比 (LOG[k]) における標準偏差と補正を施さない場合の対数比における 標準偏差との関係を求め、 図 6のグラフに示した。 同図に示すように、 補正対数 比における原点を基準とした標準偏差は、補正をしない場合の対数比における標 準偏差よりも小さくなる結果が得られた。 Condition B is a signal pair that guarantees quantitativeness in the log ratio but does not produce a statistically significant change in expression.Therefore, the condition B can be used even when mRNA is extracted under different conditions in the control experiment and the target experiment. Assuming that the bias error is reduced by the correction method of the present invention, the standard deviation of the corrected logarithmic ratio with respect to the origin is the logarithmic ratio when no correction is performed. Is smaller than the standard deviation of To confirm this, condition B was determined based on the expression intensity measured at the actual comparison between the defective strain and the wild-type strain shown in Fig. 3, or at a specific time and an arbitrary time. For the signal to be satisfied, the relationship between the standard deviation in the corrected log ratio (LOG [k]) and the standard deviation in the log ratio when no correction was performed was obtained and is shown in the graph of FIG. As shown in the figure, the standard deviation of the corrected log ratio with respect to the origin was smaller than the standard deviation of the log ratio without correction.
〔実施例 3 :転写単位の推定法〕  [Example 3: Estimation method of transcription unit]
転写単位とは、 同一の m R N Aに転写される遺伝子群のことであるから、 ゲノ ム上で同一鎖上にあり隣接した遺伝子の中で発現プロファイルにおいて正の相 関があるものを見つけることにより、 同一の転写単位にある遺伝子群を見つける ことができる。 発現プロファイルをマイクロアレイにより測定した場合、 一つの 実験に対して、 数千個の遺伝子の発現プロファイルを測定することができる一方 で、 条件により発現プロファイルが測定できない遺伝子も多々存在する。 本推定 法は、 i番目と j番目の遺伝子について全ての実験条件 Mにおいて発現データが 揃わない場合にも IS写単位を推定することができる。 また、 同一のディレク トン に属する全ての遺伝子対を対象としているため、 隣接の遺伝子の相関係数自体が 欠落している場合においても転写単位を推定することができる。 すなわち、 この ような二種の欠落を補いながら転写単位を以下のように推定 ·予測する。  A transcription unit is a group of genes that are transcribed into the same mRNA, and by finding genes that are on the same strand and adjacent to each other on the genome and have a positive correlation in the expression profile, A group of genes in the same transcription unit can be found. When the expression profile is measured by a microarray, the expression profile of several thousand genes can be measured for one experiment, but there are many genes whose expression profile cannot be measured depending on the conditions. This estimating method can estimate the IS scoping unit even when the expression data for all the experimental conditions M for the i-th and j-th genes are not complete. Also, since all gene pairs belonging to the same directive are targeted, the transcription unit can be estimated even when the correlation coefficient itself of the adjacent gene is missing. That is, the transcription unit is estimated and predicted as follows while compensating for these two kinds of deletions.
ゲノム上で連続して同一の転写方向を有する遺伝子を 5 '側から順番に 1、2、...、 i、 j、 ...、 nとする。 ここで i < jとする。 i番目と j番目の遺伝子について 8y 対の実験条件で発現プロファイルが測定され、 相関係数 、 j、 Sjj)が求められた とする。 本方法では下記のピアソンの相関式を用いた。 Genes having the same transcription direction continuously on the genome are referred to as 1, 2, ..., i, j, ..., n in order from the 5 'side. Here, i <j. It is assumed that the expression profiles of the i-th and j-th genes are measured under 8y pairs of experimental conditions, and the correlation coefficients (j, Sjj) are obtained. In this method, the following Pearson's correlation equation was used.
Figure imgf000022_0001
ここで、 r( s,t,Nst)は s番目と t番目の遺伝子について、 M種の実験のうち N Bt対について実験により値を得ることができたことを示す。
Figure imgf000022_0001
Here, r (s, t, N st ) indicates that, for the s-th and t-th genes, values could be obtained by experiments for NBt pairs out of M experiments.
n個の遺伝子については、 n(n-l)/2対と対応した相関係数が求められる。  For n genes, a correlation coefficient corresponding to n (n-1) / 2 pairs is determined.
(1)相関係数 r(i、 j、 s )が基準値 r(sij、 a)より大きいとき、 相関係数 、 j、 ) は統計的に有意に正の相関があることが保証される。 ここで、 基準相関値 r(Sij、 a)は、
Figure imgf000022_0002
(1) When the correlation coefficient r (i, j, s) is larger than the reference value r (sij, a), it is guaranteed that the correlation coefficient, j, ) has a statistically significant positive correlation. You. Here, the reference correlation value r (Sij, a) is
Figure imgf000022_0002
により求めることができる。 ここで有意水準 aにおける t値を taとする。 taは統 計検定における有意水準 aであり、統計における t分布表より得ることができる。 いま、 r(i、 j、 sy)が有意な正の相関であるとすると、 i番目と j番目の遺伝子は 同一の転写単位にある可能性があることを意味する。 すなわち、 i番目から j番 目の j-i+1個の遺伝子は同一の転写単位にある可能性がある。 Can be obtained by Here the t values in the significance level a and t a. t a is the significance level a in the statistical test, and can be obtained from the t distribution table in statistics. Now, assuming that r (i, j, sy) is a significant positive correlation, it means that the i-th and j-th genes may be in the same transcription unit. That is, the i-th to j-th j-i + 1 genes may be in the same transcription unit.
(2) 相関係数 r(i、 j、 Sij)が負の値を有するとき、 第 i番目と第 j番目の遺伝子は 異なった転写単位に帰属される。 このことは、 さらに、 x≤i、 j≤yの関係にある 遺伝子対 (x、 y)の発現プロファイルに正の相関があつたとしても、 これらの二つ の遺伝子 x、 yは異なった転写単位にあることを意味する。  (2) When the correlation coefficient r (i, j, Sij) has a negative value, the ith and jth genes belong to different transcription units. This further implies that, even though the expression profiles of gene pairs (x, y) with x≤i, j≤y are positively correlated, these two genes x, y Means in units.
(3) 第 i番目の遺伝子を基準としたときに、 i+1番目、 i+2番目、 ...、 i+k番目 の k個の遺伝子が有意に正の相関があるときには、 これらの遣伝子は同一の転写 単位にあることを意味する。 また、 i-1番目、 i-2番目、 ...、 i-m番目の m個の遗 伝子が有意に正の相関があるときには、 これらの遺伝子は同一の転写単位にある ことを意味する。 (3) Based on the i-th gene, when the i + 1-th, i + 2, ..., i + k-th k genes are significantly positively correlated, The transgenes are in the same transcription unit. Also, i-1st, i-2th, ..., imth m 遗 If the genes are significantly positively correlated, it means that these genes are in the same transcription unit.
これらの三つの条件 (1)〜(3)に基づいた以下の工程により転写単位を推定した c (工程 1 ) ゲノム上で連続して同一の転写方向を有する遺伝子を 5 '側から順番に 1、 2、 ...、 i、 j、 ...、 nとする。 ここで i < jとする。 i番目と j番目の遺伝 子について Sij対の実験条件についての発現プロファイルによる相関係数 r(i、 j、 s )を求める。 n個の遺伝子については、 n(n-l)/2対と対応した相関係数が求めら れる。 The transcription unit was estimated by the following steps based on these three conditions (1) to (3) .c (Step 1) Genes having the same transcription direction on the genome were successively selected from the 5 ' , 2, ..., i, j, ..., n. Here, i <j. For the i-th and j-th genes, the correlation coefficient r (i, j, s) based on the expression profile for the Sij pair of experimental conditions is determined. For n genes, a correlation coefficient corresponding to n (nl) / 2 pairs is determined.
(工程 2 ) 相関係数 r(i、 j、 s )のうち正の統計的有意な相関係数を選択する。 相関 係数 r(i、 j、 Sij)が基準値 r(sti、 a)より大きいとき、 統計的に有意に正の相関があ ることが保証される。 ここで、 基準相関値 r(Sij、 a)は、 r{si a) = 卜 " (Step 2) A positive statistically significant correlation coefficient is selected from the correlation coefficients r (i, j, s). When the correlation coefficient r (i, j, Sij) is larger than the reference value r ( sti , a), it is guaranteed that there is a statistically significant positive correlation. The reference correlation value r (Sij, a) is, r {s i a) = Bok "
2
Figure imgf000023_0001
- 2 により求めることができる。
Small 2
Figure imgf000023_0001
-2
(工程 3 )負の相関係数 r(i、j、 Sij)を有する遺伝子対を求める。 この遺伝子対を i、 jとするとき、 x≤i、 j≤yの関係にある遺伝子対 (x、 y)の発現プロファイルに正の 相関があつたとしても、 これらの二つの遺伝子 x、 yは異なった転写単位にある ため同一の転写単位には含めない。  (Step 3) A gene pair having a negative correlation coefficient r (i, j, Sij) is obtained. If these gene pairs are i and j, even if the expression profile of the gene pair (x, y) with the relation x≤i, j≤y has a positive correlation, these two genes x, y Are not included in the same transcription unit because they are in different transcription units.
(工程 4 ) 第 u番目の遺伝子を基準としたときに、 上記の工程 2により u-1番目、 u-2番目、…、 u- 番目の 個の遺伝子ならびに u+1番目、 u+2番目、 ...、 u+k2 番目の k2個の遺伝子について第 u番目の遺伝子の発現プロファイルと有意な正 の相関が得られ、かつ工程 3により負の相関係数を有する遺伝子対もなかった場 合は、 これらの k2 - Id + 1個の遺伝子は同一の転写単位であると推定する。 (Step 4) When the u-th gene is used as a reference, the u-1st, u-2th,..., U-th genes, u + 1th, u + 2th , ..., expression profiles and significant positive correlation of the u-th gene for u + k 2 -th k 2 genes can be obtained, and the process did not have the gene pair having a negative correlation coefficients by 3 and if, these k 2 - Id + 1 genes is estimated to be the same transcription unit.
〔本推定法の評価方法〕  (Evaluation method of this estimation method)
現在までに明らかにされている転写単位を iにより番号づける (i=l、 2、 ...、 q、 ...Nq)。 q番目の転写単位に属する遺伝子の集合 Tqで表し、 帰属される遺伝 子を、 それぞれ、 t(Tl 、 t(Tq)2,
Figure imgf000024_0001
とする。 第 U番目の遺伝子を基準に 本推定法により予測された転写単位に属する遺伝子の集合を Puで表す。 この転 写単位に帰属される遺伝子を、 それぞれ、 p(Pu>l、 p(Pu>2、 ... p(Pu>N[Pu】とする。 第 U番目の遺伝子が集合 Tqに帰属されることを Tq(U)と表記する。 いま、 第 U番目 の遺伝子が、 Tqに帰属されるとき、 理想的には、集合 Puと集合 Tq(u)の要素は一 致する。 集合 Puと集合 Tq(U)の共通の要素数を N[PunTq(U)]とする。 集合 Puと 集合 Tq(u)の要素数は、 それぞれ、 N[PU]および N[Tq(u)]である。
The transcription units identified to date are numbered by i (i = l, 2, ..., q, ... Nq). The set of genes belonging to the q-th transcription unit is represented by T q , and the assigned genes are t ( Tl, t (Tq) 2 ,
Figure imgf000024_0001
And The set of genes belonging to the transcription unit predicted by the present estimation method based on the Uth gene is denoted by Pu . Genes that are attributable to the transcription unit, respectively, p (Pu> l, p (Pu> 2, ... and p (Pu> N [Pu]. The U th genes belong to the set T q is the referred to as T q (U) that. now, the U-th gene, when it is assigned to T q, ideally, an element of the set Pu the set T q (u) is match . number of elements in common the number of elements N [P u nT q (U )] to. collectively P u the set T q (u) of the set Pu the set T q (U), respectively, N [P U ] And N [T q (u)].
予測された転写単位と既知の転写単位とがー致すれば、 以下の二つの式 (式 ( 8 ) および (9 ) ) はともに 0となる。  If the predicted transcription unit and the known transcription unit match, the following two equations (Equations (8) and (9)) are both zero.
E[PU] = N[PU] - N[PU门 Tq(u)] ( 8 ) E [P U] = N [ P U] - N [P U门T q (u)] (8 )
E[Tq(u)] = N[Tq(u)] - Ν[Ρα Π Tq(u)] ( 9 ) E [T q (u)] = N [T q (u)]-Ν [Ρα Π T q (u)] (9)
E[PU]と E[Tq(u)]はともに 0以上の数値であり、 E[Pu] >0のときには、 予測さ れた転写単位に既知の転写単位よりも多くの遺伝子が含まれる、 いわゆる過剰予 測を意味する。 E [P U ] and E [Tq (u)] are both greater than or equal to 0. When E [P u ]> 0, the predicted transcription unit contains more genes than known transcription units. That is, so-called excessive prediction.
〔解析結果〕  〔Analysis result〕
実験により報告されている転写単位をもとに転写単位の予測精度を検討した。 その結果、 E[Pu]=0となる遺伝子は、 600個あり、予測精度の検討対象とした 68% の遺伝子については、 現在までに既知とされている転写単位と一致する転写単位 を予測することができた。全体の 90%の遺伝子は、 2個以内の過剰予測の範囲に 含まれている。 また、 E[Tq(u)] > 0 となるときには、 予測と既知の転写単位の 共通耍素数に比べて既知の転写単位に含まれる遺伝子数が多いことを示す。 すな わち、既知の転写単位に帰属する全ての遺伝子を予測できなかったことを意味す る。 同様に、 転写単位が既知の 877 遣伝子について、 E[Tq(u)]の値を求めた。 The prediction accuracy of the transcription unit was examined based on the transcription unit reported by the experiment. As a result, there are 600 genes for which E [P u ] = 0, and for 68% of the genes targeted for prediction accuracy, a transcription unit that matches the transcription unit known to date is predicted. We were able to. Ninety percent of the genes are in the range of less than two overpredictions. When E [T q (u)]> 0, it indicates that the number of genes contained in the known transcription unit is larger than the number of common elements of the predicted and known transcription unit. That is, it means that all genes belonging to known transcription units could not be predicted. Similarly, the value of E [T q (u)] was determined for the 877 gene whose transcription unit is known.
E[Tq(u)]=0となる遺伝子は、 498個あり、 57%の遺伝子については既知の酝写単 位を再現できたことを意味する。 また、 2個の遺伝子の過不足を許す範囲では、 80%の遺伝子の転写単位を予測することができた。 There are 498 genes with E [T q (u)] = 0, which means that 57% of the genes could reproduce the known transcription units. Also, to the extent that the excess or deficiency of the two genes is allowed, 80% of the transcription units of the gene could be predicted.
尚、発明を実施するための最良の形態の項においてなした具体的な実施態様ま たは実施例は、 あくまでも、 本発明の技術内容を明らかにするものであって、 そ のような具体例にのみ限定して狭義に解釈されるべきものではなく、 次に記載す る特許請求の範囲内で、 様々に変更して実施することができる。 産業上の利用の可能性  It should be noted that the specific embodiments or examples made in the section of the best mode for carrying out the invention merely clarify the technical contents of the present invention, and such specific examples The present invention is not to be construed as being limited to only the above, and various modifications may be made within the scope of the following claims. Industrial potential
以上のように、 本発明は、 マイクロアレイ実験やマクロアレイ実験等の遺伝子 発現解析から得られるデータの新規解析方法、及び同方法を実行するためのプロ グラム等に関するものであり、 前述したとおり、 マイクロアレイ実験等により得 られたデータの適切な評価、 及び当該データの新たな活用が可能になり、 研究用 ツールとしての用途は勿論のこと、 これにとどまらず、 例えば疾患と遺伝子との 関係に基づく新薬の開発 ·ゲノム創薬、 新たな検査 ·診断法、 予防 ·治療法の確 立にも貢献しうるものである。  As described above, the present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method. Appropriate evaluation of data obtained by experiments, etc., and the new utilization of the data become possible, not only for use as a research tool, but also for new drugs based on the relationship between diseases and genes, for example. It can also contribute to the development of genome development, new testing and diagnostic methods, prevention and treatment methods.

Claims

請 求 の 範 囲 マイクロアレイ実験、 マクロアレイ実験、 その他これら実験に類する遺伝 子発現解析の結果得られたデータの解析方法であって、 対照実験における遺 伝子 (k)のシグナル強度 xc(k)と、 目的実験における遺伝子 (k)のシグナル強度 XT(k)との対数比に定量性があると認められる範囲を下記 (a) 〜 (c) のェ 程により決定する方法。 Scope of the request This is a method for analyzing data obtained as a result of microarray experiments, macroarray experiments, and other similar gene expression analyses.The signal intensity of the gene (k) in a control experiment x c (k ) And the range in which the logarithmic ratio of the signal intensity X T (k) of the gene (k) in the target experiment is found to be quantitative is determined by the following steps (a) to (c).
(a) 対照実験における各々の遺伝子 (k)の測定シグナル強度 sc(k)及びバッ クグラウンド強度 bc(k)と、 目的実験における各々の遺伝子 (k)の測定シグナ ル強度 sT(k)及びバックグラウンド強度 bT(k)のデータを取得する、 (a) Measured signal intensity sc (k) and background intensity bc (k) of each gene (k) in control experiment, and measured signal intensity s T (k) of each gene (k) in target experiment And obtain data of background intensity b T (k),
(b) sc(k)-bc(k)が負の値となるデータ値をもとに、 シグナル強度 xc(k)が実 質的に 0 とみなされる範囲を規定する第 1の閾値を決定する一方、 sT(k)-bT(k) が負の値となるデータ値をもとに、 シグナル強度 xT(k)が実質的 に 0とみなされる範囲を規定する第 2の閾値を決定する、 (b) Based on the data value for which sc (k) -bc (k) is a negative value, set the first threshold value that defines the range in which the signal intensity x c (k) is considered to be practically 0. On the other hand, based on the data value for which s T (k) -b T (k) is negative, a second value that defines a range in which the signal intensity x T (k) is considered to be substantially zero Determine the threshold,
(c) シグナル強度 xc(k)が第 1の閾値以上又はこれより大きい値をとり、 かつ、 シグナル強度 XT(k)が第 2の閾値以上又はこれより大きい値をとる場 合に、 xc(k)と XT(k)との対数比に定量性があると決定する。 (c) If the signal strength x c (k) is greater than or equal to the first threshold and the signal strength XT (k) is greater than or greater than the second threshold, x It is determined that the log ratio between c (k) and XT (k) is quantitative.
上記 (b) の工程において、 第 1の閾値を uxSDc (但し、 uは任意の正 数であり、 SDcは下記の式 (1) で表される統計量) と設定する一方、 第 2の閾値を uxSDT (但し、 uは任意の正数であり、 SDTは下記の式 (2) で表される統計量) と設定することを特徴とする、 請求項 1記載のデータ解 析方法。
Figure imgf000027_0001
In the above step (b), the first threshold is set to uxSDc (where u is an arbitrary positive number and SDc is a statistic represented by the following equation (1)), while the second threshold is set. the UxSD T (where, u is an arbitrary positive number, SD T are statistics represented by the following formula (2)) and setting the data solutions析方method of claim 1.
Figure imgf000027_0001
Figure imgf000027_0002
Figure imgf000027_0002
(但し、 yc(k)=8C(k)-bc(k) く 0及び yT(k)=sT(k)-bT(k) < 0であり、式( 1 ) の kは sc(k)-bc(k)が負の値をとる N c個のシグナルを、 式 (2 ) の kは sT(k)-bT(k) が負の値をとる Ντ個のシグナルをそれぞれ示す変数であり、 log の底は 1より大きい任意の正数である。) (However, yc (k) = 8 C (k) −b c (k) <0 and y T (k) = s T (k) −b T (k) <0, and k in equation (1) Denotes N c signals for which s c (k) -b c (k) takes a negative value, and k in equation (2) denotes a value for which s T (k) -b T (k) takes a negative value. A variable that represents τ signals, respectively, and the base of log is any positive number greater than 1.)
マイクロアレイ実験、 マクロアレイ実験、 その他これら実験に類する遺伝 子発現解析の結果得られたデ一タの解析方法であって、 対照実験における遺 伝子 (k)のシグナル強度 xc(k)と、 目的実験における遺伝子 (k)のシグナル強度 xT(k)との対数比における平均強度に依存した偏り誤差を下記 (a ) 〜 (c ) の工程により補正する方法。 This is a method for analyzing data obtained as a result of microarray experiments, macroarray experiments, and other similar gene expression analyses, including signal intensity x c (k) of gene (k) in control experiments, how to correct the bias error that depends on the mean intensity in the logarithmic ratio of signal intensity x T gene (k) at the target experiments (k) by the following processes (a) ~ (c).
( a ) xc(k)と XT(k)との平均強度 Av[k] (={log[xc(k)]+log[xT(k)]}/2) を値の 大小に応じて複数の区間に分割し、 s番目の区間に属する遺伝子 (u=l,(A) according to the average intensity Av [k] (= {log [xc (k)] + log [x T (k)]} / 2) the magnitude of the value of x c (k) and XT (k) Into multiple sections, and the genes belonging to the s-th section (u = l,
2,...,Nsであり Nsは s番目の区間に属する遺伝子の総数である)につ いて、 xc(u)と xT(u)との対数比の平均値 PAv(s)を求める、 2, ..., it is Ns a Ns is the total number of genes belonging to s th interval) to have Nitsu, average of log ratio of x c (u) and x T (u) Pav (s) is Ask,
( b ) k番目の遺伝子の平均強度 Av[k]が Avs-1 (s-1番目の区間の平均強度 の最小値と最大値との平均) と Avs (s番目の区間の平均強度の最小値と最 大値との平均) との間にあるとき、 その平均強度 Av[k]における基準強度 crit(k)を平均値 PAv(s-l)と PAv(s)とを用いて線形補間により求める、 ( c ) 補正対数比 LOG[k]を下記の式 (3 ) により定義することで、 xc(k)と xT(k)との対数比 log(xT(k)/ xc(k))を補正する。 (b) The average intensity Av [k] of the kth gene is Avs-1 (the average of the minimum and maximum values of the average intensity in the s-1th interval) and Avs (the minimum average intensity in the sth interval) Between the average value and the maximum value), the average intensity Av [k] crit (k) is obtained by linear interpolation using the average values PAv (sl) and PAv (s). (c) By defining the corrected logarithmic ratio LOG [k] by the following equation (3), x c log ratio of (k) and x T and (k) log (x T ( k) / x c (k)) corrected.
LOG[k] = log(xT(k)/ xc(k))-crit(k) LOG [k] = log (x T (k) / x c (k))-crit (k)
( 3 )(3)
4 . k番目の遺伝子の補正対数比 LOG[k]の絶対値が、 当該遺伝子につき設定 した閾値 Th以上又はこれより大きい値をとる場合に、 k番目の遺伝子は統 計的に有意な発現量変化が得られたシグナル対と判定することを特徴とす る、 請求項 3記載のデータ解析方法。 4. If the absolute value of the corrected log ratio LOG [k] of the k-th gene is greater than or equal to the threshold value Th set for the gene, the k-th gene has a statistically significant expression level 4. The data analysis method according to claim 3, wherein a signal pair having a change is determined.
5 . 閾値 Thを下記 (a ) 〜 (c ) の工程により設定することを特徴とする、 請求項 4記載のデータ解析方法。  5. The data analysis method according to claim 4, wherein the threshold value Th is set by the following steps (a) to (c).
( a ) s番目の区間に属する遺伝子 (k)について、 これら遺伝子の LOG[k]に 対して S Dcritを設定し、 - S Dcrit < LOG[k] く S Dcritを満たすサンプ ルに対して標準偏差を求め、 これを S D l [s]とする、  (a) For genes (k) belonging to the s-th section, set S Dcrit for LOG [k] of these genes, and-Standard for samples that satisfy S Dcrit <LOG [k] and S Dcrit Find the deviation and call it SD l [s],
( b ) 標準偏差 S D 1 [s]をもとに前後 2点 (s-2,s-l,s,s+l,s+2)の合計 5点の 平均値 S mth[s]を求め、 これを平均強度 Avs (s番目の区間の平均強度の最 小値と最大値との平均) における代表値とする、  (b) Based on the standard deviation SD 1 [s], calculate the average value S mth [s] of two points before and after (s-2, sl, s, s + l, s + 2) for a total of 5 points. Is the representative value at the average intensity Avs (the average of the minimum and maximum values of the average intensity in the s-th section),
( c ) 平均強度 Avuと Avu+1との間に Av[k]が位置するとき、 S mth[u]と S mth[u+1]とを用いて線形補間により、  (c) When Av [k] is located between the average intensities Avu and Avu + 1, by linear interpolation using S mth [u] and S mth [u + 1],
S D [k] = S mth[u+l]+(Av[k]-Avu+l)( S mth[u+l]- S mth[u])/(Avu+l-Avu) を求め、 2x S D [k]を閾値 Thとする。  SD [k] = S mth [u + l] + (Av [k] -Avu + l) (S mth [u + l]-S mth [u]) / (Avu + l-Avu) Let SD [k] be the threshold Th.
6 . 請求項 1又は 2記載の方法と、 請求項 3、 4又は 5記載の方法とを組み合 わせて、 実験により得られたシグナルを複数の区分に分類することを特徴と するデータ解析方法。  6. A data analysis method characterized by classifying a signal obtained by an experiment into a plurality of categories by combining the method according to claim 1 or 2 and the method according to claim 3, 4, or 5. .
7 . マイクロアレイ実験、 マクロアレイ実験、 その他これら実験に類する遺伝 子発現解析の結果得られたデータの解析方法であって、複数の遺伝子の発現 プロファイルとこれら遺伝子のゲノム情報とに基づいて転写単位を下記 (a) 〜 (d) の工程により推定する方法。 7. A method for analyzing data obtained as a result of microarray experiments, macroarray experiments, and other gene expression analyzes similar to these experiments. The transcription unit is determined based on the expression profiles of multiple genes and the genomic information of these genes. following A method of estimating by the steps (a) to (d).
(a) ゲノム上で隣接しかつ同一の核酸鎖に位置する二つの遺伝子の発現プ 口ファイルにおける相関係数を算出する、  (a) calculating a correlation coefficient in an expression profile file of two genes located adjacent to each other on the same nucleic acid strand on the genome,
(b) 上記相関係数に基づき s番目の遺伝子の発現プロファイルと同遺伝子 の 3,側に隣接する s+1番目の遺伝子の発現プロファイルとに有意な相関が あると判定されると、 転写単位集合にこれら遺伝子を帰属させ、 続いて、 s 番目と s+2番目の遺伝子発現プロフアイルにおいて有意な相関が得られると きに s+ 2番目の遺伝子を同集合に帰属させ、 以下同様の処理を s+3, s+4, ... と繰り返し有意な相関が得られなくなったところで終了する、  (b) If it is determined that there is a significant correlation between the expression profile of the s-th gene and the expression profile of the s + 1-th gene adjacent to the third side of the gene based on the above correlation coefficient, the transcription unit These genes are assigned to the set, and then, when a significant correlation is obtained in the s-th and s + 2th gene expression profiles, the s + 2nd gene is assigned to the same set. s + 3, s + 4, ... stop when no significant correlation can be obtained again.
(c) s番目の遺伝子と、 同遺伝子の 5 '側に隣接する遺伝子 (8-1 番目, s-2 番目,…) との発現プロファイルにおける有意な相関の有無を上記工程 (b) と同様に判定する、  (c) Determine whether there is a significant correlation in the expression profile between the s-th gene and the gene (8-1st, s-2th, ...) adjacent to the 5 'side of the same gene as in step (b) above. Judge
(d) 上記工程 (b) と (c) により得られた集合の中で、 最小の順位にあ る遺伝子と最大の順位にある遺伝子とにより挟まれた遺伝子群を一つの転 写単位と推定する。  (d) In the set obtained by the above steps (b) and (c), a group of genes sandwiched between the gene with the lowest rank and the gene with the highest rank is estimated as one transcription unit. I do.
8. 上記工程 (a) において、 相関係数を下記の式 (4) により算出すること を特徴とする、 請求項 7記載のデータ解析方法。 ,,  8. The data analysis method according to claim 7, wherein, in the step (a), the correlation coefficient is calculated by the following equation (4). ,,
1 ) ( - ,(,ΝΧ[) = ,ノ =1 ¾, (4) 1) (-, (, Ν Χ [ ) =, ノ= 1 ¾ , (4)
∑^js -^ 2∑(xj, -^y ∑ ^ js-^ 2 ∑ ( x j,-^ y
,·=1 /=1  , · = 1 / = 1
(但し、 隣接する二つの遺伝子を s、 tとする。 Nstは M個の実験のうち二 つの遺伝子 s、 tの両方について値を得ることができた実験の個数であり、 この Nst個の実験をそれぞれ jにより表している。 Xjsは遺伝子 sについて j番目の実験の発現プロファイル、 Xjtは遣伝子 tについて j番目の実験の発 現プロファイル、 は遺伝子 8について Nst種類の実験の発現プロフアイ ルの平均、 tは遺伝子 tについて Nst種類の実験の発現プロファイルの平 均、 をそれぞれ示す。) (However, the two adjacent genes are s and t. Nst is the number of experiments that could obtain values for both genes s and t out of M experiments. Are represented by j. Xjs is the expression profile of the jth experiment for gene s, and Xjt is the expression profile of the jth experiment for gene t. The current profile indicates the average of expression profiles of Nst experiments for gene 8 and t indicates the average expression profile of Nst experiments for gene t. )
請求項 1〜 8の何れか 1項に記載の方法をコンピュータに実行させること を特徴とするプログラム。 A program for causing a computer to execute the method according to any one of claims 1 to 8.
. 請求項 9記載のプログラムを記録した、 コンピュータで読み取り可能な記 録媒体。A computer-readable recording medium on which the program according to claim 9 is recorded.
. 請求項 9記載のプログラムと、 同プログラムにより請求項 1〜8の何れか 1項に記載の方法を実行するコンピュータとを備えたデータ解析装置。 10. A data analysis device comprising: the program according to claim 9; and a computer that executes the method according to any one of claims 1 to 8 using the program.
PCT/JP2003/015637 2003-10-01 2003-12-05 Novel method for analyzing data collected by microarray experiment and the like WO2005034003A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2003343862A JP2005106755A (en) 2003-10-01 2003-10-01 Novel analyzing method of data obtained through microarray experiment and the like
JP2003-343862 2003-10-01

Publications (1)

Publication Number Publication Date
WO2005034003A1 true WO2005034003A1 (en) 2005-04-14

Family

ID=34419362

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2003/015637 WO2005034003A1 (en) 2003-10-01 2003-12-05 Novel method for analyzing data collected by microarray experiment and the like

Country Status (2)

Country Link
JP (1) JP2005106755A (en)
WO (1) WO2005034003A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007079623A1 (en) 2006-01-12 2007-07-19 Solar Focus Technology Co., Ltd A portable solar power supply system and its applying device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100817103B1 (en) 2006-08-03 2008-03-31 재단법인서울대학교산학협력재단 Method and system for analyzing microarray data
JP4893194B2 (en) * 2006-09-27 2012-03-07 東レ株式会社 Analysis apparatus and correction method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002512367A (en) * 1998-04-22 2002-04-23 イメージング リサーチ, インク. Methods for evaluating chemical and biological assays
JP2002533701A (en) * 1998-12-28 2002-10-08 ロゼッタ・インファーマティクス・インコーポレーテッド Statistical combination of cell expression profiles
JP2003256407A (en) * 2002-02-28 2003-09-12 Japan Science & Technology Corp Multivariate analysis system and expression profile analysis method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002512367A (en) * 1998-04-22 2002-04-23 イメージング リサーチ, インク. Methods for evaluating chemical and biological assays
JP2002533701A (en) * 1998-12-28 2002-10-08 ロゼッタ・インファーマティクス・インコーポレーテッド Statistical combination of cell expression profiles
JP2003256407A (en) * 2002-02-28 2003-09-12 Japan Science & Technology Corp Multivariate analysis system and expression profile analysis method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SABATTI C. ET AL.: "Co-expression pattern from DNA microarray experiments as a tool for operon prediction", NUCLEIC ACIDS RESEARCH, vol. 30, no. 13, 2002, pages 2886 - 2893, XP002977895 *
SHIMIZU H. ET AL.: "Operon prediction by DNa microarray (an approach with a Bayesian hierarchical model)", TECHNICAL REPORT OF IEICE, vol. 103, no. 150, 26 February 2003 (2003-02-26), pages 23 - 28, XP002977894 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007079623A1 (en) 2006-01-12 2007-07-19 Solar Focus Technology Co., Ltd A portable solar power supply system and its applying device

Also Published As

Publication number Publication date
JP2005106755A (en) 2005-04-21

Similar Documents

Publication Publication Date Title
Zhu et al. Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm
US11435338B2 (en) Fractional abundance of polynucleotide sequences in a sample
JP2008533558A (en) Normalization method for genotype analysis
CN108913776B (en) Screening method and kit for DNA molecular markers related to radiotherapy and chemotherapy injury
JP2016165286A (en) Gene-expression profiling with reduced numbers of transcript measurements
US20230086774A1 (en) Method and system for predicting biological age on basis of various omics data analyses
JP4360479B2 (en) A method of using quality assessment criteria to assess the quality of biochemical separations.
García-Pérez et al. Epigenomic profiling of primate lymphoblastoid cell lines reveals the evolutionary patterns of epigenetic activities in gene regulatory architectures
CN112233722A (en) Method for identifying variety, and method and device for constructing prediction model thereof
Naidu et al. Current knowledge on microarray technology-an overview
US8700381B2 (en) Methods for nucleic acid quantification
US20190250143A1 (en) Multipore determination of fractional abundance of polynucleotide sequences in a sample
Ness Basic microarray analysis: strategies for successful experiments
EP3959331A1 (en) Multipore determination of fractional abundance of polynucleotide sequences in a sample
WO2005034003A1 (en) Novel method for analyzing data collected by microarray experiment and the like
JP7170711B2 (en) Use of off-target sequences for DNA analysis
EP3884502A1 (en) Method and computer program product for analysis of fetal dna by massive sequencing
KR102659915B1 (en) Method of gene selection for predicting medical information of patients and uses thereof
JP2022534236A (en) A method for discovering a marker for predicting depression or suicide risk using multiple omics analysis, a marker for predicting depression or suicide risk, and a method for predicting depression or suicide risk using multiple omics analysis
Held et al. Microarrays in ecological research: a case study of a cDNA microarray for plant-herbivore interactions
Warnat-Herresthal et al. Artificial intelligence in blood transcriptomics
Aßmann et al. Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data
WO2022015998A1 (en) Gene panels and methods of use thereof for screening and diagnosis of congenital heart defects and diseases
KR20190088037A (en) SNP marker set for predicting of prognosis of rheumatoid arthritis
Kelley et al. Correcting for gene-specific dye bias in DNA microarrays using the method of maximum likelihood

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CN US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase