WO2005034003A1

WO2005034003A1 - Novel method for analyzing data collected by microarray experiment and the like

Info

Publication number: WO2005034003A1
Application number: PCT/JP2003/015637
Authority: WO
Inventors: Toshimichi Ikemura; Shigehiko Kanaya; Ken-Nosuke Wada; Yasushi Masuda; Tatsuya Nishi; Naotake Ogasawara; Kazuo Kobayashi
Original assignee: Japan As Represented By The President Of National Institute Of Genetics
Priority date: 2003-10-01
Filing date: 2003-12-05
Publication date: 2005-04-14
Also published as: JP2005106755A

Abstract

A method for analyzing data comprising the steps of (1) determining a threshold on the basis of a data value such that the remainder of the subtraction of the background strength from the measured signal strength is negative and determining the range in which the quantitativeness of the logarithmic ratio between the signal value in a comparative experiment and that of the object experiment is assured by the threshold, (2) dividing the average strength into sections, calculating the average of the logarithmic ratios between the above-mentioned signal values in the sections, and correcting the deviation of when the logarithmic ratio is plotted with respect to the average strength, and (3) determining whether or not there is any correlation between the expression profiles of flanking genes determined by a microarray experiment or the like and predicting the transcription unit on a microorganism genome with high accuracy.

Description

Description New analysis method for data obtained from microarray experiments, etc.

The present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method. Background art

In microarray experiments and macroarray experiments, it is possible to quickly obtain vast amounts of information on gene expression in a single experiment. For example, genes that are expressed in a time-specific manner during the developmental stage or growth / division stage, genes that are expressed in a tissue / organ-specific or disease / pathology-specific manner, activated by external stimuli such as chemical substances, heat, and light Gene expression can be analyzed comprehensively and comprehensively under various conditions, such as gene groups controlled by transcription factors and genes regulated downstream of transcription factors. The gene expression information (expression profile) obtained by such microarray experiments and the like contributes to a comprehensive understanding of the gene expression regulation mechanism, and further to the elucidation of life phenomena, but it is not limited to this. Development of new drugs based on this relationship · Genomic drug discovery, new testing · Diagnosis, prevention and prevention can also contribute to the establishment of treatment.

Currently, the most frequently used microarray experiments are (1) a method in which labeled cRNA is hybridized with a probe on the chip using an Affymetrix-type chip, and (2) a method in which labeled cDNA is labeled using a spot-type array. The method is broadly classified into a probe on a slide glass and a hybridization method. Both methods are common in that changes in the expression levels of individual genes are evaluated based on mRNA extracted from cells or the like. For example, in experiments using labeled cDNA, the mRNA extracted for the control experiment and the target (target) Fluorescently labeled cDNA is prepared from the mRNA extracted for the experiment and hybridized with a large amount of probe formed on the slide glass. Then, the fluorescence at each position of the probe was measured using a scanner, and the expression level (signal intensity) xc (k) of the gene (k) in the control experiment (c) and the gene (k) expression in the target experiment (T) were measured. the amount x _T (k) and the logarithmic ratio _{(log [x T (k)} / xc (k)]) by assessing the change in the expression level of individual genes (k). Microarray experiments will be increasingly important and effective in rapidly analyzing large amounts of gene expression in the future, but the data obtained from the experiments will be subject to measurement errors and bias errors due to various experimental reasons. Therefore, it has conventionally been difficult to appropriately evaluate the amount of change in gene expression. For example, in control experiments and target experiments, the measured signal intensity (measured value before subtracting the background intensity) of each gene is ideally greater than the background intensity. Some of the measured values were lower than the ground strength. Even if the intensity is higher than the background intensity, if the measured signal intensity is low, the logarithmic ratio (log [ _XT (k) / _xc (k)]) cannot guarantee the quantitativeness. However, the extent to which quantitativeness is guaranteed was determined subjectively or empirically.

In the microarray experiment, the expression level (fluorescence intensity) of most genes was almost the same in the control experiment (c) and the target experiment (T), and the expression level changed for some genes. This is the logarithmic ratio _{(log [x T (k)} / x c (k)]) if the were convex, the majority of genes _{log [x T (k) /} Xc (k)] = 0 distributed near Means to do. However, in practice, as shown in the upper graph of Fig. 1, the logarithmic ratio often includes a bias error depending on the average intensity on the horizontal axis.

As a conventional correction method, log [x _T (k) / x _{c for} all genes spotted on the microarray (u = l, 2, ... N, where N is the number of all genes) obtains a median value MD with the (k)], log for all genes _{[x T (k) / xc} (k)] - MD calculations to be compensation was done conventionally ( "Yoshihide HAYASHIZAKI Author's Manual, DNA Microarray Experiment Manual, Yodosha, 2000 "). However, such a conventional The correction method cannot reduce the above-described bias error depending on the average intensity. In microarray experiments, as described above, changes in gene expression intensity in two experiments, a control experiment and a target experiment, are evaluated in the form of a ratio (log ratio). The quantification of this ratio (log ratio) affects the search for genes with similar expression profiles from multiple microarray experiments.

By the way, among a set of adjacent genes on the same DNA strand in the microorganism genome, a set of adjacent genes transcribed to the same mRNA is called a transcription unit, and clarifying this transcription unit is a mechanism of gene expression control in the genome. It is very important to understand. Therefore, for example, if a method for predicting and estimating a transcription unit by correlating the expression profiles of a plurality of genes based on data obtained from a plurality of microarray experiments would be useful industrially, such methods are still available. No method has been developed. Disclosure of the invention

The present invention has been made in view of the above problems, and an object of the present invention is to provide a novel method for analyzing data obtained from a large-scale gene expression analysis such as a microarray experiment or a macroarray experiment. Specifically, (1) the log ratio _{(log [x T (k)} / xc (k)]) method of evaluating statistically clearly separate the range not guaranteed a range of quantitative property is guaranteed to (2) a method for correcting and reducing bias errors depending on the average intensity of the logarithmic ratio; and (3) prediction of transcription units on the microorganism genome based on data obtained from multiple microarray experiments. · To provide a method for estimating, and a program or the like for executing these methods on a computer. The first data analysis method according to the present invention is a method for analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems. a signal intensity _xc gene (k) in the experiment (k), the logarithmic ratio below (a) the extent permitted to be quantified with the ~ of the signal intensity x _T (k) of the gene (k) at the target experiments ( c) is determined by the process are doing.

(a) the measured signal strength sc (k) and the background intensity bc (k), the measured signal strength s _T (k) and packs each gene in the objective experiment (k) of each gene (k) in the control experiment Get data of ground strength b _T (k),

(b) Based on the data value for which sc (k) -bc (k) is negative, determine the first threshold value that defines the range in which the signal intensity x _c (k) is considered to be substantially zero. On the other hand, based on data values where s _T (k) -b _T (k) is a negative value, a second threshold value that defines a range in which the signal intensity x _T (k) is considered to be substantially 0 Determine the

(c) When the signal intensity x _c (k) takes a value equal to or greater than the first threshold and the signal intensity _XT (k) takes a value equal to or greater than the second threshold, _Xc ( k) and determines that there is a quantitative property to the logarithm ratio between x _T (k).

In the above step (b), the first threshold is set to uxSDc (where u is an arbitrary positive number and SDc is a statistic represented by the following equation (1)), while the second threshold is set. the UxSD _T (where, u is an arbitrary positive number, SD _T are statistics represented by the following formula (2)) it is preferable to set the.

(2)

(However, yc (k) = s _c (k) −b _c (k) <0 and y _T (k) = s _T (k) −b _T (k) 0, and the equation (1) Where k is N _c signals where sc (k) -bc (k) takes a negative value, and k in equation (2) is s _T (k) -b _T (k) takes a negative value. Variables indicating _τ signals, respectively. The base of log is any positive number greater than 1. )

The second data analysis method of the present invention is a method of analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems. The deviation error depending on the average intensity in the logarithmic ratio between the signal intensity x _c (k) of the gene (k) in the experiment and the signal intensity x _T (k) of the gene (k) in the target experiment is as follows: The feature is that it is corrected by the process of (c).

(a) The average intensity Av [k] of x _c (k) and x _T (k) is set to the magnitude of the value, [{{log [xc (k)] + log [x _T (k)]} / 2). Then, for each gene belonging to the s-th section (u = l, 2, ..., Ns, where Ns is the total number of genes belonging to the s-th section), x _c ( u) and the average logarithmic ratio PAv (8) of XT (U)

(b) The average intensity Av [k] of the kth gene is A _V8 -1 (the average of the minimum and maximum values of the average intensity in the s-1st interval) and Avs (the average intensity in the sth interval) Between the minimum value and the maximum value), the reference intensity crit (k) at the average intensity Av [k] is calculated by linear interpolation using the average values PAv (sl) and PAv (8). Ask,

(C) the correction log ratio LOG [k] to define the equation (3) below, _xc log ratio of (k) and _{x T (k) log (x} T (k) / x c (k) ) Is corrected.

LOG [k] = log (x _T (k) / x _c (k))-crit (k) (3)

In the above data analysis method, when the absolute value of the corrected logarithmic ratio LOG [k] of the k-th gene is equal to or greater than the threshold value Th set for the gene, the k-th gene is statistically significant. It is preferable to determine that the signal pair has a large change in the expression level. Further, it is preferable to set the threshold value Th by the following steps (a) to (c).

(a) For genes (k) belonging to the s-th section, the LOG [k] of these genes S Dcrit is set as follows:-Standard deviation is calculated for samples that satisfy S Dcrit <LOG [k] <S Dcrit, and this is defined as SD l [s].

(b) Based on the standard deviation SD 1 [8], find the average value S mth [s] of a total of 5 points of 2 points before and after ( _S -2, sl, s, s + l, s + 2). Is the representative value at the average intensity Avs (the average of the minimum and maximum values of the average intensity in the s-th section),

(c) When Av [k] is located between the average intensities Avu and Avu + 1, by linear interpolation using S mth [u] and S mth [u + l],

SD [k] = S mth [u + l] + (Av [k] -Avu + l) (S mth [u + l]-S mth [u]) / (Avu + l-Avu) Let SD [k] be the threshold Th.

Further, it is preferable to combine the first data analysis method and the second data analysis method of the present invention to classify signals obtained by experiments into a plurality of categories.

The third data analysis method of the present invention is a method of analyzing data obtained as a result of a microarray experiment, a macroarray experiment, and other similar gene expression analyzes to solve the above-mentioned problems. The transcription unit is estimated by the following steps (a) to (d) based on the gene expression profiles and the genome information of these genes.

(a) calculating a correlation coefficient in an expression profile of two genes that are adjacent to each other on the genome and located on the same nucleic acid chain,

(b) If it is determined that there is a significant correlation between the expression profile of the s-th gene and the expression profile of the s + 1-th gene adjacent to the third side of the gene based on the above correlation coefficient, Assign these genes to the transcription unit set, and then assign the s + 2 gene to the same set when a significant correlation is obtained in the s-th and ₈ + 2 gene expression profiles, and so on. Is repeated with ₈ + 3, s + 4, ... and ends when no significant correlation is obtained.

(c) The s-th gene and the gene adjacent to the 5'-side of the same gene (8-1st, _s -2nd, ...) The presence or absence of a significant correlation in the expression profile with is determined in the same manner as in the above step (b).

(d) In the set obtained in the above steps (b) and (c), the gene group sandwiched by the gene with the lowest rank and the gene with the highest rank is estimated as one transcription unit I do.

In the above step (a), it is preferable to calculate the correlation coefficient by the following equation (4).

(However, the two adjacent genes are s and t. N st is the number of experiments for which values were obtained for both of the two genes s and t out of M experiments. Each experiment is represented by j, where Xjs is the expression profile of the jth experiment for gene s, Xjt is the expression profile of the jth experiment for gene t, and Xs is the average of the expression profiles of Nst experiments for gene s. , Represents the average of the expression profiles of N st experiments for the gene t.) The program of the present invention provides at least one of the first to third data analysis methods for solving the above problems. It features a computer to perform one of the two methods.

The recording medium of the present invention is a computer-readable recording medium on which the program of the present invention is recorded. Also, a data analysis device of the present invention is a data analysis device comprising the program of the present invention described above, and a computer that executes at least one of the first to third data analysis methods using the program. is there. According to the present invention, (1) a pair of signal intensities obtained from a microarray experiment or the like; The range in which the quantitative ratio is guaranteed in the numerical ratio can be objectively determined by a statistical method, and (2) the bias error depending on the average intensity of the equivalence ratio can be corrected and reduced. (3) Predict the transcription units on the microbial genome with high accuracy by utilizing data obtained from microarray experiments, etc. Becomes possible.

Further objects, features, and advantages of the present invention will be more fully understood from the following description. Also, the advantages of the present invention will become apparent in the following description with reference to the accompanying drawings. Brief Description of Drawings

FIG. 1 is a diagram for explaining that the present invention corrects a bias error in a logarithmic ratio of signal intensities of a control experiment and a target experiment.

FIG. 2 is a diagram illustrating that signal data obtained by a microarray experiment or the like can be classified into a plurality of groups according to the present invention.

FIG. 3 is a diagram illustrating measurement conditions in a microarray experiment of the present example. FIG. 4 is a graph showing a comparison of the standard deviation of a signal for which a logarithmic ratio was quantitatively recognized and a signal for which a logarithmic ratio was not found according to the present invention.

FIG. 5 is a graph comparing the logarithmic ratio corrected according to the present invention with the standard deviation of the logarithmic ratio not corrected.

FIG. 6 is a graph showing a comparison between the logarithmic ratio corrected by the present invention and the standard deviation of the uncorrected logarithmic value for signal data obtained under different experimental conditions in a control experiment and a target experiment. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, an embodiment of the present invention will be described.

(1) Method for determining the range in which the logarithmic ratio of signal data is guaranteed to be quantitative ( Signal detection method that guarantees quantitativeness in number ratio)

Here, analysis of data obtained by a microarray experiment (cDNA microarray experiment) using cDNA as a sample will be described as an example. Normally, in cDNA microarray experiments, for each gene (k = l, 2, ..., N), the control signal (c) and the experimental signal (T) are used to determine the measured signal intensity and background intensity. Become. Let these be s _c (k), b _c (k) and s _T (k), b _T (k), respectively. The values of sc (k) -b _c (k) and 8 _T (k) -b _T (k) are ideally greater than or equal to zero. Negative values of s _c (k) -b _c (k) and s _T (k) -b _T (k) are values that should be zero because they are originally supposed to be zero. The statistics SD c and SD τ for evaluating this variation are represented by the following equations (1) and (2), respectively.

(However, y _c (k) = 8c (k) −b _c (k) 及び 0 and y _T (k) = s _T (k) −b _T (k) く 0, and k in equation (1) Is N _c signals where sc (k) -b _c (k) takes a negative value, and k in equation (2) is N _T where s _T (k) -b _T (k) takes a negative value Is a variable indicating each signal, and the base of log is any positive number greater than 1.)

On the other hand, if (signal intensity)-(background intensity) has a value of 0 or more, xc (k) = s _c (k) -b _c (k) (≥0) and x _T (k) = BT It is represented by (k) -b _T (k) (≥0). In this method, the signal value xc (k) in the control experiment is

xc (k) <u * S Dc (I)

Xc (k) is considered to be 0 when Here, u is an arbitrary positive number, and when u = l, indicates a region where data in a range of 68% is statistically distributed.

Similarly, the signal intensity x _T (k) in the target experiment is

x _T (k) <u * SD _T (II)

_XT (k) is considered to be 0 when

In this method, if the case xc (k) satisfies the above formula (I), or the x _T (k) satisfies the above formula (II), the log ratio for the k-th gene I can't express it. That is, if any of these conditions is satisfied, it is determined that quantitativeness cannot be guaranteed in the logarithmic ratio.

(2) Method for reducing bias error in logarithmic ratio

When both of the above formulas (I) and (II) are not satisfied for the k-th gene, evaluation by a log ratio becomes possible. Log ratio on the vertical axis _{log [x T (k) /} xc (k)] ( or log [xc (k) / x T (k)], the bottom of the log is greater than 1 any positive number, for example 10) mean intensity Av [k] to the horizontal axis a (= {log [xc (k )] + log [x T (k)]} / 2) taking the majority of genes lo _g _[XT (k) / It is expected to be distributed near _Xc (k)] = 0. However, as shown in the upper graph in Fig. 1, the logarithmic ratio often includes a bias error depending on the average intensity on the horizontal axis. This bias error is reduced by the following method. First, the average intensity is divided into multiple sections at a fixed step size. For the gene k belonging to the s-th section, the average value PAv (s) of the log ratio is calculated by the following equation (5). v. ,-

'N.

(Where, k is a variable indicating the _s-th N ₈ pieces of signals belonging to the segment, _m in _(s) Indicates the minimum value of the average intensity in the 8th section, and max (s) indicates the maximum value of the average intensity in the sth section. )

In microarray experiments, the smaller the average intensity, the greater the variation in the logarithmic ratio. Therefore, it is important to obtain the average value from a larger number of samples in a section with a lower average intensity. Now, when s = l, 2,.:., Stotal from the section where the average intensity is small to the section where the average intensity is large, the number of samples required for each section is

8-Set (Nstart-Nfinal) / (l-Stotal) + (Nfinal-Nstart-Stotal) / (l-Stotal) · · · (A). Where Nstart is the number of samples at s = l and Nfinal is the number of samples at s = Stotal. (For example, set Nstart = 40, Nfinal = 5, etc.) Here, set Nstart> Nfinal. Only when this condition is satisfied (that is, when the actual number of samples in each section is equal to or greater than the set value (A) or greater than this value) Average intensity Avs = [min (s) + max (8)] / 2 The PAv (s) in is used as a representative value. When a sufficient number of samples cannot be obtained, the representative value of Avs cannot be obtained. Therefore, PAv (s) is obtained by linear interpolation of (Avq, PAv (q)) and (Avt, PAv (t)) before and after.

The average intensity Av [k] for the kth gene is Avs-1 (= [min (sl) + max (sl)] / 2) and Avs (= [min ( ₈ ) + max (s)] / 2) When it is between, the reference strength crit (k) at the average strength Av [k] is obtained from PAv (sl) and PAv (s) by the following equation (6) by linear interpolation.

-",, DA,, ( ^Av [ ^k ] One Avs) (PAv (s)-PAv

cnt (k) = PAv (s) + ― (s D) (6)

(Avs-Avs 1) Based on this reference intensity, the corrected log ratio LOG [k] is defined by the following equation (7).

LOG (k) = log (x _T (k) / xc (k))-crit (k) (7)

The result of correcting the deviation error of the log ratio by the above corrected log ratio LOG [k] is shown in the lower graph of Fig. 1. The range of variation of LOG [k] = 0 determined by the user is -SD crit and LOG [k] <S Dcrit. In other words, S Dcrit is set by the user based on the data obtained as a result of the experiment. According to the value of the average intensity, the sample is divided into multiple (s) intervals (s = 0, l, ..., Stotal-l; the interval is specified in increments of 0.1, but of course the interval width is not limited to this. ), And for each section, find the standard deviation for the sample that satisfies-SD crit <LOG [k] <S Dcrit (S Dcrit may be set to a unique value for each section, The same value may be used for all sections). This is SD l [s]. It is desirable that SD l [s] be calculated based on the number of samples of 50 or more, but is not limited to this.

Based on the standard deviation SD l [s] obtained in this way, the average value of a total of 5 points of 2 points before and after (s-2, 8-l, 8, s + l, s + 2) Smth [s] is obtained, and this is set as a representative value at the average intensity Avs = [min (s) + max (s)] / 2.

When the k-th gene has the value of LOG [k] at the average intensity Av [k] and Av [k] is located between the average intensity AVu and Avu + 1, S mth [u] and S mth [ u + 1] and linear interpolation

S D [k] = S mth [u + l] + (Av [k] −AVu + l) (S mth [u + l] −S mth [u]) / (AVu + 1−AVu) When 2xSD [k] <ILOG [k] l, the k-th gene is determined to be a signal pair for which a statistically significant change in expression level has been obtained.

According to this method, the signals of the microphone and row array experiments are classified into the following four groups A to D. Groups A and B are signal conditions that can be used for quantitative analysis, such as searching for genes with similar expression profiles in multiple microarray experimental data. In addition, groups A to D can be used for qualitative analysis when searching for genes that have a significant difference in the two experiments, the control experiment and the target experiment (see also Fig. 2; EF and F).

Group A: Quantitative log ratio is guaranteed and statistically significant changes in expression level The resulting signal pair. Condition: SD _C xc (k) and SD _T <x _T (k), 2xSD [k] <I LOG [k] I

Group B: Signal pairs for which logarithmic ratio is guaranteed, but does not produce statistically significant changes in expression level. Condition: SD _C xc (k) and SD _T <x _T (k), 2xSD [k〗> I LOG [k] I

Group C: A force that does not guarantee quantitativeness in the log ratio because one of the signals is regarded as 0, a signal pair in which a difference is obtained between the two signals. Conditions: (i) SD _C > xc (k) and SD _T <x _T (k :), or (ii) SD _C <xc (k) and SD _T > x _T (k)

Group D: Signal pairs for which both signals are considered to be 0 and quantification of the log ratio is not guaranteed and that there is no difference between the two signals. Condition: SD _C > xc (k) and SD _T > x _T (k)

(3) Estimation method of transcription unit

The contiguous gene set that is transcribed to the same mRNA in the contiguous gene set on the same DNA strand of the genome in the gene expression of Pacteria is called a transcription unit. Predicting this transcription unit is very important from the viewpoint of controlling gene expression in the genome. A method for estimating the transcription unit based on the expression amount or the expression change amount of various genes under various conditions represented by microarray data is described below. A set of genes continuously arranged in the same direction on the genome is called a directon. When multiple genes belong to the same transcription unit, these genes are transcribed as the same mRNA, and therefore have a positive correlation to their expression profiles in theory. Therefore, the correlation of the microarray expression profile between the genes belonging to the same directon is calculated. Now, assuming that M types of microarray experiments (microarrays that can obtain an expression profile for N genes) are performed, the expression profile of each gene can be represented by an NxM matrix as follows. Four

No. Here, the expression profile X _s of the s-th gene can be described as follows using an M-dimensional betatle.

s ⁼ (81,82, ..., 8j "-" XsM)

X _s written in base vector, with X _t, two located adjacent and identical DNA strand on the genome gene (s, t) of the method for predicting the transcription unit by the correlation of the expression profiles of The algorithm is described below. With this algorithm, it is possible to estimate the transcription unit without necessarily obtaining information on the correlation between adjacent genes.

In this method, a transcription unit is estimated based on the expression profile and genomic information by the following steps 1 to 4.

[Step 1: Calculation of correlation coefficient between genes belonging to the same directon] Correlation coefficient r (s in the expression profiles X _s and X _t of the s-th and t-th gene pairs on the same diton , T). Here, s = l, 2 ".., N, t = l, 2" .., N, where N is the total number of genes belonging to the focused directory. Here, a directive is a set of genes that are consecutively positioned on the same DNA strand.

Expression of genes adjacent to the expression profile X _B of: [Step 2 3 'search in the direction of a between the expression considering adjacency phase gene] 8 th gene (s + 1) Prof When there is a significant correlation with Isle X _{s + 1} , this gene is assigned to the transcription unit set. Subsequently, the s + 2 gene is assigned to the same set when a significant correlation is obtained between the eighth and s + 2 gene expression profiles. This operation is repeated until s + 3, 8 + 4, ... no significant correlation is obtained.

[Step 3: Searching for genes in the expression phase in the 5th direction in consideration of the adjacency] Perform the same operation as in Step 2 above for the s-1st, s-2th,.

[Step 4: List transcription unit section candidates]

In the set obtained in steps 2 and 3, the group of genes sandwiched between the gene with the lowest rank and the gene with the highest rank is estimated to be one transcription unit.

The specific method of calculating the correlation coefficient r (s, t) in step 1 will be described in an embodiment described later.

Of course, the present invention is not limited to the above-described methods (1) to (3) of the present embodiment, and various changes can be made within the scope of the present invention. For example, the values of the threshold value, the reference value, and the like used in the above methods (1) to (3) are arbitrary, and appropriate values may be set according to the application and purpose. Further, additional steps may be added to the steps (steps) of the above methods (1) to (3).

(4) Utility of the present invention (field of application)

The present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method. Appropriate evaluation and use of this data are possible, and it is possible to use it not only as a research tool, but also for development of new drugs based on the relationship between diseases and genes.Genome drug discovery, It can also contribute to the establishment of new tests and diagnostic methods, and preventive methods.

As an example of the method of using the present invention, it is possible to set an analysis method according to the signal intensity of a microarray experiment. For example, several microarray experiments When searching for genes with similar expression profiles in data, only genes that satisfy the conditions of Groups A and B described above are analyzed. Also, simply searching for a gene that has a significant change in one microarray experiment targets genes that satisfy the conditions of Groups A to D described above. Thus, the range of the target gene group suitable for quantitative analysis or qualitative analysis can be determined. This can improve the accuracy of multivariate analysis usually used for microarray analysis. The program of the present invention causes a computer to execute the method of the present invention (for example, any one of the methods (1) to (3)), and the recording medium of the present invention records the program of the present invention. And refers to any recording medium that can be accessed and read by a computer. Such recording media include magnetic recording media such as flexible disks, hard disks, and magnetic tapes; optical recording media such as CD_ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, and DVD-RW; and RAM. Examples thereof include, but are not limited to, an electrical storage medium such as a ROM and a ROM, and a magneto-optical storage medium such as an MO.

The “data analysis device” of the present invention includes a program of the present invention, and a computer that executes the method of the present invention (for example, any one of the methods (1) to (3)) by the program. Is done. The computer basically has a configuration capable of executing the method of the present invention, and includes an input device, a data storage device, a central processing unit, and an output device.

Hereinafter, the present invention will be described more specifically with reference to examples, but the present invention is not limited to these examples.

[Analysis example]

Expression profiles of Escherichia coli mutants deficient in specific genes and changes over time in Escherichia coli were measured using a commercially available Escherichia coli microarray. Figure 3 shows the measurement conditions.

[Example 1: A method for detecting a signal whose logarithmic ratio (relative value) is quantitative] The signal values _xc (k) and _XT (k) in the control experiment and the target experiment are

xc (k) <S Dc (I)

x _T (k) <SD _T (ID

When satisfying, to indicate xc (k), the analysis result for the case where regarded as x _T a (k) 0 to less than. When xc (k) satisfies the formula (I), or if x _T (k) satisfies the formula ([pi) can not express the log ratio for the k-th gene. That is, if any of these formulas is satisfied, quantitativeness is not guaranteed in the log ratio. This is, xc (k) and x _T (k) is as measured under the same conditions, the logarithmic ratio log x _T (k) and for _{xc (k) [x T (} k) / xc (k)] The standard deviation of the conditions 2-1 and 2-2 below is larger than the standard deviation of the condition 1.

Condition 1: Logarithmic ratio is guaranteed to be quantitative. Condition: SD _e <xc (k) and SD _T <x _T (k)

Condition 2-1: One of the signals is 0, so the logarithmic ratio cannot be guaranteed quantitatively, but a difference is obtained between the two signals. Conditions: (i) SD _C > xc (k) and SD _T <x _T (k) or (ii) SD _C <xc (k) and SD _T > x _T (k)

Condition 2-2: Since both signals are 0, quantification of the log ratio is not guaranteed and it is judged that there is no difference between the two signals. Condition: SD _C > xc (k) and SD _T > x _T (k)

FIG. 4 shows the standard deviation of the log ratio under the above three conditions calculated for the same mRNA sample extracted by culturing E. coli in LB medium. (A) and (B) are the results of calculating the standard deviation of the log ratio under the above three conditions for the mRNA extracted twice independently in the logarithmic growth phase. In an ideal system without experimental errors, these standard deviations are all zero. However, in practice, errors cause data variations, which are quantified by the standard deviation. The magnitude of this error is clearly larger in Condition 2-1 and Condition 2-2 than in Condition 1. Therefore, the change in expression level was quantitatively determined by logarithmic ratio under conditions 2-1 and 2-2. It is concluded that it will be difficult to evaluate.

[Example 2: Method for reducing bias error in logarithmic ratio]

For signals that satisfy either of the conditions in Groups A and B, quantitativeness is guaranteed in the log ratio.

Group A: Signal pairs that guarantee quantitativeness in the log ratio and that produce statistically significant changes in expression levels. Condition: SD _C <xc (k) and SD _T <; s _T (k), 2xS D [k] <I LOG [k] I

Group B: Signal pairs that guarantee quantitativeness in the log ratio but do not produce statistically significant changes in expression levels. Condition: SD _C <xc (k) and SD _T <x _T (k), 2x SD [k]> I LOG [k] I

For signals that satisfy either of the above conditions A and B, the standard deviation of the log ratio and the corrected log ratio with respect to the origin was determined. The fact that the bias error is reduced by the correction method of the present invention can be confirmed by the fact that the standard deviation with respect to the origin in the corrected log ratio becomes smaller than the standard deviation of the log ratio when no correction is performed. .

Figure 5 shows the standard deviation calculated for the uncorrected log ratio and the corrected log ratio (LOG [k]) of each signal that satisfies condition A or B for the same mRNA sample extracted by culturing E. coli in LB medium. . As shown in the figure, the standard deviation of the corrected log ratio with respect to the origin was smaller than the standard deviation of the log ratio without correction. (A) and (B) are the results of performing the same experiment on mRNA independently extracted twice in the logarithmic growth phase and calculating the standard deviation.

Condition B is a signal pair that guarantees quantitativeness in the log ratio but does not produce a statistically significant change in expression.Therefore, the condition B can be used even when mRNA is extracted under different conditions in the control experiment and the target experiment. Assuming that the bias error is reduced by the correction method of the present invention, the standard deviation of the corrected logarithmic ratio with respect to the origin is the logarithmic ratio when no correction is performed. Is smaller than the standard deviation of To confirm this, condition B was determined based on the expression intensity measured at the actual comparison between the defective strain and the wild-type strain shown in Fig. 3, or at a specific time and an arbitrary time. For the signal to be satisfied, the relationship between the standard deviation in the corrected log ratio (LOG [k]) and the standard deviation in the log ratio when no correction was performed was obtained and is shown in the graph of FIG. As shown in the figure, the standard deviation of the corrected log ratio with respect to the origin was smaller than the standard deviation of the log ratio without correction.

[Example 3: Estimation method of transcription unit]

A transcription unit is a group of genes that are transcribed into the same mRNA, and by finding genes that are on the same strand and adjacent to each other on the genome and have a positive correlation in the expression profile, A group of genes in the same transcription unit can be found. When the expression profile is measured by a microarray, the expression profile of several thousand genes can be measured for one experiment, but there are many genes whose expression profile cannot be measured depending on the conditions. This estimating method can estimate the IS scoping unit even when the expression data for all the experimental conditions M for the i-th and j-th genes are not complete. Also, since all gene pairs belonging to the same directive are targeted, the transcription unit can be estimated even when the correlation coefficient itself of the adjacent gene is missing. That is, the transcription unit is estimated and predicted as follows while compensating for these two kinds of deletions.

Genes having the same transcription direction continuously on the genome are referred to as 1, 2, ..., i, j, ..., n in order from the 5 'side. Here, i <j. It is assumed that the expression profiles of the i-th and j-th genes are measured under _8y pairs of experimental conditions, and the correlation coefficients (j, Sjj) are obtained. In this method, the following Pearson's correlation equation was used.

Here, r (s, t, N _st ) indicates that, for the s-th and t-th genes, values could be obtained by experiments for _NBt pairs out of M experiments.

For n genes, a correlation coefficient corresponding to n (n-1) / 2 pairs is determined.

(1) When the correlation coefficient r (i, j, s) is larger than the reference value r (sij, a), it is guaranteed that the correlation coefficient, j, _8¾ ) has a statistically significant positive correlation. You. Here, the reference correlation value r (Sij, a) is

Can be obtained by Here the t values in the significance level a and t _a. t _a is the significance level a in the statistical test, and can be obtained from the t distribution table in statistics. Now, assuming that r (i, j, sy) is a significant positive correlation, it means that the i-th and j-th genes may be in the same transcription unit. That is, the i-th to j-th j-i + 1 genes may be in the same transcription unit.

(2) When the correlation coefficient r (i, j, Sij) has a negative value, the ith and jth genes belong to different transcription units. This further implies that, even though the expression profiles of gene pairs (x, y) with x≤i, j≤y are positively correlated, these two genes x, y Means in units.

(3) Based on the i-th gene, when the i + 1-th, i + 2, ..., i + k-th k genes are significantly positively correlated, The transgenes are in the same transcription unit. Also, i-1st, i-2th, ..., imth m 遗 If the genes are significantly positively correlated, it means that these genes are in the same transcription unit.

The transcription unit was estimated by the following steps based on these three conditions (1) to (3) _.c (Step 1) Genes having the same transcription direction on the genome were successively selected from the 5 ' , 2, ..., i, j, ..., n. Here, i <j. For the i-th and j-th genes, the correlation coefficient r (i, j, s) based on the expression profile for the _Sij pair of experimental conditions is determined. For n genes, a correlation coefficient corresponding to n (nl) / 2 pairs is determined.

(Step 2) A positive statistically significant correlation coefficient is selected from the correlation coefficients r (i, j, s). When the correlation coefficient r (i, j, Sij) is larger than the reference value r ( _sti , a), it is guaranteed that there is a statistically significant positive correlation. The reference correlation value r (Sij, a) is, r {s _i a) = Bok "

Small ²

-2

(Step 3) A gene pair having a negative correlation coefficient r (i, j, Sij) is obtained. If these gene pairs are i and j, even if the expression profile of the gene pair (x, y) with the relation x≤i, j≤y has a positive correlation, these two genes x, y Are not included in the same transcription unit because they are in different transcription units.

(Step 4) When the u-th gene is used as a reference, the u-1st, u-2th,..., U-th genes, u + 1th, u + 2th , ..., expression profiles and significant positive correlation of the u-th gene for u + k ₂ -th k ₂ genes can be obtained, and the process did not have the gene pair having a negative correlation coefficients by 3 and if, these k ₂ - Id + 1 genes is estimated to be the same transcription unit.

(Evaluation method of this estimation method)

The transcription units identified to date are numbered by i (i = l, 2, ..., q, ... Nq). The set of genes belonging to the q-th transcription unit is represented by T _q , and the assigned genes are t ⁽ Tl, t (Tq) ₂ ,

And The set of genes belonging to the transcription unit predicted by the present estimation method based on the Uth gene is denoted by _Pu . Genes that are attributable to the transcription unit, ^{respectively, p (Pu> l, p} (Pu> 2, ... and p ^(Pu> _N ^[Pu]. The U th genes belong to the set T _q is the referred to as T _q (U) that. now, the U-th gene, when it is assigned to T _q, ideally, an element of the set Pu the set T _q (u) is match . number of elements in common the number of elements _{_{N [P u nT q (U}} )] to. collectively P _u the set T _q (u) of the set Pu the set T _q (U), respectively, N [P _U ] And N [T _q (u)].

If the predicted transcription unit and the known transcription unit match, the following two equations (Equations (8) and (9)) are both zero.

_{E [P U] = N [} P U] - N [P U门_{T q (u)] (8} )

E [T _q (u)] = N [T _q (u)]-Ν [Ρα Π T _q (u)] (9)

E [P _U ] and E [Tq (u)] are both greater than or equal to 0. When E [P _u ]> 0, the predicted transcription unit contains more genes than known transcription units. That is, so-called excessive prediction.

〔Analysis result〕

The prediction accuracy of the transcription unit was examined based on the transcription unit reported by the experiment. As a result, there are 600 genes for which E [P _u ] = 0, and for 68% of the genes targeted for prediction accuracy, a transcription unit that matches the transcription unit known to date is predicted. We were able to. Ninety percent of the genes are in the range of less than two overpredictions. When E [T _q (u)]> 0, it indicates that the number of genes contained in the known transcription unit is larger than the number of common elements of the predicted and known transcription unit. That is, it means that all genes belonging to known transcription units could not be predicted. Similarly, the value of E [T _q (u)] was determined for the 877 gene whose transcription unit is known.

There are 498 genes with E [T _q (u)] = 0, which means that 57% of the genes could reproduce the known transcription units. Also, to the extent that the excess or deficiency of the two genes is allowed, 80% of the transcription units of the gene could be predicted.

It should be noted that the specific embodiments or examples made in the section of the best mode for carrying out the invention merely clarify the technical contents of the present invention, and such specific examples The present invention is not to be construed as being limited to only the above, and various modifications may be made within the scope of the following claims. Industrial potential

As described above, the present invention relates to a novel analysis method for data obtained from gene expression analysis such as a microarray experiment and a macroarray experiment, and a program for executing the method. Appropriate evaluation of data obtained by experiments, etc., and the new utilization of the data become possible, not only for use as a research tool, but also for new drugs based on the relationship between diseases and genes, for example. It can also contribute to the development of genome development, new testing and diagnostic methods, prevention and treatment methods.

Claims

Scope of the request This is a method for analyzing data obtained as a result of microarray experiments, macroarray experiments, and other similar gene expression analyses.The signal intensity of the gene (k) in a control experiment x _c (k ) And the range in which the logarithmic ratio of the signal intensity X _T (k) of the gene (k) in the target experiment is found to be quantitative is determined by the following steps (a) to (c).

(a) Measured signal intensity sc (k) and background intensity bc (k) of each gene (k) in control experiment, and measured signal intensity s _T (k) of each gene (k) in target experiment And obtain data of background intensity b _T (k),

(b) Based on the data value for which sc (k) -bc (k) is a negative value, set the first threshold value that defines the range in which the signal intensity x _c (k) is considered to be practically 0. On the other hand, based on the data value for which s _T (k) -b _T (k) is negative, a second value that defines a range in which the signal intensity x _T (k) is considered to be substantially zero Determine the threshold,

(c) If the signal strength x _c (k) is greater than or equal to the first threshold and the signal strength _XT (k) is greater than or greater than the second threshold, x _It is determined that the log ratio between _c (k) and _XT (k) is quantitative.

In the above step (b), the first threshold is set to uxSDc (where u is an arbitrary positive number and SDc is a statistic represented by the following equation (1)), while the second threshold is set. the UxSD _T (where, u is an arbitrary positive number, SD _T are statistics represented by the following formula (2)) and setting the data solutions析方method of claim 1.

(However, yc (k) = 8 _C (k) −b _c (k) <0 and y _T (k) = s _T (k) −b _T (k) <0, and k in equation (1) Denotes N _c signals for which s _c (k) -b _c (k) takes a negative value, and k in equation (2) denotes a value for which s _T (k) -b _T (k) takes a negative value. A variable that represents _τ signals, respectively, and the base of log is any positive number greater than 1.)

This is a method for analyzing data obtained as a result of microarray experiments, macroarray experiments, and other similar gene expression analyses, including signal intensity x _c (k) of gene (k) in control experiments, how to correct the bias error that depends on the mean intensity in the logarithmic ratio of signal intensity x _T gene (k) at the target experiments (k) by the following processes (a) ~ (c).

(A) according to the average intensity Av [k] (= {log [xc (k)] + log [x T (k)]} / 2) the magnitude of the value of x _c (k) and _XT (k) Into multiple sections, and the genes belonging to the s-th section (u = l,

2, ..., it is Ns a Ns is the total number of genes belonging to s th interval) to have Nitsu, average of log ratio of x _c (u) and x _T (u) Pav (s) is Ask,

(b) The average intensity Av [k] of the kth gene is Avs-1 (the average of the minimum and maximum values of the average intensity in the s-1th interval) and Avs (the minimum average intensity in the sth interval) Between the average value and the maximum value), the average intensity Av [k] crit (k) is obtained by linear interpolation using the average values PAv (sl) and PAv (s). (c) By defining the corrected logarithmic ratio LOG [k] by the following equation (3), x _c log ratio of (k) and x _T and _{(k) log (x T (} k) / x c (k)) corrected.

LOG [k] = log (x _T (k) / x _c (k))-crit (k)

(3)

4. If the absolute value of the corrected log ratio LOG [k] of the k-th gene is greater than or equal to the threshold value Th set for the gene, the k-th gene has a statistically significant expression level 4. The data analysis method according to claim 3, wherein a signal pair having a change is determined.

5. The data analysis method according to claim 4, wherein the threshold value Th is set by the following steps (a) to (c).

(a) For genes (k) belonging to the s-th section, set S Dcrit for LOG [k] of these genes, and-Standard for samples that satisfy S Dcrit <LOG [k] and S Dcrit Find the deviation and call it SD l [s],

(b) Based on the standard deviation SD 1 [s], calculate the average value S mth [s] of two points before and after (s-2, sl, s, s + l, s + 2) for a total of 5 points. Is the representative value at the average intensity Avs (the average of the minimum and maximum values of the average intensity in the s-th section),

(c) When Av [k] is located between the average intensities Avu and Avu + 1, by linear interpolation using S mth [u] and S mth [u + 1],

6. A data analysis method characterized by classifying a signal obtained by an experiment into a plurality of categories by combining the method according to claim 1 or 2 and the method according to claim 3, 4, or 5. .

7. A method for analyzing data obtained as a result of microarray experiments, macroarray experiments, and other gene expression analyzes similar to these experiments. The transcription unit is determined based on the expression profiles of multiple genes and the genomic information of these genes. following A method of estimating by the steps (a) to (d).

(a) calculating a correlation coefficient in an expression profile file of two genes located adjacent to each other on the same nucleic acid strand on the genome,

(b) If it is determined that there is a significant correlation between the expression profile of the s-th gene and the expression profile of the s + 1-th gene adjacent to the third side of the gene based on the above correlation coefficient, the transcription unit These genes are assigned to the set, and then, when a significant correlation is obtained in the s-th and s + 2th gene expression profiles, the s + 2nd gene is assigned to the same set. s + 3, s + 4, ... stop when no significant correlation can be obtained again.

(c) Determine whether there is a significant correlation in the expression profile between the s-th gene and the gene (8-1st, s-2th, ...) adjacent to the 5 'side of the same gene as in step (b) above. Judge

(d) In the set obtained by the above steps (b) and (c), a group of genes sandwiched between the gene with the lowest rank and the gene with the highest rank is estimated as one transcription unit. I do.

8. The data analysis method according to claim 7, wherein, in the step (a), the correlation coefficient is calculated by the following equation (4). ,,

1) (-, (, Ν _{Χ [} ) =, ノ^{= 1} _¾ , (4)

∑ ^ js-^ ² ∑ ( ^x j,-^ y

, · = 1 / = 1

(However, the two adjacent genes are s and t. Nst is the number of experiments that could obtain values for both genes s and t out of M experiments. Are represented by j. Xjs is the expression profile of the jth experiment for gene s, and Xjt is the expression profile of the jth experiment for gene t. The current profile indicates the average of expression profiles of Nst experiments for gene 8 and t indicates the average expression profile of Nst experiments for gene t. )

A program for causing a computer to execute the method according to any one of claims 1 to 8.

A computer-readable recording medium on which the program according to claim 9 is recorded.

10. A data analysis device comprising: the program according to claim 9; and a computer that executes the method according to any one of claims 1 to 8 using the program.