WO2016157473A1 - 遺伝子型判定装置及び方法 - Google Patents
遺伝子型判定装置及び方法 Download PDFInfo
- Publication number
- WO2016157473A1 WO2016157473A1 PCT/JP2015/060368 JP2015060368W WO2016157473A1 WO 2016157473 A1 WO2016157473 A1 WO 2016157473A1 JP 2015060368 W JP2015060368 W JP 2015060368W WO 2016157473 A1 WO2016157473 A1 WO 2016157473A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- genotype
- cluster
- representative value
- clusters
- snps
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
- G16B5/20—Probabilistic models
Definitions
- Embodiments of the present invention relate to a genotyping apparatus and method.
- Organisms hold genetic information as genomic base sequences (DNA), and most of the base sequences are identical in the same species.
- DNA genomic base sequences
- a part of the nucleotide sequence varies among individuals, and in particular, a locus having different bases with a frequency of 1% or more in a population of the same species is called a single nucleotide polymorphism (SNP).
- SNP single nucleotide polymorphism
- an organism having two chromosomes (a diploid organism) like a human three kinds of combination patterns are generated due to the difference in bases in the SNP. Such a combination pattern is called a genotype.
- the SNP genotype causes individual differences such as constitution among the same species, the genotype is related to genetic diseases, drug efficacy and drug side effects. For this reason, by examining the genotype of a specific SNP of a certain individual, it is possible to predict the efficacy and side effects before medication.
- a known base sequence of an SNP on the array side is hybridized with an unknown base sequence of a certain organism (specimen) whose genotype is to be determined by a DNA microarray, and the signal intensity is measured.
- the signal intensities of a plurality of specimens measured at the same SNP are projected onto a plane and classified into clusters of the same genotype at each SNP.
- the genotype of each cluster is assigned (labeled). Thereby, the genotype of the same SNP can be determined at once for a plurality of specimens.
- a genotyping apparatus and method for improving genotyping accuracy in genotyping technology using a DNA microarray are provided.
- the genotype determination device includes a representative value calculation unit, a first labeling unit, a model construction unit, and a second labeling unit.
- the representative value calculation unit calculates each cluster based on the signal intensity of the sample included in each cluster for the sample clusters for each SNP classified based on the signal intensity of the plurality of samples in the plurality of SNPs measured by the DNA microarray.
- the representative value of is calculated.
- the first labeling unit assigns a genotype to each cluster of the SNP classified into three clusters among the SNPs based on the representative value of each cluster.
- the model construction unit constructs a model indicating the relationship between the genotype of each cluster of the SNP classified into three clusters of the SNPs and the representative value of each cluster.
- the second labeling unit assigns a genotype to each cluster of SNPs classified into one or two clusters of SNPs based on the representative value and model of each cluster.
- the figure which shows an example of signal strength data The figure which shows an example of cluster data.
- the figure which shows an example of conversion signal strength data The figure which shows an example of conversion signal strength data.
- the figure which shows an example of representative value data The figure which shows an example of a probability distribution model.
- the figure which shows an example of the determination result of a genotype The figure which shows the hardware constitutions of the genotype determination apparatus which concerns on 1st Embodiment.
- the figure explaining the calculation method of a representative value The figure which shows an example of the representative value data of SNP of 3 clusters.
- the flowchart which shows the allocation process of the genotype with respect to SNP of 3 clusters.
- the flowchart which shows the construction process of a probability distribution model. The figure explaining the extraction method of a representative value. The figure which shows an example of a probability distribution model.
- the flowchart which shows the allocation process of the genotype with respect to SNP of 1 and 2 clusters.
- the figure explaining the allocation method of the genotype with respect to SNP of 1 and 2 clusters.
- the figure which shows an example of the allocation result of the genotype with respect to SNP of 1 and 2 clusters.
- the functional block diagram which shows the genotype determination apparatus which concerns on 2nd Embodiment.
- the flowchart which shows the reallocation process by the genotype determination apparatus which concerns on 2nd Embodiment.
- the figure which shows an example of the screen displayed on a display apparatus. The figure which shows an example of the screen displayed on a display apparatus.
- FIG. 1 is a schematic diagram showing a DNA microarray.
- the DNA microarray includes a plurality of sample compartments. Each sample section corresponds to each sample. Each specimen compartment comprises hundreds of thousands to millions of SNP compartments. Each SNP partition corresponds to each SNP.
- Each SNP section has two types of probes A and B having known base sequences.
- a probe is a mechanism for capturing two different types of bases in each SNP, and each probe has a different SNP base corresponding to the SNP section.
- FIG. 1 a probe whose SNP base is A and a probe of C are shown.
- the signal intensity such as fluorescence intensity and current intensity changes.
- the DNA microarray measures this signal intensity for each type of probe.
- one probe is referred to as probe A and the other probe is referred to as probe B.
- a signal whose intensity changes in accordance with the hybridization of the probe A is referred to as signal A, and the intensity of the signal A is referred to as signal intensity A.
- a signal whose intensity changes according to the hybridization of the probe B is called a signal B, and the intensity of the signal B is called a signal intensity B.
- genotype AA is a homozygous genotype.
- Genotype AB is a heterozygous genotype.
- genotype BB is a homozygous genotype.
- the DNA microarray simultaneously measures signal intensities A and B for a plurality of samples in a plurality of SNPs. Next, clustering of samples for each SNP is performed based on the signal intensities A and B measured by the DNA microarray.
- FIG. 3 is a diagram in which a plurality of specimens are plotted on a signal intensity plane for a certain SNPi.
- the horizontal axis indicates the signal intensity A
- the vertical axis indicates the signal intensity B
- the broken line indicates each cluster.
- Each cluster is a set of specimens having the same SNPi genotype. Sample clustering is performed using an existing clustering method. As a result, three or less clusters are generated for each SNP.
- a genotype is assigned to each generated cluster.
- the clusters of the genotype AB are considered to be distributed on a 45 ° straight line on the signal intensity plane.
- the cluster of the genotype AA has a large signal intensity A and a small signal intensity B, it is distributed on the signal intensity A axis side from the 45 ° straight line, and the cluster of the genotype BB has a large signal intensity B. Since the signal intensity A is small, it is considered that the signal intensity is distributed on the B-axis side from the 45 ° line.
- FIG. 4 is a diagram showing each cluster of FIG. 3 to which a genotype is assigned by such a method.
- the genotype AA is assigned to the cluster near the signal intensity A axis
- the genotype BB is assigned to the cluster near the signal intensity B axis
- the genotype AB is assigned to the cluster on the 45 ° straight line. It has been.
- the conventional genotyping technique can simultaneously determine the genotypes of a plurality of specimens in a plurality of SNPs. For example, in the example of FIG. 4, the SNPi of the sample 1 is determined as genotype AA, the SNPi of the sample 2 is determined as genotype AB, and the SNPi of the sample 3 is determined as genotype BB.
- the genotype In the genotype assignment method using the magnitude relationship of signal strength, when the signal strengths A and B are accurately measured, the genotype can be assigned with high accuracy. In practice, however, measurement errors occur in the signal intensities A and B due to the influence of the experimental environment (DNA microarray reagents and the like) when measuring the signal intensities A and B with the DNA microarray, and the sample distribution fluctuates. Sometimes.
- the signal intensity A is measured relatively larger than the signal intensity B, and the sample distribution becomes asymmetric (fluctuation 1), or the sample distribution moves in parallel as a whole ( Fluctuation 2) can be considered.
- clusters other than the genotype AB may be located on a 45 ° straight line as shown in FIG. Even in such a case, if three clusters are generated for one SNP, it is possible to correctly assign genotypes by assigning genotypes in order of the signal strength of the clusters. As shown in FIG. 4, when only one or two clusters are generated for one SNP, genotype assignment becomes difficult.
- the genotype determination apparatus assigns a genotype to each cluster of each SNP in consideration of fluctuations that occur in the sample distribution.
- FIG. 7 and 8 are diagrams for explaining the outline of the determination method by the genotype determination apparatus according to the present embodiment.
- the signal intensity and cluster ID of 90 samples of 1 million SNPs are prepared. Among 1 million SNPs, 500,000 SNPs are classified into 3 clusters, 200,000 SNPs are classified into 2 clusters, and 300,000 SNPs are classified into 1 cluster.
- the genotype determination device assigns a genotype not to each specimen but to each cluster. For this purpose, first, the genotype determination apparatus calculates a representative value of each cluster from the signal intensity of the sample included in each cluster. A representative value is calculated for each SNP.
- the genotype determination device assigns a genotype to each cluster of SNPs classified into three clusters using the magnitude relationship of representative values.
- the representative values of each cluster of SNP 1 are 10 °, 40 °, and 80 °, respectively.
- the genotype determination device assigns genotypes AA, AB, and BB to the three clusters in ascending order of representative values.
- the genotype determination apparatus assigns a genotype to all the clusters of 500,000 SNPs classified into three clusters.
- each genotype of 500,000 SNPs is obtained.
- the representative values of the genotypes AA, AB, and BB of SNP1 are 10 °, 40 °, and 80 °, respectively.
- the genotyping apparatus constructs a probability distribution model using the genotypes and representative values of 500,000 SNPs thus obtained.
- the probability distribution model of genotype AA is expressed as a probability density function of 500,000 representative values of genotype AA.
- the genotype determination apparatus assigns a genotype to each cluster of the SNP classified into one or two clusters using a probability distribution model. Specifically, the genotype determination apparatus applies the representative value of each cluster to the above probability distribution model, and assigns the genotype having the maximum probability density to each cluster.
- the representative value of cluster 1 of SNP 3 classified into two clusters is 42 °
- the representative value of cluster 2 is 78 °.
- the probability density of the genotype AB is maximized.
- the probability density of the genotype BB is maximized.
- genotype AB is assigned to cluster 1 of SNP3
- genotype BB is assigned to cluster 2.
- the genotype determination apparatus assigns a genotype to all the clusters of 200,000 SNPs classified into two clusters. The same applies to 300,000 SNPs classified into one cluster.
- FIG. 9 is a functional block diagram showing the determination apparatus according to the present embodiment.
- the determination device includes a signal strength DB 1, a clustering unit 2, a cluster DB 3, a representative value calculating unit 4, a representative value DB 5, a first labeling unit 6, and a model building unit 7.
- the model DB 8, the second labeling unit 9, the determination result DB 10, and the display unit 11 are provided.
- the signal intensity DB1 stores signal intensity A and B (signal intensity data) measured by the DNA microarray.
- the signal intensities A and B may be fluorescence intensities or current intensities.
- the signal strength DB1 stores the signal strengths of the SNPs 1 to n of the samples 1 to M, respectively. At this time, M ⁇ n signal strengths A and B are stored in the signal strength DB1.
- FIG. 10 is a diagram illustrating an example of the signal strength A stored in the signal strength DB1.
- signal intensity A is fluorescence intensity
- FU is fluorescence unit.
- the signal strength DB 1 stores the signal strengths A of the SNPs 1 to n of the samples 1 to M.
- the signal intensity A of the SNP 1 of the sample 1 is 494.20FU.
- FIG. 11 is a diagram showing an example of the signal strength B stored in the signal strength DB1.
- signal intensity B is fluorescence intensity
- FU is fluorescence unit.
- the signal strength DB1 stores the signal strengths B of the SNPs 1 to M of the samples 1 to M.
- the signal intensity B of the SNP 1 of the sample 1 is 1448.17FU.
- the clustering unit 2 generates a cluster for each SNP based on the signal strengths A and B stored in the signal strength DB1.
- a cluster is a collection of specimens. Each specimen is classified into one of the clusters generated by the clustering unit 2. When the specimen is human, since there are only three genotypes AA, AB, and BB, three or less clusters are generated for each SNP.
- the clustering unit 2 may perform sample clustering using a known clustering method such as the k-means method.
- the cluster DB 3 stores the clustering result (cluster data) by the clustering unit 2. That is, the cluster DB 3 stores cluster information of each sample of each SNP.
- FIG. 12 is a diagram illustrating an example of the clustering result stored in the cluster DB 3. In the example of FIG. 12, the cluster of the sample 1 of the SNP 1 is the cluster 1. SNP1 is classified into one cluster, SNP2 is classified into two clusters, and SNP3 is classified into three clusters.
- the determination device can also obtain a clustering result as shown in FIG. 12 from an external device.
- the determination device may not include the clustering unit 2.
- the clustering unit 2 may calculate the converted signal strengths x and y from the signal strengths A and B and perform clustering based on the converted signal strengths x and y.
- the converted signal strengths x and y are calculated by the following formula, for example.
- a converted signal composed of the converted signal strength x-axis and the converted signal strength y-axis. Samples are plotted on the intensity plane, and clusters are generated on the transformed signal intensity plane. As shown in FIG. 13, the clusters generated on the converted signal intensity are clusters corresponding to the magnitude of the converted signal intensity x, and correspond to clusters of genotypes AA, AB, BB in ascending order of the converted signal intensity x. To do.
- the converted signal strengths x and y calculated by the clustering unit 2 may be stored in the signal strength DB1.
- FIG. 14 is a diagram illustrating an example of the converted signal strength x stored in the signal strength DB1
- FIG. 15 is a diagram illustrating an example of the converted signal strength y stored in the signal strength DB1. 14 and 15, the converted signal strengths x and y are dimensionless.
- the determination device may use the converted signal strengths x and y stored in the signal strength DB1 instead of the signal strengths A and B.
- the representative value calculation unit 4 calculates a representative value of each cluster generated by the clustering unit 2.
- the representative value is a value unique to each cluster of each SNP.
- the representative value is calculated based on the signal strengths A and B of the samples included in each cluster of each SNP and the converted signal strengths x and y. In the following, it is assumed that the representative value is calculated based on the signal strengths A and B.
- the representative value is, for example, the regression coefficient of the regression line of each cluster, the arc tangent of the regression coefficient, or the slope of the approximate line passing through the origin, but is not limited thereto.
- the representative value may be a correlation coefficient of each cluster, a cluster median value, a cluster median value, a cluster variance, an average ratio value, or an average difference value.
- the representative value DB 5 stores the representative value (representative value data) of each cluster of each SNP calculated by the representative value calculation unit 4.
- FIG. 16 is a diagram illustrating an example of representative values stored in the representative value DB 5. In the example of FIG. 16, one value is stored as a representative value of each cluster. In FIG. 16, for example, the representative value of cluster 1 of SNP 1 is 3.31, and the representative value of clusters 2 and 3 is NA (Not Available). NA indicates that no representative value is stored. This corresponds to the fact that only one cluster is generated in SNP1.
- the first labeling unit 6 refers to the representative value DB 5 and extracts the SNP in which three clusters are generated.
- An SNP in which three clusters are generated corresponds to an SNP in which representative values are stored in the three clusters. For example, in the example of FIG. 16, SNP3 is extracted.
- the first labeling unit 6 assigns a genotype to each cluster of each extracted SNP. Genotype assignment is performed using the magnitude relationship of representative values. More specifically, when a value that increases as the signal intensity A of the sample included in the cluster increases as the representative value, the first labeling unit 6 genotypes the three clusters in descending order of the representative value. AA, AB, and BB are assigned. Similarly, when a value that increases as the signal intensity B of the sample included in the cluster increases as the representative value, the first labeling unit 6 assigns the genotype BB to the three clusters in descending order of the representative value. AB and AA are assigned. The same applies to the case where the representative value is calculated based on the converted signal strengths x and y.
- the first labeling unit 6 assigns the genotypes BB, AB, and AA to the three clusters in descending order of the representative value. Therefore, in the example of FIG. 16, the genotype AA is assigned to the cluster 1, the genotype AB is assigned to the cluster 2, and the genotype BB is assigned to the cluster 3.
- the first labeling unit 6 applies the allocation result to the cluster data stored in the cluster DB 3, thereby generating the determination result of the genotype of the SNP classified into the three clusters.
- the determination result is stored in the determination result DB 10.
- the model construction unit 7 generates a probability distribution indicating the relationship between the genotype and the representative value based on the genotype of each cluster assigned by the first labeling unit 6 and the representative value of each cluster assigned the genotype.
- the probability distribution model includes a probability density function of representative values for each genotype.
- the random variable of each probability density function is a representative value.
- a probability density function according to an arbitrary probability distribution such as a Gaussian distribution (normal distribution), a mixed Gaussian distribution, an F distribution, and a beta distribution can be used.
- each probability density function may follow a different type of distribution for each genotype.
- the probability density function of genotypes AA and BB may follow a mixed Gaussian distribution
- the probability density function of genotype AB may follow a normal distribution.
- FIG. 17 is a diagram illustrating an example of the probability distribution model constructed by the model construction unit 7.
- the representative value is the slope of the approximate straight line passing through the origin.
- the probability density functions of the genotypes AA, AB, and BB are shown in order from the left.
- the probability distributions of the genotypes AA and BB are symmetric with respect to the probability distribution of the genotype AB. Further, the average value of the probability distribution of the genotype AB is about 45 °. On the other hand, in the probability distribution model of FIG. 17, the probability distributions of the genotypes AA and BB are asymmetric (fluctuation 1), and the average value of the probability distribution of the genotype AB is shifted from 45 ° (fluctuation 2).
- the model construction unit 7 can construct a probability distribution model reflecting the fluctuation of the distribution due to the influence of the experimental environment.
- the model DB 8 stores the probability distribution model constructed by the model construction unit 7. That is, the parameters (average and variance) of the probability density function for each genotype are stored.
- the second labeling unit 9 refers to the representative value DB 5 and extracts SNPs in which one or two clusters are generated. SNPs in which one or two clusters are generated correspond to SNPs in which representative values are stored in one or two clusters, respectively. For example, in the example of FIG. 16, SNPs 1 and 2 are extracted.
- the second labeling unit 9 assigns a genotype to each cluster of each extracted SNP. Genotype assignment is performed using a probability distribution model stored in the model DB 8. More specifically, the second labeling unit 9 assigns the representative value of each cluster to the probability density function of each genotype, and assigns the genotype having the maximum probability density to each cluster.
- cluster 1 when the representative value of cluster 1 of SNP1 is ⁇ °, cluster 1 has the highest probability density in the probability density function of genotype AA. Therefore, the second labeling unit 9 assigns the genotype AA to the cluster 1 of SNP1.
- the second labeling unit 9 applies the allocation result to the cluster data stored in the cluster DB 3 to generate the determination result of the genotype of the SNP classified into one or two clusters.
- the determination result is stored in the determination result DB 10.
- the determination result DB 10 stores the determination result of the genotype of each SNP of each specimen.
- the determination result is generated by applying the genotype assigned by the first labeling unit 6 and the second labeling unit 9 to each cluster stored in the cluster DB 3.
- FIG. 19 is a diagram illustrating an example of genotype determination results stored in the determination result DB 10.
- SNP1 of specimen 1 is genotype AA.
- the display unit 11 converts various types of information generated by the determination device into image data and video data, and causes the display device 103 described later to display the information.
- the display unit 11 is connected only to the determination result DB 10, but may be connected to the signal strength DB1, the cluster DB3, the representative value DB5, and the model DB8. The screen displayed by the display unit 11 will be described later.
- the determination apparatus is configured by a computer 100 as shown in FIG.
- the computer 100 includes a CPU (central processing unit) 101, an input device 102, a display device 103, a communication device 104, and a storage device 105, which are connected to each other via a bus 106.
- CPU central processing unit
- the CPU 101 is a control device and a calculation device of the computer 100.
- the CPU 101 performs arithmetic processing based on data or a program input from each device (for example, the input device 102, the communication device 104, and the storage device 105) connected via the bus 106, and outputs the calculation result and the control signal.
- the data is output to each device (for example, the display device 103, the communication device 104, and the storage device 105) connected via the bus 106.
- the CPU 101 executes an OS (operating system) of the computer 100, a determination program, and the like, and controls each device constituting the computer 100.
- the determination program is a program that causes the computer 100 to realize the above-described functions of the determination apparatus.
- the computer 100 functions as a determination device.
- the input device 102 is a device for inputting information to the computer 100.
- the input device 102 is, for example, a keyboard, a mouse, and a touch panel, but is not limited thereto.
- a user (operator) of the determination device can use the input device 102 to cause the determination device to start determination processing or to input a probability distribution model parameter.
- the display device 103 is a device for displaying images and videos.
- the display device 103 is, for example, an LCD (liquid crystal display), a CRT (CRT), and a PDP (plasma display), but is not limited thereto.
- the display device 103 displays the image data generated by the display unit 11.
- the communication device 104 is a device for the computer 100 to communicate with an external device wirelessly or by wire.
- the communication device 104 is, for example, a modem, a hub, and a router, but is not limited thereto.
- Information such as the signal intensity measured by the DNA microarray and the clustering result of the sample can be input from an external device via the communication device 104.
- the storage device 105 is a storage medium that stores the OS of the computer 100, a determination program, data necessary for executing the determination program, data generated by executing the determination program, and the like.
- the storage device 105 includes a main storage device and an external storage device.
- the main storage device is, for example, a RAM, a DRAM, or an SRAM, but is not limited thereto.
- the external storage device is a hard disk, an optical disk, a flash memory, and a magnetic tape, but is not limited thereto.
- the signal strength DB 1, the cluster DB 3, the representative value DB 5, the model DB 8, and the determination result DB 10 can be configured using the storage device 105.
- the computer 100 may include one or more CPUs 101, input devices 102, display devices 103, communication devices 104, and storage devices 105, or may be connected to peripheral devices such as printers and scanners. .
- the determination apparatus may be configured by a single computer 100 or may be configured as a system including a plurality of computers 100 connected to each other.
- the determination program may be stored in advance in the storage device 105 of the computer 100, may be recorded on a computer-readable recording medium such as a CD-ROM, or may be uploaded on the Internet. Good.
- the determination apparatus can be configured by installing the determination program in the computer 100 and executing it.
- FIG. 21 is a flowchart schematically showing the determination process.
- the representative value calculation unit 4 calculates the representative value of each cluster of SNPs 1 to n.
- the first labeling unit 6 assigns a genotype to each cluster of the SNPs classified into three clusters using the magnitude relationship of the representative values.
- the model construction unit 7 constructs a probability distribution model based on the genotype assigned to the cluster by the first labeling unit 6 and the representative value of the cluster assigned the genotype.
- the second labeling unit 9 assigns a genotype to each cluster of the SNP classified into one or two clusters using the probability distribution model.
- the genotype is assigned to each cluster of the SNPs 1 to n of the specimens 1 to M, and the determination processing ends.
- the determination result is stored in the determination result DB 10.
- Step S1 First, the representative value calculation process in step S1 will be described.
- FIG. 22 is a flowchart showing a representative value calculation process. In the following, it is assumed that the representative value is the slope of the approximate curve passing through the origin on the signal intensity plane.
- step S10 the representative value calculation unit 4 acquires the signal strength data stored in the signal strength DB1 and the cluster data stored in the cluster DB3.
- step S11 the representative value calculation unit 4 extracts the signal strengths A and B of the cluster j of SNPi.
- i is 1 to n
- j is 1 to 3.
- the representative value calculation unit 4 first extracts the sample of cluster 1 with reference to the cluster data of SNPi.
- the samples in cluster 1 are samples 1, 3, and M-1.
- the representative value calculation unit 4 refers to the signal strength data and extracts the signal strengths A and B of the samples in the cluster 1. As a result, as shown in FIG. 23, the signal strengths A and B of the cluster 1 of the SNPi are extracted.
- the representative value calculation unit 4 calculates the representative value CLU (i, j) of the cluster j of SNPi.
- the representative value CLU (i, j) is the inclination (angle) of the approximate straight line of the cluster j.
- FIG. 24 is a diagram illustrating an example of the representative value CLU (i, j).
- the representative value CLU (i, 1) of cluster 1 of SNPi and the representative value CLU (i, 2) of cluster 2 are shown.
- the approximate straight line is a straight line passing through the origin of the signal intensity plane and the cluster center of the cluster j.
- the representative value CLU (i, j) is calculated by the following equation.
- CLU (i, j) tan -1 * (averege B (i, j)) / (average A (i, j)) (1)
- B (i, j) is the signal strength B of cluster j of SNPi
- a (i, j) is the signal strength A of cluster j of SNPi.
- the coordinates of the cluster center of cluster j of SNPi are (averege A (i, j), averege B (i, j)).
- the representative value calculation unit 4 calculates the representative value CLU (i, j) by substituting the signal strengths A and B of the cluster j of SNPi extracted in step S11.
- step S13 the representative value calculation unit 4 stores the calculated representative value CLU (i, j) in the representative value DB 5.
- FIGS. 25 to 27 are diagrams illustrating an example of the representative value CLU (i, j) stored in the representative value DB5.
- FIG. 25 shows representative values CLU (i, j) of SNPs classified into three clusters
- FIG. 26 shows representative values CLU (i, j) of SNPs classified into two clusters. Indicates a representative value CLU (i, j) of SNPs classified into one cluster.
- the representative value DB 5 may include different tables for each number of SNP clusters, as shown in FIGS. Further, the representative value DB 5 may include one table as shown in FIG. In this case, NA is stored in the representative value of the cluster 3 of the SNPi classified into two clusters as in the SNP 2 of FIG. Further, as in SNP 1 of FIG. 27, NA is stored in the representative value of cluster 2 and the representative value of cluster 3 of SNPi classified into one cluster.
- FIG. 28 is a flowchart showing a genotype assignment process for three cluster SNPs.
- step S20 the first labeling unit 6 acquires representative value data of SNPi of three clusters from the representative value DB 5.
- a table as shown in FIG. 25 storing the representative values CLU (i, 1) to CLU (i, 3) is acquired.
- the first labeling unit 6 refers to the cluster data and assigns a genotype to the clusters 1 to 3 of each SNPi.
- the representative value CLU (i, j) decreases as the signal strength A increases, and increases as the signal strength B increases. Therefore, the first labeling unit 6 assigns the genotypes BB, AB, and BB to the clusters 1 to 3 in descending order of the representative value CLU (i, j). For example, in the example of FIG. 25, genotype AA is assigned to cluster 1 of SNPn, genotype AB is assigned to cluster 2, and genotype BB is assigned to cluster 3.
- FIG. 30 is a diagram illustrating an example of a genotype assignment result by the first labeling unit 6. Such an allocation result is held in the first labeling unit 6. In addition, the allocation result may be stored in the determination result DB 10.
- step S22 the first labeling unit 6 applies the SNPi genotype assignment result to the cluster data. That is, the first labeling unit 6 replaces the cluster of each sample of SNPi stored in the cluster DB 3 with the genotype assigned to each cluster of SNPi.
- FIG. 31 is a diagram for explaining a method of applying the allocation result to the cluster data.
- genotypes AA, AB, and BB are assigned to SNPi clusters 1, 2, and 3, respectively. Therefore, SNPi clusters 1, 2, and 3 in the cluster data are replaced with genotypes AA, AB, and BB, respectively.
- the determination result of the genotype of the SNP of 3 clusters as shown in FIG. 19 is generated.
- step S23 the generated determination result is stored in the determination result DB 10.
- step S24 the first labeling unit 6 applies the SNPi genotype assignment result to the representative value data. That is, the first labeling unit 6 replaces the cluster j of each representative value CLU (i, j) stored in the representative value DB 5 with the genotype assigned to each cluster j of the SNPi, and sorts by genotype. To do.
- FIG. 32 is a diagram for explaining a method of applying the assignment result to the representative value data.
- genotypes AA, AB, and BB are assigned to SNPi clusters 1, 2, and 3, respectively. Therefore, the SNPi clusters 1, 2, and 3 in the representative value data are replaced with the genotypes AA, AB, and BB, respectively.
- FIG. 33 is a diagram illustrating an example of the updated representative value data.
- the representative values of each SNPi are sorted in the order of genotypes AA, AB, and BB.
- the representative value of SNPn genotype AA is 4.32.
- FIG. 34 is a flowchart showing a probability distribution model construction process. In the following, it is assumed that the probability distribution model is constructed using a normal distribution.
- step S30 the model construction unit 7 acquires the representative value data of the three clusters of SNPs stored in the representative value DB 5. Thereby, the updated representative value data as shown in FIG. 33 is acquired.
- the model construction unit 7 extracts a representative value for each genotype.
- the model construction unit 7 extracts all the representative values of the genotype AA included in the representative value data as the representative values of the genotype AA.
- a set of representative values of the extracted genotype AA is referred to as CLU AA
- a set of representative values of the genotype AB is referred to as CLU AB
- a set of representative values of the genotype BB is referred to as CLU BB .
- step S32 the model construction unit 7 calculates the average ⁇ and variance ⁇ of each genotype. That is, the model construction unit 7 calculates an average mu AA and variance sigma AA set CLU AA, the average mu AB and variance sigma AB set CLU AB, and a mean mu BB and variance sigma BB set CLU BB.
- step S33 the model construction unit 7 applies the mean ⁇ and variance ⁇ of each genotype to the normal distribution, and generates a probability density function f (x) of each genotype.
- the probability density function is expressed by the following equation.
- x is a representative value CLU
- f AA (x) is a probability density function of genotype AA
- f AB (x) is a probability density function of genotype AB
- f BB (x) is It is a probability density function of genotype BB.
- a set of the above three probability density functions is a probability distribution model.
- FIG. 36 is a diagram illustrating an example of the probability distribution model constructed in step S33.
- the model construction unit 7 After constructing the probability distribution model, the model construction unit 7 stores the probability distribution model in the model DB 8 in step S34.
- the model DB 8 stores an average ⁇ and variance ⁇ for each genotype.
- FIG. 37 is a flowchart showing a genotype assignment process for one or two clusters of SNPs.
- step S40 the second labeling unit 9 acquires representative value data of one cluster SNP or two clusters SNP stored in the representative value DB 5. Thereby, representative value data as shown in FIGS. 26 and 27 is acquired.
- step S41 the second labeling unit 9 acquires the probability distribution model stored in the model DB 8. Thereby, the probability distribution model shown in FIG. 36 is acquired.
- step S42 the second labeling unit 9 applies the representative value CLU (i, j) to the probability distribution model. That is, as shown in FIG. 38, the second labeling unit 9 substitutes the representative value CLU (i, j) into the probability density function f (x) of each genotype, and the probability density f (CLU ( i, j)) is calculated.
- the second labeling unit 9 assigns the genotype having the maximum probability density f (CLU (i, j)) to the cluster j of SNPi.
- CLU (i, j) the maximum probability density f
- the genotype AA is assigned to the cluster j of SNPi.
- FIG. 39 is a diagram illustrating an example of a genotype assignment result by the second labeling unit 9. Such an allocation result is held in the second labeling unit 9.
- the allocation result may be stored in the determination result DB 10.
- step S44 the second labeling unit 9 applies the SNPi genotype assignment result to the cluster data. That is, the second labeling unit 9 replaces the cluster of each sample of SNPi stored in the cluster DB 3 with the genotype assigned to each cluster of SNPi.
- the method of applying the allocation result is the same as in step S22.
- the determination result of the genotype of one cluster SNP or two clusters SNP as shown in FIG. 19 is generated.
- step S45 the generated determination result is stored in the determination result DB 10. This completes the determination of the genotypes of SNPs 1 to n of specimens 1 to M.
- the genotype is determined using a probability distribution model that reflects the fluctuation of the distribution due to the influence of the experimental environment. Accordingly, it is possible to suppress genotype assignment errors due to the influence of the experimental environment and improve genotype determination accuracy.
- the second embodiment will be described below with reference to FIGS. 40 to 45.
- it is determined whether the reliability of the genotype assigned by the second labeling unit 9 is high. If the reliability is low, reassign the genotype. Biological knowledge is used for determination and reassignment.
- FIG. 40 is a functional block diagram showing the determination apparatus according to the present embodiment. As shown in FIG. 40, the determination device according to this embodiment includes a third labeling unit 12. Other configurations are the same as those in FIG.
- the third labeling unit 12 acquires the genotype assignment result by the second labeling unit 9 and determines whether the assignment result is highly reliable.
- the third labeling unit 12 If it is determined that the reliability of the allocation result is low, the third labeling unit 12 outputs the allocation result of the second labeling unit 9 as it is. On the other hand, if it is determined that the reliability of the assignment result is low, the third labeling unit 12 reassigns the genotype. Then, the third labeling unit 12 outputs the reassigned genotype assignment result.
- the determination result of the SNP genotypes of one cluster and two clusters is generated.
- FIG. 41 is a flowchart showing the reassignment process of genotype reliability by the third labeling unit 12.
- the third labeling unit 12 acquires the genotype assignment result for the SNPi from the second labeling unit 9.
- the SNPi acquired here is a one-cluster or two-cluster SNP.
- step S51 the third labeling unit 12 determines whether the acquired SNPi is one cluster or two clusters. If the SNPi is 2 clusters (Yes), the process proceeds to step S52.
- step S52 the third labeling unit 12 determines whether the two genotypes assigned to the two clusters of SNPi are different genotypes. If the genotypes are different (Yes), the process proceeds to step S53.
- step S53 the third labeling unit 12 determines whether the genotype AB is included in the two genotypes assigned to the two clusters of SNPi. When the genotype AB is included (Yes), the third labeling unit 12 outputs the allocation result acquired from the second labeling unit 9 as it is, and the reallocation process ends.
- step S53 when the genotype AB is not included in the two genotypes in step S53 (No), the process proceeds to step S54.
- step S54 the third labeling unit 12 reassigns genotypes to the two clusters 1 and 2 of the SNPi using the assignment method A.
- the allocation method A will be described later. Thereafter, the third labeling unit 12 outputs the reassigned genotype assignment result, and the reassignment process ends.
- step S52 if the two genotypes assigned to two clusters of SNPi are the same (Yes), the process proceeds to step S55.
- step S55 the third labeling unit 12 determines whether the genotype assigned to the SNPi is AB. If the genotype AB is assigned to the SNPi (YES), the process proceeds to step S56.
- step S56 the third labeling unit 12 reassigns genotypes to the two clusters 1 and 2 of the SNPi using the assignment method B.
- the allocation method B will be described later. Thereafter, the third labeling unit 12 outputs the reassigned genotype assignment result, and the reassignment process ends.
- step S55 if no genotype AB is assigned to the SNPi in step S55 (No), the process proceeds to step S57.
- step S57 the third labeling unit 12 reassigns genotypes to the two clusters 1 and 2 of the SNPi using the assignment method C.
- the allocation method C will be described later. Thereafter, the third labeling unit 12 outputs the reassigned genotype assignment result, and the reassignment process ends.
- step S51 when SNPi is one cluster (No), the process proceeds to step S58.
- step S58 the third labeling unit 12 determines whether the genotype assigned to the SNPi is AB. If the genotype AB is assigned to the SNPi (Yes), the process proceeds to step S59.
- step S59 the third labeling unit 12 reassigns the genotype to one cluster 1 of the SNPi using the assignment method D.
- the allocation method D will be described later. Thereafter, the third labeling unit 12 outputs the reassigned genotype assignment result, and the reassignment process ends.
- step S58 when the genotype AB is not assigned to the SNPi (No), the third labeling unit 12 outputs the assignment result obtained from the second labeling unit 9 as it is, and the reassignment process ends.
- Assignment method A First, the allocation method A will be described. Reassignment by assignment method A is performed when genotypes AA and BB are assigned to two clusters 1 and 2 of SNPi.
- genotype of a certain ethnic group of humans is divided into only genotype AA and genotype BB is extremely low in biology. This is because the child of the mother (father) of genotype AA and the father (mother) of genotype BB becomes genotype AB with a probability of 50%. Therefore, from the biological viewpoint, it is determined that the reliability of the assignment result is low.
- the third labeling unit 12 first acquires the probability distribution model and the SNPi representative value data. Accordingly, the probability density functions f AA (x), f AB (x), f BB (x), the representative value CLU (i, 1) of the cluster 1, and the representative value CLU (i, 2) of the cluster 2 , Is acquired.
- the third labeling unit 12 substitutes each representative value into the probability density function f AB (x) to obtain a probability density f AB (CLU (i, 1)) and a probability density f AB (CLU (i, 1, 2)). Then, the third labeling unit 12 reassigns the genotype AB to a cluster having a large probability density f AB (x). The genotype of a cluster having a small probability density f AB (x) remains unchanged.
- FIG. 42 is a diagram illustrating the allocation method A.
- genotype AA is assigned to cluster 1 and genotype BB is assigned to cluster 2.
- f AB CLU (i, 1)
- f AB CLU (i, 2)
- the third labeling unit 12 reassigns the genotype AB to the cluster 2.
- the genotype of cluster 1 is AA
- the genotype of cluster 2 is AB.
- the allocation method B is performed when the genotype AB is allocated to the two clusters 1 and 2 of the SNPi. Since the same genotype is assigned to the two clusters, it is determined that the reliability of the assignment result is low.
- the third labeling unit 12 first acquires the probability distribution model and the SNPi representative value data. Accordingly, the probability density functions f AA (x), f AB (x), f BB (x), the representative value CLU (i, 1) of the cluster 1, and the representative value CLU (i, 2) of the cluster 2 , Is acquired.
- the third labeling unit 12 substitutes each representative value into the probability density function f AB (x) to obtain a probability density f AB (CLU (i, 1)) and a probability density f AB (CLU (i, 1, 2)). Then, the third labeling unit 12 reassigns one of the genotypes AA and BB to a cluster having a small probability density f AB (x). The genotype of a cluster having a large probability density f AB (x) remains AB.
- the third labeling unit 12 calculates probability densities f AA (x) and f BB (x) of a cluster having a small probability density f AB (x).
- f AA (x)> f BB (x) the third labeling unit 12 reassigns the genotype AA to a cluster having a small probability density f AB (x).
- f AA (x) ⁇ f BB (x) the third labeling unit 12 reassigns the genotype BB to a cluster having a small probability density f AB (x).
- FIG. 43 is a diagram for explaining the allocation method B.
- genotype AB is assigned to clusters 1 and 2.
- the third labeling unit 12 reassigns the genotype BB to the cluster 2.
- the genotype of cluster 1 is AB
- the genotype of cluster 2 is BB.
- the genotype of one cluster is left as AB because, as described above, the possibility that the genotype is divided into only AA and BB is considered to be extremely low in biology. .
- the allocation method C Next, the allocation method C will be described.
- the reallocation by the allocation method C is performed when the genotype AA or the genotype BB is allocated to the two clusters 1 and 2 of the SNPi. Since the same genotype is assigned to the two clusters, it is determined that the reliability of the assignment result is low.
- the third labeling unit 12 first acquires the probability distribution model and the SNPi representative value data. Accordingly, the probability density functions f AA (x), f AB (x), f BB (x), the representative value CLU (i, 1) of the cluster 1, and the representative value CLU (i, 2) of the cluster 2 , Is acquired.
- the third labeling unit 12 substitutes each representative value into the probability density function f AA (x) to obtain the probability density f AA (CLU (i, 1)). ) And probability density f AA (CLU (i, 2)). Then, the third labeling unit 12 reassigns the genotype AB to a cluster having a small probability density f AA (x). The genotype of a cluster having a large probability density f AA (x) remains AA.
- the third labeling unit 12 substitutes each representative value into the probability density function f BB (x) to obtain the probability density f BB (CLU (i, 1)) and probability density f BB (CLU (i, 2)). Then, the third labeling unit 12 reassigns the genotype AB to a cluster having a small probability density f BB (x). The genotype of a cluster having a large probability density f BB (x) remains BB.
- FIG. 44 is a diagram for explaining the allocation method C.
- genotype AA is assigned to clusters 1 and 2.
- the third labeling unit 12 reassigns the genotype AB to the cluster 2.
- the genotype of cluster 1 is AA and the genotype of cluster 2 is AB.
- the genotype of one cluster is reassigned to AB because, as described above, the possibility that the genotype is separated into only AA and BB is considered to be extremely low in biology. .
- allocation method D (Assignment method D) Next, the allocation method D will be described.
- the reallocation by the allocation method D is performed when the genotype AB is allocated to the SNPi of one cluster.
- genotype of a human ethnic group is only genotype AB is extremely low in terms of biology. This is because if genotypes AB are parents, homozygous offspring such as genotypes AA or BB appear with a probability of about 50%.
- genotype of all large groups is AB, only the combination of the mother (father) of genotype AA and the father (mother) of genotype BB can be considered as each parent. . Therefore, from the biological viewpoint, it is determined that the reliability of the assignment result is low.
- the third labeling unit 12 first acquires the probability distribution model and the SNPi representative value data. Accordingly, the probability density functions f AA (x), f AB (x), f BB (x) and the representative value CLU (i, 1) of the cluster 1 are acquired.
- the third labeling unit 12 substitutes the representative value CLU (i, 1) into the probability density functions f AA (x), f BB (x) to obtain the probability density f AA (CLU (i, 1)). , F BB (CLU (i, 1)). Then, the third labeling unit 12 reassigns the genotype AA to the cluster 1 when f AA (CLU (i, 1))> f BB (CLU (i, 1)), and f AA (CLU (i , 1)) ⁇ f BB (CLU (i, 1)), reassigns genotype BB to cluster 1.
- FIG. 45 is a diagram for explaining the allocation method D.
- cluster 1 is assigned genotype AB.
- the third labeling unit 12 reassigns the genotype AA to the cluster 1.
- the genotype of cluster 1 is AA in the allocation result after the reallocation.
- a genotype can be reassigned to a cluster to which a genotype with low reliability is assigned using biological knowledge. Therefore, the reliability of genotype assignment can be improved, and as a result, genotype determination accuracy can be improved.
- the third labeling unit 12 reassigns genotypes using the second representative value.
- the second representative value is a representative value of a type different from the representative value used by the first labeling unit 6 and the second labeling unit 9 (hereinafter referred to as “first representative value”). Therefore, in this embodiment, at least two types of representative values including the first representative value and the second representative value are calculated.
- the second representative value may be calculated based on the signal strengths A and B.
- representative values include, for example, the regression coefficient of the regression line of each cluster, the arc tangent of the regression coefficient, or the slope of the approximate line passing through the origin, the correlation coefficient of each cluster, the cluster center value, the cluster median, and the cluster variance. , Average ratio, or average difference.
- the second representative value may not be calculated based on the signal strengths A and B.
- An example of such a representative value is the number of specimens.
- the number of samples is the number of samples included in each cluster.
- the method of determining the reliability of the genotype by the third labeling unit 12 is the same as in the second embodiment (see the flowchart in FIG. 41).
- allocation methods A to C are different from those in the second embodiment. Therefore, allocation methods A to C in the present embodiment will be described.
- the first representative value is the slope of the approximate straight line of the cluster, and the second representative value is the number of samples.
- Assignment method A First, the allocation method A will be described. Reassignment by assignment method A is performed when genotypes AA and BB are assigned to two clusters 1 and 2 of SNPi.
- the third labeling unit 12 reassigns the genotype AB to a cluster having a small number of samples. This is because a cluster with a small number of specimens is considered to have a low genotype assignment reliability. The genotype of a cluster with a large number of specimens remains the same.
- FIG. 46 is a diagram for explaining an allocation method A in the present embodiment.
- genotype AA is assigned to cluster 1 and genotype BB is assigned to cluster 2.
- the number of samples in cluster 1 is 10, and the number of samples in cluster 2 is 100.
- the third labeling unit 12 reassigns the genotype AB to the cluster 1.
- the genotype of cluster 1 is AB and the genotype of cluster 2 is BB.
- the allocation method B is performed when the genotype AB is allocated to the two clusters 1 and 2 of the SNPi.
- the third labeling unit 12 reassigns any of the genotypes AA and BB to the cluster having a small number of samples. This is because a cluster with a small number of specimens is considered to have a low genotype assignment reliability. The genotype of the cluster with a large number of specimens remains AB.
- the third labeling unit 12 may reassign a genotype to a cluster with a small number of specimens by the same method as in the second embodiment. That is, the third labeling unit 12 calculates the probability density f AA (x), f BB (x), and if f AA (x)> f BB (x), reassigns the genotype AA, and f AA If (x) ⁇ f BB (x), reassign genotype BB.
- FIG. 47 is a diagram for explaining an allocation method B in the present embodiment.
- genotype AB is assigned to clusters 1 and 2.
- the number of samples in cluster 1 is 10, the number of samples in cluster 2 is 100, and f AA (CLU (i, 1))> f BB (CLU (i, 1)).
- the third labeling unit 12 reassigns the genotype AA to the cluster 1.
- the genotype of cluster 1 is AA and the genotype of cluster 2 is AB.
- the allocation method C is performed when the genotype AA or the genotype BB is allocated to the two clusters 1 and 2 of the SNPi.
- the third labeling unit 12 reassigns the genotype AB to a cluster having a small number of samples. This is because a cluster with a small number of specimens is considered to have a low genotype assignment reliability. The genotype of a cluster with a large number of specimens remains the same.
- FIG. 48 is a diagram for explaining an allocation method C in the present embodiment.
- genotype AA is assigned to clusters 1 and 2.
- the number of samples in cluster 1 is 10, and the number of samples in cluster 2 is 100.
- the third labeling unit 12 reassigns the genotype AB to the cluster 1.
- the genotype of cluster 1 is AB and the genotype of cluster 2 is AA.
- the genotype is reassigned using the second representative value.
- the reassignment using the second representative value is performed to reassign the genotype. Reliability can be improved, and as a result, genotype determination accuracy can be improved.
- the method of the present embodiment and the method of the second embodiment can be used in combination. For example, if the threshold value ⁇ of the number of samples is set and at least one of the number of samples in the clusters 1 and 2 is equal to or less than the threshold value ⁇ , the genotype is reassigned by the method of the present embodiment, If both are larger than the threshold value ⁇ , it is conceivable to reassign the genotype by the method of the second embodiment.
- model construction unit 7 constructs a second probability distribution model based on the second representative value
- model DB 8 stores the second probability distribution model
- the third labeling unit 12 uses the second representative value and the second representative value. Genotype reallocation may be performed based on the two probability distribution models.
- the representative value calculation unit 4 calculates three or more types of representative values for each cluster, and the third labeling unit 12 uses two or more types of representative values other than the first representative value to regenerate genotypes. Allocation may be performed.
- the display unit 11 acquires SNPi signal strength data, cluster data, and representative value data from the signal strength DB1, cluster DB3, and representative value DB5, respectively, and uses the acquired various data to display the display device 103 in FIG. Can be displayed.
- the type of SNP being displayed SNPi
- a plurality of samples plotted on the signal intensity plane clusters (clusters 1 and 2) and cluster centers generated for SNPi
- a table showing representative values (CLU) calculated for each cluster is displayed.
- the representative value of cluster 1 is 11.81.
- the display unit 11 displays such a screen, the user of the determination apparatus can easily grasp the cluster and the representative value. Note that, as in the third embodiment, when multiple types of representative values are calculated, the representative value table in FIG.
- the clustering result and the genotype determination result are visualized and displayed.
- the display unit 11 acquires SNPi signal strength data, cluster data, and a determination result from the signal strength DB1, the cluster DB3, and the determination result DB10, respectively, and uses the acquired various data to display the display device 103 in FIG. A screen can be displayed.
- the screen of FIG. 50 includes the type of SNP being displayed (SNPi), a plurality of samples plotted on the signal intensity plane, the clusters (clusters 1 and 2) and cluster centers generated for the SNPi, A table showing the genotype assigned to each cluster is displayed.
- the genotype of cluster 1 is AA.
- the user of the determination apparatus can easily grasp the determination result (allocation result) of the cluster and the genotype.
- the display unit 11 can acquire probability distribution model data (such as parameters) from the model DB 8, and can display the screen of FIG. 51 on the display device 103 using the acquired data.
- probability distribution model data such as parameters
- the screen shown in FIG. 51 shows a graphed probability distribution model and a table indicating the type (normal distribution) and parameters ( ⁇ , ⁇ ) of each probability density function constituting the probability distribution model.
- the probability density function f AA (x) in accordance with a normal distribution the mean mu AA 17, variance sigma AA is 20.
- the probability density calculated to determine the genotype of the cluster is plotted on the graph of FIG. Filled circles are plotted on the probability density functions of the genotypes assigned to the clusters, and hollow circles are plotted on the probability density functions of the other genotypes.
- the user of the determination apparatus can easily grasp the constructed probability distribution model and the basis (probability density) of genotype assignment.
- the probability density used for reassignment may be plotted on the probability density function as shown in FIG. In FIG. 52, the probability density used for reassignment is plotted with a square and displayed so that it can be distinguished from the probability density used by the second labeling unit 9 for assignment.
- the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage.
- various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. Further, for example, a configuration in which some components are deleted from all the components shown in each embodiment is also conceivable. Furthermore, you may combine suitably the component described in different embodiment.
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Physiology (AREA)
- Probability & Statistics with Applications (AREA)
- Signal Processing (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
Description
以下、第1実施形態について、図7~図39を参照して説明する。
まず、ステップS1における、代表値の算出処理について説明する。図22は、代表値の算出処理を示すフローチャートである。以下では、代表値は、信号強度平面上の原点を通る近似曲線の傾きであるものとする。
CLU(i,j)=tan-1*(averege B(i,j))/(average A(i,j))・・・(1)
次に、ステップS2における、3クラスタのSNP(3つのクラスタに分類されたSNP)に対する遺伝子型の割当処理について説明する。図28は、3クラスタのSNPに対する遺伝子型の割当処理を示すフローチャートである。
次に、ステップS3における、確率分布モデルの構築処理について説明する。図34は、確率分布モデルの構築処理を示すフローチャートである。以下では、確率分布モデルは、正規分布を利用して構築されるものとする。
次に、ステップS4における、1又は2クラスタのSNP(1つのクラスタに分類されたSNP又は2つのクラスタに分類されたSNP)に対する遺伝子型の割当処理について説明する。図37は、1又は2クラスタのSNPに対する遺伝子型の割当処理を示すフローチャートである。
以下、第2実施形態について、図40~図45を参照して説明する。本実施形態では、第2ラベリング部9が割当てた遺伝子型の信頼性が高いか判定する。信頼性が低い場合には、遺伝子型を再割当てする。判定及び再割当のために、生物学的な知見が利用される。
まず、割当方法Aについて説明する。割当方法Aによる再割当が行われるのは、SNPiの2つのクラスタ1,2に遺伝子型AA,BBが割当てられた場合である。
次に、割当方法Bについて説明する。割当方法Bによる再割当が行われるのは、SNPiの2つのクラスタ1,2にいずれも遺伝子型ABが割当てられた場合である。2つのクラスタに同一の遺伝子型が割当てられていることから、この割当結果の信頼性は低いと判定される。
次に、割当方法Cについて説明する。割当方法Cによる再割当が行われるのは、SNPiの2つのクラスタ1,2にいずれも遺伝子型AA又は遺伝子型BBが割当てられた場合である。2つのクラスタに同一の遺伝子型が割当てられていることから、この割当結果の信頼性は低いと判定される。
次に、割当方法Dについて説明する。割当方法Dによる再割当が行われるのは、1クラスタのSNPiに遺伝子型ABが割当てられた場合である。
以下、第3実施形態について、図46~図48を参照して説明する。本実施形態では、第3ラベリング部12は、第2の代表値を利用して、遺伝子型の再割当を行う。第2の代表値とは、第1ラベリング部6及び第2ラベリング部9が利用する代表値(以下、「第1の代表値」という)とは異なる種類の代表値のことである。したがって、本実施形態では、第1の代表値と、第2の代表値と、を含む少なくとも2種類の代表値が算出される。
まず、割当方法Aについて説明する。割当方法Aによる再割当が行われるのは、SNPiの2つのクラスタ1,2に遺伝子型AA,BBが割当てられた場合である。
次に、割当方法Bについて説明する。割当方法Bによる再割当が行われるのは、SNPiの2つのクラスタ1,2にいずれも遺伝子型ABが割当てられた場合である。
次に、割当方法Cについて説明する。割当方法Cによる再割当が行われるのは、SNPiの2つのクラスタ1,2にいずれも遺伝子型AA又は遺伝子型BBが割当てられた場合である。
以下、第4実施形態について、図49~図52を参照して説明する。第4実施形態では、表示部11が表示装置103に表示させる画面について説明する。図49~図52は、画面の一例を示す図である。
Claims (18)
- DNAマイクロアレイにより計測された複数のSNPにおける複数の検体の信号強度に基づいて分類された前記SNP毎の検体のクラスタについて、前記各クラスタに含まれる検体の信号強度に基づいて当該各クラスタの代表値を算出する代表値算出部と、
前記SNPのうち3つのクラスタに分類されたSNPの各クラスタに、当該各クラスタの代表値に基づいて遺伝子型を割当てる第1ラベリング部と、
前記SNPのうち3つのクラスタに分類されたSNPの各クラスタの遺伝子型と、当該各クラスタの代表値と、の関係を示すモデルを構築するモデル構築部と、
前記SNPのうち1つ又は2つのクラスタに分類されたSNPの各クラスタに、当該各クラスタの代表値及び前記モデルに基づいて遺伝子型を割当てる第2ラベリング部と、
を備える遺伝子型判定装置。 - 前記信号強度は、蛍光強度又は電流強度、若しくはそれらの値に基づいて変換された変換値である
請求項1に記載の遺伝子型判定装置。 - 前記代表値は、前記クラスタに含まれる前記検体の回帰直線の回帰係数、前記回帰係数の逆正接、原点を通る近似直線の傾き、相関係数、クラスタ中心値、クラスタ中央値、クラスタ分散、比の平均値、又は差の平均値である
請求項1又は請求項2に記載の遺伝子型判定装置。 - 前記第1ラベリング部は、前記クラスタの前記代表値の順に、一方のホモ接合体の遺伝子型、ヘテロ接合体の遺伝子型、他方のホモ接合体の遺伝子型を割当てる
請求項1乃至請求項3のいずれか1項に記載の遺伝子型判定装置。 - 前記モデルは、前記遺伝子型毎の前記代表値の確率分布に従う確率密度関数である
請求項1乃至請求項4のいずれか1項に記載の遺伝子型判定装置。 - 前記確率分布は、混合ガウシアン分布、正規分布、ベータ分布、又はF分布である
請求項5に記載の遺伝子型判定装置。 - 前記第2ラベリング部は、前記クラスタに、前記代表値の確率密度が最大の前記遺伝子型を割当てる
請求項1乃至請求項6のいずれか1項に記載の遺伝子型判定装置。 - 2つの前記クラスタに分類された前記SNPの前記各クラスタに、ホモ接合型の異なる前記遺伝子型がそれぞれ割当てられた場合、前記各クラスタの前記代表値に基づいて、一方の前記クラスタにヘテロ接合型の前記遺伝子型を再割当てする第3ラベリング部を更に備える
請求項1乃至請求項7のいずれか1項に記載の遺伝子型判定装置。 - 2つの前記クラスタに分類された前記SNPの前記各クラスタに、ヘテロ接合型の前記遺伝子型がそれぞれ割当てられた場合、前記各クラスタの前記代表値に基づいて、一方の前記クラスタにホモ接合型の前記遺伝子型を再割当てする第3ラベリング部を更に備える
請求項1乃至請求項8のいずれか1項に記載の遺伝子型判定装置。 - 2つの前記クラスタに分類された前記SNPの前記各クラスタに、ホモ接合型の同一の前記遺伝子型がそれぞれ割当てられた場合、前記各クラスタの前記代表値に基づいて、一方の前記クラスタにヘテロ接合型の前記遺伝子型を再割当てする第3ラベリング部を更に備える
請求項1乃至請求項9のいずれか1項に記載の遺伝子型判定装置。 - 1つの前記クラスタに分類された前記SNPの前記クラスタに、ヘテロ接合型の前記遺伝子型が割当てられた場合、ホモ接合型の前記遺伝子型を再割当てする第3ラベリング部を更に備える
請求項1乃至請求項10のいずれか1項に記載の遺伝子型判定装置。 - 前記代表値算出部は、前記SNP毎に前記各クラスタの第2の代表値を算出する
請求項1乃至請求項11のいずれか1項に記載の遺伝子型判定装置。 - 前記第2の代表値は、前記各クラスタに含まれる前記検体の数である
請求項12に記載の遺伝子型判定装置。 - 2つの前記クラスタに分類された前記SNPの前記各クラスタに、ホモ接合型の異なる前記遺伝子型がそれぞれ割当てられた場合、前記第2の代表値に基づいて、一方の前記クラスタにヘテロ接合型の前記遺伝子型を再割当てする第3ラベリング部を更に備える
請求項12又は請求項13に記載の遺伝子型判定装置。 - 2つの前記クラスタに分類された前記SNPの前記各クラスタに、ヘテロ接合型の前記遺伝子型がそれぞれ割当てられた場合、前記第2の代表値に基づいて、一方の前記クラスタにホモ接合型の前記遺伝子型を再割当てする第3ラベリング部を更に備える
請求項12乃至請求項14のいずれか1項に記載の遺伝子型判定装置。 - 2つの前記クラスタに分類された前記SNPの前記各クラスタに、ホモ接合型の同一の前記遺伝子型がそれぞれ割当てられた場合、前記第2の代表値に基づいて、一方の前記クラスタにヘテロ接合型の前記遺伝子型を再割当てする第3ラベリング部を更に備える
請求項12乃至請求項15のいずれか1項に記載の遺伝子型判定装置。 - 前記モデル、前記判定結果、及び前記代表値の少なくとも1つを表示する表示部を更に備える
請求項1乃至請求項16のいずれか1項に記載の遺伝子型判定装置。 - DNAマイクロアレイにより計測された複数のSNPにおける複数の検体の信号強度に基づいて分類された前記SNP毎の検体のクラスタについて、前記各クラスタに含まれる検体の信号強度に基づいて当該各クラスタの代表値を算出する工程、
前記SNPのうち3つのクラスタに分類されたSNPの各クラスタに、当該各クラスタの代表値に基づいて遺伝子型を割当てる工程と、
前記SNPのうち3つのクラスタに分類されたSNPの各クラスタの遺伝子型と、当該各クラスタの代表値と、の関係を示すモデルを構築する工程と、
前記SNPのうち1つ又は2つのクラスタに分類されたSNPの各クラスタに、当該各クラスタの代表値及び前記モデルに基づいて遺伝子型を割当てる工程と、
を含む遺伝子型判定方法。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201580077795.9A CN107533591A (zh) | 2015-04-01 | 2015-04-01 | 基因型判定装置及方法 |
GB1713894.2A GB2551091A (en) | 2015-04-01 | 2015-04-01 | Genotype determination device and method |
JP2017509089A JP6367473B2 (ja) | 2015-04-01 | 2015-04-01 | 遺伝子型判定装置及び方法 |
PCT/JP2015/060368 WO2016157473A1 (ja) | 2015-04-01 | 2015-04-01 | 遺伝子型判定装置及び方法 |
US15/693,268 US20170364632A1 (en) | 2015-04-01 | 2017-08-31 | Genotyping device and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2015/060368 WO2016157473A1 (ja) | 2015-04-01 | 2015-04-01 | 遺伝子型判定装置及び方法 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/693,268 Continuation US20170364632A1 (en) | 2015-04-01 | 2017-08-31 | Genotyping device and method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016157473A1 true WO2016157473A1 (ja) | 2016-10-06 |
Family
ID=57004114
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/060368 WO2016157473A1 (ja) | 2015-04-01 | 2015-04-01 | 遺伝子型判定装置及び方法 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20170364632A1 (ja) |
JP (1) | JP6367473B2 (ja) |
CN (1) | CN107533591A (ja) |
GB (1) | GB2551091A (ja) |
WO (1) | WO2016157473A1 (ja) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110033829B (zh) * | 2019-04-11 | 2021-07-23 | 北京诺禾心康基因科技有限公司 | 基于差异snp标记物的同源基因的融合检测方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005531853A (ja) * | 2002-06-28 | 2005-10-20 | アプレラ コーポレイション | Snp遺伝子型クラスタリングのためのシステムおよび方法 |
JP2006107396A (ja) * | 2004-10-08 | 2006-04-20 | Institute Of Physical & Chemical Research | Snp遺伝子型分類方法、snp遺伝子型分類装置およびsnp遺伝子型分類プログラム |
JP2008533558A (ja) * | 2005-02-10 | 2008-08-21 | アプレラ コーポレイション | 遺伝子型分析のための正規化方法 |
WO2013073929A1 (en) * | 2011-11-15 | 2013-05-23 | Acgt Intellectual Limited | Method and apparatus for detecting nucleic acid variation(s) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005071594A1 (en) * | 2004-01-23 | 2005-08-04 | King Faisal Specialist Hospital & Research Center | Estimation of signal thresholds for microarray data using mixture modeling |
CN101570788A (zh) * | 2009-06-09 | 2009-11-04 | 华东师范大学 | 一种通过寡核苷酸多态性芯片识别基因型的方法 |
CN102952854B (zh) * | 2011-08-25 | 2015-01-14 | 深圳华大基因科技有限公司 | 单细胞分类和筛选方法及其装置 |
-
2015
- 2015-04-01 CN CN201580077795.9A patent/CN107533591A/zh active Pending
- 2015-04-01 GB GB1713894.2A patent/GB2551091A/en not_active Withdrawn
- 2015-04-01 WO PCT/JP2015/060368 patent/WO2016157473A1/ja active Application Filing
- 2015-04-01 JP JP2017509089A patent/JP6367473B2/ja active Active
-
2017
- 2017-08-31 US US15/693,268 patent/US20170364632A1/en not_active Abandoned
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005531853A (ja) * | 2002-06-28 | 2005-10-20 | アプレラ コーポレイション | Snp遺伝子型クラスタリングのためのシステムおよび方法 |
JP2006107396A (ja) * | 2004-10-08 | 2006-04-20 | Institute Of Physical & Chemical Research | Snp遺伝子型分類方法、snp遺伝子型分類装置およびsnp遺伝子型分類プログラム |
JP2008533558A (ja) * | 2005-02-10 | 2008-08-21 | アプレラ コーポレイション | 遺伝子型分析のための正規化方法 |
WO2013073929A1 (en) * | 2011-11-15 | 2013-05-23 | Acgt Intellectual Limited | Method and apparatus for detecting nucleic acid variation(s) |
Also Published As
Publication number | Publication date |
---|---|
JPWO2016157473A1 (ja) | 2017-12-21 |
JP6367473B2 (ja) | 2018-08-01 |
CN107533591A (zh) | 2018-01-02 |
US20170364632A1 (en) | 2017-12-21 |
GB201713894D0 (en) | 2017-10-11 |
GB2551091A (en) | 2017-12-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alirezaie et al. | ClinPred: prediction tool to identify disease-relevant nonsynonymous single-nucleotide variants | |
Neale et al. | Testing for an unusual distribution of rare variants | |
Lee et al. | Rare-variant association analysis: study designs and statistical tests | |
Verma et al. | Human-disease phenotype map derived from PheWAS across 38,682 individuals | |
Hejblum et al. | Time-course gene set analysis for longitudinal gene expression data | |
Sul et al. | Accounting for population structure in gene-by-environment interactions in genome-wide association studies using mixed models | |
Schadt et al. | A new paradigm for drug discovery: integrating clinical, genetic, genomic and molecular phenotype data to identify drug targets | |
Liu et al. | Systematic assessment of imputation performance using the 1000 Genomes reference panels | |
Salas et al. | A transdisciplinary approach to understand the epigenetic basis of race/ethnicity health disparities | |
Porubsky et al. | A fully phased accurate assembly of an individual human genome | |
Shevchenko et al. | Clinical versus research sequencing | |
CN114728069A (zh) | 用于体外受精的多基因风险得分 | |
JP6367473B2 (ja) | 遺伝子型判定装置及び方法 | |
de Leeuw et al. | On the interpretation of transcriptome-wide association studies | |
Deleye et al. | Massively parallel sequencing of micro-manipulated cells targeting a comprehensive panel of disease-causing genes: A comparative evaluation of upstream whole-genome amplification methods | |
JP2016177678A (ja) | アレルギー発症リスク予測装置、方法、及びプログラム | |
US20160171151A1 (en) | Method for determining read error in nucleotide sequence | |
WO2019132010A1 (ja) | 塩基配列における塩基種を推定する方法、装置及びプログラム | |
Walsh | The trouble with trabeculation: how genetics can help to unravel a complex and controversial phenotype | |
Haworth et al. | Diagnostic Genomics and Clinical Bioinformatics | |
Holm et al. | From sequence data to returnable results: ethical issues in variant calling and interpretation | |
US20220020449A1 (en) | Vector-based haplotype identification | |
WO2020214904A1 (en) | Methods for context based compression of genomic data for immuno-oncology biomarkers | |
Dyson et al. | Efficient identification of context dependent subgroups of risk from genome-wide association studies | |
Park et al. | Practical calling approach for exome array-based genome-wide association studies in Korean population |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15887620 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2017509089 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 201713894 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20150401 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15887620 Country of ref document: EP Kind code of ref document: A1 |
|
ENPC | Correction to former announcement of entry into national phase, pct application did not enter into the national phase |
Ref country code: GB |