WO2012091093A1 - 緑内障診断チップと変形プロテオミクスクラスター解析による緑内障統合的判定方法 - Google Patents
緑内障診断チップと変形プロテオミクスクラスター解析による緑内障統合的判定方法 Download PDFInfo
- Publication number
- WO2012091093A1 WO2012091093A1 PCT/JP2011/080393 JP2011080393W WO2012091093A1 WO 2012091093 A1 WO2012091093 A1 WO 2012091093A1 JP 2011080393 W JP2011080393 W JP 2011080393W WO 2012091093 A1 WO2012091093 A1 WO 2012091093A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- individual
- physiological state
- unit
- analysis
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/40—Population genetics; Linkage disequilibrium
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Definitions
- the present invention relates to an apparatus for discriminating an attribute of a physiological state of a mammal individual, a method for discriminating an attribute of a physiological state of an individual of a mammal, an apparatus for generating a discriminator used in the method,
- the present invention relates to a program for discriminating an attribute of an individual's physiological state.
- Glaucoma is a disease that causes optic disc depression and visual field impairment that are characteristic of retinal ganglion cell death. Increased intraocular pressure is the main cause of nipple depression and visual field impairment in glaucoma. On the other hand, there is glaucoma in which the intraocular pressure stays in the statistically calculated normal range, but in this case as well, it is considered that glaucoma develops because the intraocular pressure is high enough to cause visual field impairment for the individual.
- the basic treatment of glaucoma is to keep the intraocular pressure low. To keep the intraocular pressure low, it is necessary to consider the cause of high intraocular pressure. For this reason, for glaucoma diagnosis, it is important to classify the type of glaucoma according to the level of intraocular pressure and its cause.
- the cause of the increase in intraocular pressure is the presence or absence of blockage of the corner, which is the main drainage route of aqueous humor filling the eye. From these viewpoints, primary glaucoma is largely divided into closed angle glaucoma with angle blockage and open angle glaucoma without angle blockage, and open angle glaucoma is associated with increased intraocular pressure. It is classified into open-angle glaucoma in a narrow sense, that is, primary open-angle glaucoma and normal-tension glaucoma in which the intraocular pressure is in the normal range.
- SNPs single nucleotide polymorphisms
- Single Nucleotide Polymorphism are a substitution mutation in which one base changes to another in the base sequence of an individual's genome. Mutations are present at a certain frequency in the population of the species, typically with a frequency of about 1% or more. SNPs exist in introns, exons, or other genomic regions on genes.
- Patent Document 2 comprehensively analyzes known polymorphic sites present on the genome (autosome) of glaucoma patients and non-patients who do not have a family history of glaucoma, and finds SNPs related to the development of glaucoma. It is described as having been found.
- Patent Document 3 describes that, in glaucoma patients, a known polymorphic site existing on the genome of patients with early progression and those with late progression was comprehensively analyzed, and SNPs related to progression of glaucoma were found. Yes.
- Patent Document 4 discloses a mutant of a mouse WDR36 polypeptide into which a mutation corresponding to a mutation that lacks the amino acid residues 657 to 659 including the 658th aspartic acid residue in human WDR36 polypeptide is introduced. It is described that it was found that a transgenic mouse that expresses a phenotype reflecting glaucoma that a disorder occurs around the retina with good reproducibility. Furthermore, Patent Document 5 discloses a comprehensive analysis of known polymorphic sites present on the genome (in particular, autosome) of glaucoma patients and non-glaucoma patients (non-glaucoma patients), and SNPs related to glaucoma. Is found.
- Patent Document 6 describes that some known SNPs and unknown SNPs have been shown to be associated with the onset of optic neuropathy including glaucoma and Label disease. Furthermore, Patent Document 7 compares the genomic DNA of patients with open angle glaucoma (OAG) with the genomic DNA of healthy individuals, and a specific SNP of PTGIR is closely related to the onset of glaucoma. It is described that it was found.
- OAG open angle glaucoma
- Non-Patent Document 2 a glaucoma diagnostic method using an antibody that specifically recognizes TIGR, a glucocorticoid-induced protein produced by trabecular meshwork cells (Patent Document 8), and quantification of TGF- ⁇ in aqueous humor have been described. (Non-Patent Document 2).
- Patent Document 9 describes that a blood protein marker specifically detected in glaucoma patients was found by proteomic analysis of blood samples of glaucoma patients and other eye disease patients. In addition, various novel marker candidates have been reported by proteome analysis using ocular tissues (Non-patent Documents 3 and 4).
- Patent Document 2 Patent Document 3
- Patent Document 5 Patent Document 6
- SNP congenital factors
- Patent Document 6 since many other acquired factors are related to glaucoma, these conventional techniques have room for further improvement in terms of determination accuracy for determining the onset and progression of glaucoma.
- Non-Patent Document 3 only the factor of proteome level is cited as a factor of glaucoma.
- these conventional techniques have room for further improvement in terms of determination accuracy for determining the onset, progression, and prognosis of glaucoma.
- the present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technique for accurately determining the attributes of a mammal's physiological state including the onset, infection, progression, and prognosis of various diseases.
- an apparatus for discriminating the attributes of the physiological state of a mammal individual is provided.
- This device is a learning data set related to an individual group consisting of a plurality of individuals used for machine learning, which will be described later, obtained from a population consisting of individuals of the same type as the test individual, and an attribute of the physiological state of the individual
- a learning data set acquisition unit that acquires a learning data set including a combination of discrete data related to the base sequence of the individual's genome and continuous data related to the amount of a specific substance in the living body of the individual.
- this device is a sub-data set relating to a plurality of different sub-populations obtained by performing random resampling from the learning data set, and the physiological state of each individual included in the sub-population
- a re-sampling unit that extracts a sub-data set including a combination of attributes, discrete data regarding the base sequence of each individual's genome, and continuous data regarding the amount of a specific substance in the body of each individual.
- the apparatus performs machine learning on the physiological state attributes and discrete data patterns included in the plurality of sub-data sets, and determines the physiological state attributes of each individual included in the sub-data sets based on the discrete data.
- a first machine learning unit for obtaining a plurality of different first discriminators for discrimination is provided.
- the apparatus performs machine learning on the physiological state attributes and the continuous data pattern included in the plurality of sub-data sets, and determines the physiological state attributes of each individual included in the sub-data set based on the continuous data.
- a second machine learning unit is provided for obtaining a plurality of different second discriminators for discrimination.
- the apparatus includes a combination of discrete data regarding the base sequence of the genome of the individual obtained from the test individual and continuous data regarding the amount of the specific substance in the living body of the individual.
- a subject data acquisition unit for acquiring subject data consisting of discrete data and continuous data.
- the apparatus performs pattern analysis on the subject data a plurality of times using the plurality of first discriminators and the second discriminator, and the first discrimination result and the second discrimination result of the physiological condition attribute of the test individual
- a test data analysis unit that generates the determination results a plurality of times each is provided.
- the device integrates the first discrimination result and the second discrimination result for each physiological state attribute, and determines the attribute of the physiological state most discriminated in the first discrimination result and the second discrimination result.
- An integrated determination unit that integrally determines that the attribute of the physiological state of the test individual is included.
- this apparatus includes an output unit that outputs the result of the integration determination.
- a method for discriminating the attributes of the physiological state of a mammal individual is provided.
- This method is a learning data set related to an individual group consisting of a plurality of individuals used for machine learning, which will be described later, obtained from a population consisting of individuals of the same type as the test individual, and an attribute of the physiological state of the individual Obtaining a learning data set including a combination of discrete data relating to the base sequence of the individual's genome and continuous data relating to the amount of a specific substance in the individual's living body.
- this method is a sub-data set relating to a plurality of different sub-populations obtained by performing random resampling from the learning data set, and the physiological state of each individual included in the sub-population Extracting a sub-data set including a combination of attributes, discrete data regarding the base sequence of each individual's genome, and continuous data regarding the amount of a specific substance in the body of each individual.
- this method performs machine learning on the physiological state attributes and discrete data patterns included in the plurality of sub-data sets, and determines the physiological state attributes of each individual included in the sub-data sets based on the discrete data. Obtaining a plurality of different first discriminators for discrimination. In addition, this method machine-learns the physiological state attribute and continuous data pattern included in the plurality of sub-data sets, and determines the physiological state attribute of each individual included in the sub-data set based on the continuous data. Obtaining a plurality of different second discriminators for discrimination.
- the method also includes subject data relating to the subject individual, including a combination of discrete data relating to the base sequence of the individual's genome obtained from the subject individual and continuous data relating to the amount of the specific substance in the individual's body. Including the step of obtaining. In addition, this method performs pattern analysis on the subject data using the plurality of first discriminators and the second discriminator, respectively, multiple times, and the first discrimination result and the second discrimination result of the physiological condition attribute of the test individual. Including a step of generating the determination results a plurality of times.
- this method integrates the first discrimination result and the second discrimination result for each attribute of the physiological state, and determines the attribute of the physiological state most discriminated in the first discrimination result and the second discrimination result.
- the method also includes a step of outputting the result of the integration determination.
- an apparatus for generating a discriminator used in the above method is provided.
- This device is a learning data set related to an individual group consisting of a plurality of individuals used for machine learning, which will be described later, obtained from a population consisting of individuals of the same type as the test individual, and an attribute of the physiological state of the individual
- a learning data set acquisition unit that acquires a learning data set including a combination of discrete data related to the base sequence of the individual's genome and continuous data related to the amount of a specific substance in the living body of the individual.
- this device is a sub-data set relating to a plurality of different sub-populations obtained by performing random resampling from the learning data set, and the physiological state of each individual included in the sub-population
- a re-sampling unit that extracts a sub-data set including a combination of attributes, discrete data regarding the base sequence of each individual's genome, and continuous data regarding the amount of a specific substance in the body of each individual.
- the apparatus performs machine learning on the physiological state attributes and discrete data patterns included in the plurality of sub-data sets, and determines the physiological state attributes of each individual included in the sub-data sets based on the discrete data.
- a first machine learning unit for obtaining a plurality of different first discriminators for discrimination is provided.
- the apparatus performs machine learning on the physiological state attributes and the continuous data pattern included in the plurality of sub-data sets, and determines the physiological state attributes of each individual included in the sub-data set based on the continuous data.
- a second machine learning unit is provided for obtaining a plurality of different second discriminators for discrimination.
- the apparatus includes an output unit that outputs the first discriminator and the second discriminator.
- this apparatus after creating a plurality of different sub-data sets that constitute a part of the initially obtained learning data set, discrete data on the base sequences of the genomes of a plurality of individuals that constitute the sub-data set, and Two types of discriminators obtained by machine learning of data from different viewpoints as continuous data relating to the amount of a specific substance in the living body of the plurality of individuals are created for each sub-data set. Therefore, it is possible to obtain a set of two types of discriminators that can accurately determine the attribute of the physiological state of the mammal by the above method.
- a device for discriminating the attribute of the physiological state of a mammal individual includes a discriminator parameter acquisition unit that acquires the first discriminator and the second discriminator generated by the above-described device.
- the apparatus includes a combination of discrete data regarding the base sequence of the genome of the individual obtained from the test individual and continuous data regarding the amount of the specific substance in the living body of the individual.
- a subject data acquisition unit for acquiring subject data consisting of discrete data and continuous data.
- the apparatus performs pattern analysis on the subject data a plurality of times using the plurality of first discriminators and the second discriminator, and the first discrimination result and the second discrimination result of the physiological condition attribute of the test individual
- a test data analysis unit that generates the determination results a plurality of times each is provided.
- the device integrates the first discrimination result and the second discrimination result for each physiological state attribute, and determines the attribute of the physiological state most discriminated in the first discrimination result and the second discrimination result.
- An integrated determination unit that integrally determines that the attribute of the physiological state of the test individual is included.
- this apparatus includes an output unit that outputs the result of the integration determination.
- two types of discriminators generated by the above-described apparatus are acquired, and subject data relating to the test individual is subjected to pattern analysis using these two types of discriminators.
- the two types of discrimination results are subtotaled for a plurality of different sub-data sets, respectively.
- it is determined that the attribute of the physiological state having the largest total value is the attribute of the physiological state of the test individual. Therefore, according to this device, it is possible to accurately determine the attribute of the physiological state of the mammal.
- the attribute of the physiological state of a mammal can be determined with high accuracy.
- the physiological condition discrimination device of this embodiment it is a conceptual diagram for explaining the details of a method for converting genotype data into numerical values usable for various types of analysis and mathematical expressions used for normalization. It is a functional block diagram for demonstrating the structure of the learning data set acquisition part of the physiological condition discrimination device of this embodiment. It is a functional block diagram for demonstrating the structure of the resampling part of the physiological condition determination apparatus of this embodiment. It is visual data explaining the principle of principal component analysis used with the physiological state discrimination device of this embodiment. It is visual data explaining the principle of principal component analysis used with the physiological state discrimination device of this embodiment. It is visual data explaining the analysis example of the genotype data by the principal component analysis used with the physiological state discrimination
- FIG. 1 is a conceptual diagram for explaining the outline of the physiological state discriminating apparatus according to the present embodiment.
- a learning data set obtained from an individual group including a plurality of individuals such as glaucoma patients and healthy persons is prepared.
- this learning data set includes genotypes composed of attributes of physiological states (such as the onset, progression, and prognosis of glaucoma) of each individual, and the number of alleles (also referred to as alleles) of each individual. )
- a combination of discrete data on the base sequence of the genome and continuous data on the amount of a specific substance (such as blood cytokine concentration) in the body of each individual is included.
- a plurality of sub data sets resampled from the learning data set are prepared.
- the plurality of sub-data sets are input to the first machine learning unit and the second machine learning unit, respectively, and machine learning such as principal component analysis, discriminant analysis, or SVM (support vector machine) is performed.
- the first machine learning unit performs machine learning on the relationship between the discrete data regarding the genome sequence of each individual and the attribute of the physiological state
- the second machine learning unit performs the amount of the specific substance in the living body of each individual and Machine learning is performed on the relationship between physiological state attributes. Continuous data related to these is repeated N times (corresponding to the number of input sub-data sets) to obtain N first discriminators and N second discriminators, respectively.
- FIG. 2 is a conceptual diagram for explaining the outline of the physiological state discriminating apparatus according to this embodiment.
- subject data relating to a test individual whose physiological state attribute is unknown (such as a patient who has visited a hospital and is suspected of developing glaucoma) is prepared.
- the subject data includes discrete data on the base sequence of the genome (such as the number of SNPs alleles) obtained from the subject individual and the amount of a specific substance (such as blood cytokine concentration) in the body of the individual.
- a combination of continuous data on is included.
- this subject data is analyzed by N first discriminators and N second discriminators obtained as a result of the machine learning explained in FIG. 1, and N first discrimination results and second discrimination results, respectively. Is obtained.
- These discrimination results discriminate the attributes (onset / healthy, progressive / non-progressive, good prognosis / poor prognosis, etc.) of the physiological state (glaucoma onset, progression, prognosis, etc.).
- these determination results are subtotaled for each attribute of the physiological state. These subtotal results are integrated for each physiological state attribute to calculate an integrated result.
- the attribute of the physiological state having the largest number of determinations is the attribute of the physiological state of the test individual.
- the operator who sees the determination can advise the subject to receive a definitive diagnosis from a specialized ophthalmologist.
- the definitions of “progressive” and “non-progressive” among the attributes of physiological states include the following contents. Progressive: An individual with a particular disease that progresses particularly quickly. Non-progressive: An individual who has a disease but is not progressive. Needless to say, the attribute of the physiological state may be other than the forms listed above, and may be, for example, a form of progressive / healthy.
- the physiological state discriminating apparatus when constructing a learning data set, the discrete data regarding the base sequence of the genome of each individual is based on a glaucoma diagnostic chip that is a custom DNA chip equipped with SNPs related to glaucoma. An analysis result or the like is preferably used. In addition, as continuous data regarding the amount of a specific substance in the living body of each individual, an analysis result by an exhaustive measurement method of blood cytokines is preferably used. As described above, the physiological state discriminating apparatus according to the present embodiment is suitably used in predictive diagnosis such as the onset, progression, and prognosis of glaucoma.
- the present inventors acquired candidate SNPs by genome-wide correlation analysis for primary open-angle glaucoma (in a broad sense), and then used a custom chip. Identifying disease-related genes by selecting optimal SNPs and defining regions using LD blocks ("Threee susceptible loci associated with primary open-angle glaucoma identified by genome-wide association study in a Japanese population", Masakazu Nakano et. Al, 12838-12842, PNAS, August 4, 2009, vol. 106, no. 31).
- the present inventors obtained candidate SNPs by genome-wide correlation analysis for primary open-angle glaucoma (in a broad sense), and then utilized the know-how of the SNPs analysis for the whole genome. / Candidate gene correlation analysis.
- the inventors have succeeded in developing the above-mentioned glaucoma diagnostic chip by utilizing these research results. Therefore, by using this glaucoma diagnostic chip, the physiological state discriminating apparatus according to this embodiment is suitably used for predictive diagnosis such as the onset, progression, and prognosis of glaucoma.
- CBA cytometric bead array
- the physiological state determination device is capable of developing glaucoma, It is suitably used in predictive diagnosis such as progression and prognosis.
- the present inventors have developed an algorithm for predicting diagnosis of glaucoma onset, progression, prognosis, etc. by integrating genotype data obtained with a DNA chip and blood cytokine data obtained with modified proteomics.
- the present inventors apply a wide variety of existing statistical analysis, machine learning, etc. (principal component analysis, discriminant analysis, SVM, etc.) at the examination stage of this algorithm, select useful methods, and grasp data characteristics. Went.
- the present inventors examined effective analysis methods for each of genotype data and cytokine data, and finally integrated the results to examine the possibility of improving the overall diagnostic accuracy.
- FIG. 3 is a conceptual diagram illustrating input / output of data of the physiological state determination device 1000 according to the present embodiment.
- the physiological state discriminating apparatus 1000 is configured to receive an input of a learning data set and subject data and output a result of integrated determination.
- the reason why the physiological state discriminating apparatus 1000 can perform such an operation is that it has a specific configuration as described below.
- FIG. 4 is a functional block diagram for explaining the configuration of the physiological state discriminating apparatus 1000 according to the present embodiment.
- the physiological state discriminating apparatus 1000 is an apparatus for discriminating attributes of physiological states such as the onset, progression, and prognosis of glaucoma in mammals including humans.
- the physiological state discriminating apparatus 1000 acquires a learning data set for acquiring a learning data set related to an individual group consisting of a plurality of individuals used for machine learning, which will be described later, acquired from a population consisting of individuals of the same type as the test individual. Part 102 is provided.
- This population date set includes a combination of attributes of an individual's physiological state, discrete data regarding the base sequence of the individual's genome, and continuous data regarding the amount of a specific substance in the body of the individual.
- the physiological state discriminating apparatus 1000 further includes a resampling unit 106 that extracts sub-data sets relating to a plurality of different sub-individual groups that constitute a part of the individual group from the learning data set.
- This sub-data set includes a combination of attributes of the physiological state of each individual included in the sub-population, discrete data regarding the base sequence of each individual's genome, and continuous data regarding the amount of a specific substance in each individual's living body. .
- the physiological state discriminating apparatus 1000 includes a first machine learning unit 108 that performs machine learning on the physiological state attributes and discrete data patterns included in the plurality of sub-data sets.
- the first machine learning unit 108 is configured to obtain a plurality of different first discriminators for discriminating the attribute of the physiological state of each individual included in the plurality of sub-data sets based on discrete data. .
- the physiological state discriminating apparatus 1000 includes a second machine learning unit 110 that performs machine learning on physiological state attributes and continuous data patterns included in the plurality of sub-data sets.
- the second machine learning unit 110 is configured to obtain a plurality of different second discriminators for discriminating the attribute of the physiological state of each individual included in the plurality of sub-data sets based on continuous data. Has been.
- the physiological state discriminating apparatus 1000 includes a test data set acquisition unit 104 that acquires subject data consisting of discrete data and continuous data related to the test individual.
- This subject data includes a combination of discrete data relating to the base sequence of the individual's genome and continuous data relating to the amount of a specific substance in the individual's body.
- the subject data acquired by the test data set acquisition unit 104 is sent to the test data analysis unit 112 described later.
- the physiological state discriminating apparatus 1000 includes a test data analyzing unit 112 that performs pattern analysis on the subject data a plurality of times using a plurality of first discriminators and second discriminators.
- the test data analysis unit 112 is configured to generate the first discrimination result and the second discrimination result of the physiological state attribute of the test individual a plurality of times.
- the physiological state determination device 1000 integrates the first determination result and the second determination result for each physiological state attribute, and the physiological state attribute most frequently determined in the first determination result and the second determination result. Is integrated and determined to be an attribute of the physiological state of the subject individual.
- the physiological state determination apparatus 1000 includes an output unit 116 that outputs the result of the integrated determination.
- the physiological state discriminating apparatus 1000 is provided with an image display unit 122 such as a liquid crystal display and an operation unit 124 such as a keyboard / mouse. Therefore, the operator of the physiological state determination device 1000 can input various data or commands to the physiological state determination device 1000 while referring to the image data displayed on the image display unit 122.
- an image display unit 122 such as a liquid crystal display
- an operation unit 124 such as a keyboard / mouse. Therefore, the operator of the physiological state determination device 1000 can input various data or commands to the physiological state determination device 1000 while referring to the image data displayed on the image display unit 122.
- the physiological state discriminating apparatus 1000 includes a server 126 such as a file server and a measuring apparatus 128 such as a DNA sequencer, a DNA chip, a PCR, an antibody chip, or a flow cytometer via a network 118 such as the Internet, LAN, WAN, or VPN. Is connected. Therefore, the physiological state discriminating apparatus 1000 can read the learning data set and the subject data from the server 126, or can directly read them as the measurement results of the measuring apparatus 128.
- a server 126 such as a file server
- a measuring apparatus 128 such as a DNA sequencer, a DNA chip, a PCR, an antibody chip, or a flow cytometer via a network 118 such as the Internet, LAN, WAN, or VPN. Is connected. Therefore, the physiological state discriminating apparatus 1000 can read the learning data set and the subject data from the server 126, or can directly read them as the measurement results of the measuring apparatus 128.
- the physiological state determination apparatus 1000 includes an image display unit 130 such as a liquid crystal display, a printer 132 such as a laser printer or an inkjet printer, and a server 134 such as a file server via a network 118 such as the Internet, LAN, WAN, or VPN. It is connected. Therefore, the physiological state discriminating apparatus 1000 can output the result of the integration determination from the output unit 116 and display the result on the image display unit 130 as image data, or can print the image data with the printer 132. However, it can also be stored in the server 134 as data in various formats.
- an image display unit 130 such as a liquid crystal display
- a printer 132 such as a laser printer or an inkjet printer
- a server 134 such as a file server via a network 118 such as the Internet, LAN, WAN, or VPN. It is connected. Therefore, the physiological state discriminating apparatus 1000 can output the result of the integration determination from the output unit 116 and display the result on the image display unit 130 as image data, or can print the image data
- the physiological state discriminating apparatus 1000 Since the physiological state discriminating apparatus 1000 has the unique configuration as described above, a plurality of different sub-data sets that constitute a part of the learning data set obtained via the learning data set acquisition unit 102 Can be created by the resampling unit 106. Then, the physiological state discriminating apparatus 1000 has data from different points of view, such as discrete data related to the base sequences of the genomes of a plurality of individuals constituting the sub-data set and continuous data related to the amount of a specific substance in the living body of the plurality of individuals, Two types of discriminators obtained by machine learning by the first machine learning unit 108 and the second machine learning unit 110 can be created for each sub-data set.
- the physiological state discriminating apparatus 1000 receives these two types of subject data relating to the test individual separately acquired via the test data set acquisition unit 104 in a state where two types of discriminators exist for a plurality of different sub-data sets.
- the test data analysis unit 112 can perform pattern analysis using the discriminator. As a result, since two types of discrimination results are obtained for each of a plurality of different sub-data sets with respect to a separately obtained test individual, the two types of discrimination results are subtotaled for a plurality of different sub-data sets, respectively.
- the attribute of the physiological state having the largest combined value is the attribute of the physiological state of the test individual in the integrated determination unit 114. And integrated judgment.
- the physiological state determination apparatus 1000 outputs the integrated determination result from the output unit 116. Therefore, according to the physiological state discriminating apparatus 1000, it is possible to accurately determine physiological state attributes such as the onset, progression, and prognosis of glaucoma in mammals including humans.
- FIG. 5 is a conceptual diagram for explaining the genotype data used in the physiological state discriminating apparatus of the present embodiment.
- genotype data discrete data relating to the base sequence of the individual genome
- data relating to gene polymorphisms or variants are used. Because, as shown in the examples described later, when determining the attributes of physiological conditions such as the onset, progression, and prognosis of glaucoma, the gene polymorphisms related to the attributes of the physiological conditions are comprehensively examined. This is because the use of genotype data improves the discrimination accuracy of physiological state attributes.
- gene polymorphism refers to a mutation of a gene present at a frequency of 1% or more of the population.
- variant refers to a mutation of a gene that exists at a frequency of less than 1% of the population.
- causes of gene polymorphisms or variants include, for example, various mutations that occur within species, such as “substitutions” in which bases are replaced with other bases, “deletions” in which bases are lost, “insertions” in which bases enter Includes “duplication” and genetic recombination.
- SNPs in which one base is replaced with another base are considered useful as individualized markers for genetic background.
- this genotype data is data related to SNP. This is because, as shown in the examples described later, among the polymorphisms of mammals related to attributes of physiological states such as the onset, progression, and prognosis of glaucoma, the most efficient and effective polymorphism is SNP. Therefore, exhaustive examination of the SNP and using it as genotype data further improves the accuracy of determining the physiological state attribute.
- genome analysis using the Affymetrix GeneChip (R) Human Mapping 500K Array chip (Affy500k) is performed as the first stage analysis of genotype data.
- a reproducibility check analysis using a custom chip (iSelect) using the iSelect (TM) Custom Infinium (TM) Genotyping system of illumina, with the organization centered on SNPs that was significant in this first stage. It is done as a stage.
- filtering is performed from Quality A-Control from 500,568 SNPs of Affy500k to 331,838 SNPs. And the thing of P ⁇ 0.001 was extracted and narrowed down to 255 SNPs by Allele Frequency-chi-square test. Subsequently, among the 223 SNPs that were successfully installed in iSelect, filtering was performed using Quality-Control, and the results were narrowed down to 216 SNPs. Further, Cochran-Mantel-Haenszel chi-square test with P value ⁇ 0.01 and Heterogeneity (Cochran's Q) chi-square test with P value ⁇ 0.05 were extracted and narrowed down to 40 SNPs. Then, SNPs with D ′> 0.9 were excluded as those of the same LD in Haploview 4.1, which is linkage disequilibrium analysis software, and finally 29 SNPs were selected as analysis targets.
- FIG. 6 is a conceptual diagram for explaining the digitization of genotype data used in the physiological state discriminating apparatus of the present embodiment.
- the genotype data used in the physiological state discriminating apparatus of the present embodiment is data that is normalized for each individual based on the gene polymorphism or SNP allele frequency.
- This standardization method is a method referring to Price, et al: Nat Genet. 2006 Aug; 38 (8): 904-9, as shown in detail in FIG.
- correction of missing values can be performed during the normalization.
- This genotype data includes DNA sequencers (including conventional DNA sequencers based on the Sanger method (1980 Nobel Prize in Chemistry) and next-generation sequencers based on sequence technology that is completely different from the Sanger method), DNA microarrays ( This data is derived from the results of analysis by a molecular biological method including a nucleic acid amplification method (for example, TaqMan PCR method, RFLP, etc.) including a PCR method. This is because, when attempting to comprehensively examine the above gene polymorphisms or SNPs on a genome-wide basis, it is advantageous in terms of efficiency, accuracy, and cost to examine using these measuring devices.
- DNA sequencers including conventional DNA sequencers based on the Sanger method (1980 Nobel Prize in Chemistry) and next-generation sequencers based on sequence technology that is completely different from the Sanger method
- DNA microarrays This data is derived from the results of analysis by a molecular biological method including a nucleic acid amplification method (for example, TaqMan PCR
- the analysis results obtained from these measuring devices may be directly read into the physiological state discriminating apparatus 1000, or may be read once into the physiological state discriminating apparatus 1000 after being stored in a server or a storage medium. .
- a server or a storage medium it is preferable to store the data once in a server or a storage medium.
- genotype data is acquired in the above-described manner, and first, appropriate SNPs are selected from basic statistical analysis results.
- the obtained genotype data is digitized to create a matrix of (number of samples) ⁇ (number of SNPs).
- various analyzes principal component analysis, discriminant analysis, SVM, etc. are performed on the digitized genotype data matrix. For details, refer to the following description.
- FIG. 7 is a functional block diagram for explaining the configuration of the learning data set acquisition unit 102 of the physiological state determination apparatus 1000 of the present embodiment.
- the learning data set acquisition unit 102 is provided with a genotype data digitizing unit 802 that converts genotype data into numeric data.
- the genotype data digitizing unit 802 is provided with a numerical value converting unit 804 that converts the acquired genotype data into a preset numerical value.
- the numerical value conversion unit 804 is connected to the risk allele data storage unit 806.
- the risk allele data storage unit 806 stores a risk allele database including information related to risk alleles and non-risk alleles.
- the numerical value conversion unit 804 refers to the genotype data and the risk allele database. For example, in a predetermined allele included in the genotype data, if the risk allele is homozygous, a numerical value 2 is assigned. A numerical value 1 is assigned in the case of heterogeneity, and a numerical value 0 is assigned in the case where the non-risk allele is homo. In this case, the correction of the missing value can be handled by the normalization method already described in FIG.
- the learning data set acquisition unit 102 is provided with an allele frequency calculation unit 808 that calculates the appearance frequency of each allele in the genotype data included in the learning data set.
- the allele frequency calculation unit 808 calculates the allele frequency in each SNP so that the total appearance frequency of each allele becomes 1, and determines which allele in each SNPs is the main allele.
- the appearance frequency of each allele calculated in this way is temporarily stored in the allele frequency storage unit 807 and can be referred to from the outside as needed.
- the learning data set acquisition unit 102 is also provided with an average value calculation unit 809 that calculates an average value at which each allele appears in the genotype data included in the learning data set.
- the average value of the appearance of each allele thus calculated is temporarily stored in the average value storage unit 811 and can be referred to from the outside as needed.
- the learning data set acquisition unit 102 is provided with a normalization unit 810 that normalizes numerical data obtained by the numerical value conversion unit 804 based on the allele frequency calculated by the allele frequency calculation unit 808.
- a risk allele may be determined with reference to the difference in the allele frequency between the onset group and the control group or between the onset group and the non-onset group. it can.
- the allele frequency basically increases as the total number of learning data sets used for analysis increases, if there is any change / update / addition etc.
- the allele frequency calculation unit 808 is configured so that the risk allele can be updated in accordance with the update of the learning data set.
- normalization includes transforming what is not a normal form (a fixed form having desirable properties for operations such as comparison and calculation) into a normal form.
- a normal form a fixed form having desirable properties for operations such as comparison and calculation
- proportional conversion can be performed so that the mean square is 1, or linear conversion can be performed so that the average is 0 and the variance is 1.
- the normalization method shown in FIG. 6 is particularly excellent.
- the genotype data used in the physiological state discriminating apparatus 1000 of the present embodiment is based on the allele frequency calculated by the allele frequency calculation unit 808 after numerical conversion of the above-described gene polymorphism or SNP allele by the numerical conversion unit 804. It is preferable that the data is normalized for each individual by the normalization unit 810. This is because, by calculating the allele frequency of a gene polymorphism or SNP and quantifying the appearance frequency of each allele, it is quantitative how much the pattern of SNPs in the genome of the individual deviates from the general pattern. It is because it is possible to evaluate to.
- the learning data set acquisition unit 102 is provided with a cytokine data standardization unit 812 that converts cytokine data into standardized data.
- the cytokine data standardization unit 812 is provided with a control group data extraction unit 814 that extracts control group data (for example, healthy person data) from cytokine data.
- the control group data extraction unit 814 is connected to a log conversion unit 816 that performs log conversion of the blood cytokine concentration for each type of cytokine.
- the Log conversion unit 816 prepares two types of data of only the control group of each cytokine, that is, the original value and the value obtained by performing Log conversion.
- the control group data extraction unit 814 and the log conversion unit 816 are connected to a normality test unit 818 that tests the normality of the original value and the log value and adopts a value closer to the normal distribution. .
- the normality test unit 818 performs a normality test on each of the original value and the Log conversion value, and determines which value to use with reference to the p value of each cytokine.
- a test of normality in the normality test unit 818 a method of comparing with a normal distribution curve, a method of evaluating by kurtosis and skewness, and the like can be suitably used.
- a normality test method for example, a skewness test, a kurtosis test, a skewness and kurtosis test, a Kolmogorov-Smirnov test, and the like can be used.
- the normality test unit 818 calculates an average value and a standard deviation based on the data of only the control group of the original value or the Log conversion value for each cytokine, A standardization unit 820 that performs standardization using equations is connected.
- Standardized conversion value (original value or Log conversion value-average value of control group) / (standard deviation of control group)
- Cytokine data used in the physiological state discriminating apparatus 1000 of the present embodiment is preferably acquired using a technique that can measure a large number of cytokines simultaneously, such as CBA.
- CBA a technique that can measure a large number of cytokines simultaneously
- a slight tendency to values may be obtained. May change.
- CBA since the standard curve is reset every time measurement is performed, the range of possible values may change. Therefore, even if the cytokines are the same, it is not desirable to simply compare the concentration values obtained by such measurement between those having different experimental dates and experimental conditions. Therefore, it is preferable to perform standardization using a standard (for example, control group data) that can be compared stably rather than using the concentration value of the measurement result as it is for analysis. Standardization method is adopted.
- the learning data set acquisition unit 102 stores a learning data set relating to an individual group provided inside or outside the physiological state discriminating apparatus 1000. Therefore, the learning data set may be read out.
- the population database may be stored in the server 126 installed in a hospital or the like, and the learning data set acquisition unit 102 may read out the learning data set via the network 118 such as the Internet line.
- the population database described above includes an attribute of an individual's physiological state regarding a new individual of the same type as the test individual, discrete data regarding the base sequence of the individual's genome, and continuous data regarding the amount of a specific substance in the individual's living body. These combinations may be added and updated as needed. That is, even if a population database is stored in a server 126 installed in a hospital or the like, genotype data, cytokine data, definitive diagnosis data, etc. acquired in a hospital or the like are additionally updated as needed. Good.
- FIG. 8 is a functional block diagram for explaining the configuration of the resampling unit 106 of the physiological state discriminating apparatus 1000 of the present embodiment.
- the resampling unit 106 is provided with a random extraction unit 902 that randomly extracts sub-data sets from the learning data set. Therefore, the resampling unit 106 can randomly generate a large number of sub data sets including data of some individuals from a learning data set including data of a large number of individuals.
- learning by the first machine learning unit 108 and the second machine learning unit 110 which will be described later, can be performed using a large number of random sub data sets, so that the accuracy of these machine learning is improved.
- the re-sampling unit 106 when the re-sampling unit 106 generates a random sub-data set, the same sub-data set may be generated with a low probability. In this case, the same sub-data set is eliminated. It may be configured as follows.
- the re-sampling unit 106 includes a predetermined number of times (for example, 10 times, 20 times, 30 times, 50 times, 100 times) that the extraction process by the random extraction unit 902 is preset according to the size of the learning data set.
- An extraction counter 904 is provided that controls to be repeated. That is, the resampling unit 106 determines the number of times that is preferable from the beginning in order to improve the accuracy of machine learning by the first machine learning unit 108 and the second machine learning unit 110 from a statistical viewpoint. Instead, an appropriate number of times is set in advance according to the size of the learning data set to be input.
- the extraction counter 904 terminates the extraction process by the random extraction unit 902 when the determination accuracy based on the test sample data described later is equal to or higher than a predetermined threshold value (however, if the predetermined threshold value cannot be reached, a predetermined threshold value is reached). It may be configured such that the process is terminated with the maximum number of extractions.
- a predetermined threshold value for example, 10 samples, 20 samples, 30 samples, 50 samples, 100 samples, etc.
- it is set to control to extract a predetermined number of samples (for example, 10 samples, 20 samples, 30 samples, 50 samples, 100 samples, etc.) set in advance according to the size of the learning data set. It is possible to keep it. That is, by controlling the number of extractions and the number of extracted samples in this manner, for example, an arbitrary resampling process in which 50 samples from 100 samples are resampled 20 times can be performed.
- the resampling unit 106 is provided with a test sample extraction unit 906 for extracting test sample data.
- This test sample data is used for verifying the accuracy of determining the attribute of the physiological state by a first discriminator and / or a second discriminator described later. Therefore, the test sample extraction unit 906 can verify the accuracy of the determination of the attribute of the physiological state by the first discriminator and / or the second discriminator described later.
- the analysis engines used by the first discriminator and / or the second discriminator is the optimum analysis engine, such as a principal component analysis engine, a discriminant analysis engine, or an SVM engine. .
- test sample data extracted by the test sample extraction unit 906 is included in all sub-data sets generated for learning by the first machine learning unit 108 and the second machine learning unit 110 by the random extraction unit 902.
- a sample may be extracted as test sample data.
- the physiological state discriminating apparatus 1000 of the present embodiment when it is attempted to discriminate the attribute of the physiological state for a human disease such as glaucoma, generally, discrete data relating to the base sequence of the individual's genome and the in vivo body of the individual Since there are not many samples with continuous data on the amount of a specific substance, there is a problem that the diagnostic ability must be improved with a limited number of data. Therefore, in the physiological state discriminating apparatus 1000 of the present embodiment, the resampling unit 106 repeats resampling to create a large number of sub data sets, analyzes the sub data sets individually, and captures data from various angles. The discrimination ability is improved.
- FIG. 9 is visual data for explaining the principle of principal component analysis used in the physiological state discriminating apparatus of this embodiment.
- the principal component analysis includes an analysis method for obtaining the overall characteristics of a plurality of variables.
- This Principal Component Analysis eliminates the correlation between quantitative data variables described by many variables and reduces the analysis to a small number of uncorrelated synthetic variables with as little information loss as possible. Can do.
- This principal component analysis technique was proposed by Hotelling around 1933 (from Satoshi Kinmei, “Data Science by R” p. 66, Morikita Publishing). There are many types of functions (analysis engines) that perform principal component analysis.
- FIG. 10 is visual data for explaining an analysis example of genotype data by principal component analysis used in the physiological state discriminating apparatus of this embodiment.
- an examination example of application of principal component analysis to the onset discrimination use is shown. That is, a two-dimensional scatter diagram and a three-dimensional distribution diagram in the case where the principal component analysis is performed using SNPs having a significant difference between the specimen groups of the glaucoma onset group and the non-onset group are shown.
- the analysis results are shown as an onset group: ⁇ and a non-onset group: +.
- FIG. 11 is a conceptual diagram illustrating the principle of discriminant analysis used in the physiological state discriminating apparatus of the present embodiment.
- discriminant analysis includes an analysis method in which a criterion for dividing a group of data that is apparently divided in advance is learned, and a criterion that has been learned for newly given data is discriminated.
- a linear discriminator and a nonlinear discriminator (such as a function using the Mahalanobis distance).
- functions analysis engines that perform discriminant analysis. For example, “lda”, “qda” prepared in “MASS” described in “R”, and also in the library “stats” “Mahalanobis” prepared in the above can be preferably used.
- FIG. 12 is visual data for explaining an analysis example of genotype data by discriminant analysis used in the physiological state discriminating apparatus of this embodiment.
- This figure shows an application examination example of discriminant analysis for the onset example discriminating use. That is, for each specimen in the onset group of glaucoma and the control group, the measurement result by Affy500k is prepared as the first stage, and the measurement result by iSelect is prepared as the second stage, and the discriminant function is created as the “learning data” at the first stage. The discrimination function value of each specimen is calculated as “test data” for confirmation at the second stage, and the case is discriminated based on the calculated value.
- FIG. 13 is a conceptual diagram for explaining the principle of SVM used in the physiological state discriminating apparatus of the present embodiment.
- the SVM includes an analysis method for mapping data that is difficult to classify into a space that can be classified by a kernel function and calculating a discriminant plane that maximizes the margin (distance) between the data.
- this SVM it becomes possible to cope with data having any pattern by setting to use an appropriate kernel function.
- functions analysis engines for performing SVM. For example, “ksvm” prepared in “kernlab” described in “R” is changed to “rbfdot”, “polydot”, “vanilladot”.
- FIG. 14 is visual data for explaining an analysis example of genotype data by SVM used in the physiological state discriminating apparatus of this embodiment.
- SVM genotype data by SVM used in the physiological state discriminating apparatus of this embodiment.
- an example of calculation by SVM is shown. Specifically, the measurement result of Affy500k is learned, and the measurement result of iSelect is used as a test sample for estimation.
- the SVM learned that the score of each specimen was -1 in the Case (onset group) and near +1 in the Control (control group). , The closer to 0, the closer to the Control pattern.
- SVM when SVM is used, it is possible to convert data that is difficult to classify using the first-stage data, and learn the most discriminable discrimination boundary surface. Then, using the second-stage data, the distance from the discrimination boundary surface can be used as a score, and the two groups can be discriminated based on the positive / negative.
- FIG. 15 is a functional block diagram illustrating the configuration of the first machine learning unit 108 of the physiological state determination apparatus 1000 according to the present embodiment.
- the first machine learning unit 108 includes principal component analysis, discriminant analysis, SVM, factor analysis, cluster analysis, multiple regression analysis, decision tree, naive Bayes classifier, artificial neural network, Markov chain Monte Carlo method, Gibbs sampler, and SOM.
- a first statistical analysis unit 602 is provided that performs one or more statistical analysis methods selected from the group consisting of:
- the first statistical analysis unit 602 preferably performs at least one statistical analysis method selected from the group consisting of principal component analysis, discriminant analysis, and SVM among these.
- the first machine learning unit 108 includes a principal component analysis engine 210, a discriminant analysis engine 212, an SVM engine 214, and other engines (factor analysis, cluster analysis, multiple regression analysis,
- a statistical analysis engine storage unit 208 for storing various statistical analysis engines such as a decision tree, a naive Bayes classifier, an artificial neural network, a Markov chain Monte Carlo method, an engine for performing a Gibbs sampler, and an SOM) is provided. .
- the first statistical analysis unit 602 performs 100 SVMs on 100 resampled data.
- the number of types of the statistical analysis method is not limited to one or more, and may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 or more. It may be used, or the number of types within the range of these two numerical values.
- the first machine learning unit 108 is provided with a first accuracy verification unit 606 that verifies the discrimination accuracy of the test data discrimination result based on, for example, 100 SVM learning results.
- the test sample data can be acquired from a test sample extraction unit 906 provided in the resampling unit 106.
- the principal component analysis engine 210, the discriminant analysis engine 212, the SVM engine 214, and other engines (factor analysis, cluster analysis) for performing the statistical analysis method described above. , Multiple regression analysis, decision tree, naive Bayes classifier, artificial neural network, Markov chain Monte Carlo method, engine for performing analysis of Gibbs sampler, SOM, etc.) It is possible to determine whether or not discrimination is possible.
- the first machine learning unit 108 is provided with a first statistical analysis method selection unit 614.
- the first statistical analysis method selection unit 614 based on the verification result by the first accuracy verification unit 606, the principal component analysis engine 210, the discriminant analysis engine 212, the SVM engine 214, and other engines (factor analysis, cluster analysis, weight analysis) Most of one or more statistical analysis methods selected from the group consisting of regression analysis, decision trees, naive Bayes classifiers, artificial neural networks, Markov chain Monte Carlo methods, Gibbs samplers and SOM engines) One type of statistical analysis method with high discrimination accuracy is adopted.
- the number of types of the statistical analysis method is not limited to one or more, and may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 or more. It may be used, or the number of types within the range of these two numerical values.
- the first machine learning unit 108 is provided with a first discriminator parameter generation unit 616 that is a discriminator based on, for example, 100 SVM learning results.
- the first discriminator parameter generation unit 616 selects the statistical analysis method with the highest discrimination accuracy selected by the first statistical analysis method selection unit 614 from the various statistical analysis methods performed by the first statistical analysis unit 602.
- a mathematical first discriminator is generated.
- the plurality of first discriminators obtained for each of the plurality of sub-data sets is transmitted to the test data analysis unit 112 described later and used for analysis of the test data.
- the continuous data of this embodiment is data regarding the blood cytokine concentration of an individual, as will be described later. That is, as the continuous data, results of blood cytokine concentration measurement by CBA are used. That is, the measurement principle of this blood cytokine concentration measurement is as follows.
- a large number of beads whose surface is coated with a capture antibody specifically corresponding to each soluble protein such as a target cytokine are used, and the beads have different fluorescence intensities for each capture antibody. It enables simultaneous multi-item measurement of cytokines.
- 1. A blood sample collected from a specimen is centrifuged to obtain a plasma sample. 2. React the plasma sample with the capture antibody on the bead surface. Furthermore, each antibody for detection labeled with phycoerythrin dye (PE) is reacted. Using a flow cytometer, the type of antigen is measured by the fluorescence intensity of the beads, and the amount of each antigen is measured by the fluorescence intensity of PE labeled with the detection antibody.
- PE phycoerythrin dye
- such measurement can be performed by labeling the beads of two colors at various ratios and determining the positions of the beads.
- the data derived from the blood analysis results of the individual is acquired and used as continuous data to obtain continuous data necessary for the analysis. It can be acquired quickly with high accuracy and efficiency.
- continuous data necessary for analysis can be obtained in the same way by acquiring data derived from individual blood analysis results using an antibody chip that has an array of antibodies that specifically bind to cytokines and using it as continuous data. It can be acquired quickly with high accuracy and efficiency.
- 16 and 17 are conceptual diagrams for explaining the cytokine data used in the physiological state discriminating apparatus of the present embodiment.
- sample information used for acquiring cytokine data is shown. That is, in order to obtain cytokine data, 42 specimens were prepared as a glaucoma onset group, and 42 specimens were prepared as a control group.
- the concentration of 29 cytokines in plasma was measured using CBA.
- the number of types of the above cytokines in blood is not limited to one or more, but 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 , 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, or 28 or more, all 29 may be used and are within the range of these two numerical values. It may be the number of types.
- FIG. 18 is a conceptual diagram for explaining cytokine data used in the physiological state discriminating apparatus of the present embodiment. That is, as a result of the first stage, three items considered to be useful for diagnosis were similarly measured with Case (onset group) 73 vs control (control group) 52 against another sample group for reproducibility confirmation, Statistical analysis was performed.
- samples used for the measurement of these cytokines are all included in the samples used for the genotype Affy500k.
- the cytokine data thus obtained is subjected to original data standardization based on the data of the control group in the cytokine data standardization unit 812 of the learning data set acquisition unit 102 already described with reference to FIG. Then, cytokines to be used for the analysis are selected, and in the second machine learning unit 110 described later, various analyzes similar to genotype data (principal component analysis, discriminant analysis, SVM (support vector machine), discrimination) are performed on the standardized cytokine data. Analysis, SVM, factor analysis, cluster analysis, multiple regression analysis, decision tree, naive Bayes classifier, artificial neural network, Markov chain Monte Carlo method, Gibbs sampler and SOM). For details, refer to the following description.
- FIG. 19 is visual data for explaining an analysis example of cytokine data by principal component analysis used in the physiological state discriminating apparatus of the present embodiment.
- the blood cytokine concentration shown in FIG. 19 is measured, and the principal component analysis is also performed together with the result of classification of the attribute of the physiological state about the presence or absence and progress of glaucoma by the definitive diagnosis by the doctor.
- This figure is prepared based on the first-stage and second-stage specimen data described above, the onset group versus the control group, and the three items of cytokines.
- 20 and 21 are visual data for explaining an example of analysis of cytokine data by discriminant analysis / SVM used in the physiological state discriminating apparatus of the present embodiment.
- 20 and 21 show results of test data estimation using patterns extracted from discriminant analysis / SVM learning data. Specifically, at the time of discriminant analysis, a discriminant function is created from 1st Stage data, a discriminant function value of each specimen is calculated from 2nd Stage data, and a case is discriminated based on the value.
- the SVM the first stage data is learned, and the SVM parameter setting for discriminating the second stage data is determined by “grid search”.
- a discriminator is created from the first-stage data, the discriminator value of each sample is calculated from the second-stage data, and the case is discriminated based on that value.
- the first-stage data is learned to determine the second-stage data.
- the parameter setting of the SVM is determined by grid search.
- principal component analysis / discriminant analysis / SVM and other engines factor analysis, cluster analysis, Any of multiple regression analysis, decision tree, naive Bayes classifier, artificial neural network, Markov chain Monte Carlo method, Gibbs sampler, engine for performing analysis of SOM, etc. can be suitably used.
- FIG. 22 is a functional block diagram illustrating the configuration of the second machine learning unit 110 of the physiological state determination apparatus 1000 according to the present embodiment.
- the second machine learning unit 110 includes principal component analysis, discriminant analysis, SVM, factor analysis, cluster analysis, multiple regression analysis, decision tree, naive Bayes classifier, artificial neural network, Markov chain Monte Carlo method, Gibbs sampler, and SOM.
- a second statistical analysis unit 702 performs one or more statistical analysis methods selected from the group consisting of:
- the second machine learning unit 110 includes a principal component analysis engine 210, a discriminant analysis engine 212, an SVM engine 214, and other engines (factor analysis, cluster analysis, multiple regression analysis, decision tree, naive Bayes classifier, artificial
- a statistical analysis engine storage unit 208 for storing a neural network, a Markov chain Monte Carlo method, an engine for performing analysis of a Gibbs sampler, SOM, and the like) is provided.
- the second statistical analysis unit 702 receives the principal component analysis engine 708, the discriminant analysis engine 710, the SVM engine 712, and other engines (factor analysis, cluster analysis, multiple regression analysis, decision tree, naive Bayes) from the statistical analysis engine storage unit 208.
- Classifier artificial neural network, Markov chain Monte Carlo method, Gibbs sampler, engine for performing analysis such as SOM) any one analysis engine is read and attributes of physiological states and discrete data included in multiple sub-data sets Machine learning of patterns.
- the number of types of the statistical analysis method is not limited to one or more, and may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 or more. It may be used, or the number of types within the range of these two numerical values.
- the second machine learning unit 110 also verifies the discrimination accuracy of the sample analysis result obtained by pattern-analyzing the test sample data randomly extracted from the learning data set using the second discriminator.
- a two-precision verification unit 706 is provided.
- the test sample data can be acquired from a test sample extraction unit 906 provided in the resampling unit 106.
- the principal component analysis engine 210 the discriminant analysis engine 212, the SVM engine 214, and other engines (factor analysis, cluster analysis, multiple regression analysis, decision tree, naive) It is possible to determine which of the analysis engines can be used with the highest accuracy when using a Bayes classifier, an artificial neural network, a Markov chain Monte Carlo method, an engine for performing a Gibbs sampler, an SOM, or the like.
- the second machine learning unit 110 is provided with a second statistical analysis method selection unit 714.
- the second statistical analysis method selection unit 714 based on the verification result by the second accuracy verification unit 706, the principal component analysis engine 210, the discriminant analysis engine 212, the SVM engine 214, and other engines (factor analysis, cluster analysis, weight analysis) Most of one or more statistical analysis methods selected from the group consisting of regression analysis, decision trees, naive Bayes classifiers, artificial neural networks, Markov chain Monte Carlo methods, Gibbs samplers and SOM engines) A statistical analysis method with high discrimination accuracy is adopted.
- the number of types of the statistical analysis method is not limited to one or more, and may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 or more. It may be used, or the number of types within the range of these two numerical values.
- the second machine learning unit 110 is provided with a second discriminator parameter generation unit 716.
- the second discriminator parameter generation unit 716 selects the statistical analysis method with the highest discrimination accuracy selected by the second statistical analysis method selection unit 714 from the various statistical analysis methods performed by the second statistical analysis unit 702.
- a mathematical second discriminator is generated.
- the plurality of second discriminators obtained for each of the plurality of sub data sets are transmitted to the test data analysis unit 112 described later and used for analysis of the test data.
- FIG. 23 is a functional block diagram illustrating the configuration of the test data set acquisition unit 104 of the physiological state determination apparatus 1000 according to the present embodiment.
- the test data set acquisition unit 104 is configured to acquire subject data related to the test individual, including a combination of discrete data related to the individual's genetic polymorphism and continuous data related to the blood cytokine concentration of the individual.
- the test data set acquisition unit 104 is provided with a data conversion unit 401 that digitizes and / or normalizes the test subject data in the same manner as the learning data set.
- the data conversion unit 401 is provided with a genotype data conversion unit 402 that digitizes and / or normalizes genotype data included in the acquired subject data.
- the genotype data conversion unit 402 is provided with a learning data set conversion formula acquisition unit 404 that acquires the method of quantification and / or normalization in the learning data set from the learning data set acquisition unit 102.
- the genotype data conversion unit 402 digitizes and / or converts genotype data included in the subject data using a method of quantification and / or normalization in the learning data set acquired in this way.
- a normalizing conversion unit 410 is provided.
- the allele frequency calculation unit 808 in the learning data set acquisition unit 102 includes a learning data set conversion formula acquisition unit 404 for normalization according to the distribution of the learning data set. Necessary data (allele frequency information and average value information in the learning data set of each SNP) is acquired.
- the data converter 401 is provided with a cytokine data converter 412 that digitizes and / or normalizes cytokine data included in the acquired subject data.
- cytokine data converter 412 that digitizes and / or normalizes cytokine data included in the acquired subject data.
- continuous values such as cytokines can be handled in various analyzes in the same manner as the values of the learning data set at the stage of normalization as values on the standard normal distribution, some data or There is no need to obtain a conversion expression. That is, in CBA, measurement is performed at a time in units of at least a plurality of samples (basically several tens of samples) instead of one sample. For this reason, in each measurement experiment, data of at least a control group serving as a reference should be obtained at the same time. Therefore, normalization can be performed without using a learning data set.
- the cytokine data conversion unit 412 does not need to acquire anything from the learning data set. Instead, the control group data extraction unit 414 in the test data set for extracting the control group data in the test data set, the control group The extracted data processing unit 420 for calculating the average value and the standard deviation of the.
- An extraction data storage unit may be provided that extracts only the control group once from the test data set (multiple individuals), calculates the standard deviation and the average value, and temporarily stores them locally. . By doing so, it becomes possible to load and normalize an average value and standard deviation stored in advance with respect to a single individual inputted to the cytokine data conversion unit 412, and a certain test data set (a plurality of test data sets) In the process of normalizing all the individuals, the process of calculating the standard deviation and the average value every time can be omitted.
- this series of flows is further expanded to store all the test data sets used in the past (on anonymization in view of ethics), and depending on the range of input values, The standard deviation for normalization and the average value calculated empirically from the test data set may be loaded.
- a normalization parameter set ⁇ may be used if the input cytokine A value is less than 50, and a normalization parameter set ⁇ may be used if the value is 50 or more and less than 100.
- FIG. 24 is a conceptual diagram illustrating the function of the integrated determination unit 114 of the physiological state determination device 1000 according to the present embodiment.
- determination of physiological state attributes is performed by integrating individual determination results instead of numerical values.
- the two analyzes are integrated with reference to the bagging technique.
- step 1 a process using genotype data as a reference for a bagging technique is performed, as a step 2, a process using a cytokine data as a reference for a bagging technique is performed, as a step 3, as steps 1 and The results of Step 2 were integrated and a final decision was made by majority vote. Actually, the experiment was performed under the following conditions.
- the learning data is the same number (20 specimens each) from 42 specimens of glaucoma and 42 specimens of healthy subjects of the 1st stage (the same group as the first stage of cytokine and having genotype data of Affy500k). So randomly selected.
- test data 73 glaucoma samples and 52 healthy subjects were used as test data in 2nd stage (a group having the same genotype data of Affy500k as in the second stage of cytokine).
- FIG. 25 is visual data for explaining the result of integrating genotype data and cytokine data by the integrated determination unit 114 of the physiological state determination apparatus 1000 of the present embodiment.
- genotype ⁇ 501 times and cytokine ⁇ 500 times were resampled and learned / estimated, respectively, and discrimination was performed using two majority results. That is, from the 1001 re-sampling determination results, the one determined more is determined as the attribute of the final physiological state.
- the individual diagnosis rates of genotype ⁇ 501 times and cytokine ⁇ 500 times were both 67.2%, but the diagnosis rate after integration was clearly improved to 74.4%.
- FIG. 26 is visual data for explaining an integration result of genotype data and cytokine data by the integration determination unit 114 of the physiological state determination apparatus 1000 according to the present embodiment.
- the proportions correctly discriminated in the resampling process are plotted, it can be seen that the density of the plot is highest near the vertex where the integrated discrimination rate is 100%. That is, as a result of integrating the individual discrimination results of genotype ⁇ 501 times and cytokine ⁇ 500 times, the discrimination accuracy is clearly improved.
- FIG. 27 is a functional block diagram illustrating the configuration of the test data analysis unit 112 of the physiological state determination apparatus 1000 according to the present embodiment.
- the test data analysis unit 112 includes a first discriminator parameter acquisition unit 202 that acquires a first discriminator from the first machine learning unit 108.
- the test data analysis unit 112 is provided with a second discriminator parameter acquisition unit 204 that acquires a second discriminator from the second machine learning unit 110.
- first discriminators and second discriminators principal component analysis, discriminant analysis, SVM, factor analysis, cluster analysis, multiple regression analysis, decision tree, naive Bayes classifier, artificial neural network, Markov chain Monte Carlo method, Gibbs
- An optimal analysis method application unit 206 uses each of the statistical analysis methods having the highest discrimination accuracy among one or more statistical analysis methods selected from the group consisting of samplers and SOMs.
- the number of types of the statistical analysis method is not limited to one or more, and may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 or more. It may be used, or the number of types within the range of these two numerical values.
- the test data analysis unit 112 includes a principal component analysis engine 210, a discriminant analysis engine 212, an SVM engine 214, and other engines (factor analysis, cluster analysis, multiple regression analysis, decision tree, naive Bayes classifier, artificial neural network, Markov network
- a statistical analysis engine storage unit 208 for storing a chain Monte Carlo method, an engine for performing analysis such as a Gibbs sampler and SOM) is provided.
- the optimal analysis method application unit 206 is a principal component analysis engine necessary for analysis using the first discriminator and the second discriminator acquired by the first discriminator parameter acquisition unit 202 and the second discriminator parameter acquisition unit 204.
- discriminant analysis engine 212 discriminant analysis engine 212
- SVM engine 214 and other engines (factor analysis, cluster analysis, multiple regression analysis, decision tree, naive Bayes classifier, artificial neural network, Markov chain Monte Carlo method, Gibbs sampler, SOM, etc.
- the analysis engine is read from the statistical analysis engine storage unit 208 and transferred to the discriminator application unit 218.
- the test data analysis unit 112 is provided with a converted data set acquisition unit 216 that acquires the converted subject data acquired by the test data set acquisition unit 104 and digitized and normalized by the same method as the learning data set. ing. In addition, the test data analysis unit 112 uses a plurality of different first discriminators and second discriminators one or more times to perform pattern analysis on the subject data, and the first attribute of the physiological state attribute of the test individual.
- a discriminator application unit 218 that generates a discrimination result and a second discrimination result is provided.
- the test data analysis unit 112 obtains the first discrimination result based on the genotype data and the second discrimination result based on the cytokine data for any of the many subset data provided.
- the first discrimination result based on the genotype data of these many subset data and the second discrimination result based on the cytokine data are converted into two types of data sets by the first discrimination result generation unit 220 and the second discrimination result generation unit 222, respectively.
- the information is collected and transferred to the integration determination unit 114 described later.
- FIG. 28 is a functional block diagram illustrating the configuration of the integrated determination unit 114 of the physiological state determination apparatus 1000 according to the present embodiment.
- the integrated determination unit 114 is provided with a first determination result acquisition unit 302 that acquires a first determination result based on genotype data from the test data analysis unit 112.
- the integrated determination unit 114 includes a second determination result acquisition unit 304 that acquires a second determination result based on cytokine data from the test data analysis unit 112.
- the integrated determination unit 114 is provided with a sub-calculation unit 306 that subtracts the number of times that the test data is determined to be an attribute of a specific physiological state in the first determination result and the second determination result.
- the small calculation output unit 306 is provided with a first small calculation output unit 308 that calculates a subtotal of the first determination result based on the genotype data.
- the sub-calculation unit 306 is provided with a second sub-calculation unit 310 that calculates a subtotal of the second discrimination result based on the cytokine data.
- the integrated determination unit 114 is provided with a total calculation unit 314 that calculates the sum of the respective subtotal results in the first determination result based on the genotype data and the second determination result based on the cytokine data for each physiological state attribute. ing.
- the integrated determination unit 114 weights each subtotal result in the first determination result based on the genotype data and the second determination result based on the cytokine data with a predetermined parameter, and then obtains a total.
- a parameter application unit 312 is further provided. Further, the integrated determination unit 114 is provided with an integrated parameter storage unit 318 connected to the weight parameter application unit 312.
- the integrated parameter storage unit 318 stores a weight parameter database 320 that stores weight parameters that are considered to be optimal at the present time based on information on the accuracy of discrimination such as test results based on test sample data or past discrimination results. ing. Further, the integrated parameter storage unit 318 stores an integrated calculation formula for integrating the subtotal results of the first subcalculation output unit 308 and the second subcalculation output unit 310 using the weight parameter. An expression database 322 is stored.
- the integrated determination unit 114 is provided with a test sample data acquisition unit that acquires a sample analysis result obtained by processing the test sample data randomly extracted from the learning data set by the test data analysis unit 112. Yes.
- the integrated determination unit 114 is provided with a small sample calculation unit 328 that obtains a subtotal result based on genotype data and a subtotal result based on cytokine data for the sample analysis result thus obtained.
- the integrated determination unit 114 is provided with a random parameter calculation unit 324 that randomly calculates a plurality of weight parameters. The integrated determination unit 114 is weighted by the random weight parameter calculated in this manner, and then calculates the total of the respective sample subtotal results for each physiological state attribute. Is provided. Further, the integration determination unit 114 determines that the attribute of the physiological state most frequently determined by counting for each sample individual included in the test sample data in the sample total result is the attribute of the physiological state of the sample individual. A sample integration determination unit 332 is provided. The integration determination unit 114 is provided with a weight parameter selection unit 334 that aggregates the determination accuracy of the integration determination result for each sample individual for each weight parameter and adopts the weight parameter with the highest determination accuracy.
- the integrated determination unit 114 selects the weighting parameter that seems to be optimal based on the test sample discrimination result obtained by using the test sample extraction unit 906 of the resampling unit 106, and seems to be optimal. Integration determination can be performed by the total calculation unit 314 by applying weighting parameters. And the attribute of the physiological state with the largest number of discrimination
- FIG. 29 is a functional block diagram illustrating the configuration of the output unit 116 of the physiological state determination apparatus 1000 according to the present embodiment.
- the output unit 116 is provided with an output data generation unit 500 that generates a data set for the integration determination result by the integration determination unit 114.
- the output data generation unit 500 is provided with a test individual specifying data generation unit 502 for generating data for specifying the test individual. Further, the output data generation unit 500 is provided with an integration determination data generation unit 504 for generating data indicating the result of integration determination.
- the output data generation unit 500 is provided with a prediction determination accuracy data generation unit 506 for generating data indicating predicted determination accuracy.
- the output unit 116 is provided with an image data generation unit 508 that generates image data indicating the contents of the data set for the integrated determination result generated by the output data generation unit 500.
- the image data generated by the image data generation unit 508 may be displayed on the image display unit 130 via the network 120 such as a LAN or the Internet, may be printed by the printer 132, or may be written to the server 134. Good.
- FIG. 30 is a flowchart for explaining the genotype data analysis operation of the physiological state determination apparatus 1000 according to this embodiment.
- the learning data set acquisition unit 102 first receives genotype data input (S102).
- the genotype data input in this way is simply digitized in A, T, C, and G in the genotype data digitizing unit 802 of the learning data set acquisition unit 102 (S104).
- the normalization unit similarly In 810, the genotype data of the SNP is normalized and the missing value is corrected (S110). Further, the processes of S108 and S110 are repeated for the number of SNPs (S106).
- the genotype data normalized in this way is resampled by the resampling unit 106 from Case (glaucoma) and Control (healthy person), respectively (S114).
- the first machine learning unit 108 performs pattern learning (discriminant analysis, SVM, etc.) on the plurality of sub-data sets resampled in this way (S116). Further, the learning result subjected to the pattern learning in this way is transferred from the first machine learning unit 108 to the test data analysis unit 112 and stored once (S118). Further, the processes of S114, S116, and S118 are repeated N + 1 times (S112), and the series of operations is terminated.
- FIG. 31 is a flowchart for explaining the cytokine data analysis operation of the physiological state discriminating apparatus of this embodiment.
- the learning data set acquisition unit 102 first receives an input of cytokine data (S202).
- the cytokine data input in this way is subjected to log conversion in the log conversion unit 816 of the learning data set acquisition unit 102 (S206).
- log conversion unit conversion by common logarithm may be performed, conversion by natural logarithm may be performed, or conversion using other bases may be performed.
- the original value and the Log value of the cytokine data obtained in this way are subjected to normality test in the normality determination unit 818 of the learning data set acquisition unit 102 (S208), and the original value is more normal. If it is higher, the original value is adopted (S210), and if the Log value has higher normality, the Log value is adopted (S212).
- the processing of S206, S208, S210, and S212 is repeated for each cytokine (S204), and the control group data extraction unit 814 of the learning data set acquisition unit 102 extracts only the control group data from the population data. Then, the average value and standard deviation are calculated (S214). Using the average value and standard deviation obtained in this way, the standardization unit of the learning data set acquisition unit 102 normalizes (standardizes) all data (S216).
- the resampling unit 106 resamples the same number from Case (glaucoma) and Control (healthy person) (S220). Then, pattern learning (discriminant analysis, SVM, etc.) is performed in the second machine learning unit 110 for the plurality of sub-data sets resampled in this way (S222). Further, the learning result subjected to pattern learning in this way is transferred from the second machine learning unit 110 to the test data analysis unit 112 and stored once (S224). Further, the processes of S220, S222, and S224 are repeated N times (S218), and the series of operations is terminated.
- S220, S222, and S224 are repeated N times (S218), and the series of operations is terminated.
- FIG. 32 is a flowchart for explaining the test data analysis operation of the physiological state discriminating apparatus of this embodiment.
- the test data acquisition unit 104 receives genotype data input (S302).
- the genotype data input in this way is normalized with the allele frequency and average value obtained in S108 in the genotype data conversion unit 402 of the test data acquisition unit 104 (S304). This is because the calculation performed in S108 approximates that the characteristics of the learning data set are almost the same as the general genome characteristics.
- the test data acquisition unit 104 receives input of cytokine data (S306).
- the cytokine data input in this way is converted into numerical data by the cytokine data conversion unit 412 of the test data acquisition unit 104 by the same numericalization / normalization technique as the learning data set (S308).
- the discriminator application of the test data analysis unit 112 is applied.
- the genotype data of the subject data is discriminated in the part 218 (S312).
- the determination result Case (glaucoma) (S314).
- the determination result is glaucoma determination (Case determination)
- +1 point is given to Case determination (S316)
- the determination result is healthy person determination (Control determination)
- +1 point is given to Control determination. Is given (S318).
- the processes of S312, S314, S316, and S318 are repeated N + 1 times (S310).
- the discriminator application unit of the test data analysis unit 112 In 218, the cytokine data of the subject data is discriminated (S322).
- determination result Case (glaucoma) (S324).
- determination result glaucoma determination
- +1 point is given to Case determination (S326)
- Control determination +1 point is given to Control determination. Is given (S328).
- the processes of S322, S324, S326, and S328 are repeated N times (S320).
- the reason why the genotype analysis process is repeated N + 1 times and the cytokine analysis process is repeated N times is as follows. That is, when the weight of both processes is 1: 1, the final determination result may be N: N when the number of repetitions of both processes is N, and there is a possibility that neither Case nor Control can be determined. is there. Therefore, in order to ensure that the determination of either Case or Control can be obtained by setting the total number of trial processing times to an odd number instead of an even number, a genotype analysis that is considered to be more reliable is performed. I'm trying to make it more like this.
- genotype data determination results and cytokine data determination results are integrated by the integrated determination unit 114 to compare which of the Case determination count and the Control determination count is larger (S330). If there are more, it will determine with Case (glaucoma), and if there are more Control determination frequency
- FIG. 33 is a functional block diagram for explaining a modification of the present embodiment.
- the physiological state discriminator parameter generation device 1100 is a device that generates a discriminator used in the physiological state discrimination method described in the above flowchart.
- This physiological state discriminator parameter generation device 1100 is a learning data set relating to an individual group consisting of a plurality of individuals used for machine learning, which will be described later, acquired from a population consisting of individuals of the same type as the test individual.
- a learning data set acquisition unit 1102 that acquires a learning data set 1102 includes a combination of attributes of an individual's physiological state, discrete data regarding the base sequence of the individual's genome, and continuous data regarding the amount of a specific substance in the individual's living body. .
- the physiological state discriminator parameter generation device 1100 includes a resampling unit 1106 that extracts, from the learning data set, sub-data sets relating to a plurality of different sub-populations that constitute a part of the population.
- the resampling unit 1106 includes a combination of attributes of the physiological state of each individual included in the sub-individual group, discrete data regarding the genome sequence of each individual, and continuous data regarding the amount of a specific substance in the living body of each individual.
- the physiological state discriminator parameter generation device 1100 includes a first machine learning unit 1108 that performs machine learning on attributes of physiological states and patterns of discrete data included in a plurality of sub data sets.
- the first machine learning unit 1108 obtains a plurality of different first discriminators for discriminating the attribute of the physiological state of each individual included in the sub data set based on the discrete data.
- the physiological state discriminator parameter generation device 1100 includes a second machine learning unit 1110 that performs machine learning on the physiological state attribute and the continuous data pattern included in the plurality of sub data sets.
- the second machine learning unit 1110 obtains a plurality of different second discriminators for discriminating the attribute of the physiological state of each individual included in the sub data set based on continuous data.
- the physiological state discriminator parameter generation device 1100 includes an output unit 1111 that outputs the first discriminator and the second discriminator.
- the physiological state discriminator parameter generation device 1100 is provided with an image display unit 1122 such as a liquid crystal display and an operation unit 1124 such as a keyboard / mouse. Therefore, the operator of the physiological state discriminator parameter generation device 1100 can input various data or commands to the physiological state discriminator parameter generation device 1100 while referring to the image data displayed on the image display unit 1122.
- the physiological state discriminator parameter generation device 1100 includes a server 1126 such as a file server and a DNA sequencer, a DNA chip, a PCR, an antibody chip, or a flow cytometry via a network 1118 such as the Internet, LAN, WAN, or VPN.
- a measuring device 1128 is connected. Therefore, the physiological state discriminator parameter generation device 1100 can read the learning data set and the subject data from the server 1126, or can directly read the measurement result of the measurement device 1128.
- a physiological condition discriminating apparatus 1200 is connected to the physiological condition discriminator parameter generating apparatus 1100 via a network 1119 such as the Internet, LAN, WAN, or VPN. Therefore, the physiological state discriminator parameter generation device 1100 outputs the first discriminator and the second discriminator from the output unit 1111 and passes them to the discriminator parameter acquisition unit 1121 of the physiological state discriminating device 1200. Can do.
- this physiological state discriminator parameter generation device 1100 after creating a plurality of different sub-data sets that constitute a part of the initially obtained learning data set, the genomes of a plurality of individuals constituting the sub-data set are created. Two types of discriminators obtained by machine learning are created for each sub-data set, each of which is discrete data related to the base sequence and data from different viewpoints such as continuous data related to the amount of a specific substance in the living body of a plurality of individuals. Therefore, it is possible to obtain a set of two types of discriminators that can accurately determine the attribute of the physiological state of the mammal by the above method.
- the physiological state discriminating apparatus 1200 is an apparatus for discriminating attributes of the physiological state of a mammal individual.
- the physiological state discriminating apparatus 1200 includes a discriminator parameter acquisition unit 1121 that acquires the first discriminator and the second discriminator generated by the physiological state discriminator parameter generation apparatus 1100 described above. Further, the physiological state discriminating apparatus 1200 relates to the test individual related to the test individual, including a combination of discrete data regarding the base sequence of the genome of the individual acquired from the test individual and continuous data regarding the amount of the specific substance in the living body of the individual.
- a subject data acquisition unit 1104 is provided for acquiring subject data composed of discrete data and continuous data.
- the physiological state discriminating apparatus 1200 performs pattern analysis on the subject data a plurality of times using each of the plurality of first discriminators and the second discriminator, and the first discrimination result and the second discrimination result of the physiological state attribute of the test individual.
- a test data analysis unit 1112 that generates each of the determination results a plurality of times is provided.
- the physiological state discriminating apparatus 1200 integrates the first discrimination result and the second discrimination result for each attribute of the physiological state, and examines the attribute of the physiological state most discriminated in the first discrimination result and the second discrimination result.
- An integrated determination unit 1114 that integrally determines that the attribute is the physiological state attribute of the individual is provided.
- the physiological state determination apparatus 1200 includes an output unit 1116 that outputs the result of the integrated determination.
- the physiological state discriminating apparatus 1200 is provided with an image display unit 1142 such as a liquid crystal display and an operation unit 1144 such as a keyboard / mouse. Therefore, the operator of the physiological state determination device 1200 can input various data or commands to the physiological state determination device 1200 while referring to the image data displayed on the image display unit 1142.
- an image display unit 1142 such as a liquid crystal display
- an operation unit 1144 such as a keyboard / mouse. Therefore, the operator of the physiological state determination device 1200 can input various data or commands to the physiological state determination device 1200 while referring to the image data displayed on the image display unit 1142.
- the physiological state determination apparatus 1200 includes an image display unit 1130 such as a liquid crystal display, a printer 1132 such as a laser printer or an inkjet printer, and a server 1134 such as a file server via a network 1120 such as the Internet, LAN, WAN, and VPN. It is connected. Therefore, the physiological state determination apparatus 1200 can output the result of the above integrated determination from the output unit 1116 and display the result on the image display unit 1130 as image data, or can print the image data on the printer 1134. However, it can also be stored in the server 1132 as data in various formats.
- the physiological state discriminating apparatus 1200 two types of discriminators generated by the physiological state discriminator parameter generating apparatus 1100 are acquired, and subject data relating to the test individual is subjected to pattern analysis using these two types of discriminators. To do. As a result, since two types of discrimination results are obtained for each of a plurality of different sub-data sets for the test individual, the two types of discrimination results are subtotaled for a plurality of different sub-data sets, respectively. Then, as a result of summing up and integrating the subtotal results using an appropriate calculation formula, it is determined that the attribute of the physiological state having the largest total value is the attribute of the physiological state of the test individual. Therefore, according to this device, it is possible to accurately determine the attribute of the physiological state of the mammal.
- the analysis method used in the first machine learning unit 108 and the second machine learning unit 110 is principal component analysis, discriminant analysis, and SVM.
- the method is not particularly limited to these three types.
- Other analysis methods may be used.
- factor analysis, cluster analysis, multiple regression analysis, and the like can be suitably used as a technique for multivariate analysis other than principal component analysis.
- a pattern recognition / classification method a decision tree, a naive Bayes classifier, an artificial neural network, a Markov chain Monte Carlo method, a Gibbs sampler, a SOM (self-organizing map), or the like can be preferably used.
- the determination of the onset of glaucoma in humans was performed.
- the present invention is not particularly limited to these diseases, and various other non-infectious disease cases such as onset, progression, and prognosis of other humans. It can be suitably used for discrimination. Moreover, it can use suitably for various discrimination
- the onset / healthy attribute is determined for the physiological state of onset.
- the attribute is not particularly limited to these physiological state attributes. That is, the apparatus described in the above embodiment is used to discriminate various attributes such as infection / non-infection of physiological conditions such as other infections, progression, and prognosis, progressive / non-progressive, good prognosis / poor prognosis, and the like. It can be used suitably.
- the attributes of the physiological state included in the learning data set used in the above embodiment are similarly excellent as infection / non-infection, progressive / non-progressive, good prognosis / prognosis instead of onset / healthy Can be determined with high accuracy.
- Diagnosis of glaucoma onset by this integrated determination method using genotype data and cytokine data Glaucoma is one of the main causes of blindness, and congenital genetic and acquired environmental factors contribute to the onset. It is believed that Therefore, with regard to primary open-angle glaucoma (POAG, Primary Open-Angle Glaucoma), which is a typical type of glaucoma, using genotype data that is genetic information and cytokine data that reflects the condition of the acquired organism, The diagnostic performance by this method was investigated.
- POAG Primary Open-Angle Glaucoma
- Stage 1 As two independent data sets, Stage 1 with POAG42 specimen and 42 healthy control group specimens, and Stage2 with POAG73 specimen and 52 healthy control group specimens were prepared. All these specimens have both genotype data and cytokine data, and Stage 1 is used to capture the characteristics of the disease by machine learning, and the Stage 2 specimen is diagnosed based on the results.
- the SNPs extracted in the first stage were further analyzed with POAG409 specimens and 448 healthy control group specimens using a custom chip (iSelect) using illlumina's iSelect (TM) Custom Infinium (TM) Genotyping system. Went. Further, the combination analysis of the data of both stages is performed as the final stage, and the P value ⁇ 0.01 by Cochran-Mantel-Haenszel chi-square test and the P value ⁇ 0 by Heterogeneity (Cochran's Q) chi-square test .05 was extracted, and finally 40 SNPs were highly suggested to be associated with POAG.
- the concentration data of blood cytokine is divided into two stages using the Cytometric Bead Array (CBA) Flex Set System manufactured by Becton Dickinson, which can measure a large number of cytokines simultaneously. Acquired.
- CBA Cytometric Bead Array
- IL-8 IL-9, IL-10, IL-12p70, IL-13, MCP-1 (CCL2), MIP-1 ⁇ (CCL3), MIP-1 ⁇ (CCL4), RANTES (CCL5), Eotaxin (CCL11), MIG (CXCL9), basic-FGF, VEGF, G-CSF, GM-CSF, IFN- ⁇ , Fas Ligand, TNF, IP-10, angiogenin, OSM, LT- ⁇ Medium cytokine concentration data were obtained.
- the percentage of specimens with measurement failure is 5% or more (7 items)
- the percentage of specimens with a measured value of 0.0 is 5% or more (14 items)
- t-test for both groups The p value of 5% or more (5 items) was excluded and finally narrowed down to 3 items.
- newly prepared POAG73 specimens and 52 healthy control group specimens were measured in order to perform additional analysis on the three items considered to be useful for the diagnosis.
- the sample used for cytokine data acquisition is the same as the sample used in this experiment.
- genotype data of SNPs used for analysis refer to the method of normalization for each individual based on the SNP allele frequency (Price, et al: Nat Genet. 2006 Aug; 38 (8): 904-9). Then, the numerical values were converted into discrete values while correcting the missing values. Cytokine data was also standardized with reference to the blood cytokine concentration in the healthy control group and quantified as a continuous value. These data were input to statistical processing software “R” together with various library software. The developer of “R” is “R Development Core Team”, and version is 2.10.1. The version of the library “e1071” used for SVM is version 1.5-22 (the same applies to other embodiments described later).
- the diagnosis rate was improved by using this integrated judgment method, rather than using the genotype data and cytokine data individually to diagnose the specimen.
- Example 2 Progressive diagnosis of glaucoma by this integrated determination method using genotype data and cytokine data It is considered that there is a progressive / non-progressive type of glaucoma. Then, for glaucoma progressive / non-progressive, the diagnostic performance of this method can be examined using genotype data that is genetic information and cytokine data that reflects the status of the acquired living body.
- the definitions of the physiological state attributes “progressive” and “non-progressive” are as follows. Progressive type: Individuals affected with a disease, especially those whose progression is fast
- Non-progressive type Individuals affected with a disease, those that are not progressive
- Stage 1 As in the case of Example 1, as two independent data sets, Stage 1 with dozens of samples of progressive glaucoma and dozens of samples of non-progressive glaucoma, non-progress with dozens of samples of progressive glaucoma Prepare Stage 2 with dozens of specimens of type glaucoma. All these specimens have both genotype data and cytokine data, and Stage 1 is used to capture the characteristics of the disease by machine learning, and the Stage 2 specimen is diagnosed based on the results.
- SNPs Single nucleotide polymorphisms
- SNPs Single Nucleotide Polymorphisms
- Affy500k Human Mapping 500K Array chip
- the SNPs extracted in the first stage are subjected to a custom chip (iSelect) using iSelect (TM) Custom Infinum (TM) Genotyping system of Illumina, and hundreds of specimens of advanced glaucoma and non Perform additional analyzes with hundreds of specimens of advanced glaucoma. Further, the combination analysis of the data of both stages is performed as the final stage, and the P value ⁇ 0.01 by Cochran-Mantel-Haenszel chi-square test and the P value ⁇ 0 by Heterogeneity (Cochran's Q) chi-square test .05 are extracted to ultimately obtain SNPs that are highly suggested to be associated with advanced glaucoma.
- the concentration data of blood cytokine is divided into two stages using the Cytometric Bead Array (CBA) Flex Set System manufactured by Becton Dickinson, which can measure a large number of cytokines simultaneously. Get.
- CBA Cytometric Bead Array
- IL-1 ⁇ , IL-2, IL-3, and IL-4 that can be measured at the same time with this CBA with maximum accuracy for dozens of samples of progressive glaucoma and dozens of non-progressive glaucoma IL-5, IL-6, IL-7, IL-8, IL-9, IL-10, IL-12p70, IL-13, MCP-1 (CCL2), MIP-1 ⁇ (CCL3), MIP-1 ⁇ (CCL4), RANTES (CCL5), Eotaxin (CCL11), MIG (CXCL9), basic-FGF, VEGF, G-CSF, GM-CSF, IFN- ⁇ , Fas Ligand, TNF, IP-10, angiogenin, OSM , LT- ⁇ total 29 items of blood cytokine concentration data are obtained.
- the percentage of specimens with measurement failure was 5% or more
- the percentage of specimens with a measurement value of 0.0 was 5% or more
- the p-values of t-tests for both groups were 5% or more. Excluding things, finally preferably narrow down to a few items or less.
- dozens of newly prepared progressive glaucoma specimens and dozens of non-progressive glaucoma specimens are measured in order to perform additional analysis on several items that may be useful for the diagnosis. .
- the sample used for cytokine data acquisition is the same as the sample used in this experiment.
- genotype data of SNPs used for the analysis is digitized as discrete values while correcting the missing values in the same manner as in the first embodiment.
- Cytokine data is also standardized with reference to the blood cytokine concentration in non-progressive glaucoma and quantified as a continuous value. These data are input to statistical processing software “R” together with various library software.
- the onset / healthy attribute is determined for the physiological state of onset and the progressive / non-progressive attribute is determined for the physiological state of progression.
- the attributes are not limited to. That is, as in the case of the above-described embodiment, it is possible to similarly determine various attributes such as other infections, infection / non-infection of physiological states such as prognosis, and good prognosis / poor prognosis.
- the attributes of the physiological state included in the learning data set used in the above-described examples are similarly excellent even if the infection / non-infection, good prognosis / prognosis is poor instead of onset / normal and progressive / non-progressive. Can be determined with high accuracy.
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Chemical & Material Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Public Health (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Epidemiology (AREA)
- Analytical Chemistry (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Organic Chemistry (AREA)
- Pathology (AREA)
- Bioethics (AREA)
- Zoology (AREA)
- Immunology (AREA)
- Biomedical Technology (AREA)
- Wood Science & Technology (AREA)
- General Engineering & Computer Science (AREA)
- Ecology (AREA)
- Hospice & Palliative Care (AREA)
- Oncology (AREA)
- Microbiology (AREA)
- Physiology (AREA)
- Primary Health Care (AREA)
Abstract
Description
第一に、特許文献1、特許文献4、特許文献7及び非特許文献1に記載の遺伝子のみでは緑内障の遺伝的要因を全て説明することは困難であり、なお未知の緑内障関連遺伝子の存在が予想される。そのため、これらの従来技術では、緑内障の遺伝的要因を説明する上でさらなる改善の余地がある。
図1は、本実施形態に係る生理状態判別装置の概要を説明する概念図である。この生理状態判別装置を用いるには、まず、緑内障患者及び健常者などの複数の個体を含む個体群から取得された学習用データセットが用意される。なお、この学習用データセットには、各個体の(緑内障の発症、進行、予後などの)生理状態の属性、各個体の(SNPsのアレル(アリルともいう)数により構成されるジェノタイプなどの)ゲノムの塩基配列に関する離散データ及び各個体の生体内における(血中サイトカイン濃度などの)特定物質の量に関する連続データの組合せが含まれている。次に、この学習用データセットからリサンプリングされた複数のサブデータセットを用意する。
図3は、本実施形態に係る生理状態判別装置1000のデータの入出力を説明する概念図である。この図に示すように、生理状態判別装置1000は、学習用データセット及び被験者データの入力を受けて、統合判定の結果を出力する構成になっている。生理状態判別装置1000がこのような動作をすることができるのは、下記に示すような特有の構成をとっているからである。
図5は、本実施形態の生理状態判別装置で用いるジェノタイプデータについて説明するための概念図である。この図でも示すように、本実施形態の生理状態判別装置で用いるジェノタイプデータ(個体のゲノムの塩基配列に関する離散データ)としては、遺伝子多型又はバリアントに関するデータを用いている。なぜなら、後述する実施例で示すように、緑内障の発症、進行、予後などの生理状態の属性を判別する際に、その生理状態の属性に関連している遺伝子多型を網羅的に検査してジェノタイプデータとして用いることによって、生理状態の属性の判別精度が向上するためである。なお、本明細書において「遺伝子多型」とは、人口の1%以上の頻度で存在する遺伝子の変異を言う。一方、「バリアント」とは、人口の1%未満の頻度で存在する遺伝子の変異を言う。遺伝子多型又はバリアントを生じる原因としては、例えば、種内に生じる各種の突然変異、すなわち塩基がほかの塩基に置き換わる"置換"、塩基が失われる"欠失"、塩基が入る"挿入"や"重複"及び遺伝的組換えなどが含まれる。遺伝子多型の中でも、ひとつの塩基が他の塩基に置き換わっているSNPが遺伝的背景の個別化マーカーとして有用視されている。
図7は、本実施形態の生理状態判別装置1000の学習用データセット取得部102の構成について説明するための機能ブロック図である。この図でも示すように、学習用データセット取得部102には、ジェノタイプデータを数値データに変換するジェノタイプデータ数値化部802が設けられている。このジェノタイプデータ数値化部802には、取得したジェノタイプデータをあらかじめ設定された数値に変換する数値変換部804が設けられている。
図8は、本実施形態の生理状態判別装置1000のリサンプリング部106の構成について説明するための機能ブロック図である。この図でも示すように、このリサンプリング部106には、学習用データセットからサブデータセットをランダムに抽出するランダム抽出部902が設けられている。そのため、このリサンプリング部106は、多数の個体のデータを含む学習用データセットから一部の個体のデータを含むサブデータセットをランダムに多数生成することができる。その結果、多数のランダムなサブデータセットを用いて後述する第一機械学習部108及び第二機械学習部110による学習を行うことができるためこれらの機械学習の精度が向上する。なお、このリサンプリング部106は、ランダムなサブデータセットの生成を行った場合、低い確率で同じサブデータセットが生成される場合があるので、その場合には同じサブデータセットの重複を排除するように構成されていてもよい。
図9は、本実施形態の生理状態判別装置で用いる主成分分析の原理を説明するビジュアルデータである。ここで、主成分分析(principal component analysis)とは、複数の変数に対し、その全体の特性を求める分析方法を含む。この主成分分析は、多くの変数により記述された量的データの変数間の相関を排除し、できるだけ少ない情報の損失で、少数個の無相関な合成変数に縮約して、分析を行うことができる。この主成分分析の手法はホテリング(Hotelling)によって1933年頃提案されたものである(金明哲著、「Rによるデータサイエンス」p.66、森北出版より)。主成分分析を行う関数(分析エンジン)には多くの種類が存在するが、例えばR言語を実装した統計解析ソフトウェア『R』で記述された「prcomp」、「princomp」、並びに『R』標準の行列計算を利用して「eigen」から直接固有値ベクトルを求める方法、及びC言語やFortran用の数値計算ライブラリ「LAPACK」を用いた固有値ベクトル計算などを好適に使用できる。主成分分析は、遺伝学の分野では、集団構造化の評価で応用されており、例えば遺伝学における応用例として、検体集団の構造化の評価(民族、地域等が原因のゲノム情報の差異の検出)を行うことができる。具体的には、アフリカ系人種、欧米系人種、アジア系人種からなる集団に対して2つの主成分又は3つの主成分を用いて主成分分析を行うと、図9に示すように3つに分かれる。
本実施形態の連続データは、後述するように個体の血中サイトカイン濃度に関するデータである。すなわち、この連続データとしては、CBAによる血中サイトカイン濃度測定の結果が用いられる。すなわち、この血中サイトカイン濃度測定の測定原理は、以下のようなものである。
1.検体から採血した血液を遠心分離して血漿サンプルを得る
2.血漿サンプルとビーズ表面のキャプチャー抗体を反応させる
3.さらにフィコエリスリン色素(PE)で標識する検出用の各抗体を反応させる
4.フローサイトメーターを用いて、ビーズの蛍光強度により抗原の種類を、検出用抗体を標識しているPEの蛍光強度により各抗原量を測定する。
図19は、本実施形態の生理状態判別装置で用いる主成分分析によるサイトカインデータの解析例を説明するビジュアルデータである。この図では、図19で示した血中サイトカイン濃度を測定して、あわせて医師による確定診断による緑内障の有無及び進行状況についての生理状態の属性の分類結果と一緒に主成分分析を行っている。この図は、前述までの第1段階・第2段階の検体データ、発症群 対 対照群、及び3項目のサイトカインに基づいて作成されている。また、3項目のサイトカインで作成していることから、PC1~PC3を3次元でプロットすると全ての主成分を見ることができるため、3Dプロットにて作成している。この図より、PC1、PC2、PC3の3種類の主成分による分析を行った場合、全体的に対照群のデータが比較的固まっているのに対して、発症群のデータが散らばっているので、緑内障の発症(発症/健常)についての生理状態の属性の判別精度が高いことがわかる。
図23は、本実施形態の生理状態判別装置1000の被験データセット取得部104の構成を説明する機能ブロック図である。被験データセット取得部104は、個体の遺伝子多型に関する離散データ及び個体の血中サイトカイン濃度に関する連続データの組合せを含む、被験個体に関する被験者データを取得するように構成されている。
図24は、本実施形態の生理状態判別装置1000の統合判定部114の機能を説明する概念図である。上述のとおり、ジェノタイプデータは離散値であり、サイトカインデータは連続値であることなど、異なる数値特性を持つ2種類のデータを単純に足し合わせることは難しい。そこで、本実施形態では、数値の代わりに個々の判別結果を統合して生理状態の属性の判定を行っている。具体的には、バギングの手法を参考に、2つの解析を統合している。すなわち、ステップ1として、ジェノタイプデータを用いてバギングの手法を参考にした処理を行い、ステップ2として、サイトカインデータを用いてバギングの手法を参考にした処理を行い、ステップ3として、ステップ1及びステップ2の各結果を統合し、多数決で最終判定を行った。実際には、下記の条件で実験を行った。
図28は、本実施形態の生理状態判別装置1000の統合判定部114の構成を説明する機能ブロック図である。この図に示すように、統合判定部114には、被験データ解析部112からジェノタイプデータに基づく第一判別結果を取得する第一判別結果取得部302が設けられている。また、統合判定部114には、被験データ解析部112からサイトカインデータに基づく第二判別結果を取得する第二判別結果取得部304が設けられている。
図30は、本実施形態の生理状態判別装置1000のジェノタイプデータの解析動作を説明するフローチャートである。この図に示すように、生理状態判別装置1000は一連のジェノタイプデータの解析動作を開始すると、まず学習用データセット取得部102においてジェノタイプデータの入力を受け付ける(S102)。次いで、このようにして入力されたジェノタイプデータは、学習用データセット取得部102のジェノタイプデータ数値化部802においてA、T、C、Gを単純に数値化される(S104)。
図33は、本実施形態の変形例を説明するための機能ブロック図である。本実施形態に係る生理状態判別器パラメーター生成装置1100は、上記のフローチャートで説明した生理状態の判別方法に用いる判別器を生成する装置である。この生理状態判別器パラメーター生成装置1100は、被験個体と同一種の個体からなる母集団から取得された、後述の機械学習に用いられる複数の個体からなる個体群に関する学習用データセットであって、個体の生理状態の属性、個体のゲノムの塩基配列に関する離散データ及び個体の生体内における特定物質の量に関する連続データの組合せを含む、学習用データセットを取得する学習用データセット取得部1102を備える。
緑内障は主要な失明原因の1つであり、先天的な遺伝要因と後天的な環境要因が発症に寄与していると考えられている。そこで、緑内障の代表的な病型である原発開放隅角緑内障(POAG、Primary Open-Angle Glaucoma)について、遺伝情報であるジェノタイプデータと後天的な生体の状況を反映するサイトカインデータを用いて、本方法による診断性能を検討した。
独立した2つのデータセットとして、POAG42検体と健常対照群42検体によるStage1、POAG73検体と健常対照群52検体によるStage2をそれぞれ用意した。これらの検体は全てジェノタイプデータとサイトカインデータの両方を持っており、Stage1は機械学習で疾患の特徴を捉えるために使用し、その結果を元にStage2の検体を診断する。
本発明者らが既に論文として公表(Nakano et. al :Proc Natl Acad Sci U S A. 2009 Aug 4;106(31):12838-42)したデータを元に、判定に使用する一塩基多型(SNPs、Single Nucleotide Polymorphisms)を選定した。具体的には、第一段階としてAffymetrix社のGeneChip(R) Human Mapping 500K Arrayチップ(Affy500k)を用いたPOAG418検体と健常対照群300検体による全ゲノム解析を行い、Quality-Controlの後に解析対象となった331,838SNPsについてカイ二乗検定を実施した結果、有意と思われるP<0.001の255SNPsを抽出した。続いて第二段階として、第一段階で抽出したSNPsについてillumina社のiSelect(TM) Custom Infinium(TM) Genotyping systemを用いたカスタムチップ(iSelect)により、POAG409検体と健常対照群448検体による追加解析を行った。さらに最終段階として両段階のデータの組合せ解析を行い、Cochran-Mantel-Haenszel chi-square testによってP値<0.01であり、かつHeterogeneity (Cochran's Q)chi-square testでP値≧0.05のものを抽出して、最終的にPOAGとの関連が非常に示唆される40SNPsを得た。ただし、各SNPsの組合せのうち連鎖不平衡(LD、Linkage disequilibrium)の状態にあるものは解析時に誤作動を引き起こす恐れがあるため、連鎖不平衡解析用ソフトウェアであるHaploview4.1でD'>0.9の値を示したSNPsを同一LDのものとして除外して、最終的に29SNPsを解析対象として選択した。なお、これらのSNPsは本発明者らが特許取得済み(国際公開第2008/130008号)のものである。
本統合判定方法に用いるサイトカインデータを得るために、同時に多数のサイトカインを測定できるベクトン・ディッキンソン社製のCytometric Bead Array (CBA) Flex Set Systemを用いて、血中サイトカインの濃度データを2段階に分けて取得した。第一段階として、POAG42検体と健常対照群42検体について、このCBAで最大限同時に精度良く測定できるIL-1β、IL-2、IL-3、IL-4、IL-5、IL-6、IL-7、IL-8、IL-9、IL-10、IL-12p70、IL-13、MCP-1(CCL2)、MIP-1α(CCL3)、MIP-1β(CCL4)、RANTES(CCL5)、Eotaxin(CCL11)、MIG(CXCL9)、basic-FGF、VEGF、G-CSF、GM-CSF、IFN-γ、Fas Ligand、TNF、IP―10、アンギオゲニン、OSM、LT-αの計29項目の血中サイトカイン濃度データを得た。この結果に対して、測定失敗の検体の割合が5%以上のもの(7項目)、測定値が0.0の検体の割合が5%以上のもの(14項目)、及び両群のt検定のp値が5%以上のもの(5項目)を除外して、最終的に3項目に絞り込んだ。続いて第二段階として、それら診断に有用と思われる3項目について追加の解析を行うために、新たに用意したPOAG73検体と健常対照群52検体を測定した。なお、サイトカインデータ取得に用いた検体は、本実験に使用する検体と同一である。
解析に使用するSNPsのジェノタイプデータについては、SNPのアレル頻度に基づいて個体毎に正規化する方法(Price, et al :Nat Genet. 2006 Aug;38(8):904-9)を参考にして、欠損値の補正を行いつつ離散値として数値化を行った。またサイトカインデータについても、健常対照群の血中サイトカイン濃度を参考にした独自の標準化を行い、連続値として数値化を行った。これらのデータを各種ライブラリソフトと共に統計処理ソフト『R』に入力した。なお、『R』の開発者は「R Development Core Team」であり、versionは2.10.1を用いた。また、SVMに使ったライブラリ『e1071』のversionはversion 1.5-22である(後述する他の実施例でも同様)。
Stage1のPOAG及び健常対照群の各42検体からそれぞれランダムに20検体ずつサンプリングを行い、『R』の『e1071』ライブラリ内にある「サポートベクターマシン(SVM、Support Vector Machine)」を用いてジェノタイプデータの特徴を機械学習し、Stage2のPOAG73検体と健常対照群52検体の各々に対してSVMによる緑内障陽性・陰性の判定を行い、その判定結果を保存する。これら一連の操作を501回繰り返した後、今度はサイトカインデータについても同様の操作を500回繰り返す。最終的にはStage2の全検体に対して各々計1001回分の判定結果が得られるので、検体毎に陽性・陰性の判定回数をそれぞれ集計して多数決を取り、多い方の判定を各検体の最終判定とする。
このようにまとめた判定結果を表1に示す。
緑内障には進行型/非進行型があると考えられている。そして、緑内障の進行型/非進行型について、遺伝情報であるジェノタイプデータと後天的な生体の状況を反映するサイトカインデータを用いて、本方法による診断性能を検討できる。
なお、本実施例において、生理状態の属性「進行型」「非進行型」の定義は以下のとおりとする。
進行型: ある疾患に罹患した個体のうち、特にその疾患の進行が早いもの
非進行型: ある疾患に罹患した個体のうち、進行型でないもの
実施例1の場合と同様にして、独立した2つのデータセットとして、進行型緑内障の数十の検体と非進行型緑内障の数十の検体によるStage1、進行型緑内障の数十の検体と非進行型緑内障の数十の検体によるStage2をそれぞれ用意する。これらの検体は全てジェノタイプデータとサイトカインデータの両方を持っており、Stage1は機械学習で疾患の特徴を捉えるために使用し、その結果を元にStage2の検体を診断する。
実施例1の場合と同様にして、判定に使用する一塩基多型(SNPs、Single Nucleotide Polymorphisms)を選定する。具体的には、第一段階としてAffymetrix社のGeneChip(R) Human Mapping 500K Arrayチップ(Affy500k)を用いた進行型緑内障の数百の検体と非進行型緑内障の数百の検体による全ゲノム解析を行い、Quality-Controlの後に解析対象となったSNPsについてカイ二乗検定を実施した結果、有意と思われるP<0.001のSNPsを抽出する。続いて第二段階として、第一段階で抽出したSNPsについてillumina社のiSelect(TM) Custom Infinium(TM) Genotyping systemを用いたカスタムチップ(iSelect)により、進行型緑内障の数百以上の検体と非進行型緑内障の数百以上の検体による追加解析を行う。さらに最終段階として両段階のデータの組合せ解析を行い、Cochran-Mantel-Haenszel chi-square testによってP値<0.01であり、かつHeterogeneity (Cochran's Q)chi-square testでP値≧0.05のものを抽出して、最終的に進行型緑内障との関連が非常に示唆されるSNPsを得る。ただし、各SNPsの組合せのうち連鎖不平衡(LD、Linkage disequilibrium)の状態にあるものは解析時に誤作動を引き起こす恐れがあるため、連鎖不平衡解析用ソフトウェアであるHaploview4.1でD'>0.9の値を示したSNPsを同一LDのものとして除外して、最終的に好ましくは数十以下のSNPsを解析対象として選択する。
本統合判定方法に用いるサイトカインデータを得るために、同時に多数のサイトカインを測定できるベクトン・ディッキンソン社製のCytometric Bead Array (CBA) Flex Set Systemを用いて、血中サイトカインの濃度データを2段階に分けて取得する。第一段階として、進行型緑内障の数十の検体と非進行型緑内障の数十の検体について、このCBAで最大限同時に精度良く測定できるIL-1β、IL-2、IL-3、IL-4、IL-5、IL-6、IL-7、IL-8、IL-9、IL-10、IL-12p70、IL-13、MCP-1(CCL2)、MIP-1α(CCL3)、MIP-1β(CCL4)、RANTES(CCL5)、Eotaxin(CCL11)、MIG(CXCL9)、basic-FGF、VEGF、G-CSF、GM-CSF、IFN-γ、Fas Ligand、TNF、IP―10、アンギオゲニン、OSM、LT-αの計29項目の血中サイトカイン濃度データを得る。この結果に対して、測定失敗の検体の割合が5%以上のもの、測定値が0.0の検体の割合が5%以上のもの、及び両群のt検定のp値が5%以上のものを除外して、最終的に好ましくは数項目以下に絞り込む。続いて第二段階として、それら診断に有用と思われる数項目について追加の解析を行うために、新たに用意した進行型緑内障の数十の検体と非進行型緑内障の数十の検体を測定する。なお、サイトカインデータ取得に用いる検体は、本実験に使用する検体と同一である。
解析に使用するSNPsのジェノタイプデータについては、実施例1の場合と同様にして、欠損値の補正を行いつつ離散値として数値化を行う。またサイトカインデータについても、非進行型緑内障の血中サイトカイン濃度を参考にした独自の標準化を行い、連続値として数値化を行う。これらのデータを各種ライブラリソフトと共に統計処理ソフト『R』に入力する。
Stage1の進行型緑内障及び非進行型緑内障の各数十検体からそれぞれランダムに20検体ずつサンプリングを行い、『R』の『e1071』ライブラリ内にある「サポートベクターマシン(SVM、Support Vector Machine)」を用いてジェノタイプデータの特徴を機械学習し、Stage2の非進行型緑内障の数十検体と非進行型緑内障の数十検体の各々に対してSVMによる緑内障陽性・陰性の判定を行い、その判定結果を保存する。これら一連の操作を501回繰り返した後、今度はサイトカインデータについても同様の操作を500回繰り返す。最終的にはStage2の全検体に対して各々計1001回分の判定結果が得られるので、検体毎に陽性・陰性の判定回数をそれぞれ集計して多数決を取り、多い方の判定を各検体の最終判定とする。
104 被験者データ取得部
106 リサンプリング部
108 第一機械学習部
110 第二機械学習部
112 被験データ解析部
114 統合判定部
116 出力部
118 ネットワーク
120 ネットワーク
122 画像表示部
124 操作部
126 サーバ
128 測定装置
130 画像表示部
132 プリンタ
134 サーバ
202 第一判別器パラメーター取得部
204 第二判別器パラメーター取得部
206 最適解析法適用部
208 統計解析エンジン記憶部
210 主成分分析エンジン
212 判別分析エンジン
214 SVMエンジン
216 変換済被験データ取得部
218 判別器適用部
220 第一判別結果生成部
222 第二判別結果生成部
302 第一判別結果取得部
304 第二判別結果取得部
306 小計算出部
308 第一小計算出部
310 第二小計算出部
312 重みパラメーター適用部
314 合計算出部
316 生理状態判定部
318 統合パラメーター記憶部
320 重みパラメーターデータベース
322 統合計算式データベース
324 ランダムパラメーター算出部
326 テストサンプルデータ取得部
328 サンプル小計算出部
330 サンプル合計算出部
332 サンプル統合判定部
334 重みパラメーター選定部
401 データ変換部
402 ジェノタイプデータ変換部
404 学習用データセット変換式取得部
410 変換部
412 サイトカインデータ変換部
414 被験データセット内の対照群データ抽出部
420 抽出データ処理部
500 出力データ生成部
502 被験個体特定データ生成部
504 統合判定データ生成部
506 予測判定精度データ生成部
508 画像データ生成部
602 第一統計解析部
208 統計解析エンジン記憶部
606 第一精度検証部
210 主成分分析エンジン
212 判別分析エンジン
214 SVMエンジン
614 第一統計解析法選抜部
616 第一判別器パラメーター生成部
702 第二統計解析部
208 統計解析エンジン記憶部
706 第二精度検証部
210 主成分分析エンジン
212 判別分析エンジン
214 SVMエンジン
714 第二統計解析法選抜部
716 第二判別器パラメーター生成部
802 ジェノタイプデータ数値化部
804 数値変換部
806 リスクアレルデータ記憶部
808 アレル頻度算出部
810 正規化部
812 サイトカインデータ標準化部
814 対照群データ抽出部
816 Log変換部
818 正規性判定部
820 標準化部
902 ランダム抽出部
904 抽出カウンタ
906 テストサンプル抽出部
1000 生理状態判別装置
1100 生理状態判別器パラメーター生成装置
1102 学習用データセット取得部
1104 被験データ取得部
1106 リサンプリング部
1108 第一機械学習部
1110 第二機械学習部
1111 出力部
1112 被験データ解析部
1114 統合判定部
1116 出力部
1118 ネットワーク
1120 ネットワーク
1121 判別器パラメーター取得部
1122 画像表示部
1124 操作部
1126 サーバ
1128 測定装置
1130 画像表示部
1132 サーバ
1134 プリンタ
1142 画像表示部
1144 操作部
Claims (37)
- 哺乳動物の個体の生理状態の属性を判別するための装置であって、
被験個体と同一種の個体からなる母集団から取得された、機械学習に用いられる複数の個体からなる個体群に関する学習用データセットであって、前記個体の生理状態の属性、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せを含む、学習用データセットを取得する学習用データセット取得部と、
前記学習用データセットから、ランダムなリサンプリングを行う事で得られる、複数の各々異なるサブ個体群に関するサブデータセットであって、前記サブ個体群に含まれる各個体の生理状態の属性、各個体のゲノムの塩基配列に関する離散データ及び各個体の生体内における特定物質の量に関する連続データの組合せを含む、サブデータセットを抽出するリサンプリング部と、
前記複数のサブデータセットに含まれる生理状態の属性及び離散データのパターンを機械学習して、前記サブデータセットに含まれる各個体の生理状態の属性を離散データに基づいて判別するための複数の各々異なる第一判別器を得る第一機械学習部と、
前記複数のサブデータセットに含まれる生理状態の属性及び連続データのパターンを機械学習して、前記サブデータセットに含まれる各個体の生理状態の属性を連続データに基づいて判別するための複数の各々異なる第二判別器を得る第二機械学習部と、
前記被験個体から取得された、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せを含む、前記被験個体に関する前記被験個体に関する離散データと連続データからなる被験者データを取得する被験者データ取得部と、
前記被験者データを前記複数の第一判別器及び第二判別器を用いて各々複数回ずつパターン解析して、前記被験個体の生理状態の属性の第一判別結果及び第二判別結果を各々複数回ずつ生成する被験データ解析部と、
前記第一判別結果及び前記第二判別結果を生理状態の属性毎に統合して、前記第一判別結果及び前記第二判別結果において最も多く判別された生理状態の属性を前記被験個体の生理状態の属性であると統合判定する統合判定部と、
前記統合判定の結果を出力する出力部と、
を備える、装置。 - 請求項1記載の装置において、
前記離散データが、遺伝子多型又はバリアントに関するデータである、装置。 - 請求項2記載の装置において、
前記離散データが、SNPに関するデータである、装置。 - 請求項2又は3記載の装置において、
前記離散データが、前記遺伝子多型又はSNPのアレル頻度に基づいて個体毎に正規化してあるデータである、装置。 - 請求項1乃至4いずれかに記載の装置において、
前記離散データが、DNAシークエンサー、DNAマイクロアレイ又は核酸増幅法による解析結果に由来するデータである、装置。 - 請求項1乃至5いずれかに記載の装置において、
前記連続データが、前記個体の血中サイトカイン濃度に関するデータである、装置。 - 請求項6記載の装置において、
前記サイトカインが、IL-1β、IL-2、IL-3、IL-4、IL-5、IL-6、IL-7、IL-8、IL-9、IL-10、IL-12p70、IL-13、MCP-1(CCL2)、MIP-1a(CCL3)、MIP-1b(CCL4)、RANTES(CCL5)、Eotaxin(CCL11)、MIG(CXCL9)、b-FGF、VEGF、G-CSF、GM-CSF、IFN-g、Fas L、TNF、IP―10、アンギオゲニン、OSM、LT-αからなる群から選ばれる1種以上のサイトカインである、装置。 - 請求項6又は7記載の装置において、
前記連続データが、サイトカインの種類毎に前記血中サイトカイン濃度をLog変換して、元の値及びLog値の正規性を検定して正規分布に近い方の値を採用する正規性検定部を有する、装置。 - 請求項6乃至8いずれかに記載の装置において、
前記連続データが、前記サイトカインに特異的に結合する抗体のアレイを有する抗体チップ又は前記サイトカインに特異的に結合する抗体の結合したビーズセットを用いるフローサイトメトリーによる前記個体の血液の解析結果に由来するデータである、装置。 - 請求項1乃至9いずれかに記載の装置において、
前記学習用データセット取得部が、前記装置の内部又は外部に設けられている前記個体群に関する学習用データセットを格納する母集団データベースから、前記学習用データセットを読み出すように構成されている、装置。 - 請求項10記載の装置において、
前記母集団データベースが、前記被験個体と同一種の新規個体に関する前記個体の生理状態の属性、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せが、随時追加更新されるように構成されている、装置。 - 請求項1乃至11いずれかに記載の装置において、
前記リサンプリング部が、前記学習用データセットから前記サブデータセットをランダムに抽出するランダム抽出部を有する、装置。 - 請求項12記載の装置において、
前記リサンプリング部が、前記ランダム抽出部による抽出処理が10回以上の所定回数繰り返されるように制御する抽出カウンタを有する、装置。 - 請求項12又は13記載の装置において、
前記リサンプリング部が、前記第一判別器及び/又は前記第二判別器による生理状態の属性の判別精度を検証するためのテストサンプルデータを抽出するためのテストサンプル抽出部を有する、装置。 - 請求項1乃至14いずれかに記載の装置において、
前記第一機械学習部が、主成分分析、判別分析、SVM、因子分析、クラスター分析、重回帰分析、決定木、ナイーブベイズ分類器、人工ニューラルネットワーク、マルコフ連鎖モンテカルロ法、ギブスサンプラー及びSOMからなる群から選ばれる1種以上の統計解析法を行う第一統計解析部を有する、装置。 - 請求項15記載の装置において、
前記第一統計解析部が、主成分分析、判別分析及びSVMからなる群から選ばれる1種以上の統計解析法を行うように構成されている、装置。 - 請求項15又は16記載の装置において、
前記第一機械学習部が、前記学習用データセットからランダムに抽出されたテストサンプルデータを前記第一判別器を用いてパターン解析して得られるサンプル解析結果の判別精度を検証する第一精度検証部を有する、装置。 - 請求項17記載の装置において、
前記第一機械学習部が、前記第一精度検証部による検証結果に基づいて、前記1種以上の統計解析法の中から最も判別精度の高い統計解析法を採用する第一統計解析法選抜部を有する、装置。 - 請求項1乃至18いずれかに記載の装置において、
前記第二機械学習部が、主成分分析、判別分析、SVM、因子分析、クラスター分析、重回帰分析、決定木、ナイーブベイズ分類器、人工ニューラルネットワーク、マルコフ連鎖モンテカルロ法、ギブスサンプラー及びSOMからなる群から選ばれる1種以上の統計解析法を行う第二統計解析部を有する、装置。 - 請求項19記載の装置において、
前記第二統計解析部が、主成分分析、判別分析及びSVMからなる群から選ばれる1種以上の統計解析法を行うように構成されている、装置。 - 請求項20記載の装置において、
前記第二機械学習部が、前記学習用データセットからランダムに抽出されたテストサンプルデータを、前記第二判別器を用いてパターン解析して得られるサンプル解析結果の判別精度を検証する第二精度検証部を有する、装置。 - 請求項21記載の装置において、
前記第二機械学習部が、前記第二精度検証部による検証結果に基づいて、前記1種以上の統計解析法の中から最も判別精度の高い統計解析法を採用する第二統計解析法選抜部を有する、装置。 - 請求項1乃至22いずれかに記載の装置において、
前記被験者データ取得部が、前記個体の遺伝子多型に関する離散データ及び前記個体の血中サイトカイン濃度に関する連続データの組合せを含む、前記被験個体に関する被験者データを取得するように構成されている、装置。 - 請求項23記載の装置において、
前記被験者データ取得部が、前記被験者データを前記学習用データセットと同様の手法で数値化及び/又は正規化するデータ変換部を有する、装置。 - 請求項1乃至24いずれかに記載の装置において、
前記被験データ解析部が、前記複数の第一判別器及び第二判別器として、主成分分析、判別分析、SVM、因子分析、クラスター分析、重回帰分析、決定木、ナイーブベイズ分類器、人工ニューラルネットワーク、マルコフ連鎖モンテカルロ法、ギブスサンプラー及びSOMからなる群から選ばれる1種以上の統計解析法の中から最も判別精度の高い統計解析法を各々用いる最適解析法適用部を有する、装置。 - 請求項25記載の装置において、
前記最適解析法適用部が、主成分分析、判別分析及びSVMからなる群から選ばれる1種以上の統計解析法を行うように構成されている、装置。 - 請求項1乃至26いずれかに記載の装置において、
前記被験データ解析部が、前記複数の各々異なる第一判別器及び第二判別器をいずれも1回以上用いて、前記被験者データをパターン解析して、前記被験個体の生理状態の属性の第一判別結果及び第二判別結果を生成する判別器適用部を有する、装置。 - 請求項1乃至27いずれかに記載の装置において、
前記統合判定部が、
前記第一判別結果及び前記第二判別結果において前記被験データが特定の属性の生理状態と判別された回数を各々小計する小計算出部と、
前記第一判別結果及び前記第二判別結果における前記小計結果の合計を前記生理状態の属性毎に求める合計算出部と、
を有する、装置。 - 請求項28記載の装置において、
前記統合判定部が、前記第一判別結果及び前記第二判別結果における前記小計結果に各々所定のパラメーターによる重み付けをした上で前記合計を求めるための重みパラメーター適用部をさらに有する、装置。 - 請求項29記載の装置において、
前記統合判定部が、
前記学習用データセットからランダムに抽出されたテストサンプルデータを前記被験データ解析部で処理して得られるサンプル解析結果についてのサンプル小計結果を得るサンプル小計算出部と、
前記重みパラメーターをランダムに複数算出するランダムパラメーター算出部と、
前記ランダムな重みパラメーターによる重み付けをした上で前記サンプル小計結果の合計を前記生理状態の属性毎に求めるサンプル合計算出部と、
前記サンプル合計結果において前記テストサンプルデータに含まれるサンプル個体毎に最も多く判別された生理状態の属性を前記サンプル個体の生理状態の属性であると統合判定するサンプル統合判定部と、
前記サンプル個体毎の統合判定結果の判定精度を各重みパラメーター毎に集計して、最も判定精度の高い重みパラメーターを採用する重みパラメーター選定部と、
を有する、装置。 - 請求項1乃至30いずれかに記載の装置において、
前記出力部が、
被験個体を特定するための情報と、
前記統合判定の結果と、
予測される判定精度と、
をともに出力するように構成されている、装置。 - 請求項1乃至31いずれかに記載の装置において、
前記哺乳動物がヒトである、装置。 - 請求項32記載の装置において、
前記被験個体が、医療機関を受診した患者である、装置。 - 哺乳動物の個体の生理状態の属性を判別するための方法であって、
被験個体と同一種の個体からなる母集団から取得された、機械学習に用いられる複数の個体からなる個体群に関する学習用データセットであって、前記個体の生理状態の属性、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せを含む、学習用データセットを取得するステップと、
前記学習用データセットから、ランダムなリサンプリングを行う事で得られる、複数の各々異なるサブ個体群に関するサブデータセットであって、前記サブ個体群に含まれる各個体の生理状態の属性、各個体のゲノムの塩基配列に関する離散データ及び各個体の生体内における特定物質の量に関する連続データの組合せを含む、サブデータセットを抽出するステップと、
前記複数のサブデータセットに含まれる生理状態の属性及び離散データのパターンを機械学習して、前記サブデータセットに含まれる各個体の生理状態の属性を離散データに基づいて判別するための複数の各々異なる第一判別器を得るステップと、
前記複数のサブデータセットに含まれる生理状態の属性及び連続データのパターンを機械学習して、前記サブデータセットに含まれる各個体の生理状態の属性を連続データに基づいて判別するための複数の各々異なる第二判別器を得るステップと、
前記被験個体から取得された、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せを含む、前記被験個体に関する被験者データを取得するステップと、
前記被験者データを前記複数の第一判別器及び第二判別器を用いて各々複数回ずつパターン解析して、前記被験個体の生理状態の属性の第一判別結果及び第二判別結果を各々複数回ずつ生成するステップと、
前記第一判別結果及び前記第二判別結果を生理状態の属性毎に統合して、前記第一判別結果及び前記第二判別結果において最も多く判別された生理状態の属性を前記被験個体の生理状態の属性であると統合判定するステップと、
前記統合判定の結果を出力するステップと、
を含む、方法。 - 請求項34に記載の方法に用いる判別器を生成する装置であって、
被験個体と同一種の個体からなる母集団から取得された、機械学習に用いられる複数の個体からなる個体群に関する学習用データセットであって、前記個体の生理状態の属性、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せを含む、学習用データセットを取得する学習用データセット取得部と、
前記学習用データセットから、ランダムなリサンプリングを行う事で得られる、複数の各々異なるサブ個体群に関するサブデータセットであって、前記サブ個体群に含まれる各個体の生理状態の属性、各個体のゲノムの塩基配列に関する離散データ及び各個体の生体内における特定物質の量に関する連続データの組合せを含む、サブデータセットを抽出するリサンプリング部と、
前記複数のサブデータセットに含まれる生理状態の属性及び離散データのパターンを機械学習して、前記サブデータセットに含まれる各個体の生理状態の属性を離散データに基づいて判別するための複数の各々異なる第一判別器を得る第一機械学習部と、
前記複数のサブデータセットに含まれる生理状態の属性及び連続データのパターンを機械学習して、前記サブデータセットに含まれる各個体の生理状態の属性を連続データに基づいて判別するための複数の各々異なる第二判別器を得る第二機械学習部と、
前記第一の判別器及び第二の判別器を出力する出力部と、
を備える、装置。 - 哺乳動物の個体の生理状態の属性を判別するための装置であって、
請求項35記載の装置によって生成される前記第一の判別器及び第二の判別器を取得する判別器パラメーター取得部と、
前記被験個体から取得された、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せを含む、前記被験個体に関する前記被験個体に関する離散データと連続データからなる被験者データを取得する被験者データ取得部と、
前記被験者データを前記複数の第一判別器及び第二判別器を用いて各々複数回ずつパターン解析して、前記被験個体の生理状態の属性の第一判別結果及び第二判別結果を各々複数回ずつ生成する被験データ解析部と、
前記第一判別結果及び前記第二判別結果を生理状態の属性毎に統合して、前記第一判別結果及び前記第二判別結果において最も多く判別された生理状態の属性を前記被験個体の生理状態の属性であると統合判定する統合判定部と、
前記統合判定の結果を出力する出力部と、
を備える、装置。 - 哺乳動物の個体の生理状態の属性を判別するためのプログラムであって、
コンピュータに、
被験個体と同一種の個体からなる母集団から取得された、機械学習に用いられる複数の個体からなる個体群に関する学習用データセットであって、前記個体の生理状態の属性、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せを含む、学習用データセットを取得するステップと、
前記学習用データセットから、ランダムなリサンプリングを行う事で得られる、複数の各々異なるサブ個体群に関するサブデータセットであって、前記サブ個体群に含まれる各個体の生理状態の属性、各個体のゲノムの塩基配列に関する離散データ及び各個体の生体内における特定物質の量に関する連続データの組合せを含む、サブデータセットを抽出するステップと、
前記複数のサブデータセットに含まれる生理状態の属性及び離散データのパターンを機械学習して、前記サブデータセットに含まれる各個体の生理状態の属性を離散データに基づいて判別するための複数の各々異なる第一判別器を得るステップと、
前記複数のサブデータセットに含まれる生理状態の属性及び連続データのパターンを機械学習して、前記サブデータセットに含まれる各個体の生理状態の属性を連続データに基づいて判別するための複数の各々異なる第二判別器を得るステップと、
前記被験個体から取得された、前記個体のゲノムの塩基配列に関する離散データ及び前記個体の生体内における特定物質の量に関する連続データの組合せを含む、前記被験個体に関する被験者データを取得するステップと、
前記被験者データを前記複数の第一判別器及び第二判別器を用いて各々複数回ずつパターン解析して、前記被験個体の生理状態の属性の第一判別結果及び第二判別結果を各々複数回ずつ生成するステップと、
前記第一判別結果及び前記第二判別結果を生理状態の属性毎に統合して、前記第一判別結果及び前記第二判別結果において最も多く判別された生理状態の属性を前記被験個体の生理状態の属性であると統合判定するステップと、
前記統合判定の結果を出力するステップと、
を実行させるように構成されている、プログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP11853383.5A EP2660310A4 (en) | 2010-12-28 | 2011-12-28 | COMPREHENSIVE GLAUCOMOTIC PROCESS WITH A GLAUKOM DIAGNOSIS CHIP AND A CLUSTER ANALYSIS OF A DEFORMED PROTEOMIC |
JP2012551042A JPWO2012091093A1 (ja) | 2010-12-28 | 2011-12-28 | 緑内障診断チップと変形プロテオミクスクラスター解析による緑内障統合的判定方法 |
US13/976,967 US20130275349A1 (en) | 2010-12-28 | 2011-12-28 | Comprehensive Glaucoma Determination Method Utilizing Glaucoma Diagnosis Chip And Deformed Proteomics Cluster Analysis |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010-294176 | 2010-12-28 | ||
JP2010294176 | 2010-12-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012091093A1 true WO2012091093A1 (ja) | 2012-07-05 |
Family
ID=46383181
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2011/080393 WO2012091093A1 (ja) | 2010-12-28 | 2011-12-28 | 緑内障診断チップと変形プロテオミクスクラスター解析による緑内障統合的判定方法 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20130275349A1 (ja) |
EP (1) | EP2660310A4 (ja) |
JP (1) | JPWO2012091093A1 (ja) |
TW (1) | TW201248425A (ja) |
WO (1) | WO2012091093A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018003523A1 (ja) * | 2016-06-30 | 2018-01-04 | 京都府公立大学法人 | 広義原発開放隅角緑内障の発症リスクの判定方法 |
WO2018139303A1 (ja) * | 2017-01-24 | 2018-08-02 | エンゼルプレイングカード株式会社 | チップ認識システム |
JP2022003521A (ja) * | 2020-06-23 | 2022-01-11 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 交通事故識別方法、装置、デバイス及びコンピュータ記憶媒体 |
WO2022254858A1 (ja) * | 2021-06-03 | 2022-12-08 | コニカミノルタ株式会社 | 検査装置、検査方法、および検査プログラム |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2016006626A (ja) * | 2014-05-28 | 2016-01-14 | 株式会社デンソーアイティーラボラトリ | 検知装置、検知プログラム、検知方法、車両、パラメータ算出装置、パラメータ算出プログラムおよびパラメータ算出方法 |
JP6996413B2 (ja) * | 2018-04-27 | 2022-01-17 | トヨタ自動車株式会社 | 解析装置および解析プログラム |
US11941513B2 (en) * | 2018-12-06 | 2024-03-26 | Electronics And Telecommunications Research Institute | Device for ensembling data received from prediction devices and operating method thereof |
US11475239B2 (en) * | 2019-11-21 | 2022-10-18 | Paypal, Inc. | Solution to end-to-end feature engineering automation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003500766A (ja) * | 1999-05-25 | 2003-01-07 | スティーヴン ディー. バーンヒル、 | 複数支援ベクトルマシンを使用した複数データセットからの知識発見の増強 |
JP2003529131A (ja) * | 1999-10-27 | 2003-09-30 | バイオウルフ テクノロジーズ エルエルスィー | 生物学的システムにおいてパターンを同定するための方法およびデバイスならびにその使用方法 |
JP2003529561A (ja) * | 2000-02-08 | 2003-10-07 | ワックス、マーティン、ビー | 緑内障の治療法 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060099624A1 (en) * | 2004-10-18 | 2006-05-11 | Wang Lu-Yong | System and method for providing personalized healthcare for alzheimer's disease |
JP2007310860A (ja) * | 2005-10-31 | 2007-11-29 | Sony Corp | 学習装置及び方法 |
-
2011
- 2011-12-28 EP EP11853383.5A patent/EP2660310A4/en not_active Withdrawn
- 2011-12-28 US US13/976,967 patent/US20130275349A1/en not_active Abandoned
- 2011-12-28 TW TW100149241A patent/TW201248425A/zh unknown
- 2011-12-28 WO PCT/JP2011/080393 patent/WO2012091093A1/ja active Application Filing
- 2011-12-28 JP JP2012551042A patent/JPWO2012091093A1/ja active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003500766A (ja) * | 1999-05-25 | 2003-01-07 | スティーヴン ディー. バーンヒル、 | 複数支援ベクトルマシンを使用した複数データセットからの知識発見の増強 |
JP2003529131A (ja) * | 1999-10-27 | 2003-09-30 | バイオウルフ テクノロジーズ エルエルスィー | 生物学的システムにおいてパターンを同定するための方法およびデバイスならびにその使用方法 |
JP2003529561A (ja) * | 2000-02-08 | 2003-10-07 | ワックス、マーティン、ビー | 緑内障の治療法 |
Non-Patent Citations (2)
Title |
---|
IKUMITSU NAGASAKI: "Togoteki Shindan Algorithm Kaihatsu ni Kansuru Kenkyu, Ryokunaisho Shindan SNP Chip to Henkei Proteomics Cluster Kaiseki ni yoru Ryokunaisho Togoteki Shindanho no Kaihatsu ni Kansuru Kenkyu", HEISEI 20 NENDO SOKATSU BUNTAN KENKYU NENDO SHURYO HOKOKUSHO, 2009, pages 30 - 32, XP008170060 * |
NAKANO M. ET AL.: "Three susceptible loci associated with primary open-angle glaucoma identified by genome-wide association study in a Japanese population", PNAS, vol. 106, no. 31, 2009, pages 12838 - 12842, XP055119766 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7072803B2 (ja) | 2016-06-30 | 2022-05-23 | 京都府公立大学法人 | 広義原発開放隅角緑内障の発症リスクの判定方法 |
WO2018003523A1 (ja) * | 2016-06-30 | 2018-01-04 | 京都府公立大学法人 | 広義原発開放隅角緑内障の発症リスクの判定方法 |
JPWO2018003523A1 (ja) * | 2016-06-30 | 2019-04-18 | 京都府公立大学法人 | 広義原発開放隅角緑内障の発症リスクの判定方法 |
US11398129B2 (en) | 2017-01-24 | 2022-07-26 | Angel Group Co., Ltd. | Chip recognition system |
JP2023040267A (ja) * | 2017-01-24 | 2023-03-22 | エンゼルグループ株式会社 | チップ認識システム |
US11049359B2 (en) | 2017-01-24 | 2021-06-29 | Angel Playing Cards Co., Ltd. | Chip recognition system |
US11954966B2 (en) | 2017-01-24 | 2024-04-09 | Angel Group Co., Ltd. | Chip recognition system |
JPWO2018139303A1 (ja) * | 2017-01-24 | 2019-11-21 | エンゼルプレイングカード株式会社 | チップ認識システム |
WO2018139303A1 (ja) * | 2017-01-24 | 2018-08-02 | エンゼルプレイングカード株式会社 | チップ認識システム |
JP2022169652A (ja) * | 2017-01-24 | 2022-11-09 | エンゼルグループ株式会社 | チップ認識システム |
US11941942B2 (en) | 2017-01-24 | 2024-03-26 | Angel Group Co., Ltd. | Chip recognition system |
US11842595B2 (en) | 2017-01-24 | 2023-12-12 | Angel Group Co., Ltd. | Chip recognition system |
JP2023040266A (ja) * | 2017-01-24 | 2023-03-22 | エンゼルグループ株式会社 | チップ認識システム |
US10861281B2 (en) | 2017-01-24 | 2020-12-08 | Angel Playing Cards Co., Ltd. | Chip recognition system |
JP7204826B2 (ja) | 2020-06-23 | 2023-01-16 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 交通事故識別方法、装置、デバイス及びコンピュータ記憶媒体 |
JP2022003521A (ja) * | 2020-06-23 | 2022-01-11 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 交通事故識別方法、装置、デバイス及びコンピュータ記憶媒体 |
WO2022254858A1 (ja) * | 2021-06-03 | 2022-12-08 | コニカミノルタ株式会社 | 検査装置、検査方法、および検査プログラム |
Also Published As
Publication number | Publication date |
---|---|
EP2660310A1 (en) | 2013-11-06 |
US20130275349A1 (en) | 2013-10-17 |
EP2660310A4 (en) | 2015-09-30 |
JPWO2012091093A1 (ja) | 2014-06-05 |
TW201248425A (en) | 2012-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2012091093A1 (ja) | 緑内障診断チップと変形プロテオミクスクラスター解析による緑内障統合的判定方法 | |
US10354747B1 (en) | Deep learning analysis pipeline for next generation sequencing | |
US7653491B2 (en) | Computer systems and methods for subdividing a complex disease into component diseases | |
EP2864920B1 (en) | Systems and methods for generating biomarker signatures with integrated bias correction and class prediction | |
CA3096678A1 (en) | Multi-assay prediction model for cancer detection | |
KR101542529B1 (ko) | 대립유전자의 바이오마커 발굴방법 | |
US20120310539A1 (en) | Predicting gene variant pathogenicity | |
JP2005512175A (ja) | 複合遺伝子学的分類子の遺伝子特徴を識別する方法 | |
KR101460520B1 (ko) | 차세대 시퀀싱 데이터의 질병변이마커 검출 방법 | |
EP2864918B1 (en) | Systems and methods for generating biomarker signatures | |
JP2005531853A (ja) | Snp遺伝子型クラスタリングのためのシステムおよび方法 | |
JP2016099901A (ja) | 形質予測モデル作成方法および形質予測方法 | |
US20220277811A1 (en) | Detecting False Positive Variant Calls In Next-Generation Sequencing | |
US7640113B2 (en) | Methods and apparatus for complex genetics classification based on correspondence analysis and linear/quadratic analysis | |
CN117423451B (zh) | 一种基于大数据分析的智能分子诊断方法及系统 | |
KR102111820B1 (ko) | 동적 네트워크 바이오마커의 검출 장치, 검출 방법 및 검출 프로그램 | |
KR20140090296A (ko) | 유전 정보를 분석하는 방법 및 장치 | |
KR20140023607A (ko) | 개인의 유전 정보를 분석하는 방법 및 장치 | |
JP7064215B2 (ja) | 落屑症候群又は落屑緑内障の発症リスクの判定方法 | |
KR102441856B1 (ko) | 중요도 샘플링을 활용한 다중변이 연관연구 방법 | |
Barrett et al. | Linkage analysis | |
US20230207132A1 (en) | Covariate correction including drug use from temporal data | |
CN110475874A (zh) | 脱靶序列在dna分析中的应用 | |
AU2022429829A1 (en) | Rare variant polygenic risk scores | |
WO2023129622A1 (en) | Covariate correction for temporal data from phenotype measurements for different drug usage patterns |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11853383 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2012551042 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13976967 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2011853383 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011853383 Country of ref document: EP |