CN112102884A - Multi-gene locus combined disease risk analysis and evaluation platform and method - Google Patents
Multi-gene locus combined disease risk analysis and evaluation platform and method Download PDFInfo
- Publication number
- CN112102884A CN112102884A CN202010932647.3A CN202010932647A CN112102884A CN 112102884 A CN112102884 A CN 112102884A CN 202010932647 A CN202010932647 A CN 202010932647A CN 112102884 A CN112102884 A CN 112102884A
- Authority
- CN
- China
- Prior art keywords
- assessment
- data
- risk
- disease risk
- gene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000011156 evaluation Methods 0.000 title claims abstract description 71
- 201000010099 disease Diseases 0.000 title claims abstract description 65
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 65
- 238000012502 risk assessment Methods 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 31
- 208000012659 Joint disease Diseases 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims description 17
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000012163 sequencing technique Methods 0.000 claims description 12
- 238000001353 Chip-sequencing Methods 0.000 claims description 10
- 238000007726 management method Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 6
- 238000003908 quality control method Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 108090000623 proteins and genes Proteins 0.000 abstract description 8
- 239000000523 sample Substances 0.000 description 22
- 238000001514 detection method Methods 0.000 description 16
- 108700028369 Alleles Proteins 0.000 description 14
- 230000002068 genetic effect Effects 0.000 description 13
- 206010028980 Neoplasm Diseases 0.000 description 8
- 201000011510 cancer Diseases 0.000 description 6
- 206010020772 Hypertension Diseases 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 239000002773 nucleotide Substances 0.000 description 4
- 125000003729 nucleotide group Chemical group 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 210000004072 lung Anatomy 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 208000005069 pulmonary fibrosis Diseases 0.000 description 3
- 208000024172 Cardiovascular disease Diseases 0.000 description 2
- 108020004414 DNA Proteins 0.000 description 2
- 241000209094 Oryza Species 0.000 description 2
- 235000007164 Oryza sativa Nutrition 0.000 description 2
- 238000011529 RT qPCR Methods 0.000 description 2
- 208000024770 Thyroid neoplasm Diseases 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 208000026106 cerebrovascular disease Diseases 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013523 data management Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000002526 effect on cardiovascular system Effects 0.000 description 2
- 230000004907 flux Effects 0.000 description 2
- 235000009566 rice Nutrition 0.000 description 2
- 102200017290 rs429358 Human genes 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 101150037123 APOE gene Proteins 0.000 description 1
- 208000024827 Alzheimer disease Diseases 0.000 description 1
- 102100029470 Apolipoprotein E Human genes 0.000 description 1
- 206010003658 Atrial Fibrillation Diseases 0.000 description 1
- 208000032791 BCR-ABL1 positive chronic myelogenous leukemia Diseases 0.000 description 1
- 208000009137 Behcet syndrome Diseases 0.000 description 1
- 206010005003 Bladder cancer Diseases 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 206010008190 Cerebrovascular accident Diseases 0.000 description 1
- 206010008342 Cervix carcinoma Diseases 0.000 description 1
- 208000017667 Chronic Disease Diseases 0.000 description 1
- 208000010833 Chronic myeloid leukaemia Diseases 0.000 description 1
- 206010009269 Cleft palate Diseases 0.000 description 1
- 238000007400 DNA extraction Methods 0.000 description 1
- 238000001712 DNA sequencing Methods 0.000 description 1
- 208000005189 Embolism Diseases 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 201000008450 Intracranial aneurysm Diseases 0.000 description 1
- 208000008839 Kidney Neoplasms Diseases 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 208000024556 Mendelian disease Diseases 0.000 description 1
- 208000019695 Migraine disease Diseases 0.000 description 1
- 208000033761 Myelogenous Chronic BCR-ABL Positive Leukemia Diseases 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 206010038389 Renal cancer Diseases 0.000 description 1
- 208000005718 Stomach Neoplasms Diseases 0.000 description 1
- 208000006011 Stroke Diseases 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 1
- 208000002223 abdominal aortic aneurysm Diseases 0.000 description 1
- 208000007474 aortic aneurysm Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002490 cerebral effect Effects 0.000 description 1
- 201000010881 cervical cancer Diseases 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 238000010224 classification analysis Methods 0.000 description 1
- 206010009259 cleft lip Diseases 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 208000029078 coronary artery disease Diseases 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000011157 data evaluation Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 206010012601 diabetes mellitus Diseases 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- ANSXAPJVJOKRDJ-UHFFFAOYSA-N furo[3,4-f][2]benzofuran-1,3,5,7-tetrone Chemical compound C1=C2C(=O)OC(=O)C2=CC2=C1C(=O)OC2=O ANSXAPJVJOKRDJ-UHFFFAOYSA-N 0.000 description 1
- 206010017758 gastric cancer Diseases 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 238000003205 genotyping method Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000000265 homogenisation Methods 0.000 description 1
- 208000006575 hypertriglyceridemia Diseases 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 201000010982 kidney cancer Diseases 0.000 description 1
- 201000007270 liver cancer Diseases 0.000 description 1
- 208000014018 liver neoplasm Diseases 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 201000000083 maturity-onset diabetes of the young type 1 Diseases 0.000 description 1
- 206010027599 migraine Diseases 0.000 description 1
- 208000010125 myocardial infarction Diseases 0.000 description 1
- 235000016709 nutrition Nutrition 0.000 description 1
- 230000000414 obstructive effect Effects 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 201000010065 polycystic ovary syndrome Diseases 0.000 description 1
- 230000003234 polygenic effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000001681 protective effect Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 102200017284 rs7412 Human genes 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 201000011549 stomach cancer Diseases 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 201000000596 systemic lupus erythematosus Diseases 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 201000002510 thyroid cancer Diseases 0.000 description 1
- 208000013076 thyroid tumor Diseases 0.000 description 1
- 231100000331 toxic Toxicity 0.000 description 1
- 230000002588 toxic effect Effects 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 201000005112 urinary bladder cancer Diseases 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H15/00—ICT specially adapted for medical reports, e.g. generation or transmission thereof
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Primary Health Care (AREA)
- Biomedical Technology (AREA)
- Pathology (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a multi-gene locus joint disease risk analysis and evaluation platform and a method, aiming at healthy people, the disease risk of the gene disease is calculated and preliminarily evaluated by utilizing information and supplementary literature data provided by a GWAS database and a thousand-people genome database. And expanding the data volume of the target user, and learning and correcting the risk evaluation standard of the specific population by using the client feedback system. The invention can effectively remind healthy people to effectively avoid or reduce the risk of diseases, and the aim of guiding the healthy life of individuals through gene reading is fulfilled.
Description
Technical Field
The invention relates to the field of multiple disease risk locus genotype data processing, in particular to a multiple-gene risk assessment system for multiple diseases, and particularly relates to a multiple-gene locus joint disease risk analysis and assessment platform and a method.
Background
Single Nucleotide Polymorphism (SNP) refers to a change in DNA sequence caused by a change in a Single Nucleotide-a, T, C or G, resulting in diversity of chromosomal genomes between species including humans. For example, DNA fragments from two different individuals, AAGCCTA and AAGCTTA, are alleles. Almost all common Single Nucleotide Polymorphism (SNP) sites have only two alleles. The frequency of Single Nucleotide Polymorphism (SNP) alleles varies among different populations, and thus, Single Nucleotide Polymorphism (SNP) alleles that are common in certain regions or ethnic groups may be rare in other regions or ethnic groups. In the human genome, there is one SNP site every 100 to 300 bases, and 2 out of every 3 SNP sites are the interconversion of cytosine (C) and thymine (T). Various differences in human genetic genes, 90% attributable to genetic variation caused by SNPs.
Many SNPs are associated with disease, a few are associated with disease risk, and the genes associated with disease risk are called risk genes, which of course present alleles with high or low risk of disease, collectively called risk alleles, so that the number of alleles at a risk locus may be 0,1 or 2. The common risk allele frequencies in the population are relatively common, the influence of a single risk allele on the disease is small, the OR (OddsRatio) value of a risk equal site is larger than 1, and the OR value of the risk reducing site is between 0 and 1.
Multigenic Risk assessment (GRS) is the calculation of the cumulative effect of multiple independent Risk SNPs, the following are the three most common Genetic Risk assessment methods used to assess individual disease Risk:
1) early multigene risk assessment methods were non-weighted risk allele calculations (GRS-RAC), which in short were simply added at multiple independent risk SNP sites, regardless of the OR value at a particular site, and therefore the scores varied due to the varying number of SNPs.
2) Considering that the risk loci have certain independence and OR values are also different, corresponding weights are designed, and the method is called a weight risk allele risk calculation method (GRS-wrAC). It is clear that SNPs with higher OR values in this method are given higher weight, and this score also increases with increasing number of SNPs, also commonly referred to as multigenic Risk score "Polygenic Risk Scales (PRS)", an extension of grs (Genetic rice scales) (Igo, r.p., Kinzy, t.g., & book Bailey, j.n. (2019) & Genetic rice scales.current Protocols in Human Genetics,104(1) doi: 10.1002/cphg.95).
3) Standardized risk allele counting method for population (GRS-PS) (Conran CA, Na R, Chen H, et al, position-stabilized genetic risk score: the SNP-based method of choice for inherited risk assessment of state cancer. origin J android.2016; 18(4) 520-524.doi 10.4103/1008- > 682X.179527), each SNP contributing to the population in addition to the site, in addition to introducing weight counts. This method does not increase with the number of SNPs, and thus is increasingly used in the calculation of risk genes.
However, the current risk assessment method has the following problems: firstly, the evaluation in the early stage is very dependent on the sorting and collecting work of the database, the conclusion can be slightly changed along with the upgrading of the database, and meanwhile, the credibility has the crowd deviation; second, more accurate assessments of a population still require the incorporation of a validated phenotype, and therefore, a large amount of client genotype data and feedback data must be collected to correct the credibility of a particular population risk assessment.
Disclosure of Invention
Based on the above, the invention provides a multi-gene locus joint disease risk analysis and assessment platform and a method, aiming at carrying out rapid risk analysis on the genotypes of the independent risk loci of various designed disease projects and issuing a risk assessment report. The invention can effectively remind healthy people to effectively avoid or reduce the risk of diseases, and the aim of guiding the healthy life of individuals through gene reading is fulfilled.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a multi-gene locus joint disease risk analysis and evaluation platform, which comprises:
the sample preprocessing module is used for performing chip sequencing on each sample and preprocessing sequencing data to obtain genotype data of disease risk sites;
the multi-gene risk evaluation module is used for positioning disease risk sites and carrying out primary evaluation aiming at the target project, and carrying out later-stage data learning and optimized evaluation on the primary evaluation result according to the data fed back by the data sub-management processing module;
the report presenting and user feedback module is used for presenting the evaluation report, collecting the user feedback and transmitting the user feedback to the data sub-management processing module;
and the data sub-management processing module is used for managing the user data and feeding back the user data to the multi-gene risk assessment module.
Preferably, the preprocessing comprises quality control of sequencing data by adopting a default value of chip sequencing, ensuring that the default value can produce valid data, re-sequencing if the default value is not qualified, performing subsequent processing on qualified samples, and finally performing site filtering on the VCF file to obtain disease risk sites.
Preferably, the preliminary evaluation is to calculate a disease risk value corresponding to the disease risk site according to an initial evaluation database.
More preferably, the initial assessment database comprises a GWAS database and a thousand human genome database, and the calculation method comprises a GRS-wrAC and a GRS-PS method.
Preferably, the data sub-management processing module comprises a user registration information storage sub-module and a data classification sub-module, the user registration information storage sub-module is used for storing basic information of registered users, and the data classification sub-module is used for performing feature classification on the result data of the preliminary evaluation according to the basic information of the users and feeding back the result data to the multi-gene risk evaluation module.
More preferably, the basic information of the user includes a family, a gender, an age, and a last name of the user.
Preferably, the multi-gene risk assessment module obtains a preliminary assessment report through preliminary assessment and presents the preliminary assessment report in the report presentation and user feedback module; and the multi-gene risk assessment module obtains a later optimization assessment report through later optimization assessment and presents the report in the report presentation and user feedback module.
The invention also provides a multi-gene locus combined disease risk analysis and evaluation method, which comprises the following steps:
1) chip sequencing is carried out on each sample, and sequencing data are preprocessed to obtain genotype data of disease risk sites;
2) positioning disease risk sites aiming at a target project and performing preliminary evaluation to obtain a preliminary evaluation result;
3) and (3) carrying out feature classification on the result data of the preliminary evaluation in the step 2) according to the basic information filled by the user, and optimizing the preliminary evaluation result by combining with the feedback of the user.
Preferably, in step 2), the preliminary evaluation comprises: and calculating to obtain a disease risk value corresponding to the disease risk locus by adopting a GRS-wrAC and GRS-PS method according to the GWAS database and the thousand-human genome database.
Preferably, the method further comprises: presenting the preliminary evaluation result obtained in the step 2) and the optimized evaluation result in the step 3) to a user in an evaluation report mode through a mobile phone APP.
It will be appreciated that the above assessment method is not suitable for diagnosis, since the conclusion of the assessment is that the subject is not told whether the disease is present, nor is the score high or low indicative of whether the disease is definitely present in the future; therefore, the method is only used for risk assessment of diseases, and only for assessing the risk or probability of future disease occurrence of an individual, and the assessment can show the disease risk compared with normal people, but the result of whether the disease is diseased or not in the future is still uncertain. Therefore, the result of the evaluation does not have a necessary relationship with the future result of the individual, but the reliability of the evaluation is improved greatly by increasing the evaluation of the crowd sample, so that the conclusion of the evaluation is only to provide a future reminder and reference for the individual.
The invention has the following beneficial effects:
the invention mainly aims at the detection of a customized chip designed for targeted sites of healthy people, so that the invention has the characteristics of low detection cost, high detection speed, large detection flux, multiple disease types and the like.
The method has the greatest characteristic that healthy people can be reminded of the risks of certain diseases, and particularly, on the basis of the increase of analysis samples, the method can realize crowd characteristic classification, greatly improve the discrimination of risk crowds and provide risk early warning in a targeted manner, so that more accurate crowd disease risk avoidance is realized, and the quality of life of the population is improved.
The invention especially develops two sets of server service systems: firstly, aiming at user basic registration data acquisition and APP report presentation; secondly, sequencing, genotyping data analysis and the like for user samples; the specific functions are mutually independent, the whole service is a unified system, the work division is ensured to be clear, the high efficiency of sample processing and the timely release of the report are also ensured, and the efficient lookup of the report from the mobile phone client APP is realized. The analysis platform integrates a chip sequencing technology, a biological information technology, a computer technology, a wireless communication technology and the like, realizes the process from a sample to a report, namely realizes the optimization and integration of multiple technologies in the prior art, and gradually realizes the expansion of the prior cognition, so that the combination of single risk sites has more and more practical application value.
Drawings
FIG. 1 is a diagram of the overall technical architecture for the multi-gene locus joint disease risk analysis and assessment of the present invention.
FIG. 2 is a schematic structural diagram of the multi-gene locus joint disease risk analysis and assessment platform of the present invention.
FIG. 3 is a flow chart of data processing in the multi-gene locus joint disease risk analysis and assessment process of the present invention.
FIG. 4 is a schematic diagram of the sample pretreatment process of the present invention.
Detailed Description
In order to facilitate understanding of the present invention, the present application will be further described with reference to the accompanying drawings and examples.
With reference to fig. 1-3, the implementation of the multi-gene locus joint disease risk analysis assessment of the present invention mainly comprises the following parts:
1. sample pretreatment, including sample collection and library building, on-machine sequencing, data quality control and other steps, as shown in fig. 4.
Wherein the content of the first and second substances,
individual sample collection, transport, storage, target DNA pooling and other pre-sequencing treatments can be handled by standard SOP procedures, which are not described in detail herein, by means of routine skill in the art.
DNA sequencing: this fraction was used with AffymetrixGeneTianTMA chip sequencer, wherein the sequencing chip used is a chip (Rosta _ v1) customized by cooperating with Affymetrix company, and is used for treating cardiovascular and cerebrovascular diseases, hereditary tumor diseases and other chronic diseases; the functional plate comprises:
1) cardiovascular and cerebrovascular disease risk sites, specific items: abdominal aortic aneurysm [ rs7025486 … ], hypertriglyceridemia [ rs7016880/rs1260326 … ], hypertension [ rs9810888/rs5051/rs4757391 … ], coronary heart disease [ rs7136259/rs3782889/rs3782886 … ], venous embolism [ rs146922325], cerebral aneurysm [ rs12413409/rs 70039651 64 ], cerebral apoplexy [ rs556621/rs529565 … ], migraine [ rs2078371/rs 72113/rs9349379 … ], atrial fibrillation [ rs2106261/rs6843082 … ], myocardial infarction [ rs4618210/rs3803915 … ];
2) a partial monogenic disease site;
3) tumor genetic risk loci, specific items: lung cancer [ rs753955/rs4488809/rs36600 … ], liver cancer [ rs7574865/rs455804 … ], thyroid cancer [ rs966423/rs965513 … ], chronic granulocytic leukemia [ rs4869742/rs4795519 … ], bladder cancer [ rs798766/rs401681 … ], breast cancer [ rs4951011/rs10474352/rs9485372 … ], kidney cancer [ rs7105934 … ], stomach cancer [ rs80142782/rs9841504 … ], pancreatic cancer [ rs372883/rs1547374/rs5768 5768709 … ], cervical cancer [ rs13117307/rs4282438/rs9277952 … ], and the like;
4) other disease genetic risk sites, including: diabetes mellitus type 1 [ rs1893217/rs3184504/rs3741208 … ], diabetes mellitus type 2 [ rs10229583/rs10886471/rs10906115 … ], alzheimer's disease (late onset) [ rs11218343/rs429358 … ], behcet's disease [ rs1495965/rs17810546/rs897200 … ], cleft lip and cleft palate [ rs 10512212248/rs 12543318/rs227731 ], sudden sleep [ rs10995245/rs1551570 … ], toxic diffuse thyroid tumor [ rs1024161/rs12101261/rs12658 … ], polycystic ovary syndrome [ rs 108188108188601/rs 124124124601/rs 134728 4 ], systemic lupus erythematosus [ rs10845606/rs 1097790/rs 7736 … ], non-obstructive pulmonary fibrosis [ rs 10842976397639748/rs 1086326/614626 ], lung fibrosis [ rs 1084126/4248,9748,48,48,9748 ], lung parotid 4248,9748,48,48,9748,48,9748 ], lung fibrosis [ lung 1089/429/us 1169626 ], [ lung 969/us 116969 ];
5) nutritional, sports, skin, genetic talent and other trait loci;
6) a drug guideline site;
7) progenitor typing analyzes sites.
By specially designing the project sites, the project evaluation efficiency and the relative accuracy of the report can be greatly improved, and the establishment of SNP (p <5e-8 or 5e-6) sites according to strict GWAS standards is selected.
The specific chip manipulation and DNA extraction were performed strictly in accordance with the SOP used in the instrument.
Compared with the traditional high-throughput sequencing and qPCR detection, although qPCR detection has the advantage of rapidness, the detection flux is not large than that of a chip, and the detection item is single; the same NGS's panel detection has characteristics such as big data of high throughput, but the research and development cost and the use detection cost of earlier stage design panel all can be higher relatively to cost expense can promote greatly along with sample volume promotion, synthesize many factors and consider, this application adopts the customization chip to detect.
The following is a comparison of the currently used detection means:
TABLE 1 comparison of different detection means
The advantages of the custom chip are that the price of the custom chip is cheaper than that of the current commercial chip such as APMRA and PMDA chips of Affymetrix, the quantity of probes for detecting target sites is more than that of the commercial chip, the detection accuracy is improved, and simultaneously, some sites which cannot be detected by other commercial chips, such as APOE sites, can be added: rs429358, rs7412 (these are sites for pure detection, non-risk sites), and relatively more probes are designed to ensure detection.
The VCF file is finally obtained by preprocessing, quality control and the like of original data obtained from chip sequencing. The data quality control index of the off-chip computer adopts the default value of chip sequencing at present, so that the default value can generate effective data, unqualified data needs to be sequenced again, qualified samples are subjected to subsequent processing, and finally, the VCF files are subjected to disease risk site filtering extraction.
2. And positioning a disease item aiming at the extracted disease risk site, determining that the risk site of the disease item can obtain a genotype, subsequently carrying out comprehensive evaluation on the genotype, referring to the latest GWAS database and thousand-person genome database for the evaluation used site, and using the GRS-wrAC and GRS-PS methods as the specific calculation methods.
1) The first core algorithm is as follows: GRS-PS (pending-standardized genetic risk score)
The OR values corresponding to the three genotypes at any site are Andthe expected E of the OR value at that location can be determinediComprises the following steps:
suppose that the genotype of a sample is GiOR value of ORiIs apparent from Gi∈{aai,abi,bbi},ORi∈We have found thatNormalizing OR by expectationiNormalized OR value
The Risk Risk for this phenotype is then calculated by the following formula:
the overall OR value is odds ratio, the odds ratio, and is an accurate estimate of relative risk for diseases with low incidence. An OR value equal to 1 indicates that the factor does not contribute to the onset of the disease; an OR value greater than 1 indicates that the factor is a risk factor; an OR value less than 1 indicates that this factor is a protective factor.
The above algorithm references: shi Z, Yu H, Wu Y, et al. systematic evaluation of Cancer-specific genetic score for 11types of Cancer in The Cancer Genome Atlas and Electronic Medical Records and Genomics Records. Cancer Med.
2019;8(6):3196-3205.doi:10.1002/cam4.2143。
2) And a second core algorithm: GRS-wrAC (weighted genetic Risk score)
Number of Risk alleles C for WGRS (weighted Genetic Risk score)i,CiIs e.g {0,1,2}, and log is takeneOdd Ratio of the later Risk allele, i.e.Weighting to calculate project genetic risk WGRS:
the above algorithm references: the SNP-based method of choice for addressed constituent assessment of state cancer. aspect J android.2016; 18(4) 520-524.doi 10.4103/1008-682 X.179527.
And finally, performing document checking on the specific item auditing and evaluating result to confirm that all conclusions are originated from the documents of the corresponding item.
3. Aiming at the fact that the current site frequency data is from a thousand-people genome database, the fact that the sample size of Asian population data in the thousand-people genome database is small and each specific Asian population sample is not available is considered, and therefore the Asian population data can be classified according to the information of specific populations at the later stage, can be reclassified according to various classification information and fed back to an evaluation system in the step 2, the refining, upgrading and iteration of the population frequency data are conducted on evaluation results, the step relates to decision-making of various projects, core data source sample information collection, sequencing data sorting and APP information feedback are conducted, and classification analysis and re-decision-making are conducted on collected data.
In this application, what optimize is crowd's genotype data, the later stage according to the sample enlarges the change that can appear the evaluation value, mainly embodies include: and the group feature scalars of the gender, the race, the age, the surname and the like of the sample information. Therefore, the establishment of a subsequent sample library is expanded, and the evaluation value can be upgraded and improved along with the specific subdivision of the crowd. So that the evaluation values are attached with self-independent group characteristic labels of gender, race, age, surname and the like instead of the initial fuzzy population. For example, the initial assessments are all CHB/CHS (northern Han/southern Han) risk value labels. Differences can be embodied according to subsequent data collection, for example, the risk of a certain disease in the Han population is higher than the conclusion of initial evaluation, or the risk of a certain disease male is higher than that of a female, and the risk of diseases corresponding to different ages is lower than that of each surname, so that the risk is reduced to a more specific population, and upgrading, optimization, improvement and expansion are performed.
4. Finally, a report conclusion is obtained from the report platform, a corresponding evaluation conclusion is obtained for a corresponding item, evaluation values of CHB and CHS frequencies related to thousands of people are presented for the conclusion, and evaluation values after frequency data weights distinguished for people groups such as race, surname and age are presented gradually, so that a risk evaluation conclusion of a specific group can be presented specifically.
Examples
In the application, the data analysis and evaluation which can be carried out by using the biological information data server comprises the steps of sorting the evaluation database in the previous stage and sorting and installing the evaluation template, sorting the initial database and compiling, configuring and installing an evaluation analysis program, setting a conclusion of testing the system after all things are done, strictly customizing certain definite conclusion testing sites to carry out a testing verification evaluation system, determining that the initialization of the system is finished after the evaluation is in a specified range, executing an initial evaluation report under the evaluation of two algorithms according to a data basic knowledge base, subsequently generating gene frequency data of an independent population according to the population along with the increase of the sample size, and finally reporting the risk evaluation value of a special population. That is, an initial assessment report is generated first, and then an ethnic group information assessment report is generated as the number of samples increases. And collects the phenotype calibration data based on the feedback system to calibrate the population evaluation value.
The invention aims at more diseases, and a design site evaluation analysis flow of hypertension, which is a disease, is shown through a specific embodiment:
1) the step begins with the end of data preprocessing, namely the genotype data of disease risk loci obtained by each sample through chip sequencing is obtained.
2) Project sites were first defined according to us-such as the hypertension risk project: the following are the designed hypertension risk sites and the demonstration genotype test results (the actual test genotypes are different from person to person), and the demonstration information is specifically shown in the following table.
TABLE 2 detection results of site genotypes of hypertension risk items
3) Obtaining an evaluation result 0.8559 according to the risk evaluation scheme involved in the first core algorithm (GRS-PS), knowing that the evaluation value is 0.3364 under all normal conditions from prior data, and after the data is subjected to homogenization processing, the evaluation conclusion is that "the average risk is higher than that of an ordinary population", which is reflected in the specific numerical values: "5.65% above average risk". Here, professional health advice is given to the risk tips according to relevant documents in the report.
According to the calculation risk assessment result in the core algorithm two (GRS-wRAC) 3.7352, the weighted calculation formula is only used for reference and report, the calculation result of the formula increases with the increase of independent sites, and the contribution caused by the genotype with high significant risk is large, so that the comparability of the data is poor, and the use is inconvenient, so that the significance is not large here, and the weighted calculation formula is only used for reference compared with the method one.
The results of this evaluation may be more meaningful for some single-site items.
4) And 3) calculating the calculation result after the system initialization is finished, so that the genotype frequency data from the genome database of thousands of people are presented aiming at the early samples, the discussed people are narrow, and the credibility is also deviated, therefore, the user data management and report upgrading system is designed according to the data sub-management processing platform, the classification can be carried out according to the basic information filled by the user, the forms of specific nationality, gender, surname and age group are regulated, the characteristic risk report conclusion of each group is gradually opened, and the credibility of the report system conclusion is gradually improved (of course, the encryption processing is adopted for the user information related to respecting the privacy of the client).
5) In this embodiment, the report presentation form is mainly a mobile phone APP program, and the APP part has applied for computer software copyright registration (registration number: 2020SR 0056897).
For the risk analysis and evaluation platform, the APP mainly comprises two functional modules: firstly, a user data management interface comprises a user registration information storage database and background management; secondly, a report presenting interface and a feedback interface, and finally a report APP interface presents reading. In addition, according to the feedback of the user, the later upgrading can be realized aiming at the heat and the accuracy of the project.
Claims (10)
1. The platform for analyzing and evaluating the risk of multi-gene locus combined diseases comprises:
the sample preprocessing module is used for performing chip sequencing on each sample and preprocessing sequencing data to obtain genotype data of disease risk sites;
the multi-gene risk evaluation module is used for positioning disease risk sites and carrying out primary evaluation aiming at the target project, and carrying out later-stage data learning and optimized evaluation on the primary evaluation result according to the data fed back by the data sub-management processing module;
the report presenting and user feedback module is used for presenting the evaluation report, collecting the user feedback and transmitting the user feedback to the data sub-management processing module;
and the data sub-management processing module is used for managing the user data and feeding back the user data to the multi-gene risk assessment module.
2. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the preprocessing comprises quality control of sequencing data by using a default value of chip sequencing, ensuring that the default value can produce valid data, resequencing if the default value is not qualified, performing subsequent processing on a qualified sample, and finally performing site filtering on a VCF file to obtain a disease risk site.
3. The multi-gene locus joint disease risk analysis and assessment platform according to claim 1, wherein the preliminary assessment is to calculate the disease risk value corresponding to the disease risk locus according to the initial assessment database.
4. The multi-gene locus combined disease risk analysis and assessment platform according to claim 3, wherein said initial assessment database comprises GWAS database and thousand human genome database, and said calculation methods comprise GRS-wrAC and GRS-PS methods.
5. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the data sub-management processing module comprises a user registration information storage sub-module and a data classification sub-module, the user registration information storage sub-module is used for storing basic information of registered users, and the data classification sub-module is used for performing feature classification on result data of preliminary assessment according to the basic information of the users and feeding back the result data to the multi-gene risk assessment module.
6. The multi-gene-locus combined disease risk analysis and assessment platform according to claim 5, wherein the basic information of the user includes the user's ethnicity, gender, age and surname.
7. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the multi-gene risk assessment module obtains a preliminary assessment report through preliminary assessment and presents the preliminary assessment report in a report presentation and user feedback module; and the multi-gene risk assessment module obtains a later optimization assessment report through later optimization assessment and presents the report in the report presentation and user feedback module.
8. The multi-gene locus combined disease risk analysis and evaluation method comprises the following steps:
1) chip sequencing is carried out on each sample, and sequencing data are preprocessed to obtain genotype data of disease risk sites;
2) positioning disease risk sites aiming at a target project and performing preliminary evaluation to obtain a preliminary evaluation result;
3) and (3) carrying out feature classification on the result data of the preliminary evaluation in the step 2) according to the basic information filled by the user, and optimizing the preliminary evaluation result by combining with the feedback of the user.
9. The multi-gene-site combined disease risk analysis and assessment method according to claim 8, wherein in step 2), said preliminary assessment comprises: and calculating to obtain a disease risk value corresponding to the disease risk locus by adopting a GRS-wrAC and GRS-PS method according to the GWAS database and the thousand-human genome database.
10. The multi-gene-site combined disease risk analysis and assessment method according to claim 8, further comprising: presenting the preliminary evaluation result obtained in the step 2) and the optimized evaluation result in the step 3) to a user in an evaluation report mode through a mobile phone APP.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010932647.3A CN112102884A (en) | 2020-09-08 | 2020-09-08 | Multi-gene locus combined disease risk analysis and evaluation platform and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010932647.3A CN112102884A (en) | 2020-09-08 | 2020-09-08 | Multi-gene locus combined disease risk analysis and evaluation platform and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112102884A true CN112102884A (en) | 2020-12-18 |
Family
ID=73751963
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010932647.3A Pending CN112102884A (en) | 2020-09-08 | 2020-09-08 | Multi-gene locus combined disease risk analysis and evaluation platform and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112102884A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113403380A (en) * | 2021-06-11 | 2021-09-17 | 中国科学院北京基因组研究所(国家生物信息中心) | Complex disease related SNP site primer composition and application |
CN117542526A (en) * | 2024-01-08 | 2024-02-09 | 深圳市早知道科技有限公司 | Disease risk prediction method and system based on biological genetic information |
-
2020
- 2020-09-08 CN CN202010932647.3A patent/CN112102884A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113403380A (en) * | 2021-06-11 | 2021-09-17 | 中国科学院北京基因组研究所(国家生物信息中心) | Complex disease related SNP site primer composition and application |
CN117542526A (en) * | 2024-01-08 | 2024-02-09 | 深圳市早知道科技有限公司 | Disease risk prediction method and system based on biological genetic information |
CN117542526B (en) * | 2024-01-08 | 2024-04-26 | 深圳市早知道科技有限公司 | Disease risk prediction method and system based on biological genetic information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Uffelmann et al. | Genome-wide association studies | |
US10975445B2 (en) | Integrated machine-learning framework to estimate homologous recombination deficiency | |
Shabalin et al. | Merging two gene-expression studies via cross-platform normalization | |
US11164655B2 (en) | Systems and methods for predicting homologous recombination deficiency status of a specimen | |
TWI363309B (en) | Genetic analysis systems, methods and on-line portal | |
Hou et al. | Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals | |
Ghosh et al. | “Omics” data and levels of evidence for biomarker discovery | |
KR101542529B1 (en) | Examination methods of the bio-marker of allele | |
KR101460520B1 (en) | Detecting method for disease markers of NGS data | |
US20030224394A1 (en) | Computer systems and methods for identifying genes and determining pathways associated with traits | |
Hellwig et al. | Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes | |
EP2335174A1 (en) | Methods and systems for incorporating multiple environmental and genetic risk factors | |
WO2010017520A1 (en) | Methods and systems for personalized action plans | |
Cao et al. | kTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes | |
WO2022087478A1 (en) | Machine learning platform for generating risk models | |
CN112102884A (en) | Multi-gene locus combined disease risk analysis and evaluation platform and method | |
Wojcik et al. | Opportunities and challenges for the use of common controls in sequencing studies | |
KR20150024232A (en) | Examination methods of the origin marker of resistance from drug resistance gene about disease | |
Chen et al. | Pruning and thresholding approach for methylation risk scores in multi-ancestry populations | |
CN116469552A (en) | Method and system for breast cancer polygene genetic risk assessment | |
Carvalho | Working with oligonucleotide arrays | |
Fritsche et al. | Cancer PRSweb–an online Repository with polygenic risk scores (PRS) for major cancer traits and their Phenome-wide exploration in two independent biobanks | |
Li et al. | A semiparametric test to detect associations between quantitative traits and candidate genes in structured populations | |
Calciano et al. | A predictive microarray-based biomarker for early detection of Alzheimer’s disease intended for clinical diagnostic application | |
Schwarzerova et al. | A perspective on genetic and polygenic risk scores—advances and limitations and overview of associated tools |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201218 |