CN112102884A - Multi-gene locus combined disease risk analysis and evaluation platform and method - Google Patents

Multi-gene locus combined disease risk analysis and evaluation platform and method Download PDF

Info

Publication number
CN112102884A
CN112102884A CN202010932647.3A CN202010932647A CN112102884A CN 112102884 A CN112102884 A CN 112102884A CN 202010932647 A CN202010932647 A CN 202010932647A CN 112102884 A CN112102884 A CN 112102884A
Authority
CN
China
Prior art keywords
assessment
data
risk
disease risk
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010932647.3A
Other languages
Chinese (zh)
Inventor
董子平
周亚军
梁凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Rosetta Biotechnology Co ltd
Original Assignee
Suzhou Rosetta Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Rosetta Biotechnology Co ltd filed Critical Suzhou Rosetta Biotechnology Co ltd
Priority to CN202010932647.3A priority Critical patent/CN112102884A/en
Publication of CN112102884A publication Critical patent/CN112102884A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a multi-gene locus joint disease risk analysis and evaluation platform and a method, aiming at healthy people, the disease risk of the gene disease is calculated and preliminarily evaluated by utilizing information and supplementary literature data provided by a GWAS database and a thousand-people genome database. And expanding the data volume of the target user, and learning and correcting the risk evaluation standard of the specific population by using the client feedback system. The invention can effectively remind healthy people to effectively avoid or reduce the risk of diseases, and the aim of guiding the healthy life of individuals through gene reading is fulfilled.

Description

Multi-gene locus combined disease risk analysis and evaluation platform and method
Technical Field
The invention relates to the field of multiple disease risk locus genotype data processing, in particular to a multiple-gene risk assessment system for multiple diseases, and particularly relates to a multiple-gene locus joint disease risk analysis and assessment platform and a method.
Background
Single Nucleotide Polymorphism (SNP) refers to a change in DNA sequence caused by a change in a Single Nucleotide-a, T, C or G, resulting in diversity of chromosomal genomes between species including humans. For example, DNA fragments from two different individuals, AAGCCTA and AAGCTTA, are alleles. Almost all common Single Nucleotide Polymorphism (SNP) sites have only two alleles. The frequency of Single Nucleotide Polymorphism (SNP) alleles varies among different populations, and thus, Single Nucleotide Polymorphism (SNP) alleles that are common in certain regions or ethnic groups may be rare in other regions or ethnic groups. In the human genome, there is one SNP site every 100 to 300 bases, and 2 out of every 3 SNP sites are the interconversion of cytosine (C) and thymine (T). Various differences in human genetic genes, 90% attributable to genetic variation caused by SNPs.
Many SNPs are associated with disease, a few are associated with disease risk, and the genes associated with disease risk are called risk genes, which of course present alleles with high or low risk of disease, collectively called risk alleles, so that the number of alleles at a risk locus may be 0,1 or 2. The common risk allele frequencies in the population are relatively common, the influence of a single risk allele on the disease is small, the OR (OddsRatio) value of a risk equal site is larger than 1, and the OR value of the risk reducing site is between 0 and 1.
Multigenic Risk assessment (GRS) is the calculation of the cumulative effect of multiple independent Risk SNPs, the following are the three most common Genetic Risk assessment methods used to assess individual disease Risk:
1) early multigene risk assessment methods were non-weighted risk allele calculations (GRS-RAC), which in short were simply added at multiple independent risk SNP sites, regardless of the OR value at a particular site, and therefore the scores varied due to the varying number of SNPs.
2) Considering that the risk loci have certain independence and OR values are also different, corresponding weights are designed, and the method is called a weight risk allele risk calculation method (GRS-wrAC). It is clear that SNPs with higher OR values in this method are given higher weight, and this score also increases with increasing number of SNPs, also commonly referred to as multigenic Risk score "Polygenic Risk Scales (PRS)", an extension of grs (Genetic rice scales) (Igo, r.p., Kinzy, t.g., & book Bailey, j.n. (2019) & Genetic rice scales.current Protocols in Human Genetics,104(1) doi: 10.1002/cphg.95).
3) Standardized risk allele counting method for population (GRS-PS) (Conran CA, Na R, Chen H, et al, position-stabilized genetic risk score: the SNP-based method of choice for inherited risk assessment of state cancer. origin J android.2016; 18(4) 520-524.doi 10.4103/1008- > 682X.179527), each SNP contributing to the population in addition to the site, in addition to introducing weight counts. This method does not increase with the number of SNPs, and thus is increasingly used in the calculation of risk genes.
However, the current risk assessment method has the following problems: firstly, the evaluation in the early stage is very dependent on the sorting and collecting work of the database, the conclusion can be slightly changed along with the upgrading of the database, and meanwhile, the credibility has the crowd deviation; second, more accurate assessments of a population still require the incorporation of a validated phenotype, and therefore, a large amount of client genotype data and feedback data must be collected to correct the credibility of a particular population risk assessment.
Disclosure of Invention
Based on the above, the invention provides a multi-gene locus joint disease risk analysis and assessment platform and a method, aiming at carrying out rapid risk analysis on the genotypes of the independent risk loci of various designed disease projects and issuing a risk assessment report. The invention can effectively remind healthy people to effectively avoid or reduce the risk of diseases, and the aim of guiding the healthy life of individuals through gene reading is fulfilled.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a multi-gene locus joint disease risk analysis and evaluation platform, which comprises:
the sample preprocessing module is used for performing chip sequencing on each sample and preprocessing sequencing data to obtain genotype data of disease risk sites;
the multi-gene risk evaluation module is used for positioning disease risk sites and carrying out primary evaluation aiming at the target project, and carrying out later-stage data learning and optimized evaluation on the primary evaluation result according to the data fed back by the data sub-management processing module;
the report presenting and user feedback module is used for presenting the evaluation report, collecting the user feedback and transmitting the user feedback to the data sub-management processing module;
and the data sub-management processing module is used for managing the user data and feeding back the user data to the multi-gene risk assessment module.
Preferably, the preprocessing comprises quality control of sequencing data by adopting a default value of chip sequencing, ensuring that the default value can produce valid data, re-sequencing if the default value is not qualified, performing subsequent processing on qualified samples, and finally performing site filtering on the VCF file to obtain disease risk sites.
Preferably, the preliminary evaluation is to calculate a disease risk value corresponding to the disease risk site according to an initial evaluation database.
More preferably, the initial assessment database comprises a GWAS database and a thousand human genome database, and the calculation method comprises a GRS-wrAC and a GRS-PS method.
Preferably, the data sub-management processing module comprises a user registration information storage sub-module and a data classification sub-module, the user registration information storage sub-module is used for storing basic information of registered users, and the data classification sub-module is used for performing feature classification on the result data of the preliminary evaluation according to the basic information of the users and feeding back the result data to the multi-gene risk evaluation module.
More preferably, the basic information of the user includes a family, a gender, an age, and a last name of the user.
Preferably, the multi-gene risk assessment module obtains a preliminary assessment report through preliminary assessment and presents the preliminary assessment report in the report presentation and user feedback module; and the multi-gene risk assessment module obtains a later optimization assessment report through later optimization assessment and presents the report in the report presentation and user feedback module.
The invention also provides a multi-gene locus combined disease risk analysis and evaluation method, which comprises the following steps:
1) chip sequencing is carried out on each sample, and sequencing data are preprocessed to obtain genotype data of disease risk sites;
2) positioning disease risk sites aiming at a target project and performing preliminary evaluation to obtain a preliminary evaluation result;
3) and (3) carrying out feature classification on the result data of the preliminary evaluation in the step 2) according to the basic information filled by the user, and optimizing the preliminary evaluation result by combining with the feedback of the user.
Preferably, in step 2), the preliminary evaluation comprises: and calculating to obtain a disease risk value corresponding to the disease risk locus by adopting a GRS-wrAC and GRS-PS method according to the GWAS database and the thousand-human genome database.
Preferably, the method further comprises: presenting the preliminary evaluation result obtained in the step 2) and the optimized evaluation result in the step 3) to a user in an evaluation report mode through a mobile phone APP.
It will be appreciated that the above assessment method is not suitable for diagnosis, since the conclusion of the assessment is that the subject is not told whether the disease is present, nor is the score high or low indicative of whether the disease is definitely present in the future; therefore, the method is only used for risk assessment of diseases, and only for assessing the risk or probability of future disease occurrence of an individual, and the assessment can show the disease risk compared with normal people, but the result of whether the disease is diseased or not in the future is still uncertain. Therefore, the result of the evaluation does not have a necessary relationship with the future result of the individual, but the reliability of the evaluation is improved greatly by increasing the evaluation of the crowd sample, so that the conclusion of the evaluation is only to provide a future reminder and reference for the individual.
The invention has the following beneficial effects:
the invention mainly aims at the detection of a customized chip designed for targeted sites of healthy people, so that the invention has the characteristics of low detection cost, high detection speed, large detection flux, multiple disease types and the like.
The method has the greatest characteristic that healthy people can be reminded of the risks of certain diseases, and particularly, on the basis of the increase of analysis samples, the method can realize crowd characteristic classification, greatly improve the discrimination of risk crowds and provide risk early warning in a targeted manner, so that more accurate crowd disease risk avoidance is realized, and the quality of life of the population is improved.
The invention especially develops two sets of server service systems: firstly, aiming at user basic registration data acquisition and APP report presentation; secondly, sequencing, genotyping data analysis and the like for user samples; the specific functions are mutually independent, the whole service is a unified system, the work division is ensured to be clear, the high efficiency of sample processing and the timely release of the report are also ensured, and the efficient lookup of the report from the mobile phone client APP is realized. The analysis platform integrates a chip sequencing technology, a biological information technology, a computer technology, a wireless communication technology and the like, realizes the process from a sample to a report, namely realizes the optimization and integration of multiple technologies in the prior art, and gradually realizes the expansion of the prior cognition, so that the combination of single risk sites has more and more practical application value.
Drawings
FIG. 1 is a diagram of the overall technical architecture for the multi-gene locus joint disease risk analysis and assessment of the present invention.
FIG. 2 is a schematic structural diagram of the multi-gene locus joint disease risk analysis and assessment platform of the present invention.
FIG. 3 is a flow chart of data processing in the multi-gene locus joint disease risk analysis and assessment process of the present invention.
FIG. 4 is a schematic diagram of the sample pretreatment process of the present invention.
Detailed Description
In order to facilitate understanding of the present invention, the present application will be further described with reference to the accompanying drawings and examples.
With reference to fig. 1-3, the implementation of the multi-gene locus joint disease risk analysis assessment of the present invention mainly comprises the following parts:
1. sample pretreatment, including sample collection and library building, on-machine sequencing, data quality control and other steps, as shown in fig. 4.
Wherein the content of the first and second substances,
individual sample collection, transport, storage, target DNA pooling and other pre-sequencing treatments can be handled by standard SOP procedures, which are not described in detail herein, by means of routine skill in the art.
DNA sequencing: this fraction was used with AffymetrixGeneTianTMA chip sequencer, wherein the sequencing chip used is a chip (Rosta _ v1) customized by cooperating with Affymetrix company, and is used for treating cardiovascular and cerebrovascular diseases, hereditary tumor diseases and other chronic diseases; the functional plate comprises:
1) cardiovascular and cerebrovascular disease risk sites, specific items: abdominal aortic aneurysm [ rs7025486 … ], hypertriglyceridemia [ rs7016880/rs1260326 … ], hypertension [ rs9810888/rs5051/rs4757391 … ], coronary heart disease [ rs7136259/rs3782889/rs3782886 … ], venous embolism [ rs146922325], cerebral aneurysm [ rs12413409/rs 70039651 64 ], cerebral apoplexy [ rs556621/rs529565 … ], migraine [ rs2078371/rs 72113/rs9349379 … ], atrial fibrillation [ rs2106261/rs6843082 … ], myocardial infarction [ rs4618210/rs3803915 … ];
2) a partial monogenic disease site;
3) tumor genetic risk loci, specific items: lung cancer [ rs753955/rs4488809/rs36600 … ], liver cancer [ rs7574865/rs455804 … ], thyroid cancer [ rs966423/rs965513 … ], chronic granulocytic leukemia [ rs4869742/rs4795519 … ], bladder cancer [ rs798766/rs401681 … ], breast cancer [ rs4951011/rs10474352/rs9485372 … ], kidney cancer [ rs7105934 … ], stomach cancer [ rs80142782/rs9841504 … ], pancreatic cancer [ rs372883/rs1547374/rs5768 5768709 … ], cervical cancer [ rs13117307/rs4282438/rs9277952 … ], and the like;
4) other disease genetic risk sites, including: diabetes mellitus type 1 [ rs1893217/rs3184504/rs3741208 … ], diabetes mellitus type 2 [ rs10229583/rs10886471/rs10906115 … ], alzheimer's disease (late onset) [ rs11218343/rs429358 … ], behcet's disease [ rs1495965/rs17810546/rs897200 … ], cleft lip and cleft palate [ rs 10512212248/rs 12543318/rs227731 ], sudden sleep [ rs10995245/rs1551570 … ], toxic diffuse thyroid tumor [ rs1024161/rs12101261/rs12658 … ], polycystic ovary syndrome [ rs 108188108188601/rs 124124124601/rs 134728 4 ], systemic lupus erythematosus [ rs10845606/rs 1097790/rs 7736 … ], non-obstructive pulmonary fibrosis [ rs 10842976397639748/rs 1086326/614626 ], lung fibrosis [ rs 1084126/4248,9748,48,48,9748 ], lung parotid 4248,9748,48,48,9748,48,9748 ], lung fibrosis [ lung 1089/429/us 1169626 ], [ lung 969/us 116969 ];
5) nutritional, sports, skin, genetic talent and other trait loci;
6) a drug guideline site;
7) progenitor typing analyzes sites.
By specially designing the project sites, the project evaluation efficiency and the relative accuracy of the report can be greatly improved, and the establishment of SNP (p <5e-8 or 5e-6) sites according to strict GWAS standards is selected.
The specific chip manipulation and DNA extraction were performed strictly in accordance with the SOP used in the instrument.
Compared with the traditional high-throughput sequencing and qPCR detection, although qPCR detection has the advantage of rapidness, the detection flux is not large than that of a chip, and the detection item is single; the same NGS's panel detection has characteristics such as big data of high throughput, but the research and development cost and the use detection cost of earlier stage design panel all can be higher relatively to cost expense can promote greatly along with sample volume promotion, synthesize many factors and consider, this application adopts the customization chip to detect.
The following is a comparison of the currently used detection means:
TABLE 1 comparison of different detection means
Figure BDA0002670776890000061
The advantages of the custom chip are that the price of the custom chip is cheaper than that of the current commercial chip such as APMRA and PMDA chips of Affymetrix, the quantity of probes for detecting target sites is more than that of the commercial chip, the detection accuracy is improved, and simultaneously, some sites which cannot be detected by other commercial chips, such as APOE sites, can be added: rs429358, rs7412 (these are sites for pure detection, non-risk sites), and relatively more probes are designed to ensure detection.
The VCF file is finally obtained by preprocessing, quality control and the like of original data obtained from chip sequencing. The data quality control index of the off-chip computer adopts the default value of chip sequencing at present, so that the default value can generate effective data, unqualified data needs to be sequenced again, qualified samples are subjected to subsequent processing, and finally, the VCF files are subjected to disease risk site filtering extraction.
2. And positioning a disease item aiming at the extracted disease risk site, determining that the risk site of the disease item can obtain a genotype, subsequently carrying out comprehensive evaluation on the genotype, referring to the latest GWAS database and thousand-person genome database for the evaluation used site, and using the GRS-wrAC and GRS-PS methods as the specific calculation methods.
1) The first core algorithm is as follows: GRS-PS (pending-standardized genetic risk score)
The OR values corresponding to the three genotypes at any site are
Figure BDA0002670776890000071
Figure BDA0002670776890000072
And
Figure BDA0002670776890000073
the expected E of the OR value at that location can be determinediComprises the following steps:
Figure BDA0002670776890000074
suppose that the genotype of a sample is GiOR value of ORiIs apparent from Gi∈{aai,abi,bbi},ORi
Figure BDA0002670776890000075
We have found thatNormalizing OR by expectationiNormalized OR value
Figure BDA0002670776890000076
Figure BDA0002670776890000077
The Risk Risk for this phenotype is then calculated by the following formula:
Figure BDA0002670776890000078
the overall OR value is odds ratio, the odds ratio, and is an accurate estimate of relative risk for diseases with low incidence. An OR value equal to 1 indicates that the factor does not contribute to the onset of the disease; an OR value greater than 1 indicates that the factor is a risk factor; an OR value less than 1 indicates that this factor is a protective factor.
The above algorithm references: shi Z, Yu H, Wu Y, et al. systematic evaluation of Cancer-specific genetic score for 11types of Cancer in The Cancer Genome Atlas and Electronic Medical Records and Genomics Records. Cancer Med.
2019;8(6):3196-3205.doi:10.1002/cam4.2143。
2) And a second core algorithm: GRS-wrAC (weighted genetic Risk score)
Number of Risk alleles C for WGRS (weighted Genetic Risk score)i,CiIs e.g {0,1,2}, and log is takeneOdd Ratio of the later Risk allele, i.e.
Figure BDA0002670776890000079
Weighting to calculate project genetic risk WGRS:
Figure BDA00026707768900000710
the above algorithm references: the SNP-based method of choice for addressed constituent assessment of state cancer. aspect J android.2016; 18(4) 520-524.doi 10.4103/1008-682 X.179527.
And finally, performing document checking on the specific item auditing and evaluating result to confirm that all conclusions are originated from the documents of the corresponding item.
3. Aiming at the fact that the current site frequency data is from a thousand-people genome database, the fact that the sample size of Asian population data in the thousand-people genome database is small and each specific Asian population sample is not available is considered, and therefore the Asian population data can be classified according to the information of specific populations at the later stage, can be reclassified according to various classification information and fed back to an evaluation system in the step 2, the refining, upgrading and iteration of the population frequency data are conducted on evaluation results, the step relates to decision-making of various projects, core data source sample information collection, sequencing data sorting and APP information feedback are conducted, and classification analysis and re-decision-making are conducted on collected data.
In this application, what optimize is crowd's genotype data, the later stage according to the sample enlarges the change that can appear the evaluation value, mainly embodies include: and the group feature scalars of the gender, the race, the age, the surname and the like of the sample information. Therefore, the establishment of a subsequent sample library is expanded, and the evaluation value can be upgraded and improved along with the specific subdivision of the crowd. So that the evaluation values are attached with self-independent group characteristic labels of gender, race, age, surname and the like instead of the initial fuzzy population. For example, the initial assessments are all CHB/CHS (northern Han/southern Han) risk value labels. Differences can be embodied according to subsequent data collection, for example, the risk of a certain disease in the Han population is higher than the conclusion of initial evaluation, or the risk of a certain disease male is higher than that of a female, and the risk of diseases corresponding to different ages is lower than that of each surname, so that the risk is reduced to a more specific population, and upgrading, optimization, improvement and expansion are performed.
4. Finally, a report conclusion is obtained from the report platform, a corresponding evaluation conclusion is obtained for a corresponding item, evaluation values of CHB and CHS frequencies related to thousands of people are presented for the conclusion, and evaluation values after frequency data weights distinguished for people groups such as race, surname and age are presented gradually, so that a risk evaluation conclusion of a specific group can be presented specifically.
Examples
In the application, the data analysis and evaluation which can be carried out by using the biological information data server comprises the steps of sorting the evaluation database in the previous stage and sorting and installing the evaluation template, sorting the initial database and compiling, configuring and installing an evaluation analysis program, setting a conclusion of testing the system after all things are done, strictly customizing certain definite conclusion testing sites to carry out a testing verification evaluation system, determining that the initialization of the system is finished after the evaluation is in a specified range, executing an initial evaluation report under the evaluation of two algorithms according to a data basic knowledge base, subsequently generating gene frequency data of an independent population according to the population along with the increase of the sample size, and finally reporting the risk evaluation value of a special population. That is, an initial assessment report is generated first, and then an ethnic group information assessment report is generated as the number of samples increases. And collects the phenotype calibration data based on the feedback system to calibrate the population evaluation value.
The invention aims at more diseases, and a design site evaluation analysis flow of hypertension, which is a disease, is shown through a specific embodiment:
1) the step begins with the end of data preprocessing, namely the genotype data of disease risk loci obtained by each sample through chip sequencing is obtained.
2) Project sites were first defined according to us-such as the hypertension risk project: the following are the designed hypertension risk sites and the demonstration genotype test results (the actual test genotypes are different from person to person), and the demonstration information is specifically shown in the following table.
TABLE 2 detection results of site genotypes of hypertension risk items
Figure BDA0002670776890000091
3) Obtaining an evaluation result 0.8559 according to the risk evaluation scheme involved in the first core algorithm (GRS-PS), knowing that the evaluation value is 0.3364 under all normal conditions from prior data, and after the data is subjected to homogenization processing, the evaluation conclusion is that "the average risk is higher than that of an ordinary population", which is reflected in the specific numerical values: "5.65% above average risk". Here, professional health advice is given to the risk tips according to relevant documents in the report.
According to the calculation risk assessment result in the core algorithm two (GRS-wRAC) 3.7352, the weighted calculation formula is only used for reference and report, the calculation result of the formula increases with the increase of independent sites, and the contribution caused by the genotype with high significant risk is large, so that the comparability of the data is poor, and the use is inconvenient, so that the significance is not large here, and the weighted calculation formula is only used for reference compared with the method one.
The results of this evaluation may be more meaningful for some single-site items.
4) And 3) calculating the calculation result after the system initialization is finished, so that the genotype frequency data from the genome database of thousands of people are presented aiming at the early samples, the discussed people are narrow, and the credibility is also deviated, therefore, the user data management and report upgrading system is designed according to the data sub-management processing platform, the classification can be carried out according to the basic information filled by the user, the forms of specific nationality, gender, surname and age group are regulated, the characteristic risk report conclusion of each group is gradually opened, and the credibility of the report system conclusion is gradually improved (of course, the encryption processing is adopted for the user information related to respecting the privacy of the client).
5) In this embodiment, the report presentation form is mainly a mobile phone APP program, and the APP part has applied for computer software copyright registration (registration number: 2020SR 0056897).
For the risk analysis and evaluation platform, the APP mainly comprises two functional modules: firstly, a user data management interface comprises a user registration information storage database and background management; secondly, a report presenting interface and a feedback interface, and finally a report APP interface presents reading. In addition, according to the feedback of the user, the later upgrading can be realized aiming at the heat and the accuracy of the project.

Claims (10)

1. The platform for analyzing and evaluating the risk of multi-gene locus combined diseases comprises:
the sample preprocessing module is used for performing chip sequencing on each sample and preprocessing sequencing data to obtain genotype data of disease risk sites;
the multi-gene risk evaluation module is used for positioning disease risk sites and carrying out primary evaluation aiming at the target project, and carrying out later-stage data learning and optimized evaluation on the primary evaluation result according to the data fed back by the data sub-management processing module;
the report presenting and user feedback module is used for presenting the evaluation report, collecting the user feedback and transmitting the user feedback to the data sub-management processing module;
and the data sub-management processing module is used for managing the user data and feeding back the user data to the multi-gene risk assessment module.
2. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the preprocessing comprises quality control of sequencing data by using a default value of chip sequencing, ensuring that the default value can produce valid data, resequencing if the default value is not qualified, performing subsequent processing on a qualified sample, and finally performing site filtering on a VCF file to obtain a disease risk site.
3. The multi-gene locus joint disease risk analysis and assessment platform according to claim 1, wherein the preliminary assessment is to calculate the disease risk value corresponding to the disease risk locus according to the initial assessment database.
4. The multi-gene locus combined disease risk analysis and assessment platform according to claim 3, wherein said initial assessment database comprises GWAS database and thousand human genome database, and said calculation methods comprise GRS-wrAC and GRS-PS methods.
5. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the data sub-management processing module comprises a user registration information storage sub-module and a data classification sub-module, the user registration information storage sub-module is used for storing basic information of registered users, and the data classification sub-module is used for performing feature classification on result data of preliminary assessment according to the basic information of the users and feeding back the result data to the multi-gene risk assessment module.
6. The multi-gene-locus combined disease risk analysis and assessment platform according to claim 5, wherein the basic information of the user includes the user's ethnicity, gender, age and surname.
7. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the multi-gene risk assessment module obtains a preliminary assessment report through preliminary assessment and presents the preliminary assessment report in a report presentation and user feedback module; and the multi-gene risk assessment module obtains a later optimization assessment report through later optimization assessment and presents the report in the report presentation and user feedback module.
8. The multi-gene locus combined disease risk analysis and evaluation method comprises the following steps:
1) chip sequencing is carried out on each sample, and sequencing data are preprocessed to obtain genotype data of disease risk sites;
2) positioning disease risk sites aiming at a target project and performing preliminary evaluation to obtain a preliminary evaluation result;
3) and (3) carrying out feature classification on the result data of the preliminary evaluation in the step 2) according to the basic information filled by the user, and optimizing the preliminary evaluation result by combining with the feedback of the user.
9. The multi-gene-site combined disease risk analysis and assessment method according to claim 8, wherein in step 2), said preliminary assessment comprises: and calculating to obtain a disease risk value corresponding to the disease risk locus by adopting a GRS-wrAC and GRS-PS method according to the GWAS database and the thousand-human genome database.
10. The multi-gene-site combined disease risk analysis and assessment method according to claim 8, further comprising: presenting the preliminary evaluation result obtained in the step 2) and the optimized evaluation result in the step 3) to a user in an evaluation report mode through a mobile phone APP.
CN202010932647.3A 2020-09-08 2020-09-08 Multi-gene locus combined disease risk analysis and evaluation platform and method Pending CN112102884A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010932647.3A CN112102884A (en) 2020-09-08 2020-09-08 Multi-gene locus combined disease risk analysis and evaluation platform and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010932647.3A CN112102884A (en) 2020-09-08 2020-09-08 Multi-gene locus combined disease risk analysis and evaluation platform and method

Publications (1)

Publication Number Publication Date
CN112102884A true CN112102884A (en) 2020-12-18

Family

ID=73751963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010932647.3A Pending CN112102884A (en) 2020-09-08 2020-09-08 Multi-gene locus combined disease risk analysis and evaluation platform and method

Country Status (1)

Country Link
CN (1) CN112102884A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113403380A (en) * 2021-06-11 2021-09-17 中国科学院北京基因组研究所(国家生物信息中心) Complex disease related SNP site primer composition and application
CN117542526A (en) * 2024-01-08 2024-02-09 深圳市早知道科技有限公司 Disease risk prediction method and system based on biological genetic information

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113403380A (en) * 2021-06-11 2021-09-17 中国科学院北京基因组研究所(国家生物信息中心) Complex disease related SNP site primer composition and application
CN117542526A (en) * 2024-01-08 2024-02-09 深圳市早知道科技有限公司 Disease risk prediction method and system based on biological genetic information
CN117542526B (en) * 2024-01-08 2024-04-26 深圳市早知道科技有限公司 Disease risk prediction method and system based on biological genetic information

Similar Documents

Publication Publication Date Title
Uffelmann et al. Genome-wide association studies
US10975445B2 (en) Integrated machine-learning framework to estimate homologous recombination deficiency
Shabalin et al. Merging two gene-expression studies via cross-platform normalization
US11164655B2 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
TWI363309B (en) Genetic analysis systems, methods and on-line portal
Hou et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals
Ghosh et al. “Omics” data and levels of evidence for biomarker discovery
KR101542529B1 (en) Examination methods of the bio-marker of allele
KR101460520B1 (en) Detecting method for disease markers of NGS data
US20030224394A1 (en) Computer systems and methods for identifying genes and determining pathways associated with traits
Hellwig et al. Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes
EP2335174A1 (en) Methods and systems for incorporating multiple environmental and genetic risk factors
WO2010017520A1 (en) Methods and systems for personalized action plans
Cao et al. kTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes
WO2022087478A1 (en) Machine learning platform for generating risk models
CN112102884A (en) Multi-gene locus combined disease risk analysis and evaluation platform and method
Wojcik et al. Opportunities and challenges for the use of common controls in sequencing studies
KR20150024232A (en) Examination methods of the origin marker of resistance from drug resistance gene about disease
Chen et al. Pruning and thresholding approach for methylation risk scores in multi-ancestry populations
CN116469552A (en) Method and system for breast cancer polygene genetic risk assessment
Carvalho Working with oligonucleotide arrays
Fritsche et al. Cancer PRSweb–an online Repository with polygenic risk scores (PRS) for major cancer traits and their Phenome-wide exploration in two independent biobanks
Li et al. A semiparametric test to detect associations between quantitative traits and candidate genes in structured populations
Calciano et al. A predictive microarray-based biomarker for early detection of Alzheimer’s disease intended for clinical diagnostic application
Schwarzerova et al. A perspective on genetic and polygenic risk scores—advances and limitations and overview of associated tools

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201218