CN112102884A

CN112102884A - Multi-gene locus combined disease risk analysis and evaluation platform and method

Info

Publication number: CN112102884A
Application number: CN202010932647.3A
Authority: CN
Inventors: 董子平; 周亚军; 梁凯
Original assignee: Suzhou Rosetta Biotechnology Co ltd
Current assignee: Suzhou Rosetta Biotechnology Co ltd
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2020-12-18

Abstract

The invention provides a multi-gene locus joint disease risk analysis and evaluation platform and a method, aiming at healthy people, the disease risk of the gene disease is calculated and preliminarily evaluated by utilizing information and supplementary literature data provided by a GWAS database and a thousand-people genome database. And expanding the data volume of the target user, and learning and correcting the risk evaluation standard of the specific population by using the client feedback system. The invention can effectively remind healthy people to effectively avoid or reduce the risk of diseases, and the aim of guiding the healthy life of individuals through gene reading is fulfilled.

Description

Multi-gene locus combined disease risk analysis and evaluation platform and method

Technical Field

The invention relates to the field of multiple disease risk locus genotype data processing, in particular to a multiple-gene risk assessment system for multiple diseases, and particularly relates to a multiple-gene locus joint disease risk analysis and assessment platform and a method.

Background

Single Nucleotide Polymorphism (SNP) refers to a change in DNA sequence caused by a change in a Single Nucleotide-a, T, C or G, resulting in diversity of chromosomal genomes between species including humans. For example, DNA fragments from two different individuals, AAGCCTA and AAGCTTA, are alleles. Almost all common Single Nucleotide Polymorphism (SNP) sites have only two alleles. The frequency of Single Nucleotide Polymorphism (SNP) alleles varies among different populations, and thus, Single Nucleotide Polymorphism (SNP) alleles that are common in certain regions or ethnic groups may be rare in other regions or ethnic groups. In the human genome, there is one SNP site every 100 to 300 bases, and 2 out of every 3 SNP sites are the interconversion of cytosine (C) and thymine (T). Various differences in human genetic genes, 90% attributable to genetic variation caused by SNPs.

Many SNPs are associated with disease, a few are associated with disease risk, and the genes associated with disease risk are called risk genes, which of course present alleles with high or low risk of disease, collectively called risk alleles, so that the number of alleles at a risk locus may be 0,1 or 2. The common risk allele frequencies in the population are relatively common, the influence of a single risk allele on the disease is small, the OR (OddsRatio) value of a risk equal site is larger than 1, and the OR value of the risk reducing site is between 0 and 1.

Multigenic Risk assessment (GRS) is the calculation of the cumulative effect of multiple independent Risk SNPs, the following are the three most common Genetic Risk assessment methods used to assess individual disease Risk:

1) early multigene risk assessment methods were non-weighted risk allele calculations (GRS-RAC), which in short were simply added at multiple independent risk SNP sites, regardless of the OR value at a particular site, and therefore the scores varied due to the varying number of SNPs.

2) Considering that the risk loci have certain independence and OR values are also different, corresponding weights are designed, and the method is called a weight risk allele risk calculation method (GRS-wrAC). It is clear that SNPs with higher OR values in this method are given higher weight, and this score also increases with increasing number of SNPs, also commonly referred to as multigenic Risk score "Polygenic Risk Scales (PRS)", an extension of grs (Genetic rice scales) (Igo, r.p., Kinzy, t.g., & book Bailey, j.n. (2019) & Genetic rice scales.current Protocols in Human Genetics,104(1) doi: 10.1002/cphg.95).

3) Standardized risk allele counting method for population (GRS-PS) (Conran CA, Na R, Chen H, et al, position-stabilized genetic risk score: the SNP-based method of choice for inherited risk assessment of state cancer. origin J android.2016; 18(4) 520-524.doi 10.4103/1008- > 682X.179527), each SNP contributing to the population in addition to the site, in addition to introducing weight counts. This method does not increase with the number of SNPs, and thus is increasingly used in the calculation of risk genes.

However, the current risk assessment method has the following problems: firstly, the evaluation in the early stage is very dependent on the sorting and collecting work of the database, the conclusion can be slightly changed along with the upgrading of the database, and meanwhile, the credibility has the crowd deviation; second, more accurate assessments of a population still require the incorporation of a validated phenotype, and therefore, a large amount of client genotype data and feedback data must be collected to correct the credibility of a particular population risk assessment.

Disclosure of Invention

Based on the above, the invention provides a multi-gene locus joint disease risk analysis and assessment platform and a method, aiming at carrying out rapid risk analysis on the genotypes of the independent risk loci of various designed disease projects and issuing a risk assessment report. The invention can effectively remind healthy people to effectively avoid or reduce the risk of diseases, and the aim of guiding the healthy life of individuals through gene reading is fulfilled.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a multi-gene locus joint disease risk analysis and evaluation platform, which comprises:

the sample preprocessing module is used for performing chip sequencing on each sample and preprocessing sequencing data to obtain genotype data of disease risk sites;

the multi-gene risk evaluation module is used for positioning disease risk sites and carrying out primary evaluation aiming at the target project, and carrying out later-stage data learning and optimized evaluation on the primary evaluation result according to the data fed back by the data sub-management processing module;

the report presenting and user feedback module is used for presenting the evaluation report, collecting the user feedback and transmitting the user feedback to the data sub-management processing module;

and the data sub-management processing module is used for managing the user data and feeding back the user data to the multi-gene risk assessment module.

Preferably, the preprocessing comprises quality control of sequencing data by adopting a default value of chip sequencing, ensuring that the default value can produce valid data, re-sequencing if the default value is not qualified, performing subsequent processing on qualified samples, and finally performing site filtering on the VCF file to obtain disease risk sites.

Preferably, the preliminary evaluation is to calculate a disease risk value corresponding to the disease risk site according to an initial evaluation database.

More preferably, the initial assessment database comprises a GWAS database and a thousand human genome database, and the calculation method comprises a GRS-wrAC and a GRS-PS method.

Preferably, the data sub-management processing module comprises a user registration information storage sub-module and a data classification sub-module, the user registration information storage sub-module is used for storing basic information of registered users, and the data classification sub-module is used for performing feature classification on the result data of the preliminary evaluation according to the basic information of the users and feeding back the result data to the multi-gene risk evaluation module.

More preferably, the basic information of the user includes a family, a gender, an age, and a last name of the user.

Preferably, the multi-gene risk assessment module obtains a preliminary assessment report through preliminary assessment and presents the preliminary assessment report in the report presentation and user feedback module; and the multi-gene risk assessment module obtains a later optimization assessment report through later optimization assessment and presents the report in the report presentation and user feedback module.

The invention also provides a multi-gene locus combined disease risk analysis and evaluation method, which comprises the following steps:

1) chip sequencing is carried out on each sample, and sequencing data are preprocessed to obtain genotype data of disease risk sites;

2) positioning disease risk sites aiming at a target project and performing preliminary evaluation to obtain a preliminary evaluation result;

3) and (3) carrying out feature classification on the result data of the preliminary evaluation in the step 2) according to the basic information filled by the user, and optimizing the preliminary evaluation result by combining with the feedback of the user.

Preferably, in step 2), the preliminary evaluation comprises: and calculating to obtain a disease risk value corresponding to the disease risk locus by adopting a GRS-wrAC and GRS-PS method according to the GWAS database and the thousand-human genome database.

Preferably, the method further comprises: presenting the preliminary evaluation result obtained in the step 2) and the optimized evaluation result in the step 3) to a user in an evaluation report mode through a mobile phone APP.

It will be appreciated that the above assessment method is not suitable for diagnosis, since the conclusion of the assessment is that the subject is not told whether the disease is present, nor is the score high or low indicative of whether the disease is definitely present in the future; therefore, the method is only used for risk assessment of diseases, and only for assessing the risk or probability of future disease occurrence of an individual, and the assessment can show the disease risk compared with normal people, but the result of whether the disease is diseased or not in the future is still uncertain. Therefore, the result of the evaluation does not have a necessary relationship with the future result of the individual, but the reliability of the evaluation is improved greatly by increasing the evaluation of the crowd sample, so that the conclusion of the evaluation is only to provide a future reminder and reference for the individual.

The invention has the following beneficial effects:

the invention mainly aims at the detection of a customized chip designed for targeted sites of healthy people, so that the invention has the characteristics of low detection cost, high detection speed, large detection flux, multiple disease types and the like.

The method has the greatest characteristic that healthy people can be reminded of the risks of certain diseases, and particularly, on the basis of the increase of analysis samples, the method can realize crowd characteristic classification, greatly improve the discrimination of risk crowds and provide risk early warning in a targeted manner, so that more accurate crowd disease risk avoidance is realized, and the quality of life of the population is improved.

The invention especially develops two sets of server service systems: firstly, aiming at user basic registration data acquisition and APP report presentation; secondly, sequencing, genotyping data analysis and the like for user samples; the specific functions are mutually independent, the whole service is a unified system, the work division is ensured to be clear, the high efficiency of sample processing and the timely release of the report are also ensured, and the efficient lookup of the report from the mobile phone client APP is realized. The analysis platform integrates a chip sequencing technology, a biological information technology, a computer technology, a wireless communication technology and the like, realizes the process from a sample to a report, namely realizes the optimization and integration of multiple technologies in the prior art, and gradually realizes the expansion of the prior cognition, so that the combination of single risk sites has more and more practical application value.

Drawings

FIG. 1 is a diagram of the overall technical architecture for the multi-gene locus joint disease risk analysis and assessment of the present invention.

FIG. 2 is a schematic structural diagram of the multi-gene locus joint disease risk analysis and assessment platform of the present invention.

FIG. 3 is a flow chart of data processing in the multi-gene locus joint disease risk analysis and assessment process of the present invention.

FIG. 4 is a schematic diagram of the sample pretreatment process of the present invention.

Detailed Description

In order to facilitate understanding of the present invention, the present application will be further described with reference to the accompanying drawings and examples.

With reference to fig. 1-3, the implementation of the multi-gene locus joint disease risk analysis assessment of the present invention mainly comprises the following parts:

1. sample pretreatment, including sample collection and library building, on-machine sequencing, data quality control and other steps, as shown in fig. 4.

Wherein the content of the first and second substances,

individual sample collection, transport, storage, target DNA pooling and other pre-sequencing treatments can be handled by standard SOP procedures, which are not described in detail herein, by means of routine skill in the art.

DNA sequencing: this fraction was used with AffymetrixGeneTian^TMA chip sequencer, wherein the sequencing chip used is a chip (Rosta _ v1) customized by cooperating with Affymetrix company, and is used for treating cardiovascular and cerebrovascular diseases, hereditary tumor diseases and other chronic diseases; the functional plate comprises:

1) cardiovascular and cerebrovascular disease risk sites, specific items: abdominal aortic aneurysm [ rs7025486 … ], hypertriglyceridemia [ rs7016880/rs1260326 … ], hypertension [ rs9810888/rs5051/rs4757391 … ], coronary heart disease [ rs7136259/rs3782889/rs3782886 … ], venous embolism [ rs146922325], cerebral aneurysm [ rs12413409/rs 70039651 64 ], cerebral apoplexy [ rs556621/rs529565 … ], migraine [ rs2078371/rs 72113/rs9349379 … ], atrial fibrillation [ rs2106261/rs6843082 … ], myocardial infarction [ rs4618210/rs3803915 … ];

2) a partial monogenic disease site;

3) tumor genetic risk loci, specific items: lung cancer [ rs753955/rs4488809/rs36600 … ], liver cancer [ rs7574865/rs455804 … ], thyroid cancer [ rs966423/rs965513 … ], chronic granulocytic leukemia [ rs4869742/rs4795519 … ], bladder cancer [ rs798766/rs401681 … ], breast cancer [ rs4951011/rs10474352/rs9485372 … ], kidney cancer [ rs7105934 … ], stomach cancer [ rs80142782/rs9841504 … ], pancreatic cancer [ rs372883/rs1547374/rs5768 5768709 … ], cervical cancer [ rs13117307/rs4282438/rs9277952 … ], and the like;

4) other disease genetic risk sites, including: diabetes mellitus type 1 [ rs1893217/rs3184504/rs3741208 … ], diabetes mellitus type 2 [ rs10229583/rs10886471/rs10906115 … ], alzheimer's disease (late onset) [ rs11218343/rs429358 … ], behcet's disease [ rs1495965/rs17810546/rs897200 … ], cleft lip and cleft palate [ rs 10512212248/rs 12543318/rs227731 ], sudden sleep [ rs10995245/rs1551570 … ], toxic diffuse thyroid tumor [ rs1024161/rs12101261/rs12658 … ], polycystic ovary syndrome [ rs 108188108188601/rs 124124124601/rs 134728 4 ], systemic lupus erythematosus [ rs10845606/rs 1097790/rs 7736 … ], non-obstructive pulmonary fibrosis [ rs 10842976397639748/rs 1086326/614626 ], lung fibrosis [ rs 1084126/4248,9748,48,48,9748 ], lung parotid 4248,9748,48,48,9748,48,9748 ], lung fibrosis [ lung 1089/429/us 1169626 ], [ lung 969/us 116969 ];

5) nutritional, sports, skin, genetic talent and other trait loci;

6) a drug guideline site;

7) progenitor typing analyzes sites.

By specially designing the project sites, the project evaluation efficiency and the relative accuracy of the report can be greatly improved, and the establishment of SNP (p <5e-8 or 5e-6) sites according to strict GWAS standards is selected.

The specific chip manipulation and DNA extraction were performed strictly in accordance with the SOP used in the instrument.

Compared with the traditional high-throughput sequencing and qPCR detection, although qPCR detection has the advantage of rapidness, the detection flux is not large than that of a chip, and the detection item is single; the same NGS's panel detection has characteristics such as big data of high throughput, but the research and development cost and the use detection cost of earlier stage design panel all can be higher relatively to cost expense can promote greatly along with sample volume promotion, synthesize many factors and consider, this application adopts the customization chip to detect.

The following is a comparison of the currently used detection means:

TABLE 1 comparison of different detection means

The advantages of the custom chip are that the price of the custom chip is cheaper than that of the current commercial chip such as APMRA and PMDA chips of Affymetrix, the quantity of probes for detecting target sites is more than that of the commercial chip, the detection accuracy is improved, and simultaneously, some sites which cannot be detected by other commercial chips, such as APOE sites, can be added: rs429358, rs7412 (these are sites for pure detection, non-risk sites), and relatively more probes are designed to ensure detection.

The VCF file is finally obtained by preprocessing, quality control and the like of original data obtained from chip sequencing. The data quality control index of the off-chip computer adopts the default value of chip sequencing at present, so that the default value can generate effective data, unqualified data needs to be sequenced again, qualified samples are subjected to subsequent processing, and finally, the VCF files are subjected to disease risk site filtering extraction.

2. And positioning a disease item aiming at the extracted disease risk site, determining that the risk site of the disease item can obtain a genotype, subsequently carrying out comprehensive evaluation on the genotype, referring to the latest GWAS database and thousand-person genome database for the evaluation used site, and using the GRS-wrAC and GRS-PS methods as the specific calculation methods.

1) The first core algorithm is as follows: GRS-PS (pending-standardized genetic risk score)

The OR values corresponding to the three genotypes at any site are

And

the expected E of the OR value at that location can be determined_iComprises the following steps:

suppose that the genotype of a sample is G_iOR value of OR_iIs apparent from G_i∈{aa_i,ab_i,bb_i}，OR_i∈

We have found thatNormalizing OR by expectation_iNormalized OR value

The Risk Risk for this phenotype is then calculated by the following formula:

the overall OR value is odds ratio, the odds ratio, and is an accurate estimate of relative risk for diseases with low incidence. An OR value equal to 1 indicates that the factor does not contribute to the onset of the disease; an OR value greater than 1 indicates that the factor is a risk factor; an OR value less than 1 indicates that this factor is a protective factor.

The above algorithm references: shi Z, Yu H, Wu Y, et al. systematic evaluation of Cancer-specific genetic score for 11types of Cancer in The Cancer Genome Atlas and Electronic Medical Records and Genomics Records. Cancer Med.

2019；8(6):3196-3205.doi:10.1002/cam4.2143。

2) And a second core algorithm: GRS-wrAC (weighted genetic Risk score)

Number of Risk alleles C for WGRS (weighted Genetic Risk score)_i，C_iIs e.g {0,1,2}, and log is taken_eOdd Ratio of the later Risk allele, i.e.

Weighting to calculate project genetic risk WGRS:

the above algorithm references: the SNP-based method of choice for addressed constituent assessment of state cancer. aspect J android.2016; 18(4) 520-524.doi 10.4103/1008-682 X.179527.

And finally, performing document checking on the specific item auditing and evaluating result to confirm that all conclusions are originated from the documents of the corresponding item.

3. Aiming at the fact that the current site frequency data is from a thousand-people genome database, the fact that the sample size of Asian population data in the thousand-people genome database is small and each specific Asian population sample is not available is considered, and therefore the Asian population data can be classified according to the information of specific populations at the later stage, can be reclassified according to various classification information and fed back to an evaluation system in the step 2, the refining, upgrading and iteration of the population frequency data are conducted on evaluation results, the step relates to decision-making of various projects, core data source sample information collection, sequencing data sorting and APP information feedback are conducted, and classification analysis and re-decision-making are conducted on collected data.

In this application, what optimize is crowd's genotype data, the later stage according to the sample enlarges the change that can appear the evaluation value, mainly embodies include: and the group feature scalars of the gender, the race, the age, the surname and the like of the sample information. Therefore, the establishment of a subsequent sample library is expanded, and the evaluation value can be upgraded and improved along with the specific subdivision of the crowd. So that the evaluation values are attached with self-independent group characteristic labels of gender, race, age, surname and the like instead of the initial fuzzy population. For example, the initial assessments are all CHB/CHS (northern Han/southern Han) risk value labels. Differences can be embodied according to subsequent data collection, for example, the risk of a certain disease in the Han population is higher than the conclusion of initial evaluation, or the risk of a certain disease male is higher than that of a female, and the risk of diseases corresponding to different ages is lower than that of each surname, so that the risk is reduced to a more specific population, and upgrading, optimization, improvement and expansion are performed.

4. Finally, a report conclusion is obtained from the report platform, a corresponding evaluation conclusion is obtained for a corresponding item, evaluation values of CHB and CHS frequencies related to thousands of people are presented for the conclusion, and evaluation values after frequency data weights distinguished for people groups such as race, surname and age are presented gradually, so that a risk evaluation conclusion of a specific group can be presented specifically.

Examples

In the application, the data analysis and evaluation which can be carried out by using the biological information data server comprises the steps of sorting the evaluation database in the previous stage and sorting and installing the evaluation template, sorting the initial database and compiling, configuring and installing an evaluation analysis program, setting a conclusion of testing the system after all things are done, strictly customizing certain definite conclusion testing sites to carry out a testing verification evaluation system, determining that the initialization of the system is finished after the evaluation is in a specified range, executing an initial evaluation report under the evaluation of two algorithms according to a data basic knowledge base, subsequently generating gene frequency data of an independent population according to the population along with the increase of the sample size, and finally reporting the risk evaluation value of a special population. That is, an initial assessment report is generated first, and then an ethnic group information assessment report is generated as the number of samples increases. And collects the phenotype calibration data based on the feedback system to calibrate the population evaluation value.

The invention aims at more diseases, and a design site evaluation analysis flow of hypertension, which is a disease, is shown through a specific embodiment:

1) the step begins with the end of data preprocessing, namely the genotype data of disease risk loci obtained by each sample through chip sequencing is obtained.

2) Project sites were first defined according to us-such as the hypertension risk project: the following are the designed hypertension risk sites and the demonstration genotype test results (the actual test genotypes are different from person to person), and the demonstration information is specifically shown in the following table.

TABLE 2 detection results of site genotypes of hypertension risk items

3) Obtaining an evaluation result 0.8559 according to the risk evaluation scheme involved in the first core algorithm (GRS-PS), knowing that the evaluation value is 0.3364 under all normal conditions from prior data, and after the data is subjected to homogenization processing, the evaluation conclusion is that "the average risk is higher than that of an ordinary population", which is reflected in the specific numerical values: "5.65% above average risk". Here, professional health advice is given to the risk tips according to relevant documents in the report.

According to the calculation risk assessment result in the core algorithm two (GRS-wRAC) 3.7352, the weighted calculation formula is only used for reference and report, the calculation result of the formula increases with the increase of independent sites, and the contribution caused by the genotype with high significant risk is large, so that the comparability of the data is poor, and the use is inconvenient, so that the significance is not large here, and the weighted calculation formula is only used for reference compared with the method one.

The results of this evaluation may be more meaningful for some single-site items.

4) And 3) calculating the calculation result after the system initialization is finished, so that the genotype frequency data from the genome database of thousands of people are presented aiming at the early samples, the discussed people are narrow, and the credibility is also deviated, therefore, the user data management and report upgrading system is designed according to the data sub-management processing platform, the classification can be carried out according to the basic information filled by the user, the forms of specific nationality, gender, surname and age group are regulated, the characteristic risk report conclusion of each group is gradually opened, and the credibility of the report system conclusion is gradually improved (of course, the encryption processing is adopted for the user information related to respecting the privacy of the client).

5) In this embodiment, the report presentation form is mainly a mobile phone APP program, and the APP part has applied for computer software copyright registration (registration number: 2020SR 0056897).

For the risk analysis and evaluation platform, the APP mainly comprises two functional modules: firstly, a user data management interface comprises a user registration information storage database and background management; secondly, a report presenting interface and a feedback interface, and finally a report APP interface presents reading. In addition, according to the feedback of the user, the later upgrading can be realized aiming at the heat and the accuracy of the project.

Claims

1. The platform for analyzing and evaluating the risk of multi-gene locus combined diseases comprises:

2. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the preprocessing comprises quality control of sequencing data by using a default value of chip sequencing, ensuring that the default value can produce valid data, resequencing if the default value is not qualified, performing subsequent processing on a qualified sample, and finally performing site filtering on a VCF file to obtain a disease risk site.

3. The multi-gene locus joint disease risk analysis and assessment platform according to claim 1, wherein the preliminary assessment is to calculate the disease risk value corresponding to the disease risk locus according to the initial assessment database.

4. The multi-gene locus combined disease risk analysis and assessment platform according to claim 3, wherein said initial assessment database comprises GWAS database and thousand human genome database, and said calculation methods comprise GRS-wrAC and GRS-PS methods.

5. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the data sub-management processing module comprises a user registration information storage sub-module and a data classification sub-module, the user registration information storage sub-module is used for storing basic information of registered users, and the data classification sub-module is used for performing feature classification on result data of preliminary assessment according to the basic information of the users and feeding back the result data to the multi-gene risk assessment module.

6. The multi-gene-locus combined disease risk analysis and assessment platform according to claim 5, wherein the basic information of the user includes the user's ethnicity, gender, age and surname.

7. The multi-gene-site combined disease risk analysis and assessment platform according to claim 1, wherein the multi-gene risk assessment module obtains a preliminary assessment report through preliminary assessment and presents the preliminary assessment report in a report presentation and user feedback module; and the multi-gene risk assessment module obtains a later optimization assessment report through later optimization assessment and presents the report in the report presentation and user feedback module.

8. The multi-gene locus combined disease risk analysis and evaluation method comprises the following steps:

9. The multi-gene-site combined disease risk analysis and assessment method according to claim 8, wherein in step 2), said preliminary assessment comprises: and calculating to obtain a disease risk value corresponding to the disease risk locus by adopting a GRS-wrAC and GRS-PS method according to the GWAS database and the thousand-human genome database.

10. The multi-gene-site combined disease risk analysis and assessment method according to claim 8, further comprising: presenting the preliminary evaluation result obtained in the step 2) and the optimized evaluation result in the step 3) to a user in an evaluation report mode through a mobile phone APP.