CN106202936A

CN106202936A - A kind of disease risks Forecasting Methodology and system

Info

Publication number: CN106202936A
Application number: CN201610552427.1A
Authority: CN
Inventors: 全雪萍; 郝占平; 吕彬彬; 任永永; 赵文涛; 吴电云
Original assignee: Shuo Medical Data Technology (beijing) Co Ltd
Current assignee: Shuo Medical Data Technology (beijing) Co Ltd
Priority date: 2016-07-13
Filing date: 2016-07-13
Publication date: 2016-12-07

Abstract

The present invention relates to a kind of disease risks Forecasting Methodology and system, wherein method includes: S1, receives the gene sequencing object information of input, is analyzed gene sequencing object information obtaining all mutant gene information；S2, searches the disease gene variation information of corresponding gene loci in accurate medical knowledge storehouse, obtains at least one disease gene variation information according to the gene loci that described mutant gene information is corresponding；S3, mates with all disease genes variation information respectively by each mutant gene information, obtains the similarity of each mutant gene information and all disease genes variation information；S4, obtains the risk profile table for described gene sequencing object information according to the similarity of each mutant gene information with disease gene variation information.The present invention, by incidence relation analysis between tumor susceptibility gene, variant sites, variation type, disease, Healthy People being carried out risk prediction, provides overall risk grade forecast, reduces following risk.

Description

A kind of disease risks Forecasting Methodology and system

Technical field

The present invention relates to a kind of disease risks Forecasting Methodology and system, belong to technical field of biological information.

Background technology

According to " whole world cancer report 2014 " display, within 2012, Cancer in China number of the infected is 306.5 ten thousand, accounts for whole world cancer The 1/5 of number of the infected；Number of cancer deaths is 220.5 ten thousand, accounts for the 1/4 of whole world number of cancer deaths.Cancer really one Genopathy, early finds most important to the survival rate for the treatment of of cancer.Such as colorectal cancer is one of common malignant tumor of China, Being Chinese the fourth-largest cancer mortality reason, it is 6% that Chinese population suffers from the average risk of colorectal cancer in life.Wherein, 30% Colorectal cancer comes from inherited genetic factors.There is the ill wind of the crowd that first degree relative (siblings, father and mother or children) suffers from colorectal cancer Danger is the twice of crowd's average risk.If these relatives morbidity is relatively early (before 60 years old), or multidigit relatives are diagnosed as knot directly Intestinal cancer, that risk is bigger.Fraction colorectal cancer (5%) betides the family suffering from colon related diseases, such as colon Polyposis (colonic surface has been covered with countless polyps).Gene information by gene sequencing detection Healthy People, so that it may The tumor risk of Healthy People is predicted according to sequencing result.

Along with the fast development of high throughput sequencing technologies, the exploitation of big data analysis tool, Healthy People individuality is carried out base Because of detection, analyze the variation of its genes of individuals, by building based on evidence-based medicine EBM, integration Biomedical literature data, public life The accurate medical knowledge storehouse of thing medical science group data base and FDA, CFDA, NCCN guide etc. carries out clinical note to genes of individuals feature Releasing, the tumor risk developing corresponding algorithm predicts patient is possibly realized.How to integrate and utilize these groups to learn resource Also nowadays closely bound up with personal health and accurate medical treatment hot issue is reformed into.External American-European advanced country opens the beginning of this century Begin to process and the research of utilization for the big data of precisely medical treatment, established various function different, the data base that emphasis differs, Preliminarily form the Standardization System of precisely medical treatment.China is in the leading level in the world on gene sequencing technology, but how Effective utilization and collection group data message, carried out in the weight that malignant tumor risk profile is China's accurate medical field at present Weight.

Summary of the invention

The technical problem to be solved is, does not utilize and collection group data message in prior art, Set up large database system and the comprehensive data analysis system being predicted as basis with malignant tumor risk, a kind of disease is provided Sick Risk Forecast Method and system.

The technical scheme is that a kind of disease risks Forecasting Methodology, including following step Rapid:

S1, receives the gene sequencing object information of input, is analyzed obtaining all variations to gene sequencing object information Gene information；

S2, searches corresponding gene position according to the gene loci that described mutant gene information is corresponding in accurate medical knowledge storehouse The disease gene variation information of point, obtains at least one disease gene variation information；

S3, mates each mutant gene information with all disease genes variation information respectively, obtains each variation Gene information and the similarity of all disease genes variation information；

S4, obtains for described gene sequencing according to the similarity of each mutant gene information with disease gene variation information The risk profile table of object information.

The invention has the beneficial effects as follows: the present invention by based on tumor drive gene, variation type, disease, many dimensions According to algorithm, make healthy population that the disease gene variation information that autogene information is corresponding to be had gained some understanding.By to tumor susceptibility gene, Incidence relation analysis between variant sites, variation type, disease, carries out risk prediction to Healthy People, is given the most ill Risk class is predicted, and provides client prevention and health care knowledge, reduces following risk.

On the basis of technique scheme, the present invention can also do following improvement.

Further, described accurate medical knowledge storehouse is used for storing multiple disease name information and disease gene variation information；

One-to-one relationship is there is in described disease name information with disease gene variation information.

Further, described S3 obtains the similarity of each mutant gene information corresponding all disease genes variation information, And each mutant gene information is set up similarity synopsis respectively.

Above-mentioned further scheme is used to provide the benefit that, by each mutant gene information is set up similarity pair respectively According to table, it is possible to obtain suffer from the risk assessment of various disease for same mutant gene information, there is directiveness meaning for successive treatment See.

Further, all similarity synopsis that S3 is obtained by described S4 are comprehensively arranged according to similarity is descending Sequence, generates risk profile table, and described risk profile table shows the risk of described gene sequencing object information.

Use above-mentioned further scheme to provide the benefit that, by integrated ordered, can make to be ignorant of the domestic consumer of medical knowledge Risk is recognized, without the information understanding the concrete patient's condition of various diseases etc according to risk size is open-and-shut.

Further, described S1 is analyzed specifically including following steps to gene sequencing object information:

Form according to gene sequencing object information and size selection analysis flow process；

By the analysis process of selection to all or part of information in gene sequencing object information and with reference to genome ratio Right, obtain mutant gene information.

Above-mentioned further scheme is used to provide the benefit that, the genetic test result letter that different gene sequencing systems obtains The capacity of breath is different, selects different analysis process can maximize in the case of ensureing analysis precision according to its information capacity Ground extracts effective information.

Further, the method for the invention is additionally operable to detect germline mutants in human body, and analysis process compatibility targeting captures Sequencing data, full exon group sequencing data and sequencing data of whole genome；Described analysis process is characterised by detecting in human body Germline mutants, the data structure of described analysis process is the fastq file of Illumina platform, or Ion torrent The bam file of platform.

Illumina platform the analysis process of fastq: remove low quality base, use sliding window algorithm Remove comprise more low quality base order-checking section fragment, remove joint sequence pollute, enter the comparison stage, by sequencing result with Human genome reference sequences is compared, and filters out the low-quality base sequence of comparison, obtains bam file, carries out the position that makes a variation Point extracts, it is thus achieved that genovariation information, including single nucleotide variations (SNVs), genetic insertion and disappearance (Indel)；Outside complete Aobvious son and sequencing data of whole genome also include the knots such as copy number variation (CNV), group translocation (gene translocations) Structure makes a variation, and obtains VCF file, enters accurate medical knowledge storehouse, carries out automatization's clinical meaning note by certain search logic Release, generate risk profile report.

The bam file of Ion torrent platform: first bam file is converted back fastq file and carry out Quality Control, then enter Enter comparison and variation identifies, it is thus achieved that genovariation information, including single nucleotide variations (SNVs), genetic insertion and disappearance (Indel) structures such as copy number variation (CNV), group translocation (gene translocations) that and full exon group checks order Variation, obtains VCF file, enters accurate medical knowledge storehouse, carry out automatization's clinical meaning annotation by certain search logic, Generation risk profile is reported.

The technical scheme is that a kind of disease risks prognoses system, including:

Analyze module, receive the gene sequencing object information of input, be analyzed obtaining institute to gene sequencing object information There is mutant gene information；

Search module, in accurate medical knowledge storehouse, search correspondence according to the gene loci that described mutant gene information is corresponding The disease gene variation information of gene loci, obtains at least one disease gene variation information；

Matching module, mates with all disease genes variation information respectively by each mutant gene information, obtains every Individual mutant gene information and the similarity of all disease genes variation information；

Risk profile module, obtains for institute according to the similarity of each mutant gene information with disease gene variation information State the risk profile table of gene sequencing object information.

Further, described matching module obtains the phase of each mutant gene information corresponding all disease genes variation information Like degree, and each mutant gene information is set up similarity synopsis respectively.

Above-mentioned further scheme is used to provide the benefit that, by each mutant gene information is set up similarity pair respectively According to table, it is possible to obtain suffer from the risk assessment of various disease for same mutant gene information, for follow-up prevention, there is directiveness Suggestion.

Further, all similarity synopsis that matching module is obtained by described risk profile module are comprehensively according to similarity Descending being ranked up, generate risk profile table, described risk profile table shows the ill of described gene sequencing object information Risk.

Further, described analysis module is analyzed specifically including following steps to gene sequencing object information:

Above-mentioned further scheme is used to provide the benefit that, the genetic test result letter that different gene sequencing systems obtains Form and the size of breath are the most different, select different analysis process can improve analysis efficiency according to its form with size, it is to avoid Still use when quantity of information is the biggest Whole genome analysis seriously to drag and analyze speed slowly.

Further, system of the present invention is additionally operable to detect germline mutants in human body, and analysis process compatibility targeting captures Sequencing data, full exon group sequencing data and sequencing data of whole genome；Described analysis process is characterised by detecting in human body Germline mutants, the data structure of described analysis process is the fastq file of Illumina platform, or Ion torrent The bam file of platform.

Accompanying drawing explanation

Fig. 1 is a kind of disease risks Forecasting Methodology flow chart described in the embodiment of the present invention 1；

Fig. 2 is a kind of disease risks prognoses system structural representation described in the embodiment of the present invention 2.

In accompanying drawing, the list of parts representated by each label is as follows:

1, module is analyzed, 2, search module, 3, matching module, 4, risk profile module.

Detailed description of the invention

Being described principle and the feature of the present invention below in conjunction with accompanying drawing, example is served only for explaining the present invention, and Non-for limiting the scope of the present invention.

As it is shown in figure 1, for a kind of disease risks Forecasting Methodology described in the embodiment of the present invention 1, comprise the following steps:

Described accurate medical knowledge storehouse is used for storing multiple disease name information and disease gene variation information；Described disease One-to-one relationship is there is in name information with disease gene variation information.

Described S3 obtains the similarity of each mutant gene information corresponding all disease genes variation information, and to each Mutant gene information sets up similarity synopsis respectively.

All similarity synopsis that S3 is obtained by described S4 are comprehensively ranked up according to similarity is descending, generate wind Danger prediction table, described risk profile table shows the risk of described gene sequencing object information.

Gene sequencing object information is analyzed specifically including following steps by described step S1:

Form according to gene sequencing object information and size selection analysis flow process；By the analysis process of selection to gene All or part of information in sequencing result information, with reference to genome alignment, obtains mutant gene information.

The method of the invention is additionally operable to detect germline mutants in human body, analysis process compatibility targeting capture order-checking number According to, full exon group sequencing data and sequencing data of whole genome；The data structure of described analysis process is Illumina platform Fastq file, or the bam file of Ion torrent platform.

As in figure 2 it is shown, for a kind of disease risks prognoses system described in the embodiment of the present invention 2, including:

Analyze module 1, receive the gene sequencing object information of input, be analyzed obtaining institute to gene sequencing object information There is mutant gene information；

Searching module 2, it is right to search in accurate medical knowledge storehouse according to the gene loci that described mutant gene information is corresponding Answer the disease gene variation information of gene loci, obtain at least one disease gene variation information；

Matching module 3, mates with all disease genes variation information respectively by each mutant gene information, obtains every Individual mutant gene information and the similarity of all disease genes variation information；

Risk profile module 4, according to the similarity of each mutant gene information and disease gene variation information obtain for The risk profile table of described gene sequencing object information.

Described accurate medical knowledge storehouse is used for storing multiple disease name information and disease gene variation information；

Described matching module 3 obtains the similarity of each mutant gene information corresponding all disease genes variation information, And each mutant gene information is set up similarity synopsis respectively.

All similarity synopsis that matching module 3 is obtained by described risk profile module 4 comprehensively according to similarity by greatly Being ranked up to little, generate risk profile table, described risk profile table shows the risk of described gene sequencing object information.

Gene sequencing object information is analyzed specifically including following steps by described analysis module 1:

System of the present invention is additionally operable to detect germline mutants in human body, analysis process compatibility targeting capture order-checking number According to, full exon group sequencing data and sequencing data of whole genome；The data structure of described analysis process is Illumina platform Fastq file, or the bam file of Ion torrent platform.

In concrete example, disease risks Forecasting Methodology of the present invention includes procedure below:

The gene sequencing result input sequencing data analysis platform of healthy population, finds out all of genovariation information, logical Cross accurate medical knowledge library management platform, the clinical annotation information in Auto-matching accurate medical knowledge storehouse and case database, By the annotation information found, for disease risks, prediction client assesses overall risk, it is provided that prevention and health care knowledge, reduces Following risk.

Precisely medical knowledge storehouse: precisely medical knowledge storehouse is that biological group learns data and clinical medicine data store system, main Saving over thousands of kind of medicine, nearly 400 kinds of diseases, thousands of kinds of molecular markers, more than 600 plant cancer immunotherapies.Precisely medical treatment KBM platform major function is that the biological group data collected and clinical medicine data are added, revise, are deleted Except waiting basic maintenance operation, it is also possible to some data are done orientation and captures.Module mainly includes disease, biomarker, variation position Point and variation classification, clinical annotation, therapeutic scheme etc..

Disease module: according to patient information Auto-matching disease type in accurate medical knowledge storehouse, mainly has a solid tumor: Nonsmall-cell lung cancer, colorectal cancer, melanoma, carcinoma of prostate, breast carcinoma, gastrointestinal stromal tumor, cerebral glioma, hepatocarcinoma, stomach Cancer, renal carcinoma, head and neck cancer, thyroid carcinoma, soft tissue sarcoma, cervical cancer, ovarian cancer, the esophageal carcinoma；

Hematopathy: acute myeloid leukemia, myelodysplastic syndrome, bone marrow proliferative tumor, lymphoma, to bite blood thin Born of the same parents' syndrome.

Biomarker module: according to testing result Auto-matching biomarker site, susceptible in accurate medical knowledge storehouse Gene etc..

Variant sites and variation type: mate the variant sites in accurate medical knowledge storehouse and variation class according to testing result Type information, such as: single nucleotide variations (SNV), insertion (Insertion), disappearance (Deletion), insertion and deletion (Indel), first Base (Methylation), microsatellite instability (MSI), differential expression (Differential Expression), copy Number variation (CNV), fusion (Fusion), tandem sequence repeats (Tandem Repeats), Translocation (transposition), region note Release (Region-Based Variation), combined mutation (Combination Mutation) etc..

Clinical annotation: mate the clinical annotation in accurate medical knowledge storehouse according to testing result, such as: variant sites and disease The dependency that occurs, variation recall rate etc. in patients.

1. case management system:

Case management system can provide multiple inquiry mode, the convenient searching and managing that carries out case, during establishment case, fills out Write individual essential information.

Medical record management, is divided into four steps: patient information typing, gene and detection scheme select, gene and detection scheme Confirm and submit laboratory to.After patient's essential information is improved, click on [next step] and enter gene and detection scheme selection.

Gene and detection scheme select, and according to user's request, select suitable panel to click on [next step] to detection scheme Confirming, errorless rear click on [next step] submits laboratory to.

2. upload order-checking file and analyze module:

After gene test terminates, click [uploading/analyze experimental result], enter and upload/analyze experimental result interface, defeated Entering patient history number or name, the state of patient electronic medical record is result to be uploaded, clicks on [uploading], treats all genetic tests After result is all uploaded successfully, screen lower right [analysis], by secretly brightening, is clicked on [analysis], and platform is analyzed automatically, finds out change Ectopic sites, and mark clinical effectively site.When uploading test data result, data form is vcf or fastqa or bam, data Multiple sample once can be analyzed during analysis, please don't power-off and shutdown during analysis.Precisely medical knowledge storehouse: precisely medical treatment is known Knowing storehouse and preserve biology group data and clinical medicine data, including disease, biomarker, variation type, variant sites, site is faced The data such as bed annotates, therapeutic scheme.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all spirit in the present invention and Within principle, any modification, equivalent substitution and improvement etc. made, should be included within the scope of the present invention.

Claims

1. a disease risks Forecasting Methodology, it is characterised in that comprise the following steps:

S1, receives the gene sequencing object information of input, is analyzed obtaining all mutant genes to gene sequencing object information Information；

S2, searches corresponding gene loci according to the gene loci that described mutant gene information is corresponding in accurate medical knowledge storehouse Disease gene variation information, obtains at least one disease gene variation information；

S3, mates each mutant gene information with all disease genes variation information respectively, obtains each mutant gene Information and the similarity of all disease genes variation information；

S4, obtains for described gene sequencing result according to the similarity of each mutant gene information with disease gene variation information The risk profile table of information.

A kind of disease risks Forecasting Methodology the most according to claim 1, it is characterised in that described accurate medical knowledge storehouse is used In storing multiple disease name information and disease gene variation information；

A kind of disease risks Forecasting Methodology the most according to claim 1, it is characterised in that obtain each variation in described S3 The similarity of gene information corresponding all disease genes variation information, and each mutant gene information is set up similarity pair respectively According to table.

A kind of disease risks Forecasting Methodology the most according to claim 3, it is characterised in that it is all that S3 is obtained by described S4 Similarity synopsis is comprehensively ranked up according to similarity is descending, generates risk profile table, and described risk profile table shows The risk of described gene sequencing object information.

5. according to a kind of disease risks Forecasting Methodology described in any one of claim 1-4, it is characterised in that to base in described S1 Because sequencing result information is analyzed specifically including following steps:

By the analysis process of selection to all or part of information in gene sequencing object information with reference to genome alignment, obtain To mutant gene information.

A kind of disease risks Forecasting Methodology the most according to claim 5, it is characterised in that germline mutants in detection human body, Analysis process compatibility targeting capture sequencing data, full exon group sequencing data and sequencing data of whole genome；Described analysis is flowed The data structure of journey is the fastq file of Illumina platform, or the bam file of Ion torrent platform.

7. a disease risks prognoses system, it is characterised in that including:

Analyze module, receive the gene sequencing object information of input, be analyzed gene sequencing data obtaining genovariation letter Breath；

Search module, in accurate medical knowledge storehouse, search corresponding gene according to the gene loci that described mutant gene information is corresponding The disease gene variation information in site, obtains at least one disease gene variation information；

Matching module, mates each mutant gene information with all disease genes variation information respectively, obtains each change Allogene information and the similarity of all disease genes variation information；

Risk profile module, obtains for described base according to the similarity of each mutant gene information with disease gene variation information Risk profile table because of sequencing result information.

A kind of disease risks prognoses system the most according to claim 7, it is characterised in that obtain every in described matching module The similarity of individual mutant gene information corresponding all disease genes variation information, and each mutant gene information is set up phase respectively Seemingly spend synopsis.

A kind of disease risks prognoses system the most according to claim 8, it is characterised in that described risk profile module general The all similarity synopsis joining module acquisition are comprehensively ranked up according to similarity is descending, generate risk profile table, institute State risk profile table and show the risk of described gene sequencing object information.

10. according to a kind of disease risks prognoses system described in any one of claim 7-9, it is characterised in that described analysis mould Gene sequencing object information is analyzed specifically including following steps by block: