CN106778072B - For the process bearing calibration of second generation Oncogenome high-flux sequence data - Google Patents

For the process bearing calibration of second generation Oncogenome high-flux sequence data Download PDF

Info

Publication number
CN106778072B
CN106778072B CN201611264937.5A CN201611264937A CN106778072B CN 106778072 B CN106778072 B CN 106778072B CN 201611264937 A CN201611264937 A CN 201611264937A CN 106778072 B CN106778072 B CN 106778072B
Authority
CN
China
Prior art keywords
variation
data
sim
subclone
generation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611264937.5A
Other languages
Chinese (zh)
Other versions
CN106778072A (en
Inventor
赵仲孟
王嘉寅
耿彧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ginga Medical Laboratory Co., Ltd.
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201611264937.5A priority Critical patent/CN106778072B/en
Publication of CN106778072A publication Critical patent/CN106778072A/en
Application granted granted Critical
Publication of CN106778072B publication Critical patent/CN106778072B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a kind of process bearing calibrations for second generation Oncogenome high-flux sequence data.This method uses a series of 32 unsigned numbers for mark amount, the variation of corresponding every blood system or somatic variation data are recorded respectively, generate the read data for embodying purity and different subclone proportions, the Variation Relationship of succession, brother's subclone mutual exclusion is subcloned according to father and son, the calibration data of the somatic variation of filial generation subclone and its brother's subclone is obtained, is corrected for the process flow to the two generations Oncogenome high-flux sequence data.

Description

For the process bearing calibration of second generation Oncogenome high-flux sequence data
Technical field
The invention belongs to be the decision of the accurate diagnosis and treatment of tumour using accurate medicine as the data science technical field of application background A set of auxiliary corrective system of support system.
Background technique
Nearly ten years, have benefited from the fast development of high-throughput genome, transcript profile sequencing technologies, cancer genomics and swollen No matter the accurate diagnosis and treatment of tumor all achieve the achievement to attract people's attention in terms of the depth in medical research or the range in clinical application. Cancer genomics research and the accurate medicine of tumour all rely on tumour high-flux sequence data.The gene exported from sequenator Group, transcript profile sequencing data are known as read data (English name is read data), since short and there are sequencing errors, so It cannot directly be used by tumor research personnel and clinician.Must use some flow chart of data processing by read data into Row processing is exactly to be come out genetic mutation signal extraction therein by genome informatics algorithm and tool, mentions in simple terms Data after taking are known as the data that make a variation.Variation data are the data that researcher and doctor can understand.Similar medical test knot Fruit, variation data are the important reference indicators of clinical diagnosis, multiple in ideas of cancer therapy design, drug screening or recalibration etc. In committed step, variation data are all important decision foundations.
Due to the limitation of current high throughput sequencing technologies and Oncogenome informatics technology, can not be kept away in the data that make a variation Exempt from there are false positives and false negative result.Target is to reduce false positive and false negative result as far as possible, improves variation data Precision improve medication and diagnosis and treatment efficiency in the hope of reducing the probability of misdiagnosis rate and inefficient diagnosis and treatment.For this purpose, (1) is all to be directed to tumour The process flow of genome high-flux sequence data must all carry out process correction before being put into use;(2) use has been put into Process when encountering special case, also it is necessary to personalized correct again.The main purpose of process correction is in adjustment process Parameter setting, be allowed to be consistent as far as possible with the feature of the read data of lower machine, the tumour purity of case, Tumor Heterogeneity feature, To obtain high-precision variation data.
Have some process bearing calibrations for second generation genome high-flux sequence data at present, but these methods All without the heterogeneous problem for pointedly considering tumor tissues, so not all being suitable for the processing stream of Oncogenome data Journey.
The design philosophy of existing method of calibration is to firstly generate one group of variation data, is then based on group variation data Read data are generated, verification is finally implemented.Although these methods generate variation data and/or read data model on Difference, but its core concept is substantially completely consistent.According to the similarities and differences of these methods, it is generally segmented into three classes: first Class concern generate read data when simulate sequenator error, including Wgsim, ART, ArtificialFastqGenerator, PIRS and Wessim;Second class pays close attention to different types of variation, including SInC and RSVSim[7]It can be realized the structure of genome Variation;Third class pays close attention to population genome signature, including GENOME, FREGene and FIGG.
It is as follows that existing method is not suitable for the reason of Oncogenome flow chart of data processing:
First, tumor tissues are the mixtures of the tumour cell of normal cell and different subclones.Modern tumour theory It is believed that tumour cell inherits normally under normal circumstances because tumour cell is developed from normal cell The variation of cell, while carrying the mutation different from normal cell.Mutation includes two kinds, first is that the variation inherited, i.e. tumour The variation that cell and normal cell share is known as the variation of blood system;Another is that the exclusive variation of tumour cell is known as somatic variation. The somatic variation that different tumour cells carries is different.The tumour cell of identical somatic variation is carried, if swollen Tumor shows selective advantage in developing, then being regarded as a subclone of tumor tissues.Tumor tissues are generally all by counting A or even ten several subclone compositions.Therefore, the process bearing calibration for second generation genome high-flux sequence data is necessary The somatic variation data of blood system variation data and each subclone are sequentially generated according to succession and Evolvement.This is existing What method cannot achieve.
Second, ratio shared by tumour cell is known as tumour purity in tumor tissues, the tumour cell of difference subclone is accounted for The specific gravity of tumour cell is known as Tumor Heterogeneity proportion.Different types of tumour purity and Tumor Heterogeneity proportion difference are very big. As previously mentioned, purity and heterogeneous proportion are the important goals of correction.Oncogenome informatics research thinks, there are purity and The coverage of the tumour sequencing read data of heterogeneity proportion obeys multiple Beta distribution or multiple Dirichlet distribution.However, If, can only be by the read data investigation based on single variation data, thus obtained read data clothes using existing method It is uniformly distributed from multiple, the multiple Beta or multiple Dirichlet that can only level off to as far as possible is distributed, and can not be fitted with correct situation.
It is existing for second generation genome high-flux sequence data in conclusion due to the particularity of tumor tissues Process bearing calibration is not all suitable for the process flow of Oncogenome data.The current demand needs of the accurate diagnosis and treatment of tumour are directed to The process bearing calibration of second generation Oncogenome high-flux sequence data.
Summary of the invention
In view of the above-mentioned deficiencies in the prior art, the technical problem to be solved by the present invention is that providing a kind of for second For the process bearing calibration of Oncogenome high-flux sequence data, the problem of tumor structure heterogeneity is pointedly considered.
The invention adopts the following technical scheme:
For the process bearing calibration of second generation Oncogenome high-flux sequence data, using a series of 32 without symbol Number is mark amount, records the variation of corresponding every blood system or somatic variation data respectively, generates and embodies purity and different sub- gram The read data of grand proportion, according to father and son be subcloned inherit, brother subclone mutual exclusion Variation Relationship, obtain filial generation subclone and The somatic variation calibration data of its brother's subclone, for the processing to the two generations Oncogenome high-flux sequence data Process is corrected.
Preferably, it includes following that the data that the father and son is subcloned the Variation Relationship of succession, brother's subclone mutual exclusion, which generate, Step:
S1, the Variation Relationship for determining that one group of father and son is subcloned succession, brother's subclone mutual exclusion with reference to genome sequence is read Data;
S2, the variation data according to step S1 generate the somatic variation data of first generation clone;
S3, the somatic variation data according to step S2 generate the somatic variation of the filial generation subclone of first generation clone Data;
S4, the somatic variation data according to step S3 generate the somatic variation of brother's subclone of filial generation subclone Data.
Preferably, the step S1 specifically includes the following steps:
S11, it reads with reference to genome sequence;
S12, variant sites are chosen on reference genome sequence according to preset condition;
S13, the variation type for determining each variant sites, genotype and other attributes;
S14, by the variance of the step S13 variant sites determined, 32 unsigned number formats are stored in nor.sim accordingly In file, and the variation data of heterozygous variance therein are stored in nor_AB.idx file simultaneously.
Preferably, the step S2 specifically includes the following steps:
S21, it reads with reference to tri- genome sequence, nor.sim and nor_AB.idx files;
S22, the variant sites for choosing first generation clone on reference genome sequence according to preset condition;
S23, each variant sites are checked, if the variant sites appear in nor.sim, but not in nor_AB.idx In, as homozygosis makes a variation, then return step S22, otherwise definitive variation type and other attributes;
S24, by the variance of the step S23 variant sites determined, 32 unsigned number formats are stored in accordingly In founding_clone.sim file.
Preferably, the step S3 specifically includes the following steps:
S31, it reads with reference to genome sequence, nor.sim, nor_AB.idx and founding_clone.sim;
S32, variant sites are chosen on reference genome sequence according to preset condition;
S33, each variant sites are checked, if the variant sites appear in nor.sim and founding_ In clone.sim, but not in nor_AB.idx, as homozygous variation, return step S32, otherwise definitive variation type and its His attribute;
S34, by the variance of the step S33 variant sites determined, 32 unsigned number formats are stored in accordingly In subcloneX.sim file, wherein X is reference number of a document.
S35, the specificity variation in the subcloneX.sim file is stored in subcloneX_uniq.idx file In, wherein X is reference number of a document, if the somatic variation data mostly for filial generation subclone, repeatedly step S31-S35 need to be generated.
Preferably, the step S4 specifically includes the following steps:
S41, it reads with reference to genome sequence, nor.sim, nor_AB.idx, subcloneX.sim, subcloneX_ uniq.idx;
S42, variant sites are chosen on reference genome sequence according to preset condition;
S43, each variant sites are checked, if the variant sites appear in nor.sim, subcloneX.sim, but Not in nor_AB.idx, otherwise as homozygous variation, return step S42 checks whether it is already present in generation subclone Subclone*_uniq.idx in, wherein * indicates all same generations, and if it exists, return step S42, otherwise, it determines variation class Type and other attributes;
S44, by the variance of the step S43 variant sites determined, 32 unsigned number formats are stored in accordingly In subcloneY.sim file, Y is reference number of a document.
S45, the specificity variation in the subcloneY.sim file is stored in subcloneY_uniq.idx file In, Y is reference number of a document, if the somatic variation data of multiple brother's subclones, repeatedly step S41-S45 need to be generated.
Preferably, the variation data type that this method generates includes: single-point variation, insert type structure variation, missing Type structure variation, inversion type structure variation.
Preferably, the insert type structure variation includes: short insert type structure variation, long insert type structure variation, missing- The variation of insert type labyrinth and tandem sequence repeats type structure variation.
Preferably, 32 unsigned numbers are that mark amount is made of 5 parts, specific as follows:
si=< inli,insi,vi,gi,ci>
Wherein, siIndicate i-th record, ciIndicate base symbol, giIndicate genotype, viIndicate variation type, insiTable Show the base symbol of insetion sequence, inliIndicate the length (inl of short insetion sequencei< 10bp), bp is base-pair digit.
It is further, described to generate the read data for embodying purity and different subclone proportions, comprising the following steps:
The first step reads all .sim files;
The proportion of each .sim file is arranged in second step, and all .sim files are 1 with the sum of ratio;
Third step, to choose a .sim file as probability with ratio, further according to the updated distinguished sequence of this document The double end read of upper stochastical sampling or single end read, if double end read data, then be respectively stored in left.fq and In right.fq file, wherein insertion distance Normal Distribution;If single end read data, then be stored in single.fq In file.
Compared with prior art, the present invention at least has the advantages that
Bearing calibration of the present invention uses 32 unsigned numbers for mark amount, records every variation data, generates and embodies The read data of purity and different subclone proportions, sequentially generate the parent-with inheritance with Evolvement according to inheriting The somatic variation calibration data of filial generation subclone and brother's subclone with mutex relation, correction data result is close to really The problem of being worth, essentially eliminating peakdeviation, false positive peak is few.
Further, the present invention sequentially generates the body of blood system variation data and each subclone according to succession and Evolvement Cytometaplasia data are the evolutionary processes that tumor tissues are simulated by mathematical model.The direct result that tumour develops is exactly tumour It is heterogeneous.The presence of Tumor Heterogeneity is one of the reason of existing method is not suitable for Oncogenome flow chart of data processing.Cause This could embody what filial generation subclone inherited its parent only by the evolutionary process of simulation tumour in correction data Data-signal, while in the data-signal for the exclusive somatic variation for wherein embodying filial generation subclone.
Further, this method can generate the variation data of six seed types such as short mistake-insert type complexity variation, include The all six kinds of genome mutation types relevant to tumour being currently known, can be improved the precision of calibration data.
Further, the present invention selects 32 unsigned numbers as mark amount, is only an integer in data structure level Numerical value, but can be parsed out information whole needed for one variation of description, including site, variation type, genotype and other Attribute saves memory space, improves computational efficiency.
Further, the present invention can generate the data for embodying purity and different subclone proportions, solve existing method Another reason for being not suitable for Oncogenome flow chart of data processing can correctly be fitted with Tumor Heterogeneity situation.One Aspect, under the conditions of existing second generation sequencing technologies, tumour purity problem is inevitable.On the other hand, tumour purity It is just one of the key point of correction of Oncogenome flow chart of data processing with different subclone proportions.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart;
Fig. 2 is the allelic variation frequency distribution result figure of emulation experiment of the present invention;
Fig. 3 is the comparing result figure of the method for the present invention and two methods on truthful data.
Specific embodiment
Refering to Figure 1, the invention discloses a kind of streams for second generation Oncogenome high-flux sequence data Journey bearing calibration uses 32 unsigned numbers for mark amount, records every blood system variation data, generates and embodies purity and not With the read data of subclone proportion, succession is subcloned according to father and son, the blood system variation data of brother's subclone mutex relation are closed System obtains the somatic variation calibration data of brother's subclone of filial generation subclone, for the second generation Oncogenome The process flow of high-flux sequence data is corrected.
Wherein, the succession of father and son's subclone, the variation data algorithm of brother's subclone mutex relation are specific as follows:
S1, as previously mentioned, tumour cell inherits the variation of blood system, so firstly generating one group of blood system variation data.
Specific generation method is as follows:
S11, it reads with reference to genome sequence;
S12, variant sites are chosen on reference genome sequence according to preset condition;
S13, the variation type for determining each variant sites, genotype and other attributes;
S14, by the variance of the step S13 variant sites determined, 32 unsigned number formats are stored in nor.sim accordingly In file, and the variation data of heterozygous variance therein are stored in nor_AB.idx file simultaneously.
S2, the somatic variation data for then generating first generation clone.
Specific generation method is as follows:
S21, it reads with reference to tri- genome sequence, nor.sim and nor_AB.idx files;
S22, the variant sites for choosing first generation clone on reference genome sequence according to preset condition;
S23, each variant sites are checked, if the variant sites appear in nor.sim, but not in nor_AB.idx In, as homozygosis makes a variation, then return step S22, otherwise definitive variation type and other attributes;
S24, by the variance of the step S23 variant sites determined, 32 unsigned number formats are stored in accordingly In founding_clone.sim file.
The somatic variation data that S3, the filial generation for then generating first generation clone are subcloned.
Specific generation method is as follows:
S31, it reads with reference to genome sequence, nor.sim, nor_AB.idx and founding_clone.sim;
S32, variant sites are chosen on reference genome sequence according to preset condition;
S33, each variant sites are checked, if the variant sites appear in nor.sim and founding_ In clone.sim, but not in nor_AB.idx, as homozygous variation, return step S32, otherwise definitive variation type and its His attribute;
S34, by the variance of the step S33 variant sites determined, 32 unsigned number formats are stored in accordingly In subcloneX.sim file, wherein X is reference number of a document.
S35, the specificity variation in the subcloneX.sim file is stored in subcloneX_uniq.idx file In, wherein X is reference number of a document.If the somatic variation data mostly for filial generation subclone need to be generated, repeatedly step S31-S35.
The somatic variation data that S4, the brother for ultimately producing filial generation subclone are subcloned.
Specific generation method is as follows:
S41, it reads with reference to genome sequence, nor.sim, nor_AB.idx, subcloneX.sim, subcloneX_ uniq.idx;
S42, variant sites are chosen on reference genome sequence according to preset condition;
S43, each variant sites are checked, if the variant sites appear in nor.sim, subcloneX.sim, but Not in nor_AB.idx, otherwise as homozygous variation, return step S42 checks whether it is already present in generation subclone Subclone*_uniq.idx in, wherein * indicates all same generations, and if it exists, return step S42, otherwise, it determines variation class Type and other attributes;
S44, by the variance of the step S43 variant sites determined, 32 unsigned number formats are stored in accordingly In subcloneY.sim file, wherein Y is reference number of a document.
S45, the specificity variation in the subcloneY.sim file is stored in subcloneY_uniq.idx file In, wherein Y is reference number of a document.If the somatic variation data of multiple brother's subclones need to be generated, repeatedly step S41-S45.
This method, as mark amount, records every variation data using 32 unsigned numbers.
Wherein, 32 bit identification amounts are made of 5 parts.I-th is recorded, si=< inli,insi,vi,gi,ci>, wherein siIndicate i-th record, ciIndicate base symbol, giIndicate genotype, viIndicate variation type, insiIndicate the alkali of insetion sequence Base symbol, inliIndicate the length (inl of short insetion sequencei< 10bp), bp is base-pair digit.
The present invention generates the read data for embodying purity and different subclone proportions according to the following steps:
The first step reads all .sim files;
The proportion of each .sim file is arranged in second step, and all .sim files are 1 with the sum of ratio;
Third step, to choose a .sim file as probability with ratio, according to the updated distinguished sequence of this document End is read in the double end read of upper stochastical sampling or single end, if double end read data, then be respectively stored in left.fq and In right.fq file, wherein insertion distance Normal Distribution.If single end read data, then be stored in single.fq In
The variation type that this method can be generated includes: single-point variation, the change of short insert type structure variation, short deletion form structure Different, long insert type structure variation, long deletion form structure variation, tandem sequence repeats type structure variation, inversion type structure variation, missing- The variation of insert type labyrinth.
Wherein, the generation method of every kind of variation is specific as follows:
Single-point variation: the base of the reference genome sequence in the site is read, according to user in the site selected for one The single-point variation ratio of setting is different from being implanted into reference to the base of genome sequence, replace former as mutation probability, random generate Base, the i.e. variation of generation single-point.
Short insert type structure variation: one section of base is inserted into the site selected for one between the site and next site, Length and the base of insertion can generate at random, can also be specified by user, that is, generate short insert type structure variation.
Short deletion form structure variation: the site selected for one, one section of base after deleting since the site are long Degree can generate at random, can also be specified by user, that is, generate short deletion form structure variation.
Long insertion/deletion type structure variation: the insertion/deletion position selected for one, the insertion that reading user specifies/ Deletion condition file is generated according to the insertion/deletion position, sequence content, sequence length and the insertion genotype that set in file Long insertion/deletion type structure variation.
Tandem sequence repeats type structure variation: the site selected for one, one section of base after reading since the site, Length can generate at random, can also be specified by user;Then insert type structure variation is generated from the site, the base of insertion is The base section of reading generates the base section after single-point variation with certain predetermined probability;The number specified according to user, executes repeatedly Previous step, i.e. generation tandem sequence repeats type structure variation.
Inversion type structure variation: the site selected for one, one section of base after reading since the site, length It can generate, can also be specified by user at random;Then the reverse complemental base section for taking this section of base is replaced former with new base section Base section, i.e., generation inversion type structure variation, allow be inverted length little deviation and be inverted area in there are single-point variations.
Missing-insert type labyrinth variation: the site selected for one first generates one short deletion form structure and becomes It is different, one short insert type structure variation then is generated in the same site, the short sequence being inserted into and former deletion sequence is allowed to have length Elementary errors on degree, i.e. generation missing-insert type labyrinth variation.
Below by the practicability and validity of emulation experiment and truthful data the results show this method:
(1) emulation experiment
At random from the mankind with reference to reference sequences of one chromosome of selection as simplification in genome sequence.Consider two Asias It clones, wherein S1For first generation clone, S2For the filial generation subclone of first generation clone, the two proportion is 3:7, and tumour purity is 90%.The distribution of allelic variation frequency is as shown in Figure 2.Distribution in figure is consistent with the calculated results.
(2) truthful data experimental result
Three tumor samples of U.S.'s Oncogenome route map plan are chosen, with the software of Wgsim, SInC and this method Tri- kinds of methods of TNSim method as a comparison is wrapped, as a result as shown in Figure 3.Three column are respectively three samples in figure, and sample is swollen in the U.S. Number in tumor gene group route map plan database is respectively AB-2968, BH-A18P and B5-A0JV.The three of the first row in figure Opening figure is the truth for being published in the subclone that researching and analysing on Nature obtains respectively.The second to four row is respectively Wgsim, SInC and TNSim are according to known variation data and parameter, the result obtained using identical data analysis process.Knot Fruit is closer with legitimate reading, illustrates that calibration data more being capable of accurate calibration.As shown in Fig. 2, there is peak value in SInC and Wgsim The problem of deviating (such as sample AB-2968) and false positive peak value (such as Wgsim is for BH-A18P and B5-A0JV), these problems meeting Lead to calibration error.In contrast, the problem of peakdeviation is substantially not present in this method, false positive peak is also relatively minimal.
The above content is merely illustrative of the invention's technical idea, and this does not limit the scope of protection of the present invention, all to press According to technical idea proposed by the present invention, any changes made on the basis of the technical scheme each falls within claims of the present invention Protection scope within.

Claims (7)

1. a kind of process bearing calibration for second generation Oncogenome high-flux sequence data, which is characterized in that use one 32 unsigned numbers of series are mark amount, record the variation of corresponding every blood system or somatic variation data respectively, generate and embody The read data of purity and different subclone proportions are subcloned the Variation Relationship of succession, brother's subclone mutual exclusion according to father and son, obtain To filial generation subclone and its somatic variation calibration data of brother's subclone, for high-throughput to the two generations Oncogenome The process flow of sequencing data is corrected;
This method generate the variation data type include: single-point variation, insert type structure variation, deletion form structure variation, Inversion type structure variation;
The father and son be subcloned inherit, the data of the Variation Relationship of brother's subclone mutual exclusion generate the following steps are included:
S1, the Variation Relationship number for determining that one group of father and son is subcloned succession, brother's subclone mutual exclusion with reference to genome sequence is read According to, specifically includes the following steps:
S11, it reads with reference to genome sequence;
S12, variant sites are chosen on reference genome sequence according to preset condition;
S13, the variation type for determining each variant sites, genotype and other attributes;
S14, by the variance of the step S13 variant sites determined, 32 unsigned number formats are stored in nor.sim file accordingly In, and the variation data of heterozygous variance therein are stored in nor_AB.idx file simultaneously;
S2, the Variation Relationship data according to step S1 generate the somatic variation relation data of first generation clone;
S3, the somatic variation relation data according to step S2 generate the somatic variation of the filial generation subclone of first generation clone Relation data;
S4, the somatic variation relation data according to step S3 generate the somatic variation of brother's subclone of filial generation subclone Relation data.
2. a kind of process correction side for second generation Oncogenome high-flux sequence data according to claim 1 Method, it is characterised in that: the step S2 specifically includes the following steps:
S21, it reads with reference to tri- genome sequence, nor.sim and nor_AB.idx files;
S22, the variant sites for choosing first generation clone on reference genome sequence according to preset condition;
S23, each variant sites are checked, if the variant sites appear in nor.sim, but not in nor_AB.idx, As homozygosis makes a variation, then return step S22, otherwise definitive variation type and other attributes;
S24, by the variance of the step S23 variant sites determined, 32 unsigned number formats are stored in founding_ accordingly In clone.sim file.
3. a kind of process correction side for second generation Oncogenome high-flux sequence data according to claim 2 Method, which is characterized in that the step S3 specifically includes the following steps:
S31, it reads with reference to genome sequence, nor.sim, nor_AB.idx and founding_clone.sim;
S32, variant sites are chosen on reference genome sequence according to preset condition;
S33, each variant sites are checked, if the variant sites appear in nor.sim and founding_clone.sim, But not in nor_AB.idx, as homozygosis makes a variation, return step S32, otherwise definitive variation type and other attributes;
S34, by the variance of the step S33 variant sites determined, 32 unsigned number formats are stored in accordingly In subcloneX.sim file, wherein X is reference number of a document;
S35, the specificity variation in the subcloneX.sim file is stored in subcloneX_uniq.idx file, Wherein, X is reference number of a document, if the somatic variation data mostly for filial generation subclone, repeatedly step S31-S35 need to be generated.
4. a kind of process correction side for second generation Oncogenome high-flux sequence data according to claim 3 Method, which is characterized in that the step S4 specifically includes the following steps:
S41, it reads with reference to genome sequence, nor.sim, nor_AB.idx, subcloneX.sim, subcloneX_ uniq.idx;
S42, variant sites are chosen on reference genome sequence according to preset condition;
S43, each variant sites are checked, if the variant sites appear in nor.sim, subcloneX.sim, but do not existed In nor_AB.idx, otherwise as homozygous variation, return step S42 checks whether it is already present in generation subclone In subclone*_uniq.idx, wherein * indicates all same generations, and if it exists, return step S42, otherwise, it determines variation type With other attributes;
S44, by the variance of the step S43 variant sites determined, 32 unsigned number formats are stored in accordingly In subcloneY.sim file, Y is reference number of a document;
S45, the specificity variation in the subcloneY.sim file is stored in subcloneY_uniq.idx file, Y It is reference number of a document, if the somatic variation data of multiple brother's subclones, repeatedly step S41-S45 need to be generated.
5. a kind of process correction side for second generation Oncogenome high-flux sequence data according to claim 1 Method, which is characterized in that the insert type structure variation includes: short insert type structure variation, long insert type structure variation, missing- The variation of insert type labyrinth and tandem sequence repeats type structure variation.
6. a kind of process correction side for second generation Oncogenome high-flux sequence data according to claim 1 Method, it is characterised in that: 32 unsigned numbers are that mark amount is made of 5 parts, specific as follows:
si=< inli,insi,vi,gi,ci>
Wherein, siIndicate i-th record, ciIndicate base symbol, giIndicate genotype, viIndicate variation type, insiIt indicates to insert Enter the base symbol of sequence, inliIndicate the length inl of short insetion sequencei< 10bp, bp are base-pair digits.
7. a kind of process correction side for second generation Oncogenome high-flux sequence data according to claim 1 Method, which is characterized in that described to generate the read data for embodying purity and different subclone proportions, comprising the following steps:
The first step reads all .sim files;
The proportion of each .sim file is arranged in second step, and all .sim files are 1 with the sum of ratio;
Third step, using with ratio as probability selection one .sim file, further according on the updated distinguished sequence of this document with Machine samples double end reads or single end read, if double end read data, then be respectively stored in left.fq and right.fq In file, wherein insertion distance Normal Distribution;If single end read data, then be stored in single.fq file.
CN201611264937.5A 2016-12-30 2016-12-30 For the process bearing calibration of second generation Oncogenome high-flux sequence data Active CN106778072B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611264937.5A CN106778072B (en) 2016-12-30 2016-12-30 For the process bearing calibration of second generation Oncogenome high-flux sequence data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611264937.5A CN106778072B (en) 2016-12-30 2016-12-30 For the process bearing calibration of second generation Oncogenome high-flux sequence data

Publications (2)

Publication Number Publication Date
CN106778072A CN106778072A (en) 2017-05-31
CN106778072B true CN106778072B (en) 2019-05-21

Family

ID=58951554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611264937.5A Active CN106778072B (en) 2016-12-30 2016-12-30 For the process bearing calibration of second generation Oncogenome high-flux sequence data

Country Status (1)

Country Link
CN (1) CN106778072B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033749B (en) * 2018-06-29 2020-01-14 裕策医疗器械江苏有限公司 Tumor mutation load detection method, device and storage medium
CN109504751B (en) * 2018-11-28 2022-03-11 锦州医科大学 Deletion variation identification and clone counting method for tumor complex clone structure
CN110491441B (en) * 2019-05-06 2022-04-22 西安交通大学 Gene sequencing data simulation system and method for simulating crowd background information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103374759A (en) * 2012-04-26 2013-10-30 中国科学院上海生命科学研究院 Method for detecting symbolic SNP (Single Nucleotide Polymorphism) of lung cancer metastasis and application thereof
CN103853937A (en) * 2013-11-27 2014-06-11 上海丰核信息科技有限公司 Post processing method for high-throughput sequencing data
CN104133914A (en) * 2014-08-12 2014-11-05 厦门万基生物科技有限公司 Method for removing GC deviations introduced by high throughout sequencing and detecting chromosome copy number variation
CN105320850A (en) * 2014-08-03 2016-02-10 晶能生物技术(上海)有限公司 High-throughput sequencing data matching method
CN105760712A (en) * 2016-03-01 2016-07-13 西安电子科技大学 Copy number variation detection method based on next generation sequencing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3140422A1 (en) * 2014-05-03 2017-03-15 The Regents of The University of California Methods of identifying biomarkers associated with or causative of the progression of disease, in particular for use in prognosticating primary open angle glaucoma

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103374759A (en) * 2012-04-26 2013-10-30 中国科学院上海生命科学研究院 Method for detecting symbolic SNP (Single Nucleotide Polymorphism) of lung cancer metastasis and application thereof
CN103853937A (en) * 2013-11-27 2014-06-11 上海丰核信息科技有限公司 Post processing method for high-throughput sequencing data
CN105320850A (en) * 2014-08-03 2016-02-10 晶能生物技术(上海)有限公司 High-throughput sequencing data matching method
CN104133914A (en) * 2014-08-12 2014-11-05 厦门万基生物科技有限公司 Method for removing GC deviations introduced by high throughout sequencing and detecting chromosome copy number variation
CN105760712A (en) * 2016-03-01 2016-07-13 西安电子科技大学 Copy number variation detection method based on next generation sequencing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A gradient-boosting approach for filtering de novo mutations in parent-offspring trios;Yongzhuang Liu,et al.;《Bioinformatics》;20140701;第30卷(第13期);第1830-1836页
Objective review of de novo stand‐alone error correction methods for NGS data;Andy S. Alic,et al.;《Wiley Interdisciplinary Reviews:Computational Molecular Science》;20160111;全文

Also Published As

Publication number Publication date
CN106778072A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
Fan et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data
Rannala et al. Efficient Bayesian species tree inference under the multispecies coalescent
Habier et al. Extension of the Bayesian alphabet for genomic selection
Pritchard et al. Inference of population structure using multilocus genotype data
CN106778072B (en) For the process bearing calibration of second generation Oncogenome high-flux sequence data
US9449143B2 (en) Ancestral-specific reference genomes and uses thereof
Blair et al. Molecular clocks do not support the Cambrian explosion
Holland et al. Estimating effect sizes and expected replication probabilities from GWAS summary statistics
Xiao et al. Modeling three-dimensional chromosome structures using gene expression data
Wu et al. Network biology bridges the gaps between quantitative genetics and multi-omics to map complex diseases
Jin et al. Integrating multi-omics summary data using a Mendelian randomization framework
Yu et al. rcCAE: a convolutional autoencoder method for detecting intra-tumor heterogeneity and single-cell copy number alterations
MacRae Closing the ‘phenotype gap’in precision medicine: improving what we measure to understand complex disease mechanisms
Pastor et al. A conceptual modeling approach to improve human genome understanding
León Palacio SILE: a method for the efficient management of smart genomic information
CN113035275A (en) Feature extraction method for tumor gene point mutation by combining contour coefficient and RJMCMC algorithm
Wu et al. Two novel models and a parthenogenetic algorithm for detecting common driver pathways from pan-cancer data
Marcolin et al. Efficient k-mer Indexing with Application to Mapping-free SNP Genotyping.
Reyes Román et al. How to deal with Haplotype data: An Extension to the Conceptual Schema of the Human Genome
Llinares-López Significant Pattern Mining for Biomarker Discovery
Wei et al. Genealogical search using whole-genome genotype profiles
Fang et al. Rapid and accurate multi-phenotype imputation for millions of individuals
Markello Improving Sequence Alignment and Variant Calling through the Process of Population and Pedigree-Based Graph Alignment
Trajkovski Functional interpretation of gene expression data
Olosunde Some Statistical Analysis of Poultry Feeds Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20191204

Address after: Room 502, 5 / F, building 2, yard 1, No. 8, life Garden Road, Changping District, Beijing 102206

Patentee after: Beijing Ginga Medical Laboratory Co., Ltd.

Address before: Beilin District Xianning West Road 710049, Shaanxi city of Xi'an province No. 28

Patentee before: Xi'an Jiaotong University

TR01 Transfer of patent right