CN106715711A - Method for determining the sequence of a probe and method for detecting genomic structural variation - Google Patents

Method for determining the sequence of a probe and method for detecting genomic structural variation Download PDF

Info

Publication number
CN106715711A
CN106715711A CN201480080426.0A CN201480080426A CN106715711A CN 106715711 A CN106715711 A CN 106715711A CN 201480080426 A CN201480080426 A CN 201480080426A CN 106715711 A CN106715711 A CN 106715711A
Authority
CN
China
Prior art keywords
region
probe
sample
candidate
target sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201480080426.0A
Other languages
Chinese (zh)
Other versions
CN106715711B (en
Inventor
李剑
王煜
李尉
李金良
赵霞
陈仕平
张现东
刘赛军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
BGI Genomics Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Publication of CN106715711A publication Critical patent/CN106715711A/en
Application granted granted Critical
Publication of CN106715711B publication Critical patent/CN106715711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided in the present invention is a method for determining the sequence of a probe based on a reference sequence and a method for detecting genomic structural variation. The method for determining the sequence of a probe based on a reference sequence comprises: constructing a first candidate probe set based on a plurality of discrete high-frequency SNP sites, wherein the first candidate probe set is composed of a plurality of candidate probes and each one of a plurality of candidate probes has at least one discrete high-frequency SNP. A plurality of candidate probes of the first candidate probe set is compared with the reference sequence in order to obtain a comparison result. On the basis of the comparison result, the first candidate probe set is firstly screened, so as to obtain a second candidate probe set. The reference sequence is divided into a plurality of windows with a predetermined length, and a plurality of candidate probes of the second candidate probe set are respectively distributed to each of the matching windows, so as to determine its own positional information of a plurality of candidate probes. Based on the positional information and allele frequency of the discrete high-frequency SNP, the second candidate probe set is secondly screened in order to determine the probe sequence.

Description

Method for determining the sequence of a probe and method for detecting genomic structural variation
Determine the method for probe sequence and the detection method of genome structure variation
Priority information
Without technical field
The present invention relates to genomics and bioinformatics technique field, and in particular to determines the method for probe sequence and the detection method of genome structure variation.Background technology
DNA copies number variation(Copy number variation, CNV) and loss of heterozygosity(Loss of heterozygosity, LOH) it is different types of genome mutation.CNV is a kind of Common genes group structure variation, and fragment is mainly shown as missing and the repetition of sub- microscopic level from lkb to several Mb.LOH refers to still have on gene delection, paired chromosome on some chromosome on dyad, shows as homozygote SNP only occur in the very long one section of region of DNA.When the change of copy number does not occur for LOH, i.e., only from hereditary two copies of a parent, by title uniparental disomy(uniparental disomy, UPD ) .CNV, LOH, and UPD and many common genetic diseases, cancer are related to other complex diseases.A kind of accurate, comprehensive, efficiently, quick, simple, economic method for detecting CNV, LOH and UPD is set up, for research chromosomal variation event, the cause of disease of relevant disease is specified and takes corresponding therapeutic scheme, all with important value.
Some Examined effects, such as round pcr, including real-time fluorescence quantitative PCR technology and multiplex ligation amplification technology are had at present(Multiplex Ligation-dependent Probe Amplification, MLPA), real-time fluorescence PCR technology tests and analyzes one or several target spots every time, MLPA-secondary can analyze more than 40 sequence, sensitivity is high, and detection range is limited to the targeted chromosome of probe and region;FISH technology, FISH technology is generally used for the specific several chromosomes of detection, it is impossible to detect zone of ignorance;Technology based on chip, including the Comparative genomic hybridization based on chip(Array-based Comparative Genomic Hybridization, aCGH) and technology based on SNP chip(SNP-array), aCGH can detect the CNV in the range of full-length genome, it is impossible to detect polyploid, and the loss of the loss of small fragment is high;And sequencing technologies, based on genome sequencing(Whole genome sequnecing, WGS) detect the structure variation of full-length genome scope and the variation of detection target area be sequenced based on target area mainly there are four kinds of methods analysis CNV, including:Match end mapping(Paired-end mapping), long depth analysis (read-depth analysis) is read, long strategy is separately read(Split-read strategies) and sequence assembling compare(sequence assembly comparisons) .
With the development of sequencing technologies, it is necessary to which research finds the abnormal means of genome structure based on sequencing result particularly regional area sequencing result, includes findings that chromosomal aneuploidy, CNV, insertion and deletion(Insertion-deletion, indel), LOH, UPD and SNP means.The content of the invention An aspect of of the present present invention provides a kind of method that probe sequence is determined based on reference sequences, comprises the following steps:Based on multiple discrete high frequency SNP sites, the first candidate probe collection is built, the first candidate probe collection is made up of multiple candidate probes, and each in multiple candidate probes contains at least one discrete high frequency SNP;Multiple candidate probes that first candidate probe is concentrated are compared with reference sequences, to obtain comparison result;Based on comparison result, first is carried out to the first candidate probe collection and is screened, the second candidate probe collection is obtained;Reference sequences are divided into multiple windows with predetermined length, the multiple candidate probes for respectively concentrating the second candidate probe are distributed to the window of each Self Matching, to determine the respective positional information of multiple candidate probes;Gene frequency based on described positional information and discrete high frequency SNP, carries out second to the second candidate probe collection and screens, to determine the probe sequence.Wherein, discrete high frequency SNP site is that gene frequency is more than 10%, preferably it is not more than 90%, and is not less than candidate probe length with another any physical distance of discrete high frequency SNP site in reference gene group, candidate probe length is 50-250mer.
The probe obtained using the method for the determination probe sequence of the present invention, multiple genome regional areas are obtained for hybrid capture genome, the multiple regional areas captured can represent full-length genome, can reflect full-length genome variation information, the generation of the structure variation for finding full genome scope.
Another aspect provides a kind of method for detecting genome structure variation, it is adaptable to detects chromosomal aneuploidy, copy number variation and insertion and deletion, comprises the following steps:Target sample genomic nucleic acids are sequenced, to obtain gene order-checking result, described gene order-checking result is made up of multiple reads, alternatively, described sequencing includes being screened using probe, wherein, probe is to determine that the method for probe sequence is obtained based on reference sequences by what one aspect of the present invention was provided.Gene order-checking result, can be carried out library construction according to existing high flux platform Guide Book and the sequencing of upper machine is obtained by extracting genomic DNA;Gene order-checking result can also be captured the genome of target sample by probe and carry out that acquisition is sequenced, and what probe can be provided by one aspect of the present invention determines that the method for probe sequence is obtained based on reference sequences;It is m region by reference gene component, calculates target sample genome area i overburden depth TD^ wherein using the read that region i is fallen into gene order-checking result, m and i are natural number, l≤i≤m, 10<m;The difference degree of overburden depth and the region i of k sample for reference overburden depth based on target sample genome area i, judge the generation of target sample region i structure variations, wherein, k is natural number, k >=2, the region i of each sample for reference overburden depth gets the preparation method that method can refer to target sample region i overburden depth.By merging whether the region after the region that neighbouring recurring structure makes a variation, further combining data detection occurs big structure variation, further the structure variation in region i occurs for detection whether across several regions in other words.
Another aspect of the invention provides the method for being applied to detect another genome structure variation loss of heterozygosity one by one, comprises the following steps:Obtain the gene order-checking result of target sample, alternatively, described gene order-checking result is to capture the genome of target sample by probe and carry out that acquisition is sequenced, and probe is to determine that the method for probe sequence is obtained based on reference sequences according to what one aspect of the present invention was provided;Genome is divided into m' region, read and colony's region i data based on the decline of gene order-checking result in the i of region, obtain the shared SNP collection of target sample genome area i and colony region i, the heterozygosity of fragment, obtains target sample genome area i heterozygosity collection and colony region i heterozygosity collection u where each SNP site for the shared SNP concentrations for calculating target sample and colony respectively0l, comparison object sample and colony UQlTo determine whether target sample region i loss of heterozygosity occurs;Wherein, the gene frequency for having each SNP that SNP is concentrated is both greater than 0.1, has the SNP site place that SNP is concentrated Fragment is that, using upstream and downstream two SNPs adjacent with the SNP as boundary point, m' and i are natural number, m' >=i >=l, m' >=6.Colony can truly be reflected by extracting how many sample, accuracy, statistical method, sample data distribution situation that can be according to needed for detection etc. be determined, population data is made up of multiple infraspecific sample datas, can be by genome sequencing or according to the method for obtaining target sample data or from having completed published database or website is obtained, such as thousand human genome data.
Another aspect of the invention provides a kind of computer-readable recording medium, for storing the program performed for computer, one of ordinary skill in the art will appreciate that, when performing the program, by instructing related hardware to complete all or part of step for the various methods that above-mentioned detection genome structure makes a variation.Alleged storage medium can include:Read-only storage, random access memory, disk or CD etc..
The detection device that genome structure makes a variation is provided according to last aspect of the present invention, including:Data input cell, for input data;Data outputting unit, for output data;Memory cell, for data storage, including executable program;Processor, it is connected with above-mentioned data input cell, data outputting unit and memory cell data, the executable program stored for performing in memory cell, the execution of program includes completing all or part of step of the various methods of above-mentioned detection genome structure variation.
Using the present invention probe that the method for probe sequence is obtained is determined based on reference sequences, solid phase/liquid-phase chip using probe or comprising these probes carries out target area capture sequencing, the realization for being capable of low sequencing cost detects structure variation in the range of full-length genome, CNV, LOH and UPD are detected including covering 23 pairs of chromosomes of people, and detection resolution can be adjusted by adjusting the i.e. increase/reduction SNP sites of average headway distribution of probe according to demand.CNV, LOH and UPD detection that high-resolution, high accuracy, high flux, low cost are carried out in the range of full-length genome are realized with reference to analysis of biological information method using the target area capture sequencing of the present invention, the genome structure mutation detection method of the present invention is also applied for chromosomal aneuploidy variation, SNP and Indel detection simultaneously, it is adaptable to the structure variation analysis detection based on full genome sequencing data.
The additional aspect and advantage of the present invention will be set forth in part in the description, and partly will become apparent from the description below, or be recognized by the practice of the present invention.Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention, will be apparent and be readily appreciated that to the description of embodiment with reference to accompanying drawings below, wherein:
Fig. 1 is the schematic diagram of characteristic of the SeTR probes on full-length genome in one embodiment of the present invention,(A) the staple diagram of SeTR probe sequences;(B) the physical distance distribution map of probe two-by-two in SETR probes.
Fig. 2 is the test result figure of the SeTR probes in one embodiment of the present invention,(A) the overburden depth distribution map of target area(B the reads distribution maps of ref bases type and non-ref bases type) are supported.
Fig. 3 is the testing process schematic diagram of CNV, LOH and UPD in an embodiment of the invention.
Fig. 4 is the benchmark nomogram(-raph) in an embodiment of the invention.
Fig. 5 is detect sample in one embodiment of the present invention(GM50275 the schematic diagram of genome structure variation), annulus from outside to inside, is followed successively by I) chromosome information, II)!The change of ^ values(Wave); III)Rhet Corresponding P value changes, IV) RhetValue changes(Point)Detailed description of the Invention
Embodiments of the invention are described below in detail.The embodiments described below with reference to the accompanying drawings are exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.
It should be noted that term " first ", " second " are only used for describing purpose, and it is not intended that indicating or implying relative importance or the implicit quantity for indicating indicated technical characteristic.Thus, " first " is defined, one or more this feature can be expressed or be implicitly included to the feature of " second ".Further, in the description of the invention, unless otherwise indicated, " multiple " are meant that two or more.
According to one embodiment of the present invention there is provided a kind of method that probe sequence is determined based on reference sequences, comprise the following steps:
Step one:Build the first candidate probe collection
The first candidate probe collection is built using the discrete high frequency S P sites for being distributed in genome, the each candidate probe that first candidate probe is concentrated includes at least one discrete high frequency S P sites, discrete high frequency S P sites are that gene frequency is more than 10% and is not less than candidate probe length with physical distance of any another discrete high frequency S P sites in reference gene group, and candidate probe length is 50-250mer.
In the specific embodiment of the present invention, discrete high frequency S P are obtained by thousand human genome data, 90% discrete high frequency S P sites can also be less than from other published genomic datas or the further selection gene frequency obtained, it is 100mer to determine candidate probe length.
In the specific embodiment of the present invention, each candidate probe includes a discrete high frequency S P site, and discrete high frequency S P sites are located at the stage casing of described candidate probe.So every candidate probe only comprising high frequency S P site, may have between neighboring candidate probe it is overlapping may also be not overlapping.Here " stage casing ", is that for " leading portion " and " back segment ", can routinely understand, such as one sequence, and its upstream and downstream 1/3 is set to " leading portion " and " back segment " respectively, and middle 1/3 is " stage casing ";Further, discrete high frequency SNP site is located at the midpoint of described candidate probe, here " midpoint " position, such as one sequence includes 2n+l nucleotides, midpoint is the position of the (n+1)th nucleotides, and when a sequence contains 2 η nucleotides, the midpoint of sequence is the position of the η or (n+1)th nucleotides, it can so strengthen capture rate of the probe to target discrete high frequency S P sites.
In the specific embodiment of the present invention, the G/C content and/or single base for the candidate probe sequence concentrated based on the first candidate probe repeat to carry out prescreening to the first candidate probe collection, and the GC contents for remaining the first candidate probe concentration are less than 7 candidate probe for 35%-65% and/or single base severe.Single base multiplicity refers in the number of times that a base type continuously occurs in one section of sequence, such as TGAAAAAAAAGC that A therein continuously occurs 8 times, and the A bases multiplicity of the sequence is 8.Sequence G/C content is higher or relatively low, high heterozygosity easily influences the PCR or hybrid capture process of the sequence, brings GC skewed popularities(GC bias) etc., make capture specificity reduction, the first candidate probe collection retained through this prescreening will not be with these sequence hybridizations, so as to exempt GC bias or low specificity capture to resulting influence.
Step 2:First candidate probe collection and reference sequences are compared to obtain comparison result
First candidate probe collection is compared with reference sequences, comparison result is obtained, the first candidate probe collection is obtained and is referring to sequence Positional information on row.Used reference sequences are known arrays, can be the arbitrary reference templates in the affiliated category of the target sample being obtained ahead of time.Such as, target sample is the mankind, and American National Biotechnology Information center may be selected in reference sequences(NCBI) the HG18 or HG19 provided, the resources bank for including more reference sequences can be further pre-configured with, before sequence alignment is carried out, closer reference sequences first are selected according to factors such as sex, ethnic group, the regions of target sample, are conducive to obtaining more targeted probe sequence.
Step 3:First is carried out to the first candidate probe collection to screen, to obtain the second candidate probe collection
In the specific embodiment of the present invention, the candidate probe retained by the first screening need to meet any one in following two conditions:1) candidate probe of the comparison that the first candidate probe is concentrated to reference gene group unique positions;2) comparison that the first candidate probe is concentrated to the multiple positions of reference sequences and is both less than 10% with the mispairing ratio of at least two positions in the multiple positions of reference sequences;Such as candidate probe length lOOmer, 10 base mispairings are mispairing ratio 10%, and mismatch rate is low for that can be matched when hybridizing with target area close to complete complementary, and capture effect is good, and specificity is high.
Step 4:Reference sequences are divided into multiple windows, second candidate probe collection is distributed to the window of each Self Matching reference sequences are divided into multiple windows with predetermined length, utilize comparison, multiple candidate probes that second candidate probe is concentrated are assigned to the window matched, obtain positional information of each candidate probe on respective window.
The length of the window of multiple predetermined lengths can unanimously can be with inconsistent, can with it is overlapping can not be overlapping, in the specific embodiment of the present invention, reference sequences are reference gene group, reference gene group is divided into the window of multiple consistent length, length of window is 10Kb, and two neighboring window connection but not overlapping.
Step 5:Gene frequency based on described positional information and discrete high frequency S P, carries out second to the second candidate probe collection and screens, determine probe sequence
In the specific embodiment of the present invention, carrying out the second screening includes two steps,(A) it is located at same window if there is multiple candidate probes, it is determined that discrete high frequency S P gene frequency highest candidate probe;(If b) only existing discrete high frequency S P gene frequency highest candidate probe, discrete high frequency S P gene frequency highest candidate probe is then selected as probe, if there is multiple discrete high frequency S P gene frequency highest candidate probe, then candidate probe nearest apart from window center in multiple discrete high frequency S P gene frequency highest candidate probe is selected as the probe.The distance of candidate probe and window center can be distance of the midpoint with the window center of candidate probe.Target location is in the center of probe sequence as far as possible, is conducive to improving capture rate.
In the specific embodiment of the present invention, after the screening of the second candidate probe collection progress second, when the second candidate probe concentration is when respectively falling in distance of adjacent two candidate probes of two neighboring window in reference gene group more than the length of either window in adjacent two window, selectively, the STR typing being located between adjacent two candidate probes or a part for STR in reference gene group further are added into the second candidate probe after screening through second to concentrate, probe sequence is constituted together.So, it when the probe sequence obtained using these designs captures full-length genome, can make the spacing in the region captured that relatively uniform distribution is presented, the areas combine that capture is determined can be made preferably to reflect whole gene group information comprehensively.
According to another embodiment of the invention there is provided the detection method that a kind of genome structure makes a variation, described genome structure variation includes chromosomal aneuploidy, copy at least one of number variation and insertion and deletion, comprised the following steps:
(one)Target sample genomic nucleic acids are sequenced, to obtain gene order-checking result, described genome is surveyed Sequence result is made up of multiple reads, gene order-checking result can be sequenced by full genome and be obtained, such as by extracting genomic DNA, Guide Book according to existing high flux platform, Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torrent are such as utilized, unimolecule or nano-pore sequencing platform etc. carry out library construction and the sequencing of upper machine obtains read (reads);Or the genome of the target sample is captured by probe and sequencing acquisition is carried out, the determination method for the probe that probe can be provided by one aspect of the present invention is designed determination, synthesizes or is prepared then according to existing method.
(two)It is m region by reference gene component, target sample genome area i overburden depth TD is calculated using the read that region i is fallen into the read in sequencing result1Wherein, m and i is natural number, and i is zone number, l≤i≤m, 10<m.
In the specific embodiment of the present invention, the length ^ for the ^ regions i of base sum mono- that the calculation formula of region i overburden depth is included for _ the read number ^^ for falling into region i _ read for falling into region i, region i length, eight is dry1The numbering in clothing not region.Read falls on position on genome and can determined by sequence alignment, various comparison softwares can be used in comparison, such as SOAP (Short Oligonucleotide Analysis Package), bwa (Burrows-Wheeler Aligner), samtools, GATK (Genome Analysis Toolkit) etc..
(three)The difference degree of overburden depth and the region i of k sample for reference overburden depth based on target sample genome area i, judges the generation of target sample region i structure variation, wherein, k is natural number, k >=2.
In the specific embodiment of the present invention, target sample genome area i overburden depth and the comparison of the difference degree of the region i of k sample for reference overburden depth, it is the overburden depth coefficient of genome area i by comparison object sample and sample for reference to realize, the determination of target sample genome area i overburden depth coefficient comprises the following steps(A) overburden depth value progress linear regression realization to 2 η continuum including inclusion region i is exactly based on to obtain the schools of TD^ first to carrying out the first correction, wherein, η is natural number, 10<N≤m/2, in the specific embodiment of the present invention, is obtained through the first correction linear regression11)31 =( 」 TDj )/n, wherein, TD " represents the overburden depth in j-th of region in n continuum, and j is natural number, l≤j≤n;(b) region i the first correction overburden depth TD is being obtainedaiAfterwards, further to 1031Carry out homogenization acquisition1 Α^, and then obtain κ ^ ^^11^, in one specific implementation of the present invention
TDaIn=V^TD ,/n mode, to region i the first correction overburden depth TDMCarry out homogenization acquisitionai ^J.In an embodiment of the invention, further comprise after target sample is obtained to carrying out second
n, , y is that natural number represents that sample for reference is numbered, RyRepresent sample for reference y genome areas i overburden depth coefficient.
In the another embodiment of the present invention, further comprise after target sample is obtained to carrying out the second correction to obtain,R-, wherein,!The covering that ^ is the genome area i of k sample for reference and a target sample is deep R - y=1
The average value of coefficient is spent,ai k+L, y are that natural number represents that sample for reference is numbered,yRepresent sample for reference y genome areas i overburden depth coefficient.
During above-mentioned calculating processing target sample genome area i overburden depth coefficient, the processing such as correction, homogenization to intermediate value can reduce the error that fluctuation, the difference of sample room in itself because of experiment condition etc. is brought, enable last n truly reflect and around 1 fluctuating range than small, and the ^ of multiple samples meets normal distribution;Corrected in above-mentioned embodiment to carrying out first, then the numerical value after being corrected to first is uniformed, equivalent to the process averaged twice, i.e. before intending to represent region i overburden depth with the overburden depth average of inclusion region i n continuum, calculating for the overburden depth value in each region is to utilize to represent using the overburden depth average of n continuum of the region as first region in n region, so equivalent to the overburden depth value using the 2 η continuums comprising target area i the overburden depth of continuum can be made to keep stable to correct TD^.It should be noted that, those skilled in the art can make the overburden depth value in adjacent several regions keep stable using other corrections or processing of averaging, target area overburden depth is such as corrected with the average overburden depth with the how many individual several regions in target area interval, the design of the present invention is belonged to.The calculating processing of sample for reference genome area i overburden depth coefficient may be referred to the calculation processes of target sample genome area i overburden depth coefficient, sample for reference data can precalculate handle well it is standby, can also it is synchronous with the calculation processes of target sample progress and obtain.
In the specific embodiment of the present invention, target sample genome area i overburden depth and the judgement of the difference degree of the region i of k sample for reference overburden depth, are to examine whether the difference of the overburden depth coefficient of the two is significantly realized by t.In the specific embodiment of the present invention, the calculating of target sample genome area i t test statistics is public
Formula is Yk, wherein,1^ represents the ^ of k sample for reference average value,1^ is sample for reference y genomic regions
The domain i overburden depth coefficient through the second correction, 'R, , S iskIndividual sample for reference standard deviation, .Based on the value of target sample genome area 1, obtain level of signifiance I and work as Ρ Ο .05, judge the region i recurring structures variation;Conversely, then judging the region i, recurring structure does not make a variation.In the another embodiment of the present invention, value and predetermined level of signifiance P based on target sample genome area ilQ, obtain theoretical value tl0, when >=, judge the region i recurring structures variation, recurring structure does not make a variation conversely, then judging the region i, predetermined P1()≤ 0.05.The t value tables examined according to t, predetermined P1()After can check in it is corresponding.
In an embodiment of the invention, to detect bigger CNV or insertion and deletion, step is being carried out(Three)Afterwards, by equidirectional and continuous W region merging technique, obtain one-level combined region, merge two one-level combined region when two one-level combined region be it is equidirectional and between no more than L region of span, two grades of combined region are obtained, two grades of merging are detected The structure variation in region;Wherein, equidirectional region refer to the overburden depth in region t statistics be both greater than 0 or both less than 0 region, W and L are natural number, W >=2, L-W≤l.Further to detect bigger structure variation, can the like, as further merged qualified two grades of combined region, merge condition can it is similar for two two grades of combined region it is equidirectional and between the distance in reference gene group no more than L region or the individual two grades of combined region of L.
In the specific embodiment of the present invention, detect the structure variation of two grades of combined region, it is difference degree of the overburden depth with the overburden depth in corresponding region on multiple sample for reference genomes of two grades of combined region based on target sample genome, come judge two grades of combined region whether recurring structure make a variation, in other words come judge occur region i structure variation whether across W region.The acquisition of the overburden depth of corresponding two grades of combined region on sample for reference genome, the calculating of the t statistics of two grades of combined region overburden depths on target sample genome and structure variation deterministic process can be found in the calculating deterministic process of above relatively small region i structure variation.
According to yet further embodiment of the invention there is provided a kind of method of the loss of heterozygosity genome structure variation suitable for detection, comprise the following steps:
(1) target sample genomic nucleic acids are sequenced, to obtain gene order-checking result, described gene order-checking result is made up of multiple reads, gene order-checking result can be sequenced by full genome and be obtained, such as by extracting genomic DNA, Guide Book according to existing high flux platform, Illumina Hiseq2000/2500, Roche 454, Life technologies Ion Torrent are such as utilized, unimolecule or nano-pore sequencing platform etc. carry out library construction and the sequencing of upper machine obtains read (reads);Or the genome of the target sample is captured by probe and sequencing acquisition is carried out, the determination method for the probe that probe can be provided by one aspect of the present invention is designed determination, synthesizes or is prepared then according to existing method.
(2) reference gene group is divided into m' region, read information and colony's region i data based on sequencing result decline in reference gene group region i, obtain the shared S P collection of target sample genome area i and colony region i, the heterozygosity of fragment where each S P site for the shared S P concentrations for calculating target sample and colony respectively, obtain target sample genome area i heterozygosity collection 1^, and colony region i heterozygosity collection UQl, comparison object sample U and colony UQlTo determine whether target sample region i loss of heterozygosity occurs;Wherein, the gene frequency for each S P that the shared S P are concentrated is both greater than 0.1, fragment where the S P site that described shared S P are concentrated is that, using upstream and downstream two S Ps adjacent with the S P as boundary point, m' and i are natural number, m'> i >l , m' ≥6.
In the specific embodiment of the present invention, the heterozygosity of fragment is represented with the secondary gene frequency coefficient in the S P sites where a S P site, the secondary gene frequency coefficients R in the S P siteshet=MAF/ (1-MAF), MAF are high frequency S P secondary gene frequency.
In the specific embodiment of the present invention, comparison object sample U and colony UQlTo determine whether target sample region i loss of heterozygosity occurs, using F test and judges U variance and UQlVariance whether have significant difference, if U^P UQlVariance significant difference, then judge that the target sample region i has loss of heterozygosity, conversely, then judging that the target sample region i does not have loss of heterozygosity.
In the specific embodiment of the present invention, F, which is examined, to be included calculating U^P U respectivelylQVariance, utilize gained target sample U variance and colony UlQVariance calculate and obtain two statistic F reciprocal each otherup^ and utilization Statistic reciprocal described each other obtains level of signifiance pF, compare pFWith predetermined level of signifiance pFQSize, pF≤pF0Illustrate that two include calculation formula,
;Wherein,vThe numbering that S P concentrate S P is had for target sample genome area i and colony region i, q is that target sample genome area i and colony region i has the number that S P concentrate S P, the secondary gene frequency coefficient for the V SNP that the shared S P that R^AV is target sample genome area i are concentrated, for the average value of the target sample genome area i shared S P q SNP concentrated secondary gene frequency coefficientRte,lQ,vThe secondary gene frequency coefficient for the V SNP that population sample genome area i shared S P are concentrated,Rte,lQFor the average value of the population sample genome area i shared S P q SNP concentrated secondary gene frequency coefficient, Pupper and Punder are respectively according to FupperAnd FunderObtain, pFQ≤0.05 pFQ can take the adjustment such as the value that generally sets or the requirement according to the Given information grasped, to detection accuracy to set.
In an embodiment of the invention, to detect bigger LOH, in step(2) after, occur loss of heterozygosity and continuous region merging technique by W', obtain three-level combined region, merge two three-level combined region when the span between described two three-level combined region is no more than L' region, obtain level Four combined region, the heterozygosity collection in the same region of heterozygosity collection and colony of target sample level Four combined region is obtained respectively, compare two heterozygosity collection, to determine whether target sample level Four combined region occurs loss of heterozygosity, wherein, W' and L ' are natural number, W'>2, W' /2≥L '.In the specific embodiment of the present invention, W' >=4.Detect the LOH that bigger region occurs, can the like, such as further merge qualified level Four combined region, merge condition can the similar distance in reference gene group between two level Four combined region be no more than L' region or L' three-level combined region.
According to yet further embodiment of the invention; a kind of method for detecting uniparental disomy is provided; when there is loss of heterozygosity in certain target sample genome area; calculate the copy number in this region; when this region copy number is as the copy number in the region in same species normal gene group, judge that this genome area of the target sample has UPD;Genome area can be carried out with the presence or absence of LOH by the LOH detection methods of the one side of above present disclosure.
It will appreciated by the skilled person that all or part of step of various methods can be completed by programmed instruction related hardware in above-mentioned embodiment, the program can be stored in a computer-readable recording medium, and storage medium can include:Read-only storage, random access memory, disk or CD etc..
According to last embodiment of the present invention, a kind of device for detecting genome structure variation is also provided, including:Data input cell, for input data;Data outputting unit, for output data;Memory cell, for data storage, including executable program;Processor, connects with above-mentioned data input cell, data outputting unit and memory cell data Connect, the executable program stored for performing in memory cell, the execution of program includes completing all or part of step of various methods in above-mentioned embodiment.The specific probe design process and the operation result of structure variation detection method according to the present invention are described in detail below in conjunction with objectives individual.Name definition or design parameter that following processes are related to set the selection to be:
1st, the probe of design is referred to as selection target Area Probe( Seleted Target Region Primers, SeTR);
2nd, hereinafter " overburden depth ", " sequencing depth " and " depth ", is alternatively used;" region " and " target area " hereinafter is alternatively used;
3rd, the small fragment library construction operating instruction and upper machine that library construction, sequencing are provided according to the platforms of Hiseq 2000 are sequenced explanation to operate, and the size in library is 300bp-350bp, both-end sequencing(Pair-end sequencing), the long 91bp of read (sequencing type is PE91+8+91);
4th, the reference gene group or reference sequences for comparing selection are mankind's reference gene group(Hgl9, Build 37).
Unreceipted particular technique or condition in embodiment, according to the technology or condition described by document in the art(Write such as with reference to J. Pehanorm Brookers, what Huang Peitang etc. was translated《Molecular Cloning:A Laboratory guide》, the third edition, Science Press)Or carried out according to product description.Agents useful for same or the unreceipted production firm person of instrument, are that be able to can for example be purchased from Illumina companies by the conventional products of acquisition purchased in market.
Embodiment 1:Chip design, preparation, test
Generally, it is high(>60%) it is or low(<35%) G/C content and high heterozygosity are easily to bring detrimental effect in PCR or probe acquisition procedure to its DNA fragmentation, in order to avoid such a phenomenon, we devise special probe, we are referred to as SeTR. when SeTR probes are designed, it then follows following principle:A) uniqueness and stability of probe sequence are higher, it is desirable to low heterozygosity and medium GC (35% ~ 60%) content, b) the high frequency S P marks containing discrete type(SNP marker), each SNP gene frequency(allele frequency, 0.9>AF>0.1) so as to the LOH of more preferable detection full-length genome, c) final target area shows relatively uniform distribution.
The selection flow that SeTR probes design target area in other words is as follows:
1) thousand people's gene databases are based on(ftp:〃 ftp.ncbi.nih.gov/1000genomes/ftp/release), pick out gene frequency(Allele frequency, AF) be 10% ~ 90% candidate SNP collection, then concentrated again in S P and remove the S P that physical distance between two S P is less than lOOpb, so as to constitute S P makerl collection.
2) each the S P integrated using SNP makerl respectively intercepts reference gene group sequence 50pb as midpoint in its upstream and downstream, constitutes lOObp theoretical probe sequence collection, then returns to this probe sequence collection ratio on reference sequences.If the optimal comparison of a certain probe sequence does not have mispairing, and its mispairing of sub-optimal comparison yet only less than 5%, then its corresponding S P is then retained, so as to constitute SNP maker2 collection.
3) SNP maker2 collection is based on, we are picked out in reference gene group, and physically uniform S P maker are final S P maker collection.In our study, we have selected the SNP maker collection that physical distance is about lOKbp.
If 4) concentrated in final SNP maker, there are the distance between two S closed on P to be more than lOKbp, the short tandem repeat between from them(Short tandem repeat, STR) fill up uniform. Design after SeTR probes, we entrust Roche to carry out output SeTR liquid-phase chips.SeTR liquid-phase chips contain 278800 probes, and total size is 41,795,106bp, and it covers effective full-length genome(2.89G) 1.45% region.SeTR average probe lengths have reached that the average physical distance between 149bp, adjacent probe two-by-two is 10.6kbp, as shown in table 1 and Fig. 1.Distribution of the SeTR probes of table 1 on every chromosome
With the qualified DNA sample of 3 quality inspections, YH (Yan Di and Huang Di, two legendary rulers of remote antiquity's sample, Chinese human gene group DNA), a HG00537 (sample in thousand Human Genome Programs)Standing grain B GM50275 (are obtained from Ke Ruier Institute for Medical Research Coriell Institute for Medical Research human desmocyte mother cell sample), to test the availability of SeTR chips, studied with ensureing that this probe chip can be used for follow-up detection.Three samples all build storehouse sequencing using SeTR captures, obtain sequencing sequence(reads).We remove by joint first(Adapter) after the reads of pollution and the relatively low such as average mass values of quality less than 20, remaining reads is called clean reads (clean reads), clean reads is compared onto reference sequences hgl9, the reads for having obtained 98.13% ~ 99.29 is compared onto reference gene group, wherein compare and reached 67.43% ~ 67.87% to target area, in addition, the target area for having 99.73% ~ 99.95 is at least covered by a reads, the region for having more than 99% has at least been covered to 10 times, as shown in table 2, these performances will be captured better than the extron group of same type(Exome capture) chip, such as the extron group liquid-phase chip that Roche Nimblegen companies produce.In addition, the depth profile of target area, as shown in Figure 4 A, similar to Poisson distribution(Poisson distribution), the non-reference series type of most high heterozygous sites in Fig. 4 B display targets region(The non-reference allele) reads support the reads of number and reference sequences base type (reference allele) to support that number is almost suitable, that is, the positive and negative reads support numbers that high heterozygous sites are obtained when comparing are suitable(Positive and negative reads originates two homologues respectively), these all show this probe without obvious haplotype(Common is reference sequences base type, i.e. ref types)The skewed popularity of capture, and it is more excellent to target area capture homogeneity.
The comparison result of 2 three samples of table
target covered >=30X, % )
Depth is sequenced in target area>Ratio shared by=40X part(Fraction of 80.03 79.01 77.81 target covered >=40X, %) embodiment 2:Target area library construction, sequencing
1st, test material, reagent, instrument
Sample:15 target gDNA samples(Human gene group DNA, sample number is shown in table 3 below, " GM ", and " beginning is all human desmocyte mother cell), 24 with reference to DNA sample.
Main agents instrument:PCR instrument, pipettor, centrifuge, comfort type constant temperature blending instrument, DNA interrupt instrument, turbula shaker, magnetic frame, electrophoresis apparatus, Hiseq2000 sequenators, Nanodrop ultraviolet specrophotometers etc., agents useful for same or the unreceipted production firm person of instrument, being can be by the conventional products of acquisition purchased in market.
Probe is designed and synthesized:Obtained by embodiment one, in the range of the full-length genome of people, about 41M target area is chosen, from Roche Holding Ag(Roche NimbleGen SeqCap EZ liquid phase probes) are customized, the probe collection can capture corresponding designed target area.
2nd, library construction
1) extracting genome DNA
Use QIAGEN DNA extraction kits(DNA Mini Kit), and according to kit specification, genomic DNA about 3-5 μ are extracted from target sample§, for subsequent experimental.The DNA extracted (30-100ng) is run into electrophoresis detection, judges whether complete and palliating degradation degree.
2) genomic DNA is interrupted and purified
Break Row is entered to genomic DNA using covaris E210 instruments(Operated with reference to instrument operation instruction).DNA is broken into 200-250bp.Use QIAquick PCR Purification kit (250) kit, operated according to kit specification, the DNA fragmentation having no progeny of fighting each other is purified, and whether electrophoresis detection master tape size meets the requirements, i.e., whether master tape size is 200-250bpo
3) end is repaired, end adds A, adjunction head, in advance amplification
By requirement for construction data base, by double end tag library construction specification steps and its reagent, the reaction condition listed etc., end reparation is carried out to the DNA fragmentation of above-mentioned fracture after purification, and purified;Plus individual base A, in the two ends of the DNA fragmentation through end reparation after purification, purifying end adds A products;In end plus the two ends of A products connection sequence measuring joints, and utilize the DNA fragmentation for the magnetic beads for purifying belt lacing that can be combined with sequence measuring joints complementation.Preparation PCR reaction systems, the DNA segments of amplification belt lacing, magnetic beads for purifying PCR primer, whether electrophoresis detection amplified production master tape size is in 300-350bp;With Nanodrop UV spectrophotometer measuring amount of DNA, total amount need to be more than 1.0 μ§
4) hybridization of SeTR probes and elution, amplification
Carry out, buy or the hybridization in configuration kit specification, elution related reagent according to commercially available NimbleGen SeqCap EZ hybridization elutions kit specification.Prepare 1.5mL centrifuge tubes, add Cot-1 DNA, general closing sequence(), Block the closing sequence of label(Index N Block) and through step 3) after DNA sample.It is then centrifuged for lmin, 60 °C of vacuum It is concentrated and dried, then adds in hybridization buffer etc., concussion centrifugation, the metal dry bath pot for being put into 95 °C and be denatured high speed centrifugation after lOmin, concussion.4.5ul probes are added in centrifuge tube, are hybridized in PCR instrument(47 °C, 64-72hours).Eluted after the completion of hybridization.Then enter performing PCR according to the last amplification step of library construction specification, PCR reaction systems, the DNA that hybridization elution is obtained, polymerase, substrate, PCR reaction buffers, Flowcell primers are prepared on request(According to the primer of the fixed sequence program design carried on the sequence testing chip flowcell of sequenator)It is well mixed Deng reactant.PCR programs are 94 °C of pre-degenerations 2min, 94 °C of denaturation 15s, 58 °C of annealing 30s, 72 °C of extension 30s, are reacted after 15 circulations, then 72 °C of extension 5min.After the completion of PCR, PCR primer is taken out, centrifugation, magnetic beads for purifying obtains target area library.Machine sequencing in the concentration in library, preparation is surveyed with Nanodrop ultraviolet specrophotometers.
3rd, Hiseq2000 high-flux sequences
The qualified DNA library of quality inspection, upper machine sequencing is carried out according to Hiseq2000 operating instructions.The data volume of each sample is about
4.5G, average sequencing depth reaches 100X, but because the efficiency for capturing chip is extremely difficult to 100%, by analysis, effective sequencing depth of final target area is 30X ~ 45X.Embodiment 3:CNV, LOH and UPD detection
Overall procedure is referring to Fig. 3.After sequencing is completed, lower machine data are fastq file formats.Then high-quality reads and reference gene group will be obtained after filtering(Hgl9, Build 37) the comparison file for obtaining SAM forms is compared using BWA softwares, SAM comparisons file format is converted into binary BAM files using samtools softwares afterwards, and deduplication and sequence processing are carried out to comparison result, then, samtools softwares will be reused, it is specific detail as per analysis of biological information policy section that BAM forms are converted into PILEUP forms.
First, sequencing data filtering, comparison
The sequencing data of machine under above-described embodiment illumina Hiseq2000 is first subjected to simple data filtering, will be polluted by adapter, ratio containing N is higher than 5%, and reads of the average mass values less than Q20 is removed.Then software is compared by the comparing after filtering to mankind's reference gene group using bwa(Hgl9, Build 37), output sequence comparison result is the comparison file of SAM (sequence alignment/ map) form(Abbreviation SAM files), SAM files are then converted into binary BAM files using Samtools softwares, got rid of caused by PCR repeatedly(PCR duplicates) and processing is ranked up, comparison and re-graduation again are being carried out to comparison result just using GATK softwares.
2nd, the heterozygosity R of the overburden depth coefficient r P fragments through the second correction of target area is calculatedhetThe information included according to above-mentioned specific filtration resistance to the probe area file of rear acquisition calculates the n and fragment heterozygosity R of each target areahetValue.According to n values, prediction CNV is examined using t, according to RHet, prediction LOH and UPD is examined using F.
3rd, CNV, LOH standing grain P UPD analysis are detected
1st, CNV is detected
1.1 calculate the depth coefficient of each target area
The depth of target area is calculated, and with representing(Such as formula 1), in order to keep continuous several target area TD stability, the method for formula 2 is employed to correct TD and corrects TD using the depth in n' region behind the i-th region, obtain TDai, then utilize formula 3 and 4 couples of TDaiUniformed, now obtain the depth coefficient of each target area. Formula 1:TDi=Tibase I Tden formula 2:The formula 3 of TDa ,=C TD/(w '+1), w ' >=9: TDa, = (∑:+" ίλ») I (n '+ 1)
/ : R=m1/m1 , Tlbase:Compare the base number to target area i; TJen:Target area i length.
1.2 utilize multiple samples for reference(The data creation datum line of k=24), correction is obtained
The difference of fluctuation and sample room in itself due to each experiment condition causes the efficiency captured every time there is also certain fluctuation, enter caused by fluctuate, be easily caused and CNV glitches occur.Therefore, according to the fluctuation situation of multiple samples, create unified Fig. 4 and embody establishment datum line well beneficial to this detection, precursor
( preRi) Distribution as scheme fluctuation it is very big, and fluctuate it is relatively small, by being obtained after the correction of datum lineΓι, its fluctuation is smaller, the more sensitive generation for being more easy to detect CNV.In theory, think in different sample, in the case of not occurring CNV, in same target area, value is theoretically to meet Poisson distribution, and all around the metastable fluctuation up and down of each distinctive value, in order to keep the stability of respective peculiar value, the 1 of the same area of multiple samples is worth by inquiry, using average value(Mean replaces this each distinctive value, is that each target area builds respective distinctive datum line(robust baseline).What it is based on each target area is fluctuated above and below mean ly values it is assumed that we incite somebody to action divided by mean Ri are changed intoΓι, and then cause η to surround the normal distribution of about 1 fluctuation.
The CNV of 1.3 detection target areas
In theory, the η values of the same target area from multiple samples should all meet normal distribution, therefore when investigating the target area i of some sample, can be by the n values in relatively multiple this region of sample, examined using τ, the calculation formula of t statistics is as follows 1 target sample is represented in each parameter subscript, 2 represent multiple samples for reference, and ^ represents a sample to be tested in formula!^ average, is a sample for referenceΓιAverage, ^ is all samples to be tested in theory!^ averages,
^ all refers to sample r in theoryl2Average, 81And S2The respectively standard deviation of sample to be tested and sample for reference, df is the free degree, df^+n^
When sample to be tested is 1 i.e.=i, the theoretical average of sample to be tested and sample for reference is identical, and above formula abbreviation is:
By simplification formula above, each target area corresponds to detectable CNV t values, and then obtains P values(Confidence level), as the P in certain region<When 0.05, the region of this region then for a generation CNV.
The 1.4 big CNV of detection
The p value examined based on single region t, one false signal value is enclosed by each region and whether considered to characterize by the connection of next step CNV regions, further along chromosome, it would be possible to which the target area connection with consistent CNV is blocking, so that it is determined that size final CNV and copy number.
The marking convention of false signal value is, when the measured value at least four successive objective regions is equidirectional(0) it is when deviateing the respective regions of reference sample, if the P values for having 3 regions are less than first threshold that t values are simultaneously greater than or are less than simultaneously(Such as 0.05, conventional level of signifiance threshold value), and the 4th be no more than Second Threshold(0.2, four times of first threshold), then four regions, which are marked, is(It is such as bigger than normal be designated as+, it is less than normal be designated as -), be merged into a block;Here continuous and equidirectional areal and first, second threshold value are all adjustable.If the span of the distance between block and another block no more than 5 regions, then the two blocks are merged as a bulk, and the rest may be inferred, finally obtains block;With reference to above 1.3 method formula, the r values of this block are with its all region included!^ average value represents that the r values to the block domain of sample to be tested and sample for reference carry out t inspections, calculate the P values of the block.As the P of the block<0.05, CNV occurs for this block, so that it is determined that border and the size of the block, obtain big CNV border and size.
By the analysis to 15 samples of target, the CNV results that we obtain and known the result(S P-array results)It is highly consistent, and in the absence of false positive and false negative, it is shown in Table 3.Furthermore, we simulate 8 30X full-length genome data, including 5 normal specimens, 3 samples containing CNV, by carrying out CNV detection and analysis to this 8 analogue datas, current exome regions CNV forecasting softwares CONTRA (Li J, the Lupat R reported for work are compared, et a/, CONTRA:The May 15 of copy number analysis for targeted resequencing, Bioinformatics. 2012;28(10):1307-13), as a result show, our method susceptibility and specificity have reached 100%, and respective copy number is also accurately detected, accuracy of detection to CNV can reach 500Kb and can be accurately positioned, and CONTRA susceptibility is 88.9%, and specificity is only 66.7%, copy number is not provided, as shown in table 4.
Table 3
CONTR 1-5 Normal 88.9% 66.7%
A 6 chr20 15007645 15492763 0.49- ΝΑ
chrl9 45003283 45496699 0.49+ ΝΑ
7 chr20 16009467 17992034 1.98- ΝΑ
chrl9 15009149 15993267 0.98+ ΝΑ
chrl9 50000342 50998777 1 - ΝΑ
8 chr20 63704 9990568 9.93- ΝΑ
chr20 10007121 12995181 2.99- ΝΑ
chr20 19869958 35830028 ΝΑ chr20 42008751 42442770 0.43+ ΝΑ
chrl9 3430053 31917721 28.49 ΝΑ
+
chrl9 35004532 35595304 0.59- ΝΑ
+ ―
As a result 8 true positives CNVs, 3 false positive CNVs
2nd, LOH is detected
The heterozygous state detection in each region of 2.1 full-length genomes
In sample to be tested genome region, the gene frequency in thousand personal datas is found out(AF) the S P sites for being 0.1 0.9, and calculate the R of these S P sites regions in thousand people and in sample to be tested as followsHetValue.When region i is absolute heterozygous state in sample to be tested, then Rhet=l, conversely, when being absolute homozygosis, Rhet=0。
Rhet = MAF l (\ - MAF) ^ MAF (minOr allele frequency) it is time gene frequency.In testing sample, any one S P site m using in certain region as starting point, continuously take backward n S P site as heterozygosity the collection Sm, i.e. ^=H, Rh in the region,, Rn,, in the same way, in thousand personal data storehouses, the S P sites of same position are taken, heterozygosity collection Pm are constituted, i.e.,
Pm = {Rket, pketm , Rket, P (m + P(m +„)} ? fExamine the variance of two heterozygosity collection whether equal, specifically, calculating the variance of the heterozygosity collection in the sample to be tested region respectively2With the variance of the heterozygosity collection in the same region of thousand people's samples
Sp, and sample to be tested region heterozygosity collection Sm p value. Ss— ~n-
Sp *J c max = max{ c ^ , Ό c }, Ό c mm = min ^ ^ , c }
Ho '. Gs = Op
F upper― ^"^Χ , dfs― dfp— ΪΙ— \
S min
F under― ^ ^n , dfs― dfp— fl— \
S max
p― pupper + (1― p unde )
When≤0.01, we receive HA, judge that heterozygosity collection Sm loses the heterozygosity in colony, i.e. set Sm regions and occurs LOH.
The big LOH of 2.2 detections
With reference to 2.1 result, by the way of big CNV steps 1.4 are detected, it is a minimum unit to record continuous 4 subsets for losing heterozygous state.If no more than 2 subset spans, then be merged into bigger unit by two units, the rest may be inferred, finally connect into block, now, further according to the R between testing sample and thousand people's reference sets between two unitsHetValue progress F inspections, the p value of calculation block, when!When)≤0.01, we then think that LOH occurs for this block, are otherwise non-LOH blocks.
Or, it is that more accurately detection pairing and condition can be set tightened up, such as to avoid false positive caused by some random errors, defines the region at least above 5M and be only possible to as a real LOH.On this basis, it is 1 under conditions of (allowing the p value of 1 subset in block to be more than 0.01) to set block fault-tolerant, by 2.1≤0.01 subset near the continuous subset of p≤0.01 merge therewith.Finally, a F inspection has been carried out again to merging the RHet in later region, if its p value is less than 0.01, then it is assumed that the block is a real LOH.
3rd, UPD is detected
With reference to CNV the and LOH testing results of above-mentioned full-length genome, according to mendelian inheritance, UPD detections are carried out.If a certain region of DNA domain is shown as heterozygous state, i.e. R in thousand personal datasHet=l, and in actually detected, its heterozygous state disappears, i.e. RHetLevel off to 0, then judge this region to there occurs LOH, and if having CNV simultaneously in this region and having two copies(CN=2), i.e., copy number does not change(The sample of the present embodiment is diploid sample, and each region of normal diploid sample genome is all two copies), then judge that this region there occurs uniparental disomy(UPD).
In 13 of 15 samples, the UPD that 10 be more than 5M LOH and 4 are more than 5M is detected, as a result see table 5, LOH and UPD detection are in the case of no paired sample(What tissue and normal tissue usually by itself lesion were compared, this is paired sample, there is certain sample associated, and present embodiment side detection LOH and UPD compares target sample and multiple sample for reference set, target sample and sample for reference set do not have correlation, so not being paired sample),>5M LOH testing results are consistent with the CNV results of CN=1(The accuracy of LOH testing results is verified using CNV testing results), the present invention program detection LOH, UPD accuracy are high, and can reach 5M The precision of rank.
Circos schemes(Fig. 5) overview display CNV, LOH and UPD testing result of GM50275 samples.
Table 5
Industrial applicibility
The method that probe sequence is determined based on reference sequences of the present invention, can be effective for determining probe sequence, and the probe obtained, multiple genome regional areas are obtained for hybrid capture genome, the multiple regional areas captured can represent full-length genome, can reflect full-length genome variation information, the generation of the structure variation for finding full genome scope.Although the embodiment of the present invention has obtained detailed description, it will be understood to those of skill in the art that.According to disclosed all teachings, various modifications and replacement can be carried out to those details, these change within protection scope of the present invention.The four corner of the present invention is provided by appended claims and its any equivalent.
In the description of this specification, the description of reference term " one embodiment ", " some embodiments ", " illustrative examples ", " example ", " specific example " or " some examples " etc. means to combine specific features, structure, material or the feature that the embodiment or example describe and is contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referring to the schematic representation of above-mentioned term.Moreover, specific features, structure, material or the feature of description can in an appropriate manner be combined in any one or more embodiments or example.

Claims (26)

  1. Claims
    1st, a kind of method that probe sequence is determined based on reference sequences, it is characterised in that including:
    (1) multiple discrete high frequency S P sites are based on, the first candidate probe collection is built, wherein, the first candidate probe collection is made up of multiple candidate probes, wherein, each in the multiple candidate probe is containing the discrete high frequency S P sites described at least one;
    (2) the multiple candidate probe for concentrating first candidate probe is compared with reference sequences, to obtain comparison result;
    (3) comparison result is based on, carrying out first to the first candidate probe collection screens, to obtain the second candidate probe collection being made up of multiple candidate probes;
    (4) reference sequences are divided into multiple windows with predetermined length respectively, the multiple candidate probes for respectively concentrating second candidate probe are distributed to the window of each Self Matching, to determine the respective positional information of the multiple candidate probe;
    (5) gene frequency based on the positional information and the discrete high frequency S P, carries out second to the second candidate probe collection and screens, to determine the probe sequence.
    2nd, according to the method described in claim 1, it is characterised in that the gene frequency of each in the multiple discrete high frequency S P sites is respectively at least 10%, preferably more than 90%.
    3rd, according to the method described in claim 1, it is characterised in that physical distance of any two adjacent discrete high frequency S P sites on the reference sequences is not less than the length of the candidate probe in the multiple discrete high frequency S P sites.
    4th, according to the method described in claim 1, it is characterised in that the length of the candidate probe is 50 ~ 250mer, preferably 100mer.
    5th, according to the method described in claim 1, it is characterised in that the candidate probe includes a discrete high frequency S P site, and the discrete high frequency S P sites are located at the stage casing of the candidate sequence.
    6th, method according to claim 5, it is characterised in that the discrete high frequency S P sites are located at the midpoint of the candidate probe.
    7th, according to the method described in claim 1, it is characterised in that the candidate probe is from reference sequences interception.
    8th, according to the method described in claim 1, it is characterised in that before the comparison is carried out, at least one of G/C content and single base repeat number based on the candidate probe, carry out prescreening to the first candidate probe collection in advance.
    9th, method according to claim 8, it is characterised in that the prescreening, which includes retaining, meets at least one following candidate probe:
    G/C content is 35%-65%;And
    Single base severe is less than 7.
    10th, the method according to claim 1, it is characterised in that first screening includes retaining the candidate probe for meeting at least one following condition:
    The candidate probe uniquely compared with the reference sequences;
    Compare multiple positions of the reference sequences, and at least two positions in the multiple position mispairing ratio be respectively less than 10% candidate probe. 11st, according to the method described in claim 1, it is characterised in that in step(4) in, the reference sequences are divided into multiple windows with same predetermined length respectively.
    12nd, method according to claim 11, it is characterised in that the reference sequences are divided into the window that multiple length are 10Kb.
    13rd, according to the method described in claim 1, it is characterised in that in step(5) in, according to the following steps, the probe is determined:
    (a) it is located at same window if there is multiple candidate probes, it is determined that the gene frequency highest candidate probe of the discrete high frequency S P;
    If (b) only existing discrete high frequency S P gene frequency highest candidate probe, the gene frequency highest candidate probe of the discrete high frequency S P is then selected as the probe, if there is multiple discrete high frequency S P gene frequency highest candidate probe, then the nearest candidate probe of window center described in distance is selected in the multiple discrete high frequency S P gene frequency highest candidate probe as the probe.
    14th, according to the method described in claim 1, it is characterised in that it is determined that after the probe, further comprising:On the reference sequences, the distance between two neighboring probe is determined respectively;
    If the distance between described two neighboring probe is more than the maximum length for two windows that the adjacent probe is located at, further a part for STR typing or STR is selected to be used as probe between described two windows.
    15th, the method according to claim 1, it is characterised in that the reference sequences are reference gene group or one part.
    16th, a kind of method for detecting genome structure variation, the genome structure variation includes chromosomal aneuploidy, copy at least one of number variation and insertion and deletion, it is characterised in that methods described includes,
    (1) target sample genomic nucleic acids are sequenced, to obtain gene order-checking result, the gene order-checking result is made up of multiple reads, wherein, optionally, the sequencing includes being screened using probe, wherein, the probe is obtained by the method described in any one of claim 1 ~ 15;
    (2) it is m region by reference gene component, utilizes the number for the read for falling into region i, zoning i overburden depth TD1M and i is natural number, and i represents the numbering in region, l i m, 10<m;
    (3) difference degree of overburden depth and the region i of k sample for reference overburden depth based on the region i, determines that the region i whether there is structure variation, wherein, k is natural number, k 2.
    17th, method according to claim 16, it is characterised in that the overburden depth of the region i is determined using following equation:
    Fall into region i read number
    TBase that the read that n falls into region i is included sum it is sweet go out rich+Pregnant mouthfuls of a RT
    TDi=" Shang ^, wherein, the volume ° in 1 table not region
    The length in region 1
    18th, method according to claim 16, it is characterised in that the covering of the target sample genome area i is deep The inspection of degree and the difference degree of the region i of k sample for reference overburden depth, is to examine to carry out by t.
    19th, method according to claim 16, it is characterized in that, the overburden depth of the region i and the comparison of the difference degree of the region i of k sample for reference overburden depth, it is that the overburden depth coefficient of genome area i by comparison object sample and sample for reference is carried out, wherein, the determination of the overburden depth coefficient of the region i comprises the following steps
    (a) corrected to carrying out first to obtain the first correction overburden depth TDai, first school is exactly based on carries out linear regression realization to the overburden depth value of 2 η continuums including inclusion region i, wherein, η is natural number, 10<η =¾m/2;
    (b) to TDaiRow homogenization is obtainedTDAnd then obtainRl=TI TD ai
    20th, method according to claim 19, it is characterised in that in step(A) in, based on following equation, the first correction overburden depth TD is determinedai: TDai =(∑j TDJ )/n, wherein, TDj represents the overburden depth in the jth region in the n continuum, and j is natural number, 1 η.
    21st, method according to claim 20, it is characterised in that in step(B) in, based on following equation, to fourth 031Carry out homogenization acquisition
    22nd, according to any institute's method of claim 18 ~ 21, it is characterised in that further comprise to carrying out the second correction obtaining after target sample is obtainedΓι, , wherein,RalForkIndividual sample for reference genome area i's covers
    k
    R ― y=1
    The average value of lid depth coefficient,31 k, y is that natural number represents that sample for reference is numbered, RYRepresent sample for reference y genome areas i overburden depth coefficient.
    23rd, according to any described method of claim 18 ~ 21, it is characterised in that further comprise carrying out the second correction after target sample is obtained to 1 to obtain,RWherein, RAIFor k sample for reference and target sample
    R y=i
    The average value of genome area i overburden depth coefficient,31K+1, is that natural number represents that sample for reference is numbered, y represents sample for reference y genome areas i overburden depth coefficient.
    24th, the method according to claim 22 or 23, it is characterised in that carry out the t inspections, target sample base It is flat
    Average, the overburden depth coefficient through the described second correction that ^ is sample for reference y genome areas i, 'R , it is 1^ sample for reference standard deviation,.
    25th, method according to claim 24, it is characterised in that the value based on target sample genome area i, obtains the level of signifiance and works as P^O.05, judges that the region i has structure variation;Conversely, then judging that structure variation is not present in the region i.
    26th, method according to claim 24, it is characterised in that value and predetermined level of signifiance P based on target sample genome area ilQ, obtain theoretical value tlQ, work as tl0, judge that the region i has structure variation, conversely, then judging that structure variation is not present in the region i;Described predetermined 1 0.05.
    27th, according to any described method of claim 16 ~ 21, it is characterised in that carrying out step(3) after, by equidirectional and continuous W region merging technique, obtain one-level combined region, merge two one-level combined region when described two one-level combined region be it is equidirectional and between no more than L region of span, obtain two grades of combined region, the overburden depth of two grades of combined region based on target sample genome and the difference degree of the overburden depth in corresponding region on multiple sample for reference genomes, to detect the structure variation of two grades of combined region;Wherein, equidirectional region refer to region t statistics be both greater than 0 or both less than 0 region, W and L are natural number, W 2, L-W^ l o
    28th, a kind of method for detecting loss of heterozygosity, it is characterised in that including,
    (1) target sample genomic nucleic acids are sequenced, to obtain gene order-checking result, the gene order-checking result is made up of multiple reads, wherein, optionally, the sequencing includes being screened using probe, wherein, the probe is obtained by the method described in any one of claim 1 ~ 15;
    (2) reference gene group is divided into m' region, declined read information in the i of region and colony region i data based on the gene order-checking result, obtain the shared SNP sites of target sample genome area i and colony region i and constitute shared S P collection, the heterozygosity of fragment where each S P site for the shared S P concentrations for calculating target sample and colony respectively, obtain target sample genome area i heterozygosity collection 1^, and colony region i heterozygosity collection UQl, comparison object sample and colony UQlTo determine that target sample region i whether there is loss of heterozygosity;Wherein, fragment where the S P sites is that, using upstream and downstream two SNPs adjacent with the S P as boundary point, m' and i are natural number, m'^ i ^ l, m'6.
    29th, method according to claim 28, it is characterised in that the gene frequency for each S P that the shared SNP is concentrated is both greater than 0.1.
    30th, method according to claim 28, it is characterised in that the heterozygosity of fragment is represented with the secondary gene frequency coefficient in the S P sites where the S P sites, the secondary gene frequency coefficients R in the S P siteshet=MAF/
    (1-MAF), MAF is the secondary gene frequency of the SNP.
    31st, method according to claim 30, it is characterised in that the comparison object sample U and colony UQlTo determine whether target sample region i loss of heterozygosity occurs, using the variance and U of F test and judgesQlVariance whether there were significant differences, if U^P UQlVariance significant difference, then judge that the target sample region i has loss of heterozygosity, conversely, then judging that the target sample region i has loss of heterozygosity.
    32nd, method according to claim 31, it is characterised in that the F, which is examined, to be included calculating U P U respectivelyl0's Variance, utilizes gained target sample 1^ variance and colony UlQVariance ^.Calculate and obtain two statistics reciprocal each other
    Fup^ and F^to, level of signifiance p is obtained using the statistic reciprocal each otherF, compare pFWith predetermined level of signifiance pF0It is big
    ,;Wherein,vSNP numbering is concentrated for the high frequency S P that target sample genome area i and colony region i has, q is the number that the high frequency S P that target sample genome area i and colony region i has concentrate SNP, V that the shared high frequency S P for being target sample genome area i are concentrated
    S P secondary gene frequency coefficient,Rte, i is the average value of the target sample genome area i shared high frequency S P q water S P concentrated secondary gene frequency coefficient, R ^1^1The secondary gene frequency coefficient for the V S P that ^ population sample genome areas i shared high frequency S P are concentrated,!The average value of the secondary gene frequency coefficient for the q S P that the ^^ ° of shared high frequency S P for population sample genome area i is concentrated, pupPWith puncto respectively according to Fup^ and FunctaObtain, pF0 0.05。
    33rd, according to any described method of claim 28 ~ 32, it is characterised in that in step(2) after, occur loss of heterozygosity and continuous region merging technique by W', obtain three-level combined region, merge two three-level combined region when the span between described two three-level combined region is no more than L' region, obtain level Four combined region, the heterozygosity collection in the same region of heterozygosity collection and colony of target sample level Four combined region is obtained respectively, compare two heterozygosity collection, to determine whether target sample level Four combined region occurs loss of heterozygosity, wherein, W' and L' are natural number, W'^2, W'12 L'.
    34th, a kind of method for detecting uniparental disomy; it is characterized in that; when detecting that target sample genome area has loss of heterozygosity; calculate the copy number of this genome area; when the copy number of this genome area is as the copy number in the same species normal gene group region, judge the target sample genome area as uniparental disomy;The determination of target sample genome area loss of heterozygosity is carried out by any methods described of claim 27 ~ 32.
CN201480080426.0A 2014-07-04 2014-07-04 Method for determining probe sequence and method for detecting genome structure variation Active CN106715711B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/081686 WO2016000267A1 (en) 2014-07-04 2014-07-04 Method for determining the sequence of a probe and method for detecting genomic structural variation

Publications (2)

Publication Number Publication Date
CN106715711A true CN106715711A (en) 2017-05-24
CN106715711B CN106715711B (en) 2021-09-17

Family

ID=55018343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201480080426.0A Active CN106715711B (en) 2014-07-04 2014-07-04 Method for determining probe sequence and method for detecting genome structure variation

Country Status (2)

Country Link
CN (1) CN106715711B (en)
WO (1) WO2016000267A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584963A (en) * 2018-09-30 2019-04-05 南京派森诺基因科技有限公司 A kind of diversified abstracting method of high-flux sequence data
CN112739828A (en) * 2018-06-11 2021-04-30 深圳华大生命科学研究院 Method and system for determining type of sample to be tested
CN112885410A (en) * 2021-01-28 2021-06-01 陈晓熠 Genotyping chip for CNV structural variation detection
CN113593644A (en) * 2021-06-29 2021-11-02 广东博奥医学检验所有限公司 Method for detecting chromosome uniparental disomy by low-depth sequencing based on family
CN114678067A (en) * 2022-03-21 2022-06-28 纳昂达(南京)生物科技有限公司 Method and device for constructing multi-population non-exon region SNP probe set
CN115631789A (en) * 2022-10-25 2023-01-20 哈尔滨工业大学 Pangenome-based group joint variation detection method
CN115713967A (en) * 2022-11-17 2023-02-24 纳昂达(南京)生物科技有限公司 Design method of probe pool and related device
WO2023030233A1 (en) * 2021-08-30 2023-03-09 广州燃石医学检验所有限公司 Copy number variation detection method and application thereof
CN116144794A (en) * 2023-03-09 2023-05-23 华中农业大学 Bovine 12K SV liquid phase chip and design method and application thereof
CN118460706A (en) * 2024-07-10 2024-08-09 中国科学院心理研究所 Methods, devices, media and program products for detecting mitochondrial genes

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018214010A1 (en) * 2017-05-23 2018-11-29 深圳华大基因研究院 Method, device, and storage medium for detecting mutation on the basis of sequencing data
CN110872618B (en) * 2018-09-04 2022-04-19 北京果壳生物科技有限公司 Method for judging sex of detected sample based on Illumina human whole genome SNP chip data and application
CN111383714B (en) * 2018-12-29 2023-07-28 安诺优达基因科技(北京)有限公司 Method for simulating target disease simulation sequencing library and application thereof
CN110079589A (en) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 A kind of accurate method for obtaining structure variation within the scope of full-length genome
CN110600078B (en) * 2019-08-23 2022-03-18 北京百迈客生物科技有限公司 Method for detecting genome structure variation based on nanopore sequencing
CN110592208B (en) * 2019-10-08 2022-05-03 北京诺禾致源科技股份有限公司 Capture probe composition of three subtypes of thalassemia as well as application method and application device thereof
CN112662767B (en) * 2020-11-25 2021-08-06 深圳华大基因股份有限公司 Kit and probe for measuring genomic instability and application of kit and probe
CN112522382B (en) * 2020-12-22 2024-03-22 广州深晓基因科技有限公司 Y chromosome sequencing method based on liquid phase probe capture
CN113971986B (en) * 2021-10-12 2023-03-21 江苏先声医疗器械有限公司 Method for checking cross contamination of sequencing sample through sequence similarity
CN114220481B (en) * 2021-11-25 2023-09-08 深圳思勤医疗科技有限公司 Method, system and computer readable medium for completing karyotyping of a sample to be tested based on whole genome sequencing
CN114582427B (en) * 2022-03-22 2023-04-07 成都基因汇科技有限公司 Method for identifying introgression section and computer readable storage medium
CN115101128B (en) * 2022-06-29 2023-09-15 纳昂达(南京)生物科技有限公司 Method for evaluating off-target risk of hybridization capture probe
CN115713971B (en) * 2022-09-28 2024-01-23 上海睿璟生物科技有限公司 Target sequence capture probe design strategy selection method, system and terminal

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1370242A (en) * 1999-06-15 2002-09-18 基因描绘系统有限公司 Genomic profiling: repid method for testing complex biological sample for presence of many types of organisms
WO2005001091A1 (en) * 2003-06-27 2005-01-06 Olympus Corporation Probe set for detecting mutation and polymorphism in nucleic acid, dna array having the same immobilized thereon and method of detecting mutation and polymorphism in nucleic acid using the same
US20050042654A1 (en) * 2003-06-27 2005-02-24 Affymetrix, Inc. Genotyping methods
US20050136417A1 (en) * 2003-12-19 2005-06-23 Affymetrix, Inc. Amplification of nucleic acids
US20070243546A1 (en) * 2006-03-31 2007-10-18 Affymetrix, Inc Analysis of methylation using nucleic acid arrays
CN101213312A (en) * 2005-06-30 2008-07-02 先正达参股股份有限公司 Methods for screening for gene specific hybridization polymorphisms (GSHPs) and their use in genetic mapping ane marker development
CN101360834A (en) * 2005-11-21 2009-02-04 西蒙斯单倍体有限公司 Method and probe for identifying nucleotide sequence
CN101395280A (en) * 2006-03-01 2009-03-25 凯津公司 High throughput sequence-based detection of snps using ligation assays
CN101712959A (en) * 2008-10-08 2010-05-26 中国人民解放军军事医学科学院放射与辐射医学研究所 Novel human cell growth inhibiting gene THAP11 and application thereof
CN101790731A (en) * 2007-03-16 2010-07-28 吉恩安全网络公司 Be used to remove the system and method that genetic data disturbed and determined the chromosome copies number
CN102127819A (en) * 2010-11-22 2011-07-20 深圳华大基因科技有限公司 Constructing method and application of nucleic acid library in MHC (Major Histocompatibility Complex) region
WO2011146788A2 (en) * 2010-05-19 2011-11-24 The Translational Genomics Research Institute Methods of assessing a risk of developing necrotizing meningoencephalitis
CN103080333A (en) * 2010-09-14 2013-05-01 深圳华大基因科技有限公司 Methods and systems for detecting genomic structure variations
WO2014099979A2 (en) * 2012-12-17 2014-06-26 Virginia Tech Intellectual Properties, Inc. Methods and compositions for identifying global microsatellite instability and for characterizing informative microsatellite loci

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1370242A (en) * 1999-06-15 2002-09-18 基因描绘系统有限公司 Genomic profiling: repid method for testing complex biological sample for presence of many types of organisms
WO2005001091A1 (en) * 2003-06-27 2005-01-06 Olympus Corporation Probe set for detecting mutation and polymorphism in nucleic acid, dna array having the same immobilized thereon and method of detecting mutation and polymorphism in nucleic acid using the same
US20050042654A1 (en) * 2003-06-27 2005-02-24 Affymetrix, Inc. Genotyping methods
US20050136417A1 (en) * 2003-12-19 2005-06-23 Affymetrix, Inc. Amplification of nucleic acids
CN101213312A (en) * 2005-06-30 2008-07-02 先正达参股股份有限公司 Methods for screening for gene specific hybridization polymorphisms (GSHPs) and their use in genetic mapping ane marker development
CN101360834A (en) * 2005-11-21 2009-02-04 西蒙斯单倍体有限公司 Method and probe for identifying nucleotide sequence
CN101395280A (en) * 2006-03-01 2009-03-25 凯津公司 High throughput sequence-based detection of snps using ligation assays
US20070243546A1 (en) * 2006-03-31 2007-10-18 Affymetrix, Inc Analysis of methylation using nucleic acid arrays
US7901882B2 (en) * 2006-03-31 2011-03-08 Affymetrix, Inc. Analysis of methylation using nucleic acid arrays
CN101790731A (en) * 2007-03-16 2010-07-28 吉恩安全网络公司 Be used to remove the system and method that genetic data disturbed and determined the chromosome copies number
CN101712959A (en) * 2008-10-08 2010-05-26 中国人民解放军军事医学科学院放射与辐射医学研究所 Novel human cell growth inhibiting gene THAP11 and application thereof
WO2011146788A2 (en) * 2010-05-19 2011-11-24 The Translational Genomics Research Institute Methods of assessing a risk of developing necrotizing meningoencephalitis
CN103080333A (en) * 2010-09-14 2013-05-01 深圳华大基因科技有限公司 Methods and systems for detecting genomic structure variations
CN102127819A (en) * 2010-11-22 2011-07-20 深圳华大基因科技有限公司 Constructing method and application of nucleic acid library in MHC (Major Histocompatibility Complex) region
WO2014099979A2 (en) * 2012-12-17 2014-06-26 Virginia Tech Intellectual Properties, Inc. Methods and compositions for identifying global microsatellite instability and for characterizing informative microsatellite loci

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JOHN C. TAN等: "Optimizing comparative genomic hybridization probes for genotyping and SNP detection in Plasmodium falciparum", 《GENOMICS》 *
WENDY A. KELLNER等: "Uprobe: A genome-wide universal probe resource for comparative physical mapping in vertebrates", 《GENOME RESEARCH》 *
胡佳莉等: "一种增强MLPA检测SNP位点的特异性方法", 《贵阳医学院学报》 *
陆祖宏等: "与疾病相关的SNP筛选以及低成本快速全基因组DNA测序技术", 《中国化学会第十一届胶体与界面化学会议论文摘要集》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112739828A (en) * 2018-06-11 2021-04-30 深圳华大生命科学研究院 Method and system for determining type of sample to be tested
CN112739828B (en) * 2018-06-11 2024-04-09 深圳华大生命科学研究院 Method and system for determining type of sample to be detected
CN109584963A (en) * 2018-09-30 2019-04-05 南京派森诺基因科技有限公司 A kind of diversified abstracting method of high-flux sequence data
CN112885410A (en) * 2021-01-28 2021-06-01 陈晓熠 Genotyping chip for CNV structural variation detection
CN113593644A (en) * 2021-06-29 2021-11-02 广东博奥医学检验所有限公司 Method for detecting chromosome uniparental disomy by low-depth sequencing based on family
CN113593644B (en) * 2021-06-29 2024-03-26 广东博奥医学检验所有限公司 Method for detecting chromosome single parent dimer based on family low depth sequencing
WO2023030233A1 (en) * 2021-08-30 2023-03-09 广州燃石医学检验所有限公司 Copy number variation detection method and application thereof
CN114678067B (en) * 2022-03-21 2023-03-14 纳昂达(南京)生物科技有限公司 Method and device for constructing multi-population non-exon region SNP probe set
CN114678067A (en) * 2022-03-21 2022-06-28 纳昂达(南京)生物科技有限公司 Method and device for constructing multi-population non-exon region SNP probe set
CN115631789B (en) * 2022-10-25 2023-08-15 哈尔滨工业大学 Group joint variation detection method based on pan genome
CN115631789A (en) * 2022-10-25 2023-01-20 哈尔滨工业大学 Pangenome-based group joint variation detection method
CN115713967A (en) * 2022-11-17 2023-02-24 纳昂达(南京)生物科技有限公司 Design method of probe pool and related device
CN115713967B (en) * 2022-11-17 2023-08-29 纳昂达(南京)生物科技有限公司 Method for designing probe pool and related device
CN116144794A (en) * 2023-03-09 2023-05-23 华中农业大学 Bovine 12K SV liquid phase chip and design method and application thereof
CN116144794B (en) * 2023-03-09 2023-12-19 华中农业大学 Bovine 12K SV liquid phase chip and design method and application thereof
CN118460706A (en) * 2024-07-10 2024-08-09 中国科学院心理研究所 Methods, devices, media and program products for detecting mitochondrial genes

Also Published As

Publication number Publication date
WO2016000267A1 (en) 2016-01-07
CN106715711B (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN106715711A (en) Method for determining the sequence of a probe and method for detecting genomic structural variation
US11031100B2 (en) Size-based sequencing analysis of cell-free tumor DNA for classifying level of cancer
JP6946292B2 (en) Systems and methods for genome analysis
Huang Next generation sequencing to characterize mitochondrial genomic DNA heteroplasmy
CN112037860A (en) Statistical analysis for non-invasive chromosomal aneuploidy determination
EP3542291A1 (en) Validation methods and systems for sequence variant calls
Ma et al. The analysis of ChIP-Seq data
Larson et al. A clinician’s guide to bioinformatics for next-generation sequencing
Luo et al. Pilot study of a novel multi‐functional noninvasive prenatal test on fetus aneuploidy, copy number variation, and single‐gene disorder screening
WO2015043278A1 (en) Method and system for simultaneously performing target gene haplotype analysis and chromosomal aneuploidy detection
McIver et al. Population-scale analysis of human microsatellites reveals novel sources of exonic variation
Guo et al. Single-nucleotide variants in human RNA: RNA editing and beyond
Mir Sequencing genomes: from individuals to populations
Fatima Whole-Genome Sequencing of two Swedish Individuals on PromethION
Meng Ethics statement
Sharma et al. Bioinformatics of Genome-wide DNA Methylation Studies
Uziela Making microarray and RNA-seq gene expression data comparable

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant