CN110189796A - A kind of sheep full-length genome resurveys sequence analysis method - Google Patents
A kind of sheep full-length genome resurveys sequence analysis method Download PDFInfo
- Publication number
- CN110189796A CN110189796A CN201910448101.8A CN201910448101A CN110189796A CN 110189796 A CN110189796 A CN 110189796A CN 201910448101 A CN201910448101 A CN 201910448101A CN 110189796 A CN110189796 A CN 110189796A
- Authority
- CN
- China
- Prior art keywords
- data
- sheep
- sequence
- genome
- snp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a kind of sheep full-length genomes to resurvey sequence analysis method, is related to gene technology field.The method of the present invention detects its purity, concentration and volume the following steps are included: (1) acquisition sheep DNA, carries out library preparation and library quality inspection to sample of the detection after qualified, is sequenced to the library of quality inspection qualification, obtains sheep raw sequencing data;(2) data filtering is carried out to the sheep raw sequencing data and assesses sequencing quality, obtain target analysis sequence data after data Quality Control is qualified;(3) the target analysis sequence data is compared to sheep with reference to the data on genome, compared after comparing index Quality Control qualification;(4) mononucleotide SNP variation, the small fragment insertion and deletion variation InDel and chromosomal structural variation SV of the data in the comparison are detected, and annotated, obtain SNP data information, InDel data information and the SV data information in sheep genome sequencing sequence.
Description
Technical field
The present invention relates to gene technology fields more particularly to a kind of sheep full-length genome to resurvey sequence analysis method.
Background technique
DNA is a kind of important substance in organism, its load hereditary information in the form of gene, and as gene duplication
With the template of transcription, it plays weight during the Growth and Differentiation of cell and development, metabolism and the disease generation of bion etc.
It acts on.Whole hereditary information entrained by cell are collectively referred to as genome.
Full-length genome resurveys sequence, is based on Illumina microarray dataset, carries out to the species having with reference to genome sequence a
Body or the genome sequencing of group detect single nucleotide polymorphism using high-performance calculation platform and bioinformatics method
The polymorphism informations such as site (SNP), insertion and deletion (InDel), obtain its biological heredity feature, thus carry out it is subsequent heredity into
Change analysis and the prediction of the related candidate gene of important character, there is important guidance to anticipate the researchs such as the molecular breeding of the species
Justice.However there are not in more detail more specifically sequencing approach and its effectively analysis also for sheep full-length genome.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of sheep full-length genomes to resurvey sequence analysis method.
In order to achieve the above objectives, invention broadly provides following technical solutions:
On the one hand, the embodiment of the invention provides a kind of sheep full-length genomes to resurvey sequence analysis method, the method includes
Step:
(1) sheep DNA is obtained, the purity, concentration and volume of the DNA are detected, text is carried out to sample of the detection after qualified
Library preparation and library quality inspection, are sequenced the library of quality inspection qualification, obtain sheep raw sequencing data;
(2) data filtering is carried out to the sheep raw sequencing data and assesses sequencing quality, after data Quality Control is qualified
Obtain target analysis sequence data;
(3) the target analysis sequence data is compared to sheep and is referred on genome, after being compared index Quality Control qualification
The data compared;
(4) mononucleotide SNP variation, small fragment insertion and deletion variation InDel and the dye of the data in the comparison are detected
Chromosome structures variation SV, and annotated, obtain SNP data information, the InDel data letter in sheep genome sequencing sequence
Breath and SV data information.
Preferably, the detailed process of the data filtering are as follows:
(1) sequence of removal connector pollution;Wherein, the base number of sequence center tap pollution is greater than 5bp, and both-end is surveyed
Sequence removes the sequence at both ends if one end is polluted by connector;
(2) low-quality sequence is removed;Wherein, in sequence the total base of base Zhan of mass value Q≤19 50% or more, it is right
It is sequenced in both-end, if one end is low quality sequence, the sequence at both ends can be removed;
(3) removal ratio containing N is greater than 5% Reads;Wherein, both-end is sequenced, if one end ratio containing N is greater than 5%,
The sequence at both ends can then be removed.
Preferably, the assessment sequencing quality includes the base distribution letter of the Mass Distribution information and data of assessing data
Breath;Wherein, the Mass Distribution information includes statistics base sequencing error rate and base correct recognition rata.
Software BWA is used on genome preferably, the target analysis sequence data is compared to sheep to refer to;It compares
Rate is 90.2%-99.1%.
Preferably, the comparison index Quality Control includes sequencing Depth profile information;Wherein, the sequencing depth distribution letter
Breath includes single base Depth profile information and accumulation Depth profile information.
Preferably, the SNP data information is to pass through mutation analysis software in the data basis in the comparison
GATK therefrom extracts potential SNP site all in full-length genome, does further further according to mass value, depth and repeatability
Filtering screening, finally obtains the SNP data set of high confidence level, and annotates to it;Then SNP in the data set is counted to exist
The heterozygosis ratio of SNP in the distributed intelligence in each section of genome, analysis genome, the distributed intelligence for counting SNP mutation mode and
The SNP mutation function classification information in statistical coding region.
Preferably, the InDel data information is in the data basis in the comparison by mutation analysis software
GATK therefrom extracts the potential site InDel all in full-length genome, does further further according to mass value, depth and repeatability
Filtering screening, finally obtain the InDel data set of high confidence level, and annotate to it;Then it counts in the data set
InDel is in the distributed intelligence in each section of genome, the distributed intelligence of statistics InDel Catastrophe Model and statistical coding region
InDel functional mutant classification information.
Preferably, the SV data information is in the data basis in the comparison by chromosomal structural variation point
Analysis software DELLY therefrom extract the potential site SV all in full-length genome, further according to mass value, depth and repeatability do into
The filtering screening of one step, finally obtains the SV data set of high confidence level, and annotates to it;Then it counts in the data set
The variation type of SV and all types of distributed intelligences in each section of genome, statistics SV location distribution information in the genome and
Count the distribution of lengths information of SV.
Functional annotation is carried out to all genes preferably, the method includes the steps (5).
Compared with prior art, the beneficial effects of the present invention are:
The present invention has carried out detailed sequencing procedures for sheep full genome, and has carried out effective analysis to sequencing result, sends out
Now a large amount of monokaryons calculate polymorphic site SNP, insertion and deletion site, structure variation site and copy number variant sites;The present invention is logical
Biological information means are crossed, the architectural difference between sheep genes of individuals group is analyzed, by the method for the invention can be sheep sequence
Column difference and structure variation provide scientific basis.
Detailed description of the invention
Fig. 1 is experiment flow figure provided in an embodiment of the present invention;
Fig. 2 is provided in an embodiment of the present invention to resurvey sequence information analysis flow chart;
Fig. 3 is FASTQ file format exemplary diagram provided in an embodiment of the present invention;
Fig. 4 is sample quality Distribution value figure provided in an embodiment of the present invention;
Fig. 5 is sample single base depth profile provided in an embodiment of the present invention;
Fig. 6 is sample accumulation depth profile provided in an embodiment of the present invention.
Specific embodiment
For further illustrate the present invention to reach the technical means and efficacy that predetermined goal of the invention is taken, below with compared with
Good embodiment, to specific embodiment, technical solution, feature and its effect applied according to the present invention, detailed description is as follows.Under
Stating the special characteristic, structure or feature in multiple embodiments in bright can be combined by any suitable form.
Embodiment 1 (sheep full-length genome weight sequencing approach and analysis)
Experiment flow is as shown in Figure 1, to resurvey sequence information analysis process as shown in Figure 2.
1, sheep sample information: for the relationship between more clear used sample name and initial sample information, column
Sample information collects table 1 out, as follows:
1 sample information of table collects table
(1) Sample ID: original sample title;
(2) Sample Name: sample names used in analysis result;
(3) Sample Description: original sample description information.
2, data filtering: after sample reception, firstly, DNA to offer or the DNA extracted from the sample of offer are carried out
The detection of purity, concentration and volume etc.;Secondly, carrying out library preparation and library quality inspection, text to sample of the detection after qualified
Library preparation adds connector by extracting the genomic DNA of sample and interrupting at random, the DNA segment of length needed for electrophoresis recycles
Required library is prepared in primer;Finally, carrying out upper machine sequencing to the library of quality inspection qualification;Experiment flow is as shown in Figure 1;
(1) raw sequencing data: the original image data file that high-flux sequence (Illumina) obtains is through CASAVA alkali
Base identification (Base Calling) analysis is converted into sequencing sequence (Sequenced Reads), as a result with FASTQ (referred to as fq)
Stored in file format, referred to as Raw Reads.FASTQ file include the title of every sequencing sequence (Read), base sequence and
Its corresponding sequencing quality information.In FASTQ formatted file, each base-pair answers a base quality character, each base
The corresponding ASCII character value of quality character subtracts 33, as the sequencing quality score (Phred QualityScore) of the base;No
Different base sequencing error rates is represented with Phred Quality Score, if PhredQuality Score value is 20 and 30
Respectively indicating base sequencing error rate is 1% and 0.1%;Wherein FASTQ format sample is as shown in Figure 3;
In Fig. 3, (1) the first row is then Illumina sequencing mark identifier (Sequence with " " beginning
) and descriptive text (selective part) Identifiers;(2) second rows are base sequences;(3) the third line is started with "+", with
It is afterwards Illumina sequencing mark identifier (selective part);(4) fourth line is the sequencing quality of corresponding base, each in the row
The corresponding ASCII value of character subtracts 33, the sequencing quality value of as corresponding second row base.
Obtained certain primitive sequencer sequences are sequenced and contain belt lacing, low-quality sequence, in order to guarantee information analysis
Quality, it is necessary to original series are filtered, to obtain Clean Reads, subsequent analysis is all based on Clean Reads.
Data processing step is as follows:
(1) removal connector pollution Reads (Reads center tap pollution base number be greater than 5bp, for both-end be sequenced, if
One end is polluted by connector, then removes the Reads at both ends);
(2) remove low-quality Reads (50% or more of the total base of base Zhan of mass value Q≤19 in Reads, for
Both-end sequencing can remove the Reads at both ends if one end is low quality Reads);
(3) Reads of the removal ratio containing N greater than 5% (is sequenced both-end, if one end ratio containing N is greater than 5%, can go
Fall the Reads at both ends).Data filtering statistical result is shown in Table 2:
2 data filtering statistical analysis table of table
In table 2, (1) Raw Reads Number: the original sequence number for not filtering sequencing;(2)Raw Bases
Number: original unfiltered base number;(3) Clean Reads Number: filtered remaining sequence number;(4)
Clean Reads Rate (%): the ratio of remaining sequence number after filtering.This value is bigger, illustrates sequencing quality or library
Quality is better;(5) Clean Bases Number: filtered remaining base number;(6)Low-quality Reads
Number: the sequence number removed by low quality filter criteria;(7) Low-quality Reads Rate (%): by low quality mistake
The ratio for the sequence that filter standard is removed;(8) it Adapter polluted Reads Number: is removed containing connector pollution
Sequence number;(9) Adapter polluted Reads Rate (%): the ratio containing the sequence that connector pollution is removed;(10)
Raw Q30 Bases Rate (%): before filtration, mass value is greater than the base of 30 (error rate is less than 0.1%) in original series
Number accounts for the ratio of total bases;(11) Clean Q30 Bases Rate (%): after filtering, mass value is greater than 30 in total sequence
The ratio of the base number of (error rate is less than 0.1%);(12) Ns Reads Number: the N containing base ratio is greater than 5%
Reads number;(13) Reads of Ns Reads Rate (%): the N containing the base ratio greater than 5% accounts for the ratio of Raw Reads number.
(2) Mass Distribution of data: sequencing error rate is related with base quality, by sequenator itself, sequencing reagent, sample
Etc. Multiple factors joint effect.Each base sequencing error rate is by the way that Phred numerical value (Phred score, Qphred) is sequenced
It converts to obtain by formula, and Phred numerical value is during base identification (Base Calling) by a kind of prediction base
Differentiate that error probability model, which occurs, to be calculated, corresponding relationship is as shown in table 3.
Concise mapping table between the identification of 3 IlluminaCaseva base of table and Phred score value
In order to react the stability of sequencing quality in sequencing procedure roughly, using the base positions of Clean Reads as cross
Coordinate, the average sequencing quality value of each position as ordinate, obtain each sample sequencing quality distribution map as shown in figure 4,
It is most of all in Q30 or more in Fig. 4, it is known that the quality of data is relatively high;It is since sequencing is prolonged that Reads front half section quality is often relatively low
After cause, to second half section quality often also it is relatively low is due to sequencing raw material loss cause.
(3) base distribution of data: Clean Data base contents distribution table reflects each base in Clean Reads
The ratio that every kind of base occurs on position, is used to check whether there is or not AT, GC segregation phenomenon, and this phenomenon may be sequencing or build
Brought by library, it will affect subsequent quantitative analysis, except the stronger text of the specificity such as library PCR for a species
Library, distribution situation should be consistent substantially with the GC distribution situation of reference genome.
3, comparison and Quality Control
(1) comparison information counts: using genome alignment software BWA (Li, et al., 2009), by filtered Clean
Reads is compared onto reference genome, and specifying information is shown in Table 4:
4 comparison data of table statistics
In table 4, (1) Genome Length (bp): Genome Size is referred to;(2) Clean Reads: filtered
The number of Reads;(3) Clean Bases: filtered base number;(4) it Mapped Reads: compares onto reference genome
Reads number;(5) Mapping Rate (%): the percentage to the Reads on reference genome is compared;(6)Mean
Depth: the average sequencing depth after Uniq is compared is taken, that is, is used for the data depth of subsequent analysis;(7)Coverage
Rate (%): coverage has the region of much ratios at least to measure 1 time in reference sequences.
(2) depth distribution is sequenced: the distribution that depth is sequenced can reflect the homogeneity for building library sequencing and cover to genome
Details;For single base depth profile depth is sequenced as abscissa, the base position ratio of respective depth is ordinate,
Fraction of Bases=(the corresponding base position number/genome length of certain depth) herein, it reflects certain depth
Under corresponding genome coverage condition;Fig. 5 is sequencing depth profile, and single base depth profile is to be sequenced depth as horizontal seat
Mark, Fraction of bases are ordinate.Depth profile is accumulated so that depth is sequenced as abscissa, greater than respective depth
Base position ratio is ordinate, and Fraction of bases=is (greater than base position number/gene group leader of this depth herein
Degree), it is reflected greater than the genome coverage under certain depth;Fig. 6 is accumulation depth profile.
4, variation detection
(1) SNP variation detection and annotation: on the basis of comparing to reference to genome sequence, pass through mutation analysis software
GATK (McKenna, etal., 2010) therefrom extracts potential SNP site all in full-length genome, further according to mass value, depth
The factors such as degree, repeatability do further filtering screening, finally obtain the SNP data set of high confidence level, and annotate to it.
SNP detection and annotation: after detecting using GATK and SNP be obtained by filtration, using ANNOVAR (Wang, et al.,
2010) software and existing genome annotation file (gff/gtf) annotate the SNP detected accordingly, the knot of annotation
Fruit is stored in Excel file, and specific notes content explanation refers to format description book.
SNP position distribution statistics: the SNP after annotation counts it in the distribution situation in each section of genome, with one of them
For sample:
5 SNP distribution statistics of table
In table 5, (1) Total: whole SNP number in genome;(2) the SNP number in the UTR5 of gene UTR5: occurs
Mesh;(3) the SNP number in the UTR3 of gene UTR3: occurs;(4)UTR5;UTR3: the UTR5 and another base in gene occurs
Because of the SNP number in the shared section of UTR3;It is other similar;(5) the SNP number in exon region exonic: occurs;(6)
Splicing: the SNP number in gene-splicing region (the shearing site upstream area 2bp, Ji Fei Exonic) occurs;(7)
exonic;Splicing: occur gene Exonic adjacent to (the shearing site downstream) shearing site 2bp SNP number;(8)
Upstream: the SNP number in upstream region of gene (1000bp) occurs;(9) downstream: occur in downstream of gene
SNP number in (1000bp);(10)upstream;Downstream: occur in upstream region of gene or downstream (1000bp)
SNP number;(11) intronic: occur in the SNP number for including subregion;(12) intergenic: occur between gene
The SNP number in area;(13) ncRNA: without RNA, the RNA not translated not of correlative coding annotation, referring to the Gene of ANNOVAR
Explanatory notes;Subinterval annotation is same as above;(14) other: positioned at the number of other positions SNP.
SNP heterozygosis is than analysis: detected through GATK (McKenna, et al., 2010) and be obtained by filtration SNP point for heterozygosis and
Homozygous SNP analyzes the heterozygosis ratio of SNP in genome, helps to have the species more analyses and understand, subsequent to carry out
Analysis.The homozygosis of sample and the ratio such as table 6 of heterozygosis SNP, by taking one of sample as an example:
6 homozygosis of table and heterozygosis SNP ratio table
In table 6, (1) Hom_genome: homozygous SNP in genome;(2) Het_genome: heterozygosis in genome
SNP;(3) Hom_exonic: homozygous SNP in exon;(4) Het_exonic: the SNP of heterozygosis in exon.
SNP mutation mode distribution statistics: the factors such as different plant species, varying environment will lead to the difference of SNP mutation mode, lead to
Cross statistics SNP mutation mode distribution situation, the peculiar Catastrophe Model of available particular species, specific type sample, thus
Have to the species or sample and more comprehensively understands and analyze.Table 7 is the statistics of SNP mutation mode, is with one of sample
Example:
7 SNP mutation mode distribution statistics of table
In table 7, (1) T-A: the i.e. mutation of T to A (mutation of A to the T comprising anti-chain);(2) T-C: the i.e. mutation of T to C
(mutation of A to the G comprising anti-chain);(3) T-G: the i.e. mutation of T to G (mutation of A to the C comprising anti-chain);(4) C-A: i.e. C is arrived
The mutation (mutation of G to the T comprising anti-chain) of A;(5) C-T: the i.e. mutation of C to T (mutation of G to the A comprising anti-chain);6)C-
G: the i.e. mutation of C to G (mutation of G to the C comprising anti-chain).
Code area SNP functional annotation and statistics: the SNP mutation of coding region may influence whether the coding of amino acid, into
And influence gene function;The change of amino acid whether is caused to carry out classification annotation according to it mutation for being located at coding region, such as
Nonsynonymous mutation, same sense mutation etc., usual nonsynonymous mutation leads to corresponding amino acid change so that gene function changes
Become, and Stopgain and Stoploss result in the appearance in advance or missing of terminator, so and detrimental mutation;Table 8 provides
The function distribution statistics of SNP, by taking one of sample as an example:
8 SNP function distribution statistics of table
In table 8, (1) Total: all mutation summations;(2) nonsynonymous SNV: nonsynonymous mutation, codon
Change the amino acid change (the same SNP of SNV herein) for leading to coding;(3) synonymous SNV: same sense mutation, codon mutation
To encode the codon with monoamino-acid, the change of nucleotide does not cause the change of amino acid, i.e., does not cause the prominent of gene product
Become;(4) stopgain: the change of codon leads to the appearance of terminator;(5) stoploss: the change of codon causes to terminate
The missing of son;(6) unknown: UNKNOWN TYPE.
The SNP function statistical result in encoding samples area is mapped.
(2) InDel variation detection and annotation: soft by mutation analysis on the basis of comparing to reference to genome sequence
Part GATK (McKenna, etal., 2010) therefrom extract potential polymorphism InDel all in full-length genome (Insertion and
Deletion further filtering screening is done further according to factors such as mass value, depth, repeatability in) site, finally obtains high credible
The InDel data set of degree, and it is annotated.
InDel detection and annotation: after detecting using GATK (McKenna, et al., 2010) and InDel is obtained by filtration, make
With ANNOVAR (Wang, et al., 2010) software and existing genome annotation file (gff/gtf) to the InDel detected
It is annotated accordingly, the result of annotation is stored in Excel file, and specific notes content refers to format description book.
InDel position distribution statistics: after detecting using GATK (McKenna, et al., 2010) and InDel be obtained by filtration,
Using ANNOVAR (Wang, et al., 2010) software and existing genome annotation file (gff/gtf) to what is detected
InDel is annotated accordingly, counts it in the distribution situation in each section of genome, by taking one of sample as an example, such as 9 institute of table
Show:
9 InDel distribution statistics of table
In table 9, (1) Total: whole InDel number in genome;(2) UTR5 in gene UTR5: occurs
InDel number;(3) the InDel number in the UTR3 of gene UTR3: occurs;(4)UTR5;UTR3: occur gene UTR5 with
The InDel number in the shared section of another gene UTR3;It is other similar;(5) exonic: occur in exon region
InDel number;(6) splicing: occur in gene-splicing region (the shearing site upstream area 2bp, Ji Fei Exonic)
InDel number;(7)exonic;Splicing: the Exonic in gene occurs adjacent to shearing site 2bp (shearing site downstream)
InDel number;(8) the InDel number in upstream region of gene (1000bp) upstream: occurs;(9) downstream: hair
The raw InDel number in downstream of gene (1000bp);(10)upstream;Downstream: occurring in upstream region of gene or
InDel number in downstream (1000bp);(11) intronic: occur in the InDel number for including subregion;(12)
Intergenic: the InDel number in intergenic region occurs;(13) it ncRNA: without the RNA of correlative coding annotation, does not turn over not
The RNA translated, referring to the Gene explanatory notes of ANNOVAR;Subinterval annotation is same as above;(14) other: it is located at other positions InDel
Number.
InDel Catastrophe Model distribution statistics: the difference of InDel length can cause the influence different degrees of to genome,
Full-length genome and code area, the distribution of the InDel of different length have apparent difference, and code area is because distinctive needed for it
Conservative, (3 base InDel are not easy to cause to move compared with 2 bases and 4 bases etc. more for the quantitative proportion of the InDel of 3 bases
Code).Table 10 is the statistics of InDel Catastrophe Model, by taking a sample as an example:
10 InDel Catastrophe Model of table statistics
In table 10, first row indicates the length of InDel, the i.e. length of Insertion or Deletion;(1) Genome:
Length is the number of the InDel of n in full-length genome;(2) Exonic: length is the number of the InDel of n in code area.
Code area InDel functional annotation and statistics: the InDel mutation of coding region may influence whether the coding of amino acid
And then gene function is influenced, whether cause the change of amino acid to carry out classification annotation according to it mutation for being located at coding region,
Such as frameshift mutation, non-frameshift mutation, the usual more non-frameshift mutation of frameshift mutation is more harmful, and Stopgain and Stoploss because
It also can be detrimental mutation for the appearance in advance or missing for resulting in terminator.Table 11 gives the functional annotation statistics of InDel,
By taking one of sample as an example:
11 InDel function of table statistics
In table 11, (1) Total: all mutation summations;(2) frameshift: frameshift mutation, base deletion or increase non-3
Multiple, cause a series of codings after this position to be displaced the change of mistake;(3) nonframeshift: non-frameshit is prominent
Become, the multiple that base deletion or increase are 3;(4) stopgain: the change of codon leads to the appearance of terminator;(5)
Stoploss: the change of codon leads to the missing of terminator;(6) unknown: UNKNOWN TYPE.
(3) SV detection and annotation: chromosomal structural variation (SV) is the important composition of genome mutation, Primary mutations class
Type has: insertion, missing, inversion etc..On the basis of comparing to reference to genome sequence, analyzed by chromosomal structural variation soft
Part DELLY (Tobias, et al., 2012) detection all potential sites SV of full-length genome further according to mass value, are supported
The factors such as Reads number do further filtering, finally obtain the SV data set of high confidence level, and annotate to it.
SV variation detection and annotation: it after detecting using DELLY (Tobias, et al., 2012) and SV is obtained by filtration, utilizes
Existing gene annotation file (gff/gtf) annotates the SV detected accordingly, and the result after annotation deposits in Excel
In file.
SV variation type statistics: chromosome insertion, chromosome deficiency, the chromosome inversion that will test etc. are different types of
The number of SV is counted, as a result such as table 12, by taking one of sample as an example:
12 SV variation type statistical form of table
In table 12, (1) DEL: chromosome deficiency;(2) TRA: chromosome translocation;(3) DUP: chromosome doubling;(4) INV:
Chromosome inversion;(5) INS: chromosome insertion.
SV position distribution statistics: after detecting using DELLY (Tobias, et al., 2012) and SV is obtained by filtration, using
Some genome annotation files (gff/gtf) annotate the SV detected accordingly, the various elements of statistics SV covering
Distribution situation, by taking one of sample as an example, as shown in table 13.
13 SV position distribution table of table
In table 13, (1) Total: whole SV number in genome;(2) the SV number in the UTR5 of gene UTR5: occurs
Mesh;(3) the SV number in the UTR3 of gene UTR3: occurs;(4)UTR5;UTR3: the UTR5 and another gene in gene occurs
The SV number in the shared section of UTR3;It is other similar;(5) the SV number in exon region exonic: occurs;(6)
Splicing: the SV number in gene-splicing region (the shearing site upstream area 2bp, Ji Fei Exonic) occurs;(7)
exonic;Splicing: occur gene Exonic adjacent to (the shearing site downstream) shearing site 2bp SV number;(8)
Upstream: the SV number in upstream region of gene (1000bp) occurs;(9) downstream: occur in downstream of gene
SV number in (1000bp);(10)upstream;Downstream: occur in upstream region of gene or downstream (1000bp)
SV number;(11) intronic: occur in the SV number for including subregion;(12) intergenic: occur in intergenic region
SV number;(13) it ncRNA: without RNA, the RNA not translated not of correlative coding annotation, is annotated referring to the Gene of ANNOVAR
Explanation;Subinterval annotation is same as above;(14) other: positioned at the number of other positions SV.
SV distribution of lengths statistics: the research to SV length characteristic facilitates the complexity and change that understand the species gene group
Off course degree.
5, all genes annotation of gene function: are subjected to functional annotation, comment file format description such as table 14.
14 annotation of gene function file format explanation of table
Place, those skilled in the art can not select from the prior art to the greatest extent in the embodiment of the present invention.
Disclosed above is only a specific embodiment of the invention, but scope of protection of the present invention is not limited thereto, is appointed
What those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, answer
It is included within the scope of the present invention.Therefore, protection scope of the present invention should be with above-mentioned scope of protection of the claims
It is quasi-.
Claims (9)
1. a kind of sheep full-length genome resurveys sequence analysis method, which is characterized in that the described method comprises the following steps:
(1) sheep DNA is obtained, the purity, concentration and volume of the DNA are detected, library system is carried out to sample of the detection after qualified
Standby and library quality inspection, is sequenced the library of quality inspection qualification, obtains sheep raw sequencing data;
(2) data filtering is carried out to the sheep raw sequencing data and assesses sequencing quality, obtained after data Quality Control is qualified
Target analysis sequence data;
(3) the target analysis sequence data is compared to sheep with reference on genome, is obtained after being compared index Quality Control qualification
Data in comparison;
(4) mononucleotide SNP variation, small fragment insertion and deletion variation InDel and the chromosome of the data in the comparison are detected
Structure variation SV, and being annotated, obtain SNP data information in sheep genome sequencing sequence, InDel data information and
SV data information.
2. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the data filtering
Detailed process are as follows:
(1) sequence of removal connector pollution;Wherein, the base number of sequence center tap pollution is greater than 5bp, and both-end is sequenced, if
One end is polluted by connector, then removes the sequence at both ends;
(2) low-quality sequence is removed;Wherein, in sequence the total base of base Zhan of mass value Q≤19 50% or more, for double
End sequencing can remove the sequence at both ends if one end is low quality sequence;
(3) removal ratio containing N is greater than 5% Reads;Wherein, both-end is sequenced, it, can if one end ratio containing N is greater than 5%
Remove the sequence at both ends.
3. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the assessment sequencing
Quality includes assessing the base distribution information of the Mass Distribution information and data of data;Wherein, the Mass Distribution information includes
Count base sequencing error rate and base correct recognition rata.
4. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that by the target point
Sequence data comparison is analysed to refer on genome to sheep using software BWA;Comparison rate is 90.2%-99.1%.
5. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the comparison index
Quality Control includes sequencing Depth profile information;Wherein, the sequencing Depth profile information includes single base Depth profile information and tires out
Product Depth profile information.
6. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the SNP data
Information is therefrom to extract all in full-length genome dive by mutation analysis software GATK in the data basis in the comparison
SNP site, do further filtering screening further according to mass value, depth and repeatability, finally obtain the SNP of high confidence level
Data set, and it is annotated;Then distributed intelligence in each section of genome of SNP in the data set, analysis base are counted
Because of the SNP mutation function classification of the heterozygosis ratio of SNP, the distributed intelligence of statistics SNP mutation mode and statistical coding region in group
Information.
7. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the InDel number
It is believed that breath be in the data basis in the comparison by mutation analysis software GATK therefrom extract it is all in full-length genome
The potential site InDel, does further filtering screening further according to mass value, depth and repeatability, finally obtains high confidence level
InDel data set, and it is annotated;Then distribution of the InDel in each section of genome in the data set is counted to believe
The InDel functional mutant classification information of breath, the distributed intelligence of statistics InDel Catastrophe Model and statistical coding region.
8. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the SV data letter
Breath is to analyze software DELLY by chromosomal structural variation in the data basis in the comparison therefrom to extract in full-length genome
All potential sites SV, do further filtering screening further according to mass value, depth and repeatability, finally obtain high credible
The SV data set of degree, and it is annotated;Then the variation type of SV and all types of in genome is counted in the data set
The distributed intelligence in each section, statistics SV location distribution information in the genome and the distribution of lengths information for counting SV.
9. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the method includes
Step (5) carries out functional annotation to all genes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910448101.8A CN110189796A (en) | 2019-05-27 | 2019-05-27 | A kind of sheep full-length genome resurveys sequence analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910448101.8A CN110189796A (en) | 2019-05-27 | 2019-05-27 | A kind of sheep full-length genome resurveys sequence analysis method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110189796A true CN110189796A (en) | 2019-08-30 |
Family
ID=67718171
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910448101.8A Pending CN110189796A (en) | 2019-05-27 | 2019-05-27 | A kind of sheep full-length genome resurveys sequence analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110189796A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110093406A (en) * | 2019-05-27 | 2019-08-06 | 新疆农业大学 | A kind of argali and its filial generation gene research method |
CN111676270A (en) * | 2020-07-09 | 2020-09-18 | 四川省自然资源科学研究院 | Method for screening polymorphic SNP molecular marker, polymorphic SNP molecular marker and primer pair |
CN111755068A (en) * | 2020-06-19 | 2020-10-09 | 深圳吉因加医学检验实验室 | Method and device for identifying tumor purity and absolute copy number based on sequencing data |
CN113005189A (en) * | 2021-04-16 | 2021-06-22 | 中国农业科学院兰州畜牧与兽药研究所 | Method for assembling and annotating Guide black fur sheep genome based on third-generation PacBio and Hi-C technology |
CN113122642A (en) * | 2021-04-16 | 2021-07-16 | 中国农业科学院兰州畜牧与兽药研究所 | Method for assembling and annotating Hu sheep genome based on third-generation PacBio and Hi-C technology |
CN113555062A (en) * | 2021-07-23 | 2021-10-26 | 哈尔滨因极科技有限公司 | Data analysis system and analysis method for genome base variation detection |
CN113628685A (en) * | 2021-07-27 | 2021-11-09 | 广东省农业科学院水稻研究所 | Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data |
CN114974416A (en) * | 2022-07-15 | 2022-08-30 | 深圳雅济科技有限公司 | Method and device for detecting adjacent polynucleotide variation |
CN116434837A (en) * | 2023-06-12 | 2023-07-14 | 广州盛安医学检验有限公司 | Chromosome balance translocation detection analysis system based on NGS |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103184275A (en) * | 2011-12-29 | 2013-07-03 | 天津农学院 | Novel method for gene identification of rice genome |
CN105653893A (en) * | 2015-12-25 | 2016-06-08 | 北京百迈客生物科技有限公司 | Genome re-sequencing analysis system and method |
CN106021979A (en) * | 2016-05-12 | 2016-10-12 | 北京百迈客云科技有限公司 | Analysis system and method for human genome re-sequencing data |
CN107194204A (en) * | 2017-05-22 | 2017-09-22 | 人和未来生物科技(长沙)有限公司 | A kind of sequencing data of whole genome calculates deciphering method |
-
2019
- 2019-05-27 CN CN201910448101.8A patent/CN110189796A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103184275A (en) * | 2011-12-29 | 2013-07-03 | 天津农学院 | Novel method for gene identification of rice genome |
CN105653893A (en) * | 2015-12-25 | 2016-06-08 | 北京百迈客生物科技有限公司 | Genome re-sequencing analysis system and method |
CN106021979A (en) * | 2016-05-12 | 2016-10-12 | 北京百迈客云科技有限公司 | Analysis system and method for human genome re-sequencing data |
CN107194204A (en) * | 2017-05-22 | 2017-09-22 | 人和未来生物科技(长沙)有限公司 | A kind of sequencing data of whole genome calculates deciphering method |
Non-Patent Citations (2)
Title |
---|
兰蓉等: "云南黑山羊全基因组重测序", 《草食家畜》 * |
李晓凯等: "全基因组测序在重要家畜上的研究进展", 《生物技术通报》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110093406A (en) * | 2019-05-27 | 2019-08-06 | 新疆农业大学 | A kind of argali and its filial generation gene research method |
CN111755068A (en) * | 2020-06-19 | 2020-10-09 | 深圳吉因加医学检验实验室 | Method and device for identifying tumor purity and absolute copy number based on sequencing data |
CN111755068B (en) * | 2020-06-19 | 2021-02-19 | 深圳吉因加医学检验实验室 | Method and device for identifying tumor purity and absolute copy number based on sequencing data |
CN111676270B (en) * | 2020-07-09 | 2023-07-25 | 四川省自然资源科学研究院 | Screening method of polymorphic SNP molecular markers, polymorphic SNP molecular markers and primer pair |
CN111676270A (en) * | 2020-07-09 | 2020-09-18 | 四川省自然资源科学研究院 | Method for screening polymorphic SNP molecular marker, polymorphic SNP molecular marker and primer pair |
CN113005189A (en) * | 2021-04-16 | 2021-06-22 | 中国农业科学院兰州畜牧与兽药研究所 | Method for assembling and annotating Guide black fur sheep genome based on third-generation PacBio and Hi-C technology |
CN113122642A (en) * | 2021-04-16 | 2021-07-16 | 中国农业科学院兰州畜牧与兽药研究所 | Method for assembling and annotating Hu sheep genome based on third-generation PacBio and Hi-C technology |
CN113555062A (en) * | 2021-07-23 | 2021-10-26 | 哈尔滨因极科技有限公司 | Data analysis system and analysis method for genome base variation detection |
CN113628685B (en) * | 2021-07-27 | 2022-03-15 | 广东省农业科学院水稻研究所 | Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data |
CN113628685A (en) * | 2021-07-27 | 2021-11-09 | 广东省农业科学院水稻研究所 | Whole genome correlation analysis method based on multiple genome comparisons and second-generation sequencing data |
CN114974416A (en) * | 2022-07-15 | 2022-08-30 | 深圳雅济科技有限公司 | Method and device for detecting adjacent polynucleotide variation |
CN114974416B (en) * | 2022-07-15 | 2023-04-07 | 深圳雅济科技有限公司 | Method and device for detecting adjacent polynucleotide variation |
CN116434837A (en) * | 2023-06-12 | 2023-07-14 | 广州盛安医学检验有限公司 | Chromosome balance translocation detection analysis system based on NGS |
CN116434837B (en) * | 2023-06-12 | 2023-08-29 | 广州盛安医学检验有限公司 | Chromosome balance translocation detection analysis system based on NGS |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110189796A (en) | A kind of sheep full-length genome resurveys sequence analysis method | |
CN113744807B (en) | Macrogenomics-based pathogenic microorganism detection method and device | |
CN102770558B (en) | The analysis of Fetal genome is carried out by maternal biological sample | |
EP2926288B1 (en) | Accurate and fast mapping of targeted sequencing reads | |
CN109994154B (en) | Screening device for candidate pathogenic genes of monogenic recessive genetic disease | |
US20130166221A1 (en) | Method and system for sequence correlation | |
CN111755072B (en) | Method and device for simultaneously detecting methylation level, genome variation and insertion fragment | |
CN108319813A (en) | Circulating tumor DNA copies the detection method and device of number variation | |
CN110093406A (en) | A kind of argali and its filial generation gene research method | |
CN108304694B (en) | Method for analyzing gene mutation based on second-generation sequencing data | |
CN112349346A (en) | Method for detecting structural variations in genomic regions | |
CN112233722B (en) | Variety identification method, and method and device for constructing prediction model thereof | |
CN113362889A (en) | Genome structure variation annotation method | |
CN115083521A (en) | Method and system for identifying tumor cell group in single cell transcriptome sequencing data | |
CN115083529A (en) | Method and device for detecting sample pollution rate | |
CN117095746A (en) | GBS whole genome association analysis method for buffalo | |
CN109461473B (en) | Method and device for acquiring concentration of free DNA of fetus | |
CN111243665A (en) | Analysis method and system for ribosome imprinting sequencing data | |
CN107217091A (en) | A kind of detection method of milch goat Fecundity Trait related gene SNP | |
CN108728515A (en) | A kind of analysis method of library construction and sequencing data using the detection ctDNA low frequencies mutation of duplex methods | |
CN117275577A (en) | Algorithm for detecting human mitochondrial genetic mutation sites based on second-generation sequencing technology | |
CN114530200B (en) | Mixed sample identification method based on calculation of SNP entropy | |
CN104232649A (en) | Genetic mutant and application of genetic mutant | |
Roy et al. | NGS-μsat: bioinformatics framework supporting high throughput microsatellite genotyping from next generation sequencing platforms | |
CN116469462A (en) | Ultra-low frequency DNA mutation identification method and device based on double sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190830 |