CN110189796A

CN110189796A - A kind of sheep full-length genome resurveys sequence analysis method

Info

Publication number: CN110189796A
Application number: CN201910448101.8A
Authority: CN
Inventors: 依明·苏来曼; 阿布来提·苏来曼; 决肯·阿尼瓦什; 刘武军; 黄锡霞; 黄李勇; 赵雄
Original assignee: Xinjiang Agricultural University
Current assignee: Xinjiang Agricultural University
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2019-08-30

Abstract

The invention discloses a kind of sheep full-length genomes to resurvey sequence analysis method, is related to gene technology field.The method of the present invention detects its purity, concentration and volume the following steps are included: (1) acquisition sheep DNA, carries out library preparation and library quality inspection to sample of the detection after qualified, is sequenced to the library of quality inspection qualification, obtains sheep raw sequencing data；(2) data filtering is carried out to the sheep raw sequencing data and assesses sequencing quality, obtain target analysis sequence data after data Quality Control is qualified；(3) the target analysis sequence data is compared to sheep with reference to the data on genome, compared after comparing index Quality Control qualification；(4) mononucleotide SNP variation, the small fragment insertion and deletion variation InDel and chromosomal structural variation SV of the data in the comparison are detected, and annotated, obtain SNP data information, InDel data information and the SV data information in sheep genome sequencing sequence.

Description

A kind of sheep full-length genome resurveys sequence analysis method

Technical field

The present invention relates to gene technology fields more particularly to a kind of sheep full-length genome to resurvey sequence analysis method.

Background technique

DNA is a kind of important substance in organism, its load hereditary information in the form of gene, and as gene duplication With the template of transcription, it plays weight during the Growth and Differentiation of cell and development, metabolism and the disease generation of bion etc. It acts on.Whole hereditary information entrained by cell are collectively referred to as genome.

Full-length genome resurveys sequence, is based on Illumina microarray dataset, carries out to the species having with reference to genome sequence a Body or the genome sequencing of group detect single nucleotide polymorphism using high-performance calculation platform and bioinformatics method The polymorphism informations such as site (SNP), insertion and deletion (InDel), obtain its biological heredity feature, thus carry out it is subsequent heredity into Change analysis and the prediction of the related candidate gene of important character, there is important guidance to anticipate the researchs such as the molecular breeding of the species Justice.However there are not in more detail more specifically sequencing approach and its effectively analysis also for sheep full-length genome.

Summary of the invention

In view of this, the embodiment of the invention provides a kind of sheep full-length genomes to resurvey sequence analysis method.

In order to achieve the above objectives, invention broadly provides following technical solutions:

On the one hand, the embodiment of the invention provides a kind of sheep full-length genomes to resurvey sequence analysis method, the method includes Step:

(1) sheep DNA is obtained, the purity, concentration and volume of the DNA are detected, text is carried out to sample of the detection after qualified Library preparation and library quality inspection, are sequenced the library of quality inspection qualification, obtain sheep raw sequencing data；

(2) data filtering is carried out to the sheep raw sequencing data and assesses sequencing quality, after data Quality Control is qualified Obtain target analysis sequence data；

(3) the target analysis sequence data is compared to sheep and is referred on genome, after being compared index Quality Control qualification The data compared；

(4) mononucleotide SNP variation, small fragment insertion and deletion variation InDel and the dye of the data in the comparison are detected Chromosome structures variation SV, and annotated, obtain SNP data information, the InDel data letter in sheep genome sequencing sequence Breath and SV data information.

Preferably, the detailed process of the data filtering are as follows:

(1) sequence of removal connector pollution；Wherein, the base number of sequence center tap pollution is greater than 5bp, and both-end is surveyed Sequence removes the sequence at both ends if one end is polluted by connector；

(2) low-quality sequence is removed；Wherein, in sequence the total base of base Zhan of mass value Q≤19 50% or more, it is right It is sequenced in both-end, if one end is low quality sequence, the sequence at both ends can be removed；

(3) removal ratio containing N is greater than 5% Reads；Wherein, both-end is sequenced, if one end ratio containing N is greater than 5%, The sequence at both ends can then be removed.

Preferably, the assessment sequencing quality includes the base distribution letter of the Mass Distribution information and data of assessing data Breath；Wherein, the Mass Distribution information includes statistics base sequencing error rate and base correct recognition rata.

Software BWA is used on genome preferably, the target analysis sequence data is compared to sheep to refer to；It compares Rate is 90.2%-99.1%.

Preferably, the comparison index Quality Control includes sequencing Depth profile information；Wherein, the sequencing depth distribution letter Breath includes single base Depth profile information and accumulation Depth profile information.

Preferably, the SNP data information is to pass through mutation analysis software in the data basis in the comparison GATK therefrom extracts potential SNP site all in full-length genome, does further further according to mass value, depth and repeatability Filtering screening, finally obtains the SNP data set of high confidence level, and annotates to it；Then SNP in the data set is counted to exist The heterozygosis ratio of SNP in the distributed intelligence in each section of genome, analysis genome, the distributed intelligence for counting SNP mutation mode and The SNP mutation function classification information in statistical coding region.

Preferably, the InDel data information is in the data basis in the comparison by mutation analysis software GATK therefrom extracts the potential site InDel all in full-length genome, does further further according to mass value, depth and repeatability Filtering screening, finally obtain the InDel data set of high confidence level, and annotate to it；Then it counts in the data set InDel is in the distributed intelligence in each section of genome, the distributed intelligence of statistics InDel Catastrophe Model and statistical coding region InDel functional mutant classification information.

Preferably, the SV data information is in the data basis in the comparison by chromosomal structural variation point Analysis software DELLY therefrom extract the potential site SV all in full-length genome, further according to mass value, depth and repeatability do into The filtering screening of one step, finally obtains the SV data set of high confidence level, and annotates to it；Then it counts in the data set The variation type of SV and all types of distributed intelligences in each section of genome, statistics SV location distribution information in the genome and Count the distribution of lengths information of SV.

Functional annotation is carried out to all genes preferably, the method includes the steps (5).

Compared with prior art, the beneficial effects of the present invention are:

The present invention has carried out detailed sequencing procedures for sheep full genome, and has carried out effective analysis to sequencing result, sends out Now a large amount of monokaryons calculate polymorphic site SNP, insertion and deletion site, structure variation site and copy number variant sites；The present invention is logical Biological information means are crossed, the architectural difference between sheep genes of individuals group is analyzed, by the method for the invention can be sheep sequence Column difference and structure variation provide scientific basis.

Detailed description of the invention

Fig. 1 is experiment flow figure provided in an embodiment of the present invention；

Fig. 2 is provided in an embodiment of the present invention to resurvey sequence information analysis flow chart；

Fig. 3 is FASTQ file format exemplary diagram provided in an embodiment of the present invention；

Fig. 4 is sample quality Distribution value figure provided in an embodiment of the present invention；

Fig. 5 is sample single base depth profile provided in an embodiment of the present invention；

Fig. 6 is sample accumulation depth profile provided in an embodiment of the present invention.

Specific embodiment

For further illustrate the present invention to reach the technical means and efficacy that predetermined goal of the invention is taken, below with compared with Good embodiment, to specific embodiment, technical solution, feature and its effect applied according to the present invention, detailed description is as follows.Under Stating the special characteristic, structure or feature in multiple embodiments in bright can be combined by any suitable form.

Embodiment 1 (sheep full-length genome weight sequencing approach and analysis)

Experiment flow is as shown in Figure 1, to resurvey sequence information analysis process as shown in Figure 2.

1, sheep sample information: for the relationship between more clear used sample name and initial sample information, column Sample information collects table 1 out, as follows:

1 sample information of table collects table

(1) Sample ID: original sample title；

(2) Sample Name: sample names used in analysis result；

(3) Sample Description: original sample description information.

2, data filtering: after sample reception, firstly, DNA to offer or the DNA extracted from the sample of offer are carried out The detection of purity, concentration and volume etc.；Secondly, carrying out library preparation and library quality inspection, text to sample of the detection after qualified Library preparation adds connector by extracting the genomic DNA of sample and interrupting at random, the DNA segment of length needed for electrophoresis recycles Required library is prepared in primer；Finally, carrying out upper machine sequencing to the library of quality inspection qualification；Experiment flow is as shown in Figure 1；

(1) raw sequencing data: the original image data file that high-flux sequence (Illumina) obtains is through CASAVA alkali Base identification (Base Calling) analysis is converted into sequencing sequence (Sequenced Reads), as a result with FASTQ (referred to as fq) Stored in file format, referred to as Raw Reads.FASTQ file include the title of every sequencing sequence (Read), base sequence and Its corresponding sequencing quality information.In FASTQ formatted file, each base-pair answers a base quality character, each base The corresponding ASCII character value of quality character subtracts 33, as the sequencing quality score (Phred QualityScore) of the base；No Different base sequencing error rates is represented with Phred Quality Score, if PhredQuality Score value is 20 and 30 Respectively indicating base sequencing error rate is 1% and 0.1%；Wherein FASTQ format sample is as shown in Figure 3；

In Fig. 3, (1) the first row is then Illumina sequencing mark identifier (Sequence with " " beginning ) and descriptive text (selective part) Identifiers；(2) second rows are base sequences；(3) the third line is started with "+", with It is afterwards Illumina sequencing mark identifier (selective part)；(4) fourth line is the sequencing quality of corresponding base, each in the row The corresponding ASCII value of character subtracts 33, the sequencing quality value of as corresponding second row base.

Obtained certain primitive sequencer sequences are sequenced and contain belt lacing, low-quality sequence, in order to guarantee information analysis Quality, it is necessary to original series are filtered, to obtain Clean Reads, subsequent analysis is all based on Clean Reads.

Data processing step is as follows:

(1) removal connector pollution Reads (Reads center tap pollution base number be greater than 5bp, for both-end be sequenced, if One end is polluted by connector, then removes the Reads at both ends)；

(2) remove low-quality Reads (50% or more of the total base of base Zhan of mass value Q≤19 in Reads, for Both-end sequencing can remove the Reads at both ends if one end is low quality Reads)；

(3) Reads of the removal ratio containing N greater than 5% (is sequenced both-end, if one end ratio containing N is greater than 5%, can go Fall the Reads at both ends).Data filtering statistical result is shown in Table 2:

2 data filtering statistical analysis table of table

In table 2, (1) Raw Reads Number: the original sequence number for not filtering sequencing；(2)Raw Bases Number: original unfiltered base number；(3) Clean Reads Number: filtered remaining sequence number；(4) Clean Reads Rate (%): the ratio of remaining sequence number after filtering.This value is bigger, illustrates sequencing quality or library Quality is better；(5) Clean Bases Number: filtered remaining base number；(6)Low-quality Reads Number: the sequence number removed by low quality filter criteria；(7) Low-quality Reads Rate (%): by low quality mistake The ratio for the sequence that filter standard is removed；(8) it Adapter polluted Reads Number: is removed containing connector pollution Sequence number；(9) Adapter polluted Reads Rate (%): the ratio containing the sequence that connector pollution is removed；(10) Raw Q30 Bases Rate (%): before filtration, mass value is greater than the base of 30 (error rate is less than 0.1%) in original series Number accounts for the ratio of total bases；(11) Clean Q30 Bases Rate (%): after filtering, mass value is greater than 30 in total sequence The ratio of the base number of (error rate is less than 0.1%)；(12) Ns Reads Number: the N containing base ratio is greater than 5% Reads number；(13) Reads of Ns Reads Rate (%): the N containing the base ratio greater than 5% accounts for the ratio of Raw Reads number.

(2) Mass Distribution of data: sequencing error rate is related with base quality, by sequenator itself, sequencing reagent, sample Etc. Multiple factors joint effect.Each base sequencing error rate is by the way that Phred numerical value (Phred score, Qphred) is sequenced It converts to obtain by formula, and Phred numerical value is during base identification (Base Calling) by a kind of prediction base Differentiate that error probability model, which occurs, to be calculated, corresponding relationship is as shown in table 3.

Concise mapping table between the identification of 3 IlluminaCaseva base of table and Phred score value

In order to react the stability of sequencing quality in sequencing procedure roughly, using the base positions of Clean Reads as cross Coordinate, the average sequencing quality value of each position as ordinate, obtain each sample sequencing quality distribution map as shown in figure 4, It is most of all in Q30 or more in Fig. 4, it is known that the quality of data is relatively high；It is since sequencing is prolonged that Reads front half section quality is often relatively low After cause, to second half section quality often also it is relatively low is due to sequencing raw material loss cause.

(3) base distribution of data: Clean Data base contents distribution table reflects each base in Clean Reads The ratio that every kind of base occurs on position, is used to check whether there is or not AT, GC segregation phenomenon, and this phenomenon may be sequencing or build Brought by library, it will affect subsequent quantitative analysis, except the stronger text of the specificity such as library PCR for a species Library, distribution situation should be consistent substantially with the GC distribution situation of reference genome.

3, comparison and Quality Control

(1) comparison information counts: using genome alignment software BWA (Li, et al., 2009), by filtered Clean Reads is compared onto reference genome, and specifying information is shown in Table 4:

4 comparison data of table statistics

In table 4, (1) Genome Length (bp): Genome Size is referred to；(2) Clean Reads: filtered The number of Reads；(3) Clean Bases: filtered base number；(4) it Mapped Reads: compares onto reference genome Reads number；(5) Mapping Rate (%): the percentage to the Reads on reference genome is compared；(6)Mean Depth: the average sequencing depth after Uniq is compared is taken, that is, is used for the data depth of subsequent analysis；(7)Coverage Rate (%): coverage has the region of much ratios at least to measure 1 time in reference sequences.

(2) depth distribution is sequenced: the distribution that depth is sequenced can reflect the homogeneity for building library sequencing and cover to genome Details；For single base depth profile depth is sequenced as abscissa, the base position ratio of respective depth is ordinate, Fraction of Bases=(the corresponding base position number/genome length of certain depth) herein, it reflects certain depth Under corresponding genome coverage condition；Fig. 5 is sequencing depth profile, and single base depth profile is to be sequenced depth as horizontal seat Mark, Fraction of bases are ordinate.Depth profile is accumulated so that depth is sequenced as abscissa, greater than respective depth Base position ratio is ordinate, and Fraction of bases=is (greater than base position number/gene group leader of this depth herein Degree), it is reflected greater than the genome coverage under certain depth；Fig. 6 is accumulation depth profile.

4, variation detection

(1) SNP variation detection and annotation: on the basis of comparing to reference to genome sequence, pass through mutation analysis software GATK (McKenna, etal., 2010) therefrom extracts potential SNP site all in full-length genome, further according to mass value, depth The factors such as degree, repeatability do further filtering screening, finally obtain the SNP data set of high confidence level, and annotate to it.

SNP detection and annotation: after detecting using GATK and SNP be obtained by filtration, using ANNOVAR (Wang, et al., 2010) software and existing genome annotation file (gff/gtf) annotate the SNP detected accordingly, the knot of annotation Fruit is stored in Excel file, and specific notes content explanation refers to format description book.

SNP position distribution statistics: the SNP after annotation counts it in the distribution situation in each section of genome, with one of them For sample:

5 SNP distribution statistics of table

In table 5, (1) Total: whole SNP number in genome；(2) the SNP number in the UTR5 of gene UTR5: occurs Mesh；(3) the SNP number in the UTR3 of gene UTR3: occurs；(4)UTR5；UTR3: the UTR5 and another base in gene occurs Because of the SNP number in the shared section of UTR3；It is other similar；(5) the SNP number in exon region exonic: occurs；(6) Splicing: the SNP number in gene-splicing region (the shearing site upstream area 2bp, Ji Fei Exonic) occurs；(7) exonic；Splicing: occur gene Exonic adjacent to (the shearing site downstream) shearing site 2bp SNP number；(8) Upstream: the SNP number in upstream region of gene (1000bp) occurs；(9) downstream: occur in downstream of gene SNP number in (1000bp)；(10)upstream；Downstream: occur in upstream region of gene or downstream (1000bp) SNP number；(11) intronic: occur in the SNP number for including subregion；(12) intergenic: occur between gene The SNP number in area；(13) ncRNA: without RNA, the RNA not translated not of correlative coding annotation, referring to the Gene of ANNOVAR Explanatory notes；Subinterval annotation is same as above；(14) other: positioned at the number of other positions SNP.

SNP heterozygosis is than analysis: detected through GATK (McKenna, et al., 2010) and be obtained by filtration SNP point for heterozygosis and Homozygous SNP analyzes the heterozygosis ratio of SNP in genome, helps to have the species more analyses and understand, subsequent to carry out Analysis.The homozygosis of sample and the ratio such as table 6 of heterozygosis SNP, by taking one of sample as an example:

6 homozygosis of table and heterozygosis SNP ratio table

In table 6, (1) Hom_genome: homozygous SNP in genome；(2) Het_genome: heterozygosis in genome SNP；(3) Hom_exonic: homozygous SNP in exon；(4) Het_exonic: the SNP of heterozygosis in exon.

SNP mutation mode distribution statistics: the factors such as different plant species, varying environment will lead to the difference of SNP mutation mode, lead to Cross statistics SNP mutation mode distribution situation, the peculiar Catastrophe Model of available particular species, specific type sample, thus Have to the species or sample and more comprehensively understands and analyze.Table 7 is the statistics of SNP mutation mode, is with one of sample Example:

7 SNP mutation mode distribution statistics of table

In table 7, (1) T-A: the i.e. mutation of T to A (mutation of A to the T comprising anti-chain)；(2) T-C: the i.e. mutation of T to C (mutation of A to the G comprising anti-chain)；(3) T-G: the i.e. mutation of T to G (mutation of A to the C comprising anti-chain)；(4) C-A: i.e. C is arrived The mutation (mutation of G to the T comprising anti-chain) of A；(5) C-T: the i.e. mutation of C to T (mutation of G to the A comprising anti-chain)；6)C- G: the i.e. mutation of C to G (mutation of G to the C comprising anti-chain).

Code area SNP functional annotation and statistics: the SNP mutation of coding region may influence whether the coding of amino acid, into And influence gene function；The change of amino acid whether is caused to carry out classification annotation according to it mutation for being located at coding region, such as Nonsynonymous mutation, same sense mutation etc., usual nonsynonymous mutation leads to corresponding amino acid change so that gene function changes Become, and Stopgain and Stoploss result in the appearance in advance or missing of terminator, so and detrimental mutation；Table 8 provides The function distribution statistics of SNP, by taking one of sample as an example:

8 SNP function distribution statistics of table

In table 8, (1) Total: all mutation summations；(2) nonsynonymous SNV: nonsynonymous mutation, codon Change the amino acid change (the same SNP of SNV herein) for leading to coding；(3) synonymous SNV: same sense mutation, codon mutation To encode the codon with monoamino-acid, the change of nucleotide does not cause the change of amino acid, i.e., does not cause the prominent of gene product Become；(4) stopgain: the change of codon leads to the appearance of terminator；(5) stoploss: the change of codon causes to terminate The missing of son；(6) unknown: UNKNOWN TYPE.

The SNP function statistical result in encoding samples area is mapped.

(2) InDel variation detection and annotation: soft by mutation analysis on the basis of comparing to reference to genome sequence Part GATK (McKenna, etal., 2010) therefrom extract potential polymorphism InDel all in full-length genome (Insertion and Deletion further filtering screening is done further according to factors such as mass value, depth, repeatability in) site, finally obtains high credible The InDel data set of degree, and it is annotated.

InDel detection and annotation: after detecting using GATK (McKenna, et al., 2010) and InDel is obtained by filtration, make With ANNOVAR (Wang, et al., 2010) software and existing genome annotation file (gff/gtf) to the InDel detected It is annotated accordingly, the result of annotation is stored in Excel file, and specific notes content refers to format description book.

InDel position distribution statistics: after detecting using GATK (McKenna, et al., 2010) and InDel be obtained by filtration, Using ANNOVAR (Wang, et al., 2010) software and existing genome annotation file (gff/gtf) to what is detected InDel is annotated accordingly, counts it in the distribution situation in each section of genome, by taking one of sample as an example, such as 9 institute of table Show:

9 InDel distribution statistics of table

In table 9, (1) Total: whole InDel number in genome；(2) UTR5 in gene UTR5: occurs InDel number；(3) the InDel number in the UTR3 of gene UTR3: occurs；(4)UTR5；UTR3: occur gene UTR5 with The InDel number in the shared section of another gene UTR3；It is other similar；(5) exonic: occur in exon region InDel number；(6) splicing: occur in gene-splicing region (the shearing site upstream area 2bp, Ji Fei Exonic) InDel number；(7)exonic；Splicing: the Exonic in gene occurs adjacent to shearing site 2bp (shearing site downstream) InDel number；(8) the InDel number in upstream region of gene (1000bp) upstream: occurs；(9) downstream: hair The raw InDel number in downstream of gene (1000bp)；(10)upstream；Downstream: occurring in upstream region of gene or InDel number in downstream (1000bp)；(11) intronic: occur in the InDel number for including subregion；(12) Intergenic: the InDel number in intergenic region occurs；(13) it ncRNA: without the RNA of correlative coding annotation, does not turn over not The RNA translated, referring to the Gene explanatory notes of ANNOVAR；Subinterval annotation is same as above；(14) other: it is located at other positions InDel Number.

InDel Catastrophe Model distribution statistics: the difference of InDel length can cause the influence different degrees of to genome, Full-length genome and code area, the distribution of the InDel of different length have apparent difference, and code area is because distinctive needed for it Conservative, (3 base InDel are not easy to cause to move compared with 2 bases and 4 bases etc. more for the quantitative proportion of the InDel of 3 bases Code).Table 10 is the statistics of InDel Catastrophe Model, by taking a sample as an example:

10 InDel Catastrophe Model of table statistics

In table 10, first row indicates the length of InDel, the i.e. length of Insertion or Deletion；(1) Genome: Length is the number of the InDel of n in full-length genome；(2) Exonic: length is the number of the InDel of n in code area.

Code area InDel functional annotation and statistics: the InDel mutation of coding region may influence whether the coding of amino acid And then gene function is influenced, whether cause the change of amino acid to carry out classification annotation according to it mutation for being located at coding region, Such as frameshift mutation, non-frameshift mutation, the usual more non-frameshift mutation of frameshift mutation is more harmful, and Stopgain and Stoploss because It also can be detrimental mutation for the appearance in advance or missing for resulting in terminator.Table 11 gives the functional annotation statistics of InDel, By taking one of sample as an example:

11 InDel function of table statistics

In table 11, (1) Total: all mutation summations；(2) frameshift: frameshift mutation, base deletion or increase non-3 Multiple, cause a series of codings after this position to be displaced the change of mistake；(3) nonframeshift: non-frameshit is prominent Become, the multiple that base deletion or increase are 3；(4) stopgain: the change of codon leads to the appearance of terminator；(5) Stoploss: the change of codon leads to the missing of terminator；(6) unknown: UNKNOWN TYPE.

(3) SV detection and annotation: chromosomal structural variation (SV) is the important composition of genome mutation, Primary mutations class Type has: insertion, missing, inversion etc..On the basis of comparing to reference to genome sequence, analyzed by chromosomal structural variation soft Part DELLY (Tobias, et al., 2012) detection all potential sites SV of full-length genome further according to mass value, are supported The factors such as Reads number do further filtering, finally obtain the SV data set of high confidence level, and annotate to it.

SV variation detection and annotation: it after detecting using DELLY (Tobias, et al., 2012) and SV is obtained by filtration, utilizes Existing gene annotation file (gff/gtf) annotates the SV detected accordingly, and the result after annotation deposits in Excel In file.

SV variation type statistics: chromosome insertion, chromosome deficiency, the chromosome inversion that will test etc. are different types of The number of SV is counted, as a result such as table 12, by taking one of sample as an example:

12 SV variation type statistical form of table

In table 12, (1) DEL: chromosome deficiency；(2) TRA: chromosome translocation；(3) DUP: chromosome doubling；(4) INV: Chromosome inversion；(5) INS: chromosome insertion.

SV position distribution statistics: after detecting using DELLY (Tobias, et al., 2012) and SV is obtained by filtration, using Some genome annotation files (gff/gtf) annotate the SV detected accordingly, the various elements of statistics SV covering Distribution situation, by taking one of sample as an example, as shown in table 13.

13 SV position distribution table of table

In table 13, (1) Total: whole SV number in genome；(2) the SV number in the UTR5 of gene UTR5: occurs Mesh；(3) the SV number in the UTR3 of gene UTR3: occurs；(4)UTR5；UTR3: the UTR5 and another gene in gene occurs The SV number in the shared section of UTR3；It is other similar；(5) the SV number in exon region exonic: occurs；(6) Splicing: the SV number in gene-splicing region (the shearing site upstream area 2bp, Ji Fei Exonic) occurs；(7) exonic；Splicing: occur gene Exonic adjacent to (the shearing site downstream) shearing site 2bp SV number；(8) Upstream: the SV number in upstream region of gene (1000bp) occurs；(9) downstream: occur in downstream of gene SV number in (1000bp)；(10)upstream；Downstream: occur in upstream region of gene or downstream (1000bp) SV number；(11) intronic: occur in the SV number for including subregion；(12) intergenic: occur in intergenic region SV number；(13) it ncRNA: without RNA, the RNA not translated not of correlative coding annotation, is annotated referring to the Gene of ANNOVAR Explanation；Subinterval annotation is same as above；(14) other: positioned at the number of other positions SV.

SV distribution of lengths statistics: the research to SV length characteristic facilitates the complexity and change that understand the species gene group Off course degree.

5, all genes annotation of gene function: are subjected to functional annotation, comment file format description such as table 14.

14 annotation of gene function file format explanation of table

Place, those skilled in the art can not select from the prior art to the greatest extent in the embodiment of the present invention.

Disclosed above is only a specific embodiment of the invention, but scope of protection of the present invention is not limited thereto, is appointed What those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, answer It is included within the scope of the present invention.Therefore, protection scope of the present invention should be with above-mentioned scope of protection of the claims It is quasi-.

Claims

1. a kind of sheep full-length genome resurveys sequence analysis method, which is characterized in that the described method comprises the following steps:

(1) sheep DNA is obtained, the purity, concentration and volume of the DNA are detected, library system is carried out to sample of the detection after qualified Standby and library quality inspection, is sequenced the library of quality inspection qualification, obtains sheep raw sequencing data；

(2) data filtering is carried out to the sheep raw sequencing data and assesses sequencing quality, obtained after data Quality Control is qualified Target analysis sequence data；

(3) the target analysis sequence data is compared to sheep with reference on genome, is obtained after being compared index Quality Control qualification Data in comparison；

(4) mononucleotide SNP variation, small fragment insertion and deletion variation InDel and the chromosome of the data in the comparison are detected Structure variation SV, and being annotated, obtain SNP data information in sheep genome sequencing sequence, InDel data information and SV data information.

2. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the data filtering Detailed process are as follows:

(1) sequence of removal connector pollution；Wherein, the base number of sequence center tap pollution is greater than 5bp, and both-end is sequenced, if One end is polluted by connector, then removes the sequence at both ends；

(2) low-quality sequence is removed；Wherein, in sequence the total base of base Zhan of mass value Q≤19 50% or more, for double End sequencing can remove the sequence at both ends if one end is low quality sequence；

(3) removal ratio containing N is greater than 5% Reads；Wherein, both-end is sequenced, it, can if one end ratio containing N is greater than 5% Remove the sequence at both ends.

3. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the assessment sequencing Quality includes assessing the base distribution information of the Mass Distribution information and data of data；Wherein, the Mass Distribution information includes Count base sequencing error rate and base correct recognition rata.

4. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that by the target point Sequence data comparison is analysed to refer on genome to sheep using software BWA；Comparison rate is 90.2%-99.1%.

5. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the comparison index Quality Control includes sequencing Depth profile information；Wherein, the sequencing Depth profile information includes single base Depth profile information and tires out Product Depth profile information.

6. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the SNP data Information is therefrom to extract all in full-length genome dive by mutation analysis software GATK in the data basis in the comparison SNP site, do further filtering screening further according to mass value, depth and repeatability, finally obtain the SNP of high confidence level Data set, and it is annotated；Then distributed intelligence in each section of genome of SNP in the data set, analysis base are counted Because of the SNP mutation function classification of the heterozygosis ratio of SNP, the distributed intelligence of statistics SNP mutation mode and statistical coding region in group Information.

7. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the InDel number It is believed that breath be in the data basis in the comparison by mutation analysis software GATK therefrom extract it is all in full-length genome The potential site InDel, does further filtering screening further according to mass value, depth and repeatability, finally obtains high confidence level InDel data set, and it is annotated；Then distribution of the InDel in each section of genome in the data set is counted to believe The InDel functional mutant classification information of breath, the distributed intelligence of statistics InDel Catastrophe Model and statistical coding region.

8. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the SV data letter Breath is to analyze software DELLY by chromosomal structural variation in the data basis in the comparison therefrom to extract in full-length genome All potential sites SV, do further filtering screening further according to mass value, depth and repeatability, finally obtain high credible The SV data set of degree, and it is annotated；Then the variation type of SV and all types of in genome is counted in the data set The distributed intelligence in each section, statistics SV location distribution information in the genome and the distribution of lengths information for counting SV.

9. a kind of sheep full-length genome as described in claim 1 resurveys sequence analysis method, which is characterized in that the method includes Step (5) carries out functional annotation to all genes.