CN110310699A

CN110310699A - The analysis tool and application of target gene sequence are excavated based on whole genome sequence

Info

Publication number: CN110310699A
Application number: CN201910586422.4A
Authority: CN
Inventors: 肖宁; 李爱宏; 戴正元; 周长海; 刘广青; 潘存红; 李育红; 吴云雨; 余玲; 王志平; 蔡跃; 黄年生; 季红娟; 张小祥
Original assignee: Jiangsu Lixiahe Prefecture Institute Of Agricultural Science
Current assignee: Jiangsu Lixiahe Prefecture Institute Of Agricultural Science
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2019-10-08

Abstract

The present invention relates to a kind of method and its application tested and analyzed for the analysis tool for excavating target gene sequence based on whole genome sequence run under Linux environment write using Perl language, it realizes from full-length genome level, carry out the variant sites of target gene using the complete genome sequence of multiple parent materials, variation type is analyzed, and obtains homologous sequence of the target gene in parent material.The analysis tool and analysis method can be automatically performed target interval search, sequence alignment and the analysis work of function variation type, other any species gene group annotation results are not needed as reference, versatility with higher, and it can support 2, the analysis of 000 parental gene group, it can be widely applied to the target gene sequence analysis in crop gene group, simple and efficient sequence polymorphism analysis tool and strategy be provided for molecular breeding.

Description

The analysis tool and application of target gene sequence are excavated based on whole genome sequence

Technical field

Analysis tool creation and fortune that target gene sequence is excavated based on whole genome sequence are utilized the present invention relates to a kind of The excavation of target gene sequence, analysis method in whole genome sequence are carried out with it.This method and its creation based on full genome The analysis tool EXGE1.0 of group sequential mining target gene sequence is mainly used in the target gene sequence in crop gene group point Analysis.

Background technique

In recent years, being constantly progressive with sequencing technologies, sequencing throughput is higher and higher, while sequencing cost is lower and lower, The genome sequence of some material is obtained by gene order-checking, and the variation type of target gene is found in genome sequence Have become the elementary tactics of animals and plants molecular breeding improvement.But along with the sharp increase of sample size, the product of lots of genes group sequencing Tired, the functional gene type and variation position information that target gene how is quickly found in mass data have become repressor gene group The key factor of breeding improvement process, there are complex for operation step, works in operation lots of genes group sequence for traditional analysis tool The shortcomings that making high intensity, heavy workload.Therefore it provides the target gene automated analysis tool based on whole genome sequence is one A effective method.

Summary of the invention

Technical problem solved by the invention, which is to provide, a kind of excavates target gene sequence based on whole genome sequence Analysis tool automatically analyzes the variation type of target gene sequence, does not need other any species bases from full-length genome level Because group annotation result is as reference, there is good versatility.

The technical solution for realizing the aim of the invention is as follows:

A kind of analysis tool for excavating target gene sequence based on whole genome sequence, comprising:

Parameter :-i: target gene sequence file name ,-g: the text on path file name of target gene combination of sets ,-e: Filtering threshold ,-d: between genomic region to be detected ,-o: the filename of output；Order line 1:-g file format is one gene of every row Group path；Order line 2:-d specifies chromosome numbers and physical location；Order line 3:-i is fasta formatted file, it is desirable that storage In the same file folder of genome to be detected；Using perl order, execute EXEG.pl shell script, and carry parameter-i ,- g、-e、-d、-o。

A kind of determination method using the above-mentioned analysis tool for excavating target gene sequence based on whole genome sequence, The following steps are included:

Step 1: bioperl software package being installed under computer (SuSE) Linux OS and sequence alignment program Blast+ is soft Part packet；

Step 2: extracting sample genomic dna, and be sequenced, build library, obtain sample genome sequence, and be converted into Fasta formatted file obtains sample genome sequence file；

Step 3: the title of sample genome sequence is sequentially written in the text on path text of target gene combination of sets in order In part g, the format of the text on path of target gene combination of sets are as follows: every row records the path of a sample genome；It sets to be checked Cls gene class interval d is indicated between genomic region to be detected are as follows: chromosome numbers: physical distance；Filtering threshold e is set；

Step 4: by sample genome sequence file and the target gene sequence file i detected being needed to be put into same target text In part folder, wherein the target gene sequence file i for needing to detect is fasta formatted file, while will be described in claim 1 The script software packet of the analysis tool of target gene sequence, the path text of target gene combination of sets are excavated based on whole genome sequence This document g is also placed in same destination folder；

Step 5: running the analysis tool for excavating target gene sequence based on whole genome sequence, export target gene sequence Insertion or deletion mutation site information, SNP mutation information and target gene sequence and sample gene in sample genome BLAST comparison result in combination of sets；Wherein, insertion or deletion mutation site information include being inserted into or lacking in target gene sequence The physical location of mutation, the title of affiliated sample genome sequence are inserted into sample genome sequence or the physics of deletion mutation The variation type in variation type, sample genome sequence in position, target gene sequence；SNP mutation information includes target The physical location of SNP mutation in gene order, the title of affiliated sample genome sequence, SNP mutation in sample genome sequence Physical location, the SNP base type in target gene sequence, the nucleotide variation type in sample genome sequence, it is synonymous or Nonsynonymous mutation type；It includes target gene sequence in sample that BLAST comparison result is concentrated in target gene sequence and the assortment of genes Homologous sequence in genome.

Using above-mentioned determination method in the Sequence Detection analysis after rice and the gene order-checking of other crops Using.

The invention adopts the above technical scheme compared with prior art, has following technical effect that

1, the present invention automatically analyzes the variation type of target gene sequence, does not need other from full-length genome level What species gene group annotation result has good versatility, ordinary individual PC computer is suitble to use as reference.

2, the present invention can be automatically performed the analysis work of target interval search, sequence alignment, function variation type, whole nothing Any manual intervention is needed, the summary sheet of the variation type, aligned sequences that ultimately generate works convenient for user for subsequent analysis.

3, the present invention can support the analysis of 2,000 or less complete genome group (each genome 430Mb) sequence, simultaneously Standardized output data format is provided, calls third party's tool to analysis data reprocessing convenient for user.

Detailed description of the invention

Fig. 1 is the Fasta formatted file of sample genome sequence；

Fig. 2 is the Fasta formatted file of target gene sequence；

Fig. 3 is insertion or deletion mutation output result；

Fig. 4 is insertion or deletion mutation output result explanation；

Fig. 5 is SNP mutation output result；

Fig. 6 is SNP mutation output result explanation；

Fig. 7 is that BLAST comparison result is concentrated in target gene sequence and the assortment of genes；

Fig. 8 is the CDS sequence of rice blast Piz-t disease-resistant gene；

Fig. 9 is to analyze Piz-t disease-resistant gene using the analysis tool for excavating target gene sequence based on whole genome sequence Variation type and disease-resistant phenotype in sequencing parent material, A indicate Piz-t disease-resistant gene haplotype and variation position Point, B indicate the relationship of haplotype and disease-resistant, susceptible phenotype.

Specific embodiment

Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not construed as limiting the claims.

Parameter :-i: target gene sequence file name ,-g: the text on path file name of target gene combination of sets ,-e: Filtering threshold ,-d: between genomic region to be detected ,-o: the filename of output；

Order line 1:-g file format is one genome path of every row；

Order line 2:-d specifies chromosome numbers and physical location；

Order line 3:-i is fasta formatted file, it is desirable that is stored in the same file folder of genome to be detected；

Using perl order, EXEG.pl shell script is executed, and carries parameter-i ,-g ,-e ,-d ,-o.

Step 2: extracting sample genomic dna, and be sequenced, build library, obtain sample genome sequence, utilize analysis tool The reads of each sample measured is compared BWA with reference to genome, generates BAM formatted file, recycles samtools BAM formatted file is converted to fasta formatted file and obtains sample genome sequence file by software；

Step 3: the title of sample genome sequence is sequentially written in the text on path text of target gene combination of sets in order In part g, the format of the text on path of target gene combination of sets are as follows: every row records the path of a sample genome；Such as:~/ Msuv7.fa, 2000 genomes are exactly 2000 rows；

D between genomic region to be detected is set, is indicated between genomic region to be detected are as follows: chromosome numbers: physical distance；Such as Chromosome numbers are Chr01, in addition physical distance, is expressed as Chr01:1-1000, it should be noted that the chromosome numbers in order line It is consistent with the number in sample genome sequence to be detected；

Filtering threshold e is set, and e is defaulted as 10^-10, filtering threshold values can be adjusted according to actual needs；

Step 4: by sample genome sequence file and the target gene sequence file i detected being needed to be put into same target text In part folder, wherein the target gene sequence file i for needing to detect is fasta formatted file,

Simultaneously by the script of the analysis tool described in claim 1 for excavating target gene sequence based on whole genome sequence Software package, target gene combination of sets text on path file g be also placed in same destination folder；

Step 5: running the analysis tool for excavating target gene sequence based on whole genome sequence, export target gene sequence Insertion or deletion mutation site information, SNP mutation information and target gene sequence and sample gene in sample genome BLAST comparison result in combination of sets；

Wherein, insertion or deletion mutation site information include the physical bit of insertion or deletion mutation in target gene sequence Set, in the title of affiliated sample genome sequence, sample genome sequence insertion or deletion mutation physical location, target gene The variation type in variation type, sample genome sequence in sequence；

SNP mutation information includes the name of the physical location of SNP mutation in target gene sequence, affiliated sample genome sequence Claim, the physical location of SNP mutation, the SNP base type in target gene sequence, sample genome sequence in sample genome sequence Nucleotide variation type, synonymous or nonsynonymous mutation type in column；

It includes target gene sequence in sample genome that BLAST comparison result is concentrated in target gene sequence and the assortment of genes Homologous sequence.

Above-mentioned determination method can be used in the analysis of the Sequence Detection after rice and the gene order-checking of other crops. Embodiment 1

(1) running environment requirement

Hardware configuration requirement: more than 4 core of CPU, inside there is 16G or more, hard disk 1000G or more.Software environment requirement: Linux operating system (perl equipped with 5.10 or more versions).

(2) in parent material disease-resistant gene excavation

1, material to be tested

199 parts of height are for stabilization of rice sample.

2, DNA extracts the DNA extraction method with reference to (2000) such as Temnykh, extracts genome respectively to each single plant DNA.After extraction, gene order-checking builds library and sequencing, and sequencing depth is 20 times, and Read is more than 50% base in initial data Quality value less than 5 or have connector pollution, then filtered eliminate.On the basis of genomic DNA sequencing data, benefit The reads that each sample obtains is compared with reference to genome (IRGSP-1.0) with free analysis tool BWA, is given birth to At BAM formatted file, BAM file is converted to the file of fasta format using samtools software.In order to improve sequential extraction procedures Reliability, quality-controlling parameters setting are as follows: the mapping mass value in each site be greater than 20, variation mass value be greater than 50, and And each base at least comes from 3 or more reads data supportings.

3, parental gene group sequence (sequence content such as Fig. 1) achieved above and the target gene sequence that detects is needed (such as It Fig. 2) is stored in same file folder.The document format data that this shell script is related to is the file of fasta format, sequence Description information with " > " beginning only account for a line, first character section cannot repeat in file thereafter.For in sequence after sequence explanation Hold, continuous multirow can be divided to store.

Use the gene order of rice blast resistance gene Piz-t as target gene sequence, sequence content in the present embodiment Such as Fig. 8.

4, direct.txt text file is created in the above file, and in the name of file input sample above genome Claim, such as~/199_1.fa, until~/199_199.fa.

5, target gene sequence file is Piz-t.fasta, and sequence content is as shown in Figure 8.

6, using shell script EXGE.pl, order behavior perl EXGE.pl-i Piz-t.fa-g direct.txt - e-10-d Chr06_consensus:10,000,00-12,000,000-o Piz-t_result, shell script Bao Jianyuan Code.

7, after completing script operation, there are three files for output result, including target gene sequence in corresponding genome SNP variation (such as Fig. 5), Indel insertion and deletion variant sites information (such as Fig. 3) and BLAST comparison result (such as Fig. 7), SNP The result of variation illustrates as shown in fig. 6, the result explanation of insertion and deletion variant sites is as shown in Figure 4.BLAST in Fig. 7 is compared As a result in, what > 199_17_Chr11_consensus_27982787-27983057 was indicated is the o.11 of sample ' 199_17 ' There are very high homologies with target gene for the sequence in 27982787 to 27983057 sections of chromosome；POS: target gene sequence is indicated 94th to the 364th section of column and the sequence homology degree of sample genome are up to 100%.

8, the insertion according to the above disease-resistant gene Piz-t in parent material, missing and replacement type non-synonymous will become Foreign peoples's type is divided into 13 kinds of haplotypes (being named as Hap1~Hap13), wherein the Piz-t sequence of Hap1 type and disease-resistant type 100% is consistent, and as shown in Figure 9 A, NO. indicates the parent material number with the haplotype, and '-' indicates the site deletion 1 The base of bp, ' -- ' indicate the base of 2 bp of the site deletion.Above 13 kinds of Hap classes are identified using rice blast pathogen ' 83-14 ' The parent material rice blast resistance of type, as shown in Figure 9 B, R: disease-resistant phenotype；S: susceptible phenotype, wherein the Rice Leaf of Hap1 type Pest performance it is disease-resistant, and other types Hap2~Hap13 then show it is susceptible.Therefore, using EXGE1.0 shell script from parent's material Disease-resistant gene type is identified in material, result is consistent with the phenotype for connecing bacterium.

The source code for excavating the analysis tool of target gene sequence based on whole genome sequence is as follows:

The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, without departing from the principle of the present invention, several improvement can also be made, these improvement should be regarded as guarantor of the invention Protect range.

Claims

1. a kind of analysis tool for excavating target gene sequence based on whole genome sequence characterized by comprising

Parameter :-i: target gene sequence file name ,-g: the text on path file name of target gene combination of sets ,-e: filtering Threshold value ,-d: between genomic region to be detected ,-o: the filename of output；

Order line 1:-g file format is one genome path of every row；

Order line 2:-d specifies chromosome numbers and physical location；

2. a kind of detection using the analysis tool described in claim 1 for excavating target gene sequence based on whole genome sequence Analysis method, which comprises the following steps:

Step 1: bioperl software package and sequence alignment program Blast+ software package are installed under computer (SuSE) Linux OS；

Step 3: the title of sample genome sequence is sequentially written in the text on path file g of target gene combination of sets in order In, the format of the text on path of target gene combination of sets are as follows: every row records the path of a sample genome；

D between genomic region to be detected is set, is indicated between genomic region to be detected are as follows: chromosome numbers: physical distance；

Filtering threshold e is set；

Step 4: by sample genome sequence file and the target gene sequence file i detected being needed to be put into same destination folder In, wherein the target gene sequence file i for needing to detect is fasta formatted file,

Simultaneously by the script software of the analysis tool described in claim 1 for excavating target gene sequence based on whole genome sequence It wraps, the text on path file g of target gene combination of sets is also placed in same destination folder；

Step 5: running the analysis tool for excavating target gene sequence based on whole genome sequence, export target gene sequence in sample Insertion or deletion mutation site information, SNP mutation information and target gene sequence and the sample assortment of genes in this genome Concentrate BLAST comparison result；

Wherein, insertion or deletion mutation site information include physical location, the institute of insertion or deletion mutation in target gene sequence Belong to the title of sample genome sequence, be inserted into sample genome sequence or physical location, the target gene sequence of deletion mutation In variation type, the variation type in sample genome sequence；

SNP mutation information include the physical location of SNP mutation in target gene sequence, affiliated sample genome sequence title, The physical location of SNP mutation, the SNP base type in target gene sequence, sample genome sequence in sample genome sequence In nucleotide variation type, synonymous or nonsynonymous mutation type；

It includes that target gene sequence is same in sample genome that target gene sequence, which concentrates BLAST comparison result with the assortment of genes, Source sequence.

3. the detection and analysis of the analysis tool according to claim 2 for excavating target gene sequence based on whole genome sequence Method, which is characterized in that sample genome sequence is converted in step 2 and generates fasta formatted file specifically: utilize analysis work The reads of each sample measured is compared tool BWA with reference to genome, generates BAM formatted file, recycles BAM formatted file is converted to fasta formatted file by samtools software.

4. the detection and analysis of the analysis tool according to claim 2 for excavating target gene sequence based on whole genome sequence Method, which is characterized in that filtering threshold e is 10^-10。

5. using the determination method for the analysis tool for excavating target gene sequence based on whole genome sequence in rice and its The application in Sequence Detection analysis after the gene order-checking of its crop.