CN109411014A

CN109411014A - A kind of cyclic method of plant chloroplast full-length genome assembling based on the sequencing of two generations

Info

Publication number: CN109411014A
Application number: CN201811174710.0A
Authority: CN
Inventors: 赵磊; 李洪涛; 李德铢
Original assignee: Kunming Institute of Botany of CAS
Current assignee: Kunming Institute of Botany of CAS
Priority date: 2018-10-09
Filing date: 2018-10-09
Publication date: 2019-03-01
Anticipated expiration: 2038-10-09
Also published as: CN109411014B

Abstract

The present invention is that based on two generation sequencing datas (such as: Illumina platform) plant chloroplast full-length genome data are assembled and constructed with a kind of method of cyclization.Method provided by the present invention has used data filtering software Trimmomatic, composite software SPAdes and MASHMAP software first, then ChloroplastCircle kit is utilized, and realizes the data flow procedure of entire plant chloroplast full-length genome assembling, cyclization in conjunction with perl script Programming with Pascal Language.By the way that experimental results demonstrate method provided by the present invention can quick, accurate, batch, automatically complete assembling and the cyclization of plant chloroplast full-length genome.Currently, the present invention has been carried out in 10,000 plant chloroplast full-length genome projects, good effect is achieved.

Description

A kind of cyclic method of plant chloroplast full-length genome assembling based on the sequencing of two generations

Technical field

The invention belongs to technical field of biological information, and in particular to the plant chloroplast full-length genome sequence based on the sequencing of two generations Column assembling, a kind of cyclic method.

Background technique

Chloroplaset is the important organelle that green plants is converted into luminous energy chemical energy on the earth, is that progress is photosynthetic Important place.In the 1960s, researcher has found chloroplast DNA (chloroplast DNA, cpDNA).Study table Bright, Chloroplast gene size has between 120kb to 217kb than more conservative cyclic structure.Plant chloroplast complete sequence Analysis, which discloses cpDNA genome, following characteristics: 1. genomes are by two inverted repeats (IR) and one short single copy sequence Arrange (short single copy seguence, SSC) and single-copy sequence (long single copy one long Seguence, LSC) composition；Each 10-24Kb of 2.IRA and IRB long, coding is identical, contrary；3. although each plant CpDNA size is different, but gene composition is similar, and the number of all genes is almost identical.

(Sanger Sequencing) is sequenced relative to traditional mulberry lattice, two generation sequencing technologies have speed fast, accuracy rate The features such as high (99.99%), flux is high, at low cost, it is especially the most famous with the microarray dataset of Illumina company.In recent years, Using high throughput sequencing technologies, a large amount of plant chloroplast genome has been decrypted, and is widely used and leads in each research Domain.Such as: the phylogenomics based on chloroplaset, biodiversity research etc., but most famous should be DNABarcoding research.DNA bar code (DNA Barcode) is fellow of the Royal Society of Canada Paul Herbert in 2003 The concept that year puts forward: biological species are identified based on genetic fragment.The meaning of DNA bar code is can have to species Effect, Rapid identification；It was found that hidden storage kind, promotes bio-diversity discovery；Research for other problem in science such as systematic growth. DNA bar code is not only the strong supplement of traditional species identification, more since it uses digitized forms, makes sample qualification process It can be realized automation, standardization and globalization, breach the transition to experience and rely on, and carried out using the relic of organism It quickly and effectively identifies, the application system to be formed and be easy to utilize can be established within a short period of time.The maximum of DNA bar code technology Advantage be using fubaritic known of biological vestiges (bloodstain, feather trace, fragment of tissue etc.) Lai Jianding morphology or Unknown species.Therefore, DNA bar code technology is in biological species identification, national health quarantine, national important biomolecule strategic resource mirror Fixed, discovery of new species resource etc. has a wide range of applications.

The present inventor proposed that achievement was published in using ITS as the new standard of seed plant core bar code in 2011 PNAS upper (Li et al., 2011).Plant species generation type is different, and some species hybridizations take place frequently.In actual plant species In identification, just with several genetic fragments, the qualification result of some species is still undesirable.In recent years, high-flux sequence skill The fast development of art and further decreasing for sequencing cost, bring new chance.Therefore, the present inventor proposes to utilize plant again Bar code of the chloroplaset whole genome sequence as species identification.Target is 30 China, a higher plant chloroplaset more than 000 Whole genome sequence is all sequenced, and builds up the maximum Barcoding database in the whole nation or even the whole world, is researcher and government's phase Pass department provides technical support service.Currently, completion 10 has been sequenced in the present inventor, a species more than 000.Such big data situation Under, if being assembled and being spelled by hand completely ring, that workload is hardly conceivable.Therefore, urgent need will be established a kind of fast Speed, efficient, accurate, automation method.

Summary of the invention

In view of the drawbacks of the prior art, the present invention is intended to provide a kind of quick, efficient, accurate, automatic plant The assembling of chloroplaset whole genome sequence and cyclic method.The present invention tests in 10,000 plant chloroplast Genome Projects Its reliability is demonstrate,proved.

In order to realize above-mentioned purpose of the invention, and the defect of the prior art is solved, the present invention is especially by following technology Scheme is implemented:

A kind of cyclic method of plant chloroplast full-length genome assembling based on the sequencing of two generations, this method comprises the following steps:

Total DNA (including core DNA, chloroplast DNA and mitochondrial DNA) sample is carried out with two generations sequencing Illumina platform 2G is sequenced in sequencing, each sample；

Initial data is handled with data filtering software Trimmomatic, removes connector and low-quality reads, Obtain clean reads；

Multiple kmer are carried out to clean reads with SPAdes from the beginning to assemble, and construct Scaffolds；

Use the 4Kb sequence in arabidopsis Chloroplast gene as library, then use assembling after Scaffolds sequence as Query does BLASTN operation；

BLASTN is parsed with the parse_blastnToScaffold.pl script in ChloroplastCircle kit As a result, obtaining that Scaffolds sequence of bitscore maximum score value；

With the Scaffolds sequence of maximum score value some complete Chloroplast gene libraries BLAST again, these chloroplasets Genome sequence can be downloaded from NCBI or user oneself provides；

With the parse_blastnToReference_Genome.pl script in ChloroplastCircle kit into Row parsing, obtains the chloroplaset full-length genome reference sequences with some species of the nearest edge of target Scaffolds；

All Scaffolds sequences after assembling are referred on genome sequence to chloroplaset with MASHMAP software Mapping is positioned on reference genome；

Finally using the parse_MashMap.pl script in ChloroplastCircle kit to mapping to ginseng The Scaffolds for examining genome is attached and carries out filling-up hole with the Scaffolds on no mapping, final cyclization, Reach the upload standard of ncbi database.

Specifically, this method comprises the following steps:

(1) with two generations sequencing Illumina microarray dataset to plant genomic DNA (including core DNA, chloroplast DNA and mitochondria DNA) sample is sequenced, and is obtained initial data (raw data), and about 2G or so is sequenced in each sample；

(2) it is handled with the initial data that data filtering software Trimmomatic obtains step (1), removes connector With low-quality reads, clean reads is obtained；

(3) multiple kmer are carried out with the clean reads that SPAdes obtains step (2) from the beginning to assemble, is constructed Scaffolds；

(4) use the 4Kb sequence in arabidopsis Chloroplast gene as library, it is then assembled with step (3) Scaffolds sequence does BLASTN operation as query；

(5) with parse_blastnToScaffold.pl script analyzing step in ChloroplastCircle kit (4) BLASTN obtained is as a result, obtain that Scaffolds sequence of bitscore maximum score value, wherein parse_ BlastnToScaffold.pl the Script section main code is as follows:

(6) the Scaffolds sequence some complete chloroplaset bases of BLAST again of the maximum score value obtained with step (5) Because of a group library, these Chloroplast gene sequences can be downloaded from NCBI (https: //www.ncbi.nlm.nih.gov), or be used Family oneself provides；

(7) with parse_blastnToReference_Genome.pl script in ChloroplastCircle kit into The file that row analyzing step (6) obtains obtains joining with the chloroplaset full-length genome of some species of the nearest edge of target Scaffolds Sequence is examined, wherein parse_blastnToReference_Genome.pl the Script section main code is as follows:

(8) the leaf assembled all Scaffolds sequences of step (1) obtained to step (7) with MASHMAP software is green Body positions on reference genome with reference to mapping on genome sequence；

(9) finally using parse_MashMap.pl script in ChloroplastCircle kit to step (8) The Scaffolds of mapping to reference genome is attached and is mended with the Scaffolds on no mapping Hole, final cyclization, wherein the main pseudocode of parse_blastnToReference_Genome.pl script algorithm is as follows:

Detailed description of the invention

Following drawings is not intended to limit the present invention range for illustrating specific embodiments of the present invention；

Fig. 1 shows the flow chart of a technical solution of the invention；

Fig. 2 shows the algorithm idea schematic diagram of parse_MashMap.pl in the present invention；

Fig. 3 shows that the present invention assembles certain plant chloroplast full-length genome, and after cyclization, in Geneious8.0 Schematic diagram.

Specific embodiment

Property content is described further son for the essence of the present invention with reference to the accompanying drawings and examples, as described below, but The present invention is not limited with this.Any person skilled in the art is become possibly also with the technology contents of above-mentioned description More equivalent embodiment.It is all without departing from the contents of the present invention, following embodiment is done according to the technical essence of the invention Any simple modification or the variation of program language, fall within the scope of protection of the present invention.

The present invention is to chloroplast number according to assembling cyclization, but to mitochondria data assembling cyclization, the present invention is equally also fitted With.Therefore, it for mitochondria data assembling cyclization, all falls in restriction protection scope of the invention.

Embodiment 1

LHT120202 total DNA sample is completed by the Hiseq2500 microarray dataset of Illumina company, using pair-end Build library sequencing, sequencing reading length 150bp, initial data about 2.2G.

It is filtered with data filtering software Trimmomatic initial data, removes connector and low-quality reads, obtain To clean reads, order are as follows: nohup java~/Trimmomatic-0.32/trimmomatic-0.32.jar PE- threads 2-phred33LHT120202_R1.fastq LHT120202_R2.fastq LHT120202_R1.trim LHT120202_R1.unpaired LHT120202_R2.trim LHT120202_R2.unpaired ILLUMINACLIP: ~/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15MINLEN:80&；

Multiple kmer are carried out to clean reads with SPAdes from the beginning to assemble, and construct Scaffolds, order are as follows: Nohup python spades.py--careful-k 21,33,55,77,99,127-o~/outsamplename--pe1-1 ~/cpgenome/samplename--pe1-2~/cpgenome/samplename&；

Use the 4Kb sequence in arabidopsis Chloroplast gene as library, then with assembled Scaffolds sequence above As query, BLASTN operation is done, is ordered are as follows: nohup makeblastdb-in~/Arabidopsis_thaliana_ 4K.fasta-out~/Arabidopsis_thaliana_4K-dbtype nucl&nohup blastn-query~/ LHT120202_scaffolds.fasta-db~/Arabidopsis_thaliana_4K-out~/scaffolds_AT_ 4K.xls-num_threads 4-outfmt"6 qseqid sseqid length sstart send qstart qend bitscore qcovs pident evalue mismatch"&；

It is obtained above with the parse_blastnToScaffold.pl script parsing in ChloroplastCircle kit BLASTN as a result, obtain that Scaffolds sequence of bitscore maximum score value, order are as follows: nohup perl parse_blastnToScaffold.pl scaffolds_BLASTN_AT_4K.xls LHT120202_ scaffolds.fasta scaffolds_Filtered.fa bitscore_max_scaffold.fa&；

(6) the Scaffolds sequence some complete chloroplaset bases of BLAST again of the maximum score value obtained with step (5) Because of a group library, these Chloroplast gene sequences can be downloaded from NCBI (https: //www.ncbi.nlm.nih.gov), or be used Family oneself provides, order are as follows: nohup makeblastdb-in~/plastid_genomic.fna-out~/plastid_ genomic-dbtype nucl&

Nohup blastn-query~/bitscore_max_scaffold.fa-db~/plastid_genomic- Out~/bitscore_max_scaffold_Ref.xls-num_threads 4-outfmt " 6qseqid sseqid length sstart send qstart qend bitscore qcovs pident evalue mismatch"&；

(7) with the parse_blastnToReference_Genome.pl script in ChloroplastCircle kit The file that analyzing step (6) obtain is carried out, the chloroplaset full-length genome with some species of the nearest edge of target Scaffolds is obtained Reference sequences, order are as follows: nohup perl parse_blastnToReference_Genome.pl bitscore_max_ scaffold_Ref.xlsplastid_genomic.fna reference_genome.fa&；

(8) the leaf assembled all Scaffolds sequences of step (1) obtained to step (7) with MASHMAP software is green Body positions on reference genome with reference to mapping on genome sequence, orders are as follows: nohup mashmap- rreference_genome.fa-q scaffolds.fa-s 300-f none-o test-t 10&；

(9) finally using the parse_MashMap.pl script in ChloroplastCircle kit to step (8) The Scaffolds of mapping to reference genome is attached and is mended with the Scaffolds on no mapping Hole, final cyclization, order are as follows: nohup perl parse_MashMap.pl-m mashmap_test-s scaffolds.fa-k 127-dbitscore_max_scaffold.fa-t MashMap_site_sort.xls-a scaffolds_all.fasta-p mapped_seq.fa-olinked_each_contigs.fa&。

The main software used in 1 present invention of table

Claims

1. a kind of plant chloroplast full-length genome based on the sequencing of two generations assembles cyclic method, it is characterised in that this method includes such as Lower step:

Total DNA (including core DNA, chloroplast DNA and mitochondrial DNA) sample is surveyed with two generations sequencing Illumina platform 2G is sequenced in sequence, each sample；

Initial data is handled with data filtering software Trimmomatic, connector and low-quality reads is removed, obtains clean reads；

With in ChloroplastCircle kit parse_blastnToScaffold.pl script parsing BLASTN as a result, Obtain that Scaffolds sequence of bitscore maximum score value；

With the Scaffolds sequence of maximum score value some complete Chloroplast gene libraries BLAST again, these chloroplast genes Group sequence can be downloaded from NCBI or user oneself provides；

It is solved with the parse_blastnToReference_Genome.pl script in ChloroplastCircle kit Analysis, obtains the chloroplaset full-length genome reference sequences with some species of the nearest edge of target Scaffolds；

All Scaffolds sequences after assembling are referred to mapping on genome sequence to chloroplaset with MASHMAP software, It is positioned on reference genome；

Finally utilize the parse_MashMap.pl script in ChloroplastCircle kit to mapping to reference to base It is attached because of the Scaffolds of group and carries out filling-up hole with the Scaffolds on no mapping, final cyclization reaches The upload standard of ncbi database.

2. a kind of plant chloroplast full-length genome based on the sequencing of two generations assembles cyclic method, it is characterised in that this method includes such as Lower step:

(1) with two generations sequencing Illumina microarray dataset to plant genomic DNA (including core DNA, chloroplast DNA and mitochondrial DNA) Sample is sequenced, and is obtained initial data (raw data), and about 2G is sequenced in each sample；

(2) it is handled with the initial data that data filtering software Trimmomatic obtains step (1), removes connector and low The reads of quality obtains clean reads；

(3) multiple kmer are carried out with the clean reads that SPAdes obtains step (2) from the beginning to assemble, constructs Scaffolds；

(4) it uses the 4Kb sequence in arabidopsis Chloroplast gene as library, then uses step (3) assembled Scaffolds sequence Column are used as query, do BLASTN operation；

(5) it is obtained with parse_blastnToScaffold.pl script analyzing step (4) in ChloroplastCircle kit BLASTN as a result, obtain that Scaffolds sequence of bitscore maximum score value, wherein parse_ BlastnToScaffold.pl the Script section main code is as follows:

(6) the Scaffolds sequence some complete Chloroplast genes of BLAST again of the maximum score value obtained with step (5) Library, these Chloroplast gene sequences are downloaded from NCBI (https: //www.ncbi.nlm.nih.gov) or user oneself It provides；

(7) it is solved with parse_blastnToReference_Genome.pl script in ChloroplastCircle kit The file that step (6) obtain is analysed, is obtained with the chloroplaset full-length genome of some species of the nearest edge of target Scaffolds with reference to sequence Column, wherein parse_blastnToReference_Genome.pl the Script section main code is as follows:

(8) joined with the chloroplaset that MASHMAP software obtains the assembled all Scaffolds sequences of step (1) to step (7) Mapping on genome sequence is examined, is positioned on reference genome；

(9) finally using parse_MashMap.pl script in ChloroplastCircle kit to step (8) mapping Scaffolds to reference genome is attached and carries out filling-up hole with the Scaffolds on no mapping, finally at Ring, wherein the main pseudocode of parse_blastnToReference_Genome.pl script algorithm is as follows: