CN112349350B

CN112349350B - Method for strain identification based on Dunaliella core genome sequence

Info

Publication number: CN112349350B
Application number: CN202011238521.2A
Authority: CN
Inventors: 高帆; 宋韡; 南芳茹; 冯佳; 谢树莲
Original assignee: Shanxi University
Current assignee: Qingdao Aixin Biotechnology Co.,Ltd.
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2022-07-19
Anticipated expiration: 2040-11-09
Also published as: CN112349350A

Abstract

The invention belongs to the technical field of plant molecular identification, and particularly relates to a method for strain identification based on a dunaliella core gene sequence. The method mainly comprises the following steps: collecting, purifying and culturing a sample; extracting whole genome DNA; constructing a DNA sequencing library; obtaining whole genome sequencing data of an alga strain to be detected and Dunaliella quartolecta; screening and de novo assembling a core genome sequencing fragment of the Dunaliella D.quartz pecta, and performing gene component, protein function annotation and genome contig colinearity analysis on the assembled core genome sequence; the method comprises the steps of constructing a phylogenetic tree by utilizing single nucleotide polymorphism, and when the to-be-detected algae strain and the Dunaliella tertiolecta are gathered into a cluster, the branched data support rate is 0.99-1.00, the genetic similarity percentage is more than or equal to 99%, and the to-be-detected algae strain is D.quartz.

Description

Method for strain identification based on Dunaliella core genome sequence

Technical Field

The invention belongs to the technical field of plant molecular identification, and particularly relates to a method for strain identification based on a Dunaliella core genome sequence.

Background

Dunaliella viridis Dunaliella quatolytica is a eukaryotic unicellular microalgae living in oceans, salt lakes and other extreme environments, belongs to Chlorophyta, Chlorophyceae, Volvocales, Dunaliella, has strong stress resistance, no cell wall, contains a chromoplast and a protein nucleus, and has flagella at the top of the cell. The Dunaliella tertiolecta D.quartolecta is rich in bioactive substances such as glycerol, beta-carotene, algal polysaccharides and the like, and belongs to characteristic economic microalgae. The characteristic strain in the Dunaliella D.quartz is used as a bioreactor to extract active substances and carry out industrial production, and the method has important application prospect in the fields of food processing, medical care, biodiesel and the like. However, at present, 23 types of dunaliella identified at home and abroad have similar morphology and broad-spectrum salt tolerance, and the identification of the dunaliella D.quartz is difficult from the morphological point of view. Although the efficiency of identifying the algal strains is improved from the perspective of DNA (deoxyribonucleic acid) markers, gene markers and protein markers, the accuracy is still limited by factors such as molecular marker means, conservation of fragments and non-universality of amplification or experimental procedures, the conventional molecular identification of some kindred algal strains usually has the defects of few candidate amplification fragments, poor specificity of universal markers, long development period of novel markers and specific primers, optimization of PCR (polymerase chain reaction) amplification procedures and the like, and the obtained identification result also often has false positive. As an important characteristic strain with high added value in the genus Dunaliella, the molecular identification of the D.quartz pecta resource of the Dunaliella is very key. Therefore, there is a need to develop a more accurate, rapid and universal method for identifying the D.quartolecta molecule in Dunaliella.

Due to the rapid development of next generation DNA sequencing technologies, molecular identification technologies based on the whole genome level of species are possible. Compared with the traditional molecular identification technology, the identification genetic information quantity of the whole genome level is larger, the detection range is wider, the identification of related species is more effective, and the obtained genetic variation information is richer. Currently, whole genome sequencing data for many model species have been published. Although reference genome sequencing data of dunaliella salina (d.salina) has been published in 2017 (Dunsal1 v.2), there has been no report on whole genome sequencing work of the strain as another typical dunaliella salina d.quartococta. The currently popular second generation and third generation combined sequencing technology is used for sequencing the whole genome of a species, and although complete genetic information of the species can be obtained, the following defects still exist: (1) all sequencing fragments need to be completely compared, the operation time is long, the data output is huge, a large amount of time and resources of a computer can be consumed, and the molecular identification work is not facilitated to be carried out in time; (2) genome assembly and biological information analysis not only highly depend on second-generation and third-generation high-throughput sequencing platforms of domestic and foreign sequencing companies, such as Illimina, Nanopore, PacBio and the like, but also are limited by the size of species genomes and the computing capability of the platforms, so that the result output period is longer, the manufacturing cost is higher, and common laboratories are often difficult to bear; (3) molecular identification is carried out on related species, the whole genome re-sequencing quality of the related species is highly dependent, the whole genome re-sequencing quality is closely related to the genome quality of a reference species, if the genome sequencing depth of the reference species is not enough and the assembling quality is not high, the re-sequencing result of the genome of the species to be detected is influenced, and further the species identification is deviated.

Therefore, how to provide an accurate, efficient and economic method for identifying the dunaliella D.quatorecta from the strain to be detected is an urgent technical problem to be solved in the field.

Disclosure of Invention

The invention provides a method for strain identification based on a Dunaliella core genome sequence.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for strain identification based on the core genome sequence of the dunaliella salina comprises the following steps:

(1) collecting, purifying and culturing a sample: collecting an alga strain to be detected and a Dunaliella tertiolecta D.quartz, purifying the alga strain to be detected, and then carrying out indoor expanded culture;

(2) extracting whole genome DNA: respectively extracting the whole genome DNA of the to-be-detected alga strain and the D.quartolecta by using an improved CTAB method, and freezing and storing;

(3) respectively constructing a DNA sequencing library after breaking and purifying the whole genome DNA of the alga strain to be detected and the Dunaliella D.quartz ectca in the step (2);

(4) sequencing the DNA sequencing libraries in the step (3) by adopting a high-throughput sequencing method respectively to obtain second-generation sequencing data of the to-be-detected alga strain and the D.quartolecta whole genome;

(5) taking the saline Dunaliella salina whole genome data published by NCBI as reference, comparing the D.quatolytica whole genome sequencing data obtained in the step (4) with the data, obtaining the D.quatolytica core genome sequence of the Dunaliella salina through screening, de novo assembly and quality evaluation, wherein the size of the core genome sequence is 6592916bp, the number of contigs is 3000, the length of the maximum contig is 1133322bp, the average length of the contig is 2197.64bp, the length of the contig N50 is 15270, the proportion of the complete gene is 23.65%, the proportion of the single copy gene is 15.18%, the proportion of the multi-copy gene is 13.76%, the proportion of vacancy/deletion is 1.89%, and the proportion of the incomplete fragment is 17.45%, constructing a Dunaliella salinolytica core genome circular map which is assembled de, and then performing gene component, protein function annotation and genome overlap collinearity analysis on the D.quatolytica core genome sequence of the Dunaliella salinolytica;

(6) And (3) taking the core genome sequence of the Dunaliella D.quartz Colecta constructed in the step (5) as a reference, comparing the whole genome sequencing data of the to-be-detected algal strain obtained in the step (4) and published genome sequencing data of representative algae with the to-be-detected algal strain, detecting single nucleotide polymorphism and insertion/deletion sites among species, and constructing a phylogenetic tree by using the single nucleotide polymorphism, wherein when the to-be-detected algal strain and the Dunaliella D.quartz Colecta are gathered into a cluster, the branched data support rate is 0.99-1.00, the genetic similarity percentage is more than or equal to 99%, and the to-be-detected algal strain is the Dunaliella D.quartz Colecta.

Further, the indoor expanding culture in the step (1) comprises the following specific steps: performing monoclonal picking on algal cells of an algal strain to be detected under an aseptic condition, performing indoor expanded culture under the aseptic condition after passing microscopic examination, wherein the indoor expanded culture condition is as follows: the photoperiod is 18 h: 6h, light intensity 19000lx, temperature: keeping the aseptic ventilation environment at 23 +/-3 ℃, shaking the culture dish every 5 days to prevent the algal cells from adhering to the walls, performing microscopic examination on 0.5-1 mL of algal solution, and preparing the following culture medium solutions to perform indoor expanded culture on the algal strains to be detected, wherein the formula of the culture medium is as follows:

30g/L NaCl，1.5g/L NaNO₃，1.4g/L K₂HPO₄，1.75g/L MgSO4·7H₂O，1.36g/LCaCl₂·7H₂O，1.2g/LNa₂CO₃，0.006g/L FeC₆H₅O₇，0.005g/LNaH₂PO₄·2H₂O，0.5g/LCo(NO₃)₂·6H₂O，0.8g/LCuSO₄·5H₂O，2.3g/LZnSO₄·7H₂O，0.03g/LH₃BO₃，4.0g/LNa₂MoO₄·2H₂O，0.02g/LMnCl₂·4H₂O，0.5g/LVB₁，0.5g/LVB₁₂VH 0.5g/L and ultrapure water to constant volume of 1L.

Further, the improved CTAB method in the step (2) comprises the following specific steps: taking 600-800 mg of algae to be tested, washing with ultrapure water for 2-3 times, centrifuging at 4 ℃ 8000r/min for 1.5min, adding liquid nitrogen, grinding for 15sec, adding 800 mu L of 2% W/V CTAB solution preheated at 20 ℃ and 1 mu L of 1% V/V beta-mercaptoethanol, uniformly mixing, carrying out water bath at 60 ℃ for 1.5h, shaking for 1 time every 20min, adding 800 mu L of LTris saturated phenol, centrifuging at 4 ℃ 12000r/min for 2.5min, taking supernatant, adding the mixture into the mixture, and adding the mixture into the mixture in a volume ratio of 25: 24: 2, mixing Tris saturated phenol, chloroform and isoamylol, standing for 10min at 4 ℃ after vortex oscillation, uniformly mixing for 2-3 times, and adding 800 mu L of ddH treated by 0.1% V/V DEPC₂O, water bath at 60 ℃ for 30min, centrifuging at 4 ℃ for 4min at 12000r/min, taking supernatant, adding 150mL of 3mol/L sodium acetate and 250mL of 4-5 ℃ precooled absolute ethanol, precipitating at-20 ℃ for 50min, centrifuging at 4 ℃ for 3min at 10000r/min, discarding supernatant, adding 1mL of 4-5 ℃ precooled 70% V/V ethanol solution, carrying out vortex oscillation for 20sec, volatilizing liquid in a nucleic acid vacuum drying system after discarding supernatant, adding 100 xTE buffer solution to dissolve precipitate so as to ensure that the DNA concentration is more than or equal to 150 ng/mu L and the 1% W/V agarose gel electrophoresis combined fluorescence quantifier is used for detecting genome DNA, ensuring that an electrophoresis strip is bright and has no degradation, and OD is not degraded ₂₆₀/OD₂₈₀1.8 to 1.9, and no pollution.

Further, the specific steps of constructing the DNA sequencing library in the step (3) are as follows: breaking the whole genome DNA by using a strong-grade ultrasonic wave band of 80-100W for 6sec, repeating the breaking for 1 time every 3sec, carrying out ultrasonic treatment for 5 times in total, and setting breaking parameters to be 300-400 bp; carrying out agarose gel electrophoresis on the fragments, and recovering 300-400 bp target fragments by using the agarose gel; adsorbing and recovering the target fragments by using silicon-based magnetic beads, and detecting the quality of the adsorbed and recovered target fragments by using a fluorescence quantitative instrument; DNA end repair, adding A at the 3' end; adding a joint for a connection reaction, and purifying, converting and PCR verifying a connection product; and (3) carrying out single-stranded DNA cyclization reaction on the positive product after the positive product is denatured at 95 ℃ for 20sec, and purifying the product to construct a whole genome DNA sequencing library for use in the computer.

Further, the specific steps of obtaining the core genome sequence of the dunaliella d.quartz necta after screening, assembling and quality evaluation in the step (5) are as follows: screening from a sequencing platform to obtain a high-quality sequence, taking a fragment with the screening sequencing depth of 50-80 x, the average length of 12-15K and the length of N50 greater than 18K as a query sequence, replying the query sequence to a reported dunaliella salina reference genome (Dunal 1 v.2) by utilizing SOAPaligner or BWA software, further screening a sequencing fragment with the sequence consistency of more than or equal to 90 percent and the comparison result E value of less than 1E-10 as dunaliella salina D.quartolola core genome sequence candidate data; comparing all the residual sequencing fragments with the candidate data set to obtain an overlapping area between comparison data; error correction and correction operation are carried out on the comparison result by using Falcon or Pilot software, and the contig is assembled by using SOAPde novo 2.04, Mecat, HERA or Canu software; determining the order of each contig using BySS 2.2.3, Velvet 1.2.10 or ABySS 2.2.3 software; carrying out whole genome coverage measurement and calculation by using BAMStats or GATK DepthOfCoverage software, and screening a core sequence with reference genome coverage of not less than 50% and contig continuous arrangement number of not less than 2000; evaluating the assembly quality of the screened overlapped groups by using BUSCO 2.0 or Quast software, and selecting an assembly sequence with the complete gene ratio of more than or equal to 20 percent, the single-copy gene ratio of 15 percent, the multi-copy gene ratio of more than or equal to 12 percent and the deletion/vacancy ratio of less than or equal to 3 percent as a Dunaliella D.quartolecta core genome sequence; the circular map of the core genome of this species was constructed using the Circos software.

Further, in the step (5), the gene composition, protein function annotation and genome contig collinearity analysis are carried out on the core genome sequence of the dunaliella D.quartolecta, and the specific steps are as follows: CDS prediction is carried out on the assembly data by using Augusts 3.3.3, ESTScan3.0.1, TransDecoder 2.0.1 or Prodigal 2.6.1 software, repeated sequence analysis is carried out on the assembly data by using replay asker 4.0.9, replay proteomMask 3.2.2, LTR-FINDER, Piler 1.0.6 or replay Scout 1.0.5 software, protein sequences coded by CDS are aligned to NR database by using Diamond 0.9.14 or BLASTX software and are annotated with functions, and after the predicted protein sequences are aligned by BLASTSc, MCanX, Last, Mugsy, Spines or progressive masive software, the co-linear analysis of genome is carried out.

Further, the specific steps of constructing the phylogenetic tree by using the single nucleotide polymorphisms in the step (6) are as follows: comparing the algae strain to be detected and 5-6 kinds of representative algae genome data reported in an NCBI database with the Dunaliella D.quartz core genome sequence assembled in the step (5) by using LASTZ 1.02.00 or Mauvee 2.3.1 software, extracting the corresponding genotype of each species and the Dunaliella D.quartz core genome according to the result of the compared collinear block, merging, extracting and filtering the genotype information of all the species by using the Dunaliella D.quartz core genome as a template, and detecting the single nucleotide polymorphism data and the insertion/deletion site data by using BWA0.7.17 software; based on single nucleotide polymorphism data, a phylogenetic tree is constructed by utilizing a maximum likelihood algorithm in easy SpecifesTree 1.0, MEGA 5.0, TreeBeST 1.9.2, PHYLIP, Puzzle 5.2 or PHYLO-WIN software, and then the genetic relationship between the to-be-detected algae strain and the Dunaliella D.quartz necta is determined.

Further, the deletion rate of the filtration is not higher than 20%.

The method provided by the invention does not completely depend on the known whole-genome sequencing result of the Dunaliella, the genome of a related strain without published genome sequencing data, namely the Dunaliella D.quartz genome, is sequenced, and the defects of time consumption, high dependence on an advanced sequencing system platform, high manufacturing cost and the like in the traditional genome sequencing are avoided and overcome by using an optimized data comparison method and a sequence assembly strategy. An operator can perform sequencing data processing, assembling and information analysis according to the genome core sequence and the program command constructed by the invention after obtaining the second-generation sequencing data from a domestic sequencing company, the steps can select a wide software range, the program setting in the example is strict, the operation on the computer is easy, and the method has wide application prospects in the aspects of Dunaliella strain molecule identification, variation detection, system evolution analysis and the like.

On the basis that the whole genome sequencing data of the Dunaliella alga D.quartz necta is not published at home and abroad, the invention firstly constructs the core genome assembly sequence of the Dunaliella alga D.quartz necta, the sequence comprises the current most abundant genetic information and the D.quartz necta core genome information with higher assembly quality, and theory and information support are provided for the genetic oriented improvement and the industrial application of the alga strain by taking the D.quartz necta as reference.

Compared with the prior art, the invention has the following advantages:

1. according to the invention, a D.quartolecta core genome sequence of the dunaliella is constructed for the first time by utilizing a second-generation sequencing combined genome de novo assembly technology, and the sequence contains the D.quartolecta core genome information which is most abundant in genetic information amount and higher in assembly quality at present, so that the blank of the genome information of the species is made up.

2. The core genome sequence of the Dunaliella D.quartz necta constructed by the invention can be applied to the molecular identification of the algae strain, and can be used as the theoretical and technical basis for the phylogenetic and evolutionary research and identification of the Dunaliella at home and abroad while greatly improving the accurate identification efficiency of the Dunaliella strain.

3. Compared with the published Dunaliella salina D.salina whole genome sequence, the Dunaliella salina D.quartz genome constructed by the invention has smaller data volume, and is used as a reference sequence to analyze the sequencing data of the genome of the strain to be detected, so that the data comparison time can be greatly shortened, the effective Single Nucleotide Polymorphism (SNP) data acquisition efficiency of the strain to be detected is improved, the important reference value is provided for the genetic variation analysis of the genome level Dunaliella salina related strain, and a rich data basis is provided for the systematic research of origin and evolution of low-class algae, particularly green algae.

4. By taking the core genome sequence of the Dunaliella alga D.quartz necta constructed by the invention as reference, corresponding experimental groups and control groups are set according to different experimental purposes of researchers, or the alga strain and the kindred strain thereof are compared to mine difference or characteristic genes, which lays a foundation for improving and researching the quality of the alga strain from the molecular level and promoting the industrial application of the alga strain.

5. The method for indoor expanded culture of the Dunaliella D.quartolecta and the to-be-detected algal strains, the improved CTAB method, the screening of core genome sequencing data and the de novo assembly of sequencing fragments can be widely applied to algae, particularly to the aspects of artificial culture of green algae, high-quality whole genome DNA extraction, genome sequencing data optimization processing and the like, has shorter experimental period, higher efficiency and easy operation compared with the traditional method, and is a set of indirectly-replicable technical method.

Drawings

FIG. 1 is a circular map of the core genome of Dunaliella alga D.quartolecta assembled from the head, the outermost layer of the map is the nucleotide sequence size coordinate (unit: Mbp), the inner side is the de novo assembled fragments arranged based on the sequence identity (relative to the reference genome Dunsal1 v.2), the internal lines of the genome fragments represent the gene sites of each type, the innermost side is the corresponding contig sequencing abundance map, and the internal part of the circular map is the basic information of the core genome of the alga;

FIG. 2 is a morphological observation result of an alga strain to be identified (tentatively named Dunaliella sp.) after indoor expanding culture for 30 days, wherein the upper part is macroscopic condition, the lower part is microscopic condition (scale bar: 50 μm), and No. 1-4 samples of the alga are sequentially arranged from left to right;

FIG. 3 is a schematic diagram of 1% agarose gel electrophoresis detection of whole genome DNA of a sample to be identified, M1 and M2 represent DNAsadeders;

FIG. 4 is a plot of collinearity analysis scatter diagram between the D.quartolecta core genome of Dunaliella and the sequencing fragment of the genome of the strain to be identified, the dots in the plot represent collinearity blocks between the genomes of the two species, and A and B in the plot represent 2 collinearity regions densely distributed between the D.quartolecta and the genome of the strain to be identified, respectively;

FIG. 5 is a phylogenetic tree between 7 different algae constructed based on Single Nucleotide Polymorphism (SNP) data, the phylogenetic tree construction algorithm is maximum likelihood method, the step value is set to 1000, and the data between each branch node represents the support rate and the genetic similarity percentage respectively;

FIG. 6 is a circle of collinearity analysis within the core genome of an identified Dunaliella strain Dq _ SX, the connecting lines between the segments within the circle representing possible doubling events during evolution of the species' genome, the numbers on the circle representing core genome contig numbers;

FIG. 7 is a histogram of the frequency distribution of the Ka/Ks values of the identified Dunaliella strain Dq _ SX, where the data on the histogram represent the frequency values in different intervals, Ka represents nucleotide non-synonymous substitution rate, and Ks represents nucleotide synonymous substitution rate;

FIG. 8 is a histogram of the statistics of the annotation information of the protein COG in the core genome of an identified Dunaliella strain Dq _ SX, i.e., the orthologous protein database, with the histogram accounting for the functional information of the homologous protein annotation information at the top20 (top 20);

FIG. 9 is a diagram showing prediction of transmembrane domain of a transcription regulatory factor in the identified Dunaliella strain Dq _ SX, in which different lines represent the region of the membrane, the intramembrane region and the extramembrane region, respectively, the vertical axis represents the probability value predicted by the region, and the horizontal axis represents the amino acid position;

FIG. 10 is a diagram showing the structure prediction of a signal peptide of a transcription regulator identified in Dunaliella strain Dq _ SX, wherein C-score, S-score and Y-score represent the cleavage site score, signal peptide score and comprehensive score value, respectively;

fig. 11 is a venturi diagram of metabolic pathways of d.quartz ecta and Dq _ SX of dunaliella, the intersection part is a common metabolic pathway between two algal strains, and the metabolic pathway prediction of the two algal strains is performed based on KEGG database, i.e. japanese Kyoto gene and genome encyclopedia;

Fig. 12 is a map of the unique pre-20 (top20) metabolic pathway enrichment bubbles in dunaliella d.quartz, the metabolic pathway information is from KEGG, i.e. japanese kyoto genes and genome encyclopedia database, the larger the bubble volume represents the more genes involved in the pathway, the darker the bubble color represents the higher the confidence of the pathway (the lower the Q value), the degree of enrichment (significance) is expressed as the enrichment ratio, which is the number of genes/total number of genes annotated by KEGG pathway;

fig. 13 is a map of the enrichment of the top20 (top20) metabolic pathway unique to the identified strain Dq _ SX, the metabolic pathway information is from kyoto genes and genome encyclopedia database (KEGG) in japan, the larger the bubble volume represents the larger the number of genes involved in the pathway, the darker the bubble color represents the higher the confidence (lower Q-value) of the pathway, the degree of enrichment (significance) is expressed as an enrichment ratio, which is the number of genes/total number of genes annotated by the KEGG pathway;

fig. 14 is a GO enrichment analysis histogram of the d.quartolecta significantly enriched metabolic pathway top20 (top20), GO is a database established by the gene ontology association, and the more GO entries, the higher the corresponding-log 10(Q value) (the higher the confidence), the higher the degree of the gene participating in the biological function;

Fig. 15 is a GO enrichment analysis histogram of the identified strain Dq _ SX significantly enriching the top20 ranking in the metabolic pathway (top20), GO is a database established by the gene ontology association, the more GO entries, the higher the corresponding-log 10(Q value) (higher confidence), the higher the degree of gene involvement in the biological function;

FIG. 16 is a phylogenetic tree constructed based on ITS genes of 21 Dunaliella, the construction algorithm of the phylogenetic tree is a maximum likelihood method, the step value is set to 1000, and the data among the branch nodes respectively represent the support rate and the genetic similarity percentage;

FIG. 17 is a phylogenetic tree constructed based on 21 Dunaliella SSR markers, the evolutionary tree construction algorithm is a maximum likelihood method, the step value is set to 1000, and data among branch nodes respectively represent support rate and genetic similarity percentage;

FIG. 18 is a phylogenetic tree constructed based on 21 Dunaliella genome SNP, the evolutionary tree construction algorithm is a maximum likelihood method, the step value is set to 1000, and the data among all branch nodes respectively represent the support rate and the genetic similarity percentage.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

A method for whole genome sequencing of Dunaliella D.quartolecta and de novo assembly of core genome sequence fragments thereof comprises the following steps:

step 1, performing monoclonal picking on an alga cell of a strain of Dunaliella D.quartz necta under an aseptic condition, performing indoor expanded culture under the aseptic condition after passing microscopic examination, wherein the indoor expanded culture condition is as follows: the photoperiod is 18 h: 6h, light intensity 19000lx, temperature: keeping the aseptic ventilation environment at 23 +/-3 ℃, shaking the culture dish every 5 days to prevent the algal cells from adhering to the walls, performing microscopic examination on 0.5-1 mL of algal solution, and preparing the following culture medium solutions to perform indoor expanded culture on the algal strains to be detected, wherein the formula of the culture medium is as follows:

30g/L NaCl，1.5g/L NaNO₃，1.4g/L K₂HPO₄，1.75g/L MgSO4·7H₂O，1.36g/LCaCl₂·7H₂O，1.2g/LNa₂CO₃，0.006g/L FeC₆H₅O₇，0.005g/LNaH₂PO₄·2H₂O，0.5g/LCo(NO₃)₂·6H₂O，0.8g/LCuSO₄·5H₂O，2.3g/LZnSO₄·7H₂O，0.03g/LH₃BO₃，4.0g/LNa₂MoO₄·2H₂O，0.02g/LMnCl₂·4H₂O，0.5g/LVB₁，0.5g/LVB₁₂VH is 0.5g/L, and the volume of ultrapure water is constant to 1L;

step 2, extracting the whole genome DNA of the Dunaliella D.quartz necta by using the improved CTAB method of the invention, ensuring that the DNA concentration is not lower than 150 ng/mu L and the OD is not lower than₂₆₀/OD₂₈₀Between 1.8 and 1.9, free of protein, salt ion and RNA contamination; the specific procedures are as follows: taking 600-800 mg of indoor expanded cultured algae cells, centrifuging at 8000r/min at 4 ℃ for 1.5min, adding liquid nitrogen, grinding for 15sec, adding 800 mu L of 2% W/V CTAB solution preheated at 20 ℃ and 1 mu L of 1% beta-mercaptoethanol (V/V), uniformly mixing, then carrying out water bath at 60 ℃ for 1.5h, shaking up 1 time every 20min during the mixing, adding 800 mu L of L-phenol, centrifuging at 12000r/min at 4 ℃ for 2.5min after uniformly mixing, taking supernatant, adding the mixture into the mixture, and adding the mixture into the mixture according to the volume ratio of 25: 24: 2 Tris saturated phenol, chloroform and iso And (3) standing the amyl alcohol mixed solution for 10min at 4 ℃ after vortex oscillation, uniformly mixing for 2-3 times, and adding 800 mu L of ddH treated by 0.1% DEPC (V/V)₂O, carrying out water bath at 60 ℃ for 30min, centrifuging at 4 ℃ of 12000r/min for 4min, taking supernatant, adding 150mL of 3mol/L sodium acetate and 250mL of 4-5 ℃ absolute ethyl alcohol, precipitating at-20 ℃ for 50min, centrifuging at 4 ℃ of 10000r/min for 3min, then discarding supernatant, adding 1mL of 70% (V/V) ethanol solution precooled at 4-5 ℃ and carrying out vortex oscillation for 20sec, removing supernatant, volatilizing liquid in a nucleic acid vacuum drying system, and adding a proper amount of 100 × TE buffer solution (10mmol/LTris-HCl, 1mmol/L EDTA) to dissolve precipitate;

step 3, after the whole genome DNA is broken for 5 times (6 sec/time, On/6s Off and once every 3 sec) by using the strong energy (80-100W) of a non-contact ultrasonic crusher, obtaining a short DNA fragment which meets the length requirement (300-400 bp);

step 4, carrying out 1.5% TBE agarose gel recovery and magnetic bead purification and selection on the DNA fragment (AgencourtAmpure XP Beads magnetic Beads are selected in the invention), carrying out further screening to obtain a sample with the size of 300-400 bp, detecting the quality of the sample, and ensuring that the quality of the genomic DNA meets the quality standard of the step (1);

step 5, repairing the ends of the obtained qualified DNA sample under the action of T4 DNA polymerase and Klenow polymerase, preparing blunt ends, and adding A bases at the 3' end; preparing a connection reaction system: 1 μ LT4 DNA ligase, 1 μ LT vector, 5 μ L of 1 Xligation reaction buffer, 5 μ L linker (10 μmol/L), 5 μ L DNA sample, sterile water to constant volume of 20 μ L; obtaining a connecting reaction product after water bath at 16 ℃ overnight, and purifying the product according to the requirements of an Agencourt AMPure XP kit; carrying out PCR verification and sequencing on the purified product by bacterial liquid after competent cell transformation and blue-white screening (the step can be finished by a sequencing company), selecting a positive cloning result, and detecting an amplification product by using an Agilent 2100 Bioanalyzer; after the positive amplification product is denatured at 96 ℃ for 30sec, a DNA circularization amplification system is prepared: 2 mu L of DNA sample, 4 mu L of 5 × Rapid ligation buffer, 1 mu L of ligase, and double distilled water to constant volume of 20 mu L; after the amplification system is subjected to water bath at 25 ℃ for 15min, adding linear DNA digestive enzyme for digestion for 10min, and finally obtaining a DNA sequencing library; detecting the concentration of the library by using an Agilent SureSelectQXT WGS instrument, ensuring that the concentration of the library does not exceed 2nmol/L and the volume is not less than 12 mu L;

Step 6, performing gradient PCR on the sequencing library obtained in the step 5 to prepare an amplification system: mu.L of the library sample to be tested, 1. mu.L of each primer pair (optionally using a second generation sequencing adapter primer kit), 0.5. mu.L of DNA polymerase, 2.5. mu.L of dNTPs, and 1.5. mu.L of MgCl₂2.5 μ Lbuffer buffer, ddH₂O is added to the volume of 25 mu L; the PCR amplification procedure was: cycling at 96 deg.C for 3min and 96 deg.C for 30sec for 40 times (reducing 1 deg.C to 56 deg.C and 72 deg.C for 45sec every 0.5 sec), at 72 deg.C for 8min, and storing at 4 deg.C; the amplified fragment is subjected to high-throughput sequencing by a combined anchored polymerization technology (cPAS), and the step is finished by a sequencing company with related technical qualification;

and 7, filtering the original sequencing data of the Dunaliella D.quartz-origin obtained in the step 6, filtering out low-quality sequencing data (short sequences with the length less than 5kb, sequences with the average quality less than 8 and linker sequences) by using ngsQCToolkit 2.3.3, respectively storing the obtained high-quality sequencing data in a FASTQ file format, wherein the file is named as Dq.fq, and performing core fragment screening and assembling on the D.quartz-origin whole genome sequencing data (Dq.fq) of the Dunaliella.

Step 8, the specific steps of core genome fragment screening and assembling are as follows: screening a sequencing data set with the sequencing depth of 50-80X, the average length of 12-15K and the length of N50 larger than 18K from the D.quartolecta sequencing data (dq.fq), replying the sequencing data set to a Dunsal1 v.2 of the Dunaliella salina (D.salinalinina) reference genome, performing quality control on the replying result by using Picard software, setting the comparison rate to be more than or equal to 90 percent and the comparison parameter to be 1e-10, and screening a sequence meeting the conditions as the D.quartolecta genome core sequence candidate data; performing BLASTn comparison on the residual D.quartolecta genome sequencing data and core sequence candidate data by using Burrows-Wheeleraliment (BWA) software, setting comparison parameters to be 1e-8, performing error correction by using Falcon software, acquiring an overlapping region between comparison data, and performing contig assembly by using SOAPde novo 2.04 software, wherein the set program command is as follows:

1)#maximal read length

2)max_rd_len＝100

3)[LIB]

4)#average insert size

5)avg_ins＝300

6)#ifsequence needs to be reversed

7)reverse_seq＝0

8)#in which part(s)the reads are used

9)asm_flags＝3

10)#use only first 100 bps ofeach read

11)rd_len_cutoff＝100

12)#in which order the reads are used while scaffolding

13)rank＝1

14)#cutoffofpair number for a reliable connection(at least 3 for short insert size)

15)pair_num_cutoff＝3

16)#minimum aligned length to contigs for a reliable read location(at least 32for short insert size)

17)map_len＝32

18)#a pair offastq file,read 1 file should always be followed by read 2 file

19)q1＝/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/Dq_1.fq

20)#SOAPdenovo-63mer all–s config.txt-p 10-K 55-M 3-F-u–o

21)#SOAPdenovo-63mer all-s-config.txt p 40-K 27-D 1-N 500m-o./result/MDCZ_27>MDCZ_27.log

22)SOAPdenovo-63mer all-s/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/soapdenovo/config.txt-p 10-K 55-o

23)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/soapdenovo/test

24)qsub-l nodes＝1-q queue8./soap.sh

And 9, reassembling the contigs by ABySS 2.2.3 software, wherein the set program command is as follows:

25)conda install-c conda-forge-c bioconda-c defaults ABySS

26)ABYSS-k 31-o/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/ABySS/31_contigs.fa

27)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/Dq.fq

28)qsub-l nodes＝1-q queue6./ABySS.sh

and step 10, evaluating the quality of the Dunaliella D.quartz vitrecta genome assembly sequence by using BUSCO 2.0 software, and selecting the assembly sequence with the complete gene ratio of more than or equal to 20 percent, the single-copy gene ratio of 15 percent, the multi-copy gene ratio of more than or equal to 12 percent and the deletion/vacancy ratio of less than or equal to 3 percent as the Dunaliella D.quartz vitrecta core genome sequence. The set program commands are:

29)python/public/home/wangjingchun/miniconda2/bin/run_BUSCO.py-i

30)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/02busco/Dq_contig.fa-m geno-l

31)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/02busco/eukaryota_odb10-o results_Dq

and step 11, performing functional gene CDS prediction on the screened core genome assembly data by using Augustus 3.3.3 software, wherein the set program command is as follows:

32)augustus--strand＝both--genemodel＝partial--singlestrand＝false--protein＝on--introns＝on--start＝on--stop＝on--cds＝on--codingseq＝on--alternatives-from-evidence＝true--gff3＝on--UTR＝false--outfile＝/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/04gene/Dqaugustus/out.gff--species＝volvox/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/04gene/Dq/Dq_masked.fa

step 12, constructing a core genome circular map of the alga by using a Circos software, wherein the set program command is as follows:

33)#circos.conf

34)karyotype＝data/karyotype/karyotype.Dq.txt

35)<ideogram>

36)<spacing>

37)default＝0.005r

38)</spacing>

39)radius＝0.9r

40)thickness＝20p

41)fill＝yes

42)</ideogram>

43)#The remaining content is standard and required.It is imported

44)#from default files inthe Circos distribution.

45)#These shouldbe present in every Circos configuration file and

46)#overridden as required.To see the content ofthese files,

47)#look in etc/in the Circos distribution.

48)<image>

49)#Included from Circos distribution.

50)<<include etc/image.conf>>

51)</image>

52)#RGB/HSV color definitions,colorlists,location offonts,fill patterns.

53)#Included from Circos distribution.

54)<<include etc/colors_fonts_patterns.conf>>

55)#Debugging,I/O an dother systemparameters

56)#Included from Circos distribution.

57)<<include etc/housekeeping.conf>>

according to the genome assembly quality evaluation results, a core genome sequence can be screened from the Dunaliella tertiolecta D.quartolecta, the size of the core genome sequence is 6592916bp, the number of contigs is 3000, the maximum contig length is 1133322bp, the average length of the contigs is 2197.64bp, the contig N50 is 15270, the proportion of complete genes is 23.65%, the proportion of single-copy genes is 15.18%, the proportion of multi-copy genes is 13.76%, the proportion of vacancy/deletion is 1.89%, the predicted CDS proportion is 38.03%, and the core genome circular map is shown in FIG. 1.

Example 2

A method for strain identification using the core genome sequence of dunaliella d.quartolecta, comprising the steps of:

step 1, sample collection, purification and culture: collecting an alga strain to be detected (tentatively named Dunaliella sp), purifying the alga strain to be detected, and then carrying out indoor amplification culture, wherein the method comprises the following specific steps: monoclonal picking of algal cells of the algal strain to be detected under the aseptic condition, performing indoor expanded culture under the aseptic condition after passing microscopic examination, wherein the indoor expanded culture conditions are as follows: the photoperiod is 18 h: 6h, light intensity 19000lx, temperature: keeping the aseptic ventilation environment at 23 +/-3 ℃, shaking the culture dish every 5 days to prevent the algal cells from adhering to the walls, performing microscopic examination on 0.5-1 mL of algal solution, and preparing the following culture medium solutions to perform indoor expanded culture on the algal strains to be detected, wherein the formula of the culture medium is as follows:

30g/L NaCl，1.5g/L NaNO₃，1.4g/L K₂HPO₄，1.75g/L MgSO4·7H₂O，1.36g/LCaCl₂·7H₂O，1.2g/LNa₂CO₃，0.006g/L FeC₆H₅O₇，0.005g/LNaH₂PO₄·2H₂O，0.5g/LCo(NO₃)₂·6H₂O，0.8g/LCuSO₄·5H₂O，2.3g/LZnSO₄·7H₂O，0.03g/LH₃BO₃，4.0g/LNa₂MoO₄·2H₂O，0.02g/LMnCl₂·4H₂O，0.5g/LVB₁，0.5g/LVB₁₂VH is 0.5g/L, and the volume of ultrapure water is constant to 1L; the algal strains obtained by the scale-up culture were divided into 4 specimens (Nos. 1 to 4).

Step 2, extracting whole genome DNA: respectively taking algae liquid (figure 2) in a mature period (about 30 days), centrifuging at a low temperature of 4 ℃ for 1.5min (8000r/min), enriching algae cells, quickly freezing by using liquid nitrogen, quickly grinding for 15sec, and respectively extracting whole genome DNA by using an improved CTAB method, wherein the specific procedure is as follows: adding 800 mu L of 2% (W/V) CTAB solution preheated at 20 ℃ into the grinding powder, adding 1 mu L of 1% beta-mercaptoethanol (V/V), gently mixing uniformly, then carrying out water bath at 60 ℃ for 1.5h, adding 800 mu L of Tris saturated phenol, gently mixing uniformly, centrifuging at 4 ℃ of 12000r/min for 2.5min, taking supernatant, and adding the mixture into the mixture according to the volume ratio of 25: 24: 2 Tris-saturated phenol, chloroform and isoamyl alcohol mixture, and vortex oscillating Standing at 4 deg.C for 10min, gently mixing for 2-3 times, adding 800 μ L of 0.1% DEPC (V/V) -treated ddH₂O, water bath at 60 ℃ for 30min, centrifuging at 12000r/min at 4 ℃ for 4min, taking supernatant, adding 150mL of 3mol/L sodium acetate and 250mL of anhydrous ethanol pre-cooled at 4-5 ℃, precipitating at 20 ℃ for 50min, centrifuging at 10000r/min at 4 ℃ for 3min, discarding supernatant, adding 1mL of 70% (V/V) ethanol solution pre-cooled at 4-5 ℃, performing vortex oscillation for 20sec, volatilizing the supernatant in a nucleic acid vacuum drying system, adding 100 muL of 100 xTE buffer (10mmol/L Tris-HCl, 1mmol/L EDTA) to dissolve and precipitate, detecting the quality of genome DNA by 1% (W/V) agarose gel electrophoresis combined with a fluorescence quantifier, and ensuring that the DNA concentration is not lower than 150 ng/muL and the OD is not lower than 150 ng/muL₂₆₀/OD₂₈₀Between 1.8 and 1.9, free of protein, salt ion and RNA contamination. Agarose gel electrophoresis detection results show (fig. 3) that the DNA concentration of the No. 1 and No. 4 samples is higher, and the integrity is better; the results of the fluorescent quantitative detection also show (Table 1), that the samples No. 1 and No. 4 have higher DNA concentration and less pollution, and are suitable for being used as candidate samples for the next library construction.

TABLE 1 fluorescent quantitative determination of the quality of the whole genome DNA of an algae sample to be identified

Sample numbering	Dilution factor (X)	Sample size (μ L)	Detection concentration (ng/. mu.L)	OD₂₆₀/OD ₂₈₀
					1	1	1	204.6	1.85
2	1	1	152.0	1.69
					3	1	1	72.2	1.62
4	1	1	384.1	1.89

Step 3, constructing a DNA sequencing library: taking about 2.0 mu g of whole genome DNA, obtaining short DNA fragments meeting the length requirement (300-400 bp) after 5 times of strong energy interruption (6 sec/time, On/6s Off and once every 3 sec) by a 80-100W non-contact ultrasonic crusher, then agarose gel electrophoresis is carried out (the concentration of the agarose gel is 1 percent, the voltage is 150V), EB staining is carried out after 30min of electrophoresis, fragments of about 300-400 bp are intercepted under an ultraviolet lamp and recovered, adding 10 mu L of silicon-based magnetic Beads (the AgencourtAmpure XP Beads magnetic Beads are selected in the invention) with the adsorption range of 300-400 bp into the dissolved glue recovery liquid, uniformly mixing, placing the mixture in a magnetic frame for separation, washing the separated magnetic Beads for 2-3 times by 150 mu L of 80% ethanol, adding 15 mu L of 0.1 XTE, mixing, standing at room temperature for 10min, placing a centrifugal tube on the magnetic frame, and collecting the supernatant after about 8 min. After the fluorescent quantitative detection is qualified, the obtained qualified DNA sample repairs the end under the action of T4 DNA polymerase and Klenow polymerase, a blunt end is prepared, and A is added to the 3' end; preparing a connection reaction system: 1 u L T4 DNA ligase, 1 u LT vector, 5 u L1 Xligation reaction buffer, 5 u L linker (10 u mol/L), 5 u L DNA sample, sterile water constant volume to 20L. Obtaining a connecting reaction product after water bath at 16 ℃ overnight, and purifying the product according to the requirements of an Agencourt AMPure XP kit; carrying out PCR verification and sequencing on the purified product after transformation and screening by using a bacterial liquid (the step can be finished by a sequencing company), selecting a positive cloning result, and detecting an amplification product by using an Agilent 2100 Bioanalyzer; the amplification product was denatured at 96 ℃ for 30sec and then placed on ice to prepare a DNA circularization amplification system: mu.L of DNA sample, 4. mu.L of 5 × Rapid ligation buffer, 1. mu.L of ligase, and double distilled water to a volume of 20. mu.L. And (3) after the amplification system is subjected to water bath at 25 ℃ for 15min, adding linear DNA digestive enzyme for room temperature digestion for 10min, finally obtaining a DNA sequencing library, and detecting the concentration of the library by using an Agilent SureSelectQXTWGS instrument to ensure that the concentration of a single library does not exceed 2nmol/L and the volume is not less than 12 mu L.

Step 4, performing gradient PCR on the sequencing library obtained in the step 3 to prepare an amplification system: mu.L of the library sample to be tested, 1. mu.L of each primer pair (optionally using a second generation sequencing adapter primer kit), 0.5. mu.L of DNA polymerase, 2.5. mu.L of dNTPs, 1.5. mu.L of LMgCl₂2.5 μ Lbuffer buffer, ddH₂O is added to the volume of 25 mu L; the PCR amplification procedure was: cycling at 96 deg.C for 3min and 96 deg.C for 30sec 40 times (every 0.5sec, 1 deg.C is decreased to 56 deg.C and 72 deg.C for 45sec), at 72 deg.C for 8min, and storing at 4 deg.C; and (3) carrying out high-throughput sequencing on the amplified fragment by a joint anchored polymerization technology (cPAS) to obtain whole genome sequencing data of the strain to be tested (the step can be finished by a sequencing company with related technical qualifications).

And 5, performing quality control on the original sequencing data of the to-be-detected algae strain obtained in the step 4 (Q20 is more than 96%, and GC content is more than 45%), respectively performing data filtration, filtering out low-quality sequencing data (short sequences with the length less than 5kb, sequences with the average quality less than 8 and linker sequences) by utilizing ngsQCToolkit 2.3.3 software, setting a filtration parameter to be-l 20-Q0.5-n 0.03-A0.28', storing the obtained high-quality sequencing data (table 2) in a FASTQ file format, and naming the file as Dsp.fq.

TABLE 2 statistical table of the sequencing information of filtered strains to be identified

Sample numbering	Number of fragments after filtration	Number of bases after filtration	Read length	Q20(％)	GC(％)
						1	238,959	23,895,898	100	97.90	49.11
4	155,286	15,528,625	100	95.36	47.47

As can be seen from Table 2, the quality control test shows that the sample of the strain to be identified with the number of 1 has better sequencing quality (higher Q20 and GC content), and can be used for data comparison and analysis in the next step.

Step 6, genome sequencing data (Dsp. fq) of the to-be-identified algae strain obtained in the step 5 and genome sequencing data of 5 representative algae published by NCBI database, namely, stonewort (Chara braunii), Chlamydomonas eustigma (Chlamydomonas eustigma), Microcystis aeruginosa (Microcystis aeruginosa), Microcystis paniformis and Volvox carteri, are collected, with reference to the D.quartolecta D.quarttactuctive genome sequence assembled and constructed in the example 1, the genome data of the algae are compared with the D.quartolecta D.tact core genome data by using LASTZ1.02.00 software, and the genotype corresponding to the D.quarttact of each species is extracted from the results of the result of the collinearity blocks (A and B in the figure 4), and the genotype information is merged, extracted and filtered (the loss rate of filtration is less than or equal to 20%).

And 7, detecting Single Nucleotide Polymorphism (SNP) and insertion/deletion sites (Indel) among the species in the step 5 by using an BWA0.7.17 software with the core genome sequence of the Dunaliella alga D.quartz as a reference, wherein a program command for detecting the data of the strain to be detected is as follows:

1) Establishing a library of bw index-abwtsw Dq

2)bwa aln-t 2-f/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq_results/Dsp_R1.sai/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq.fna/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/00data/Dsp_1.fq

3)bwa aln-t 2-f/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp_R2.sai/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq.fna/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/00data/Dsp_2.fq

4)bwa sampe-f/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.sam/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq.fna/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp_R1.sai/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp_R2.sai/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/00data/Dsp_1.fq/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/00data/Dsp_2.fq

5)samtools view-@20-b-S/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.sam-o/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.bam

6)samtools sort-@20-m 150G/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.bam-o/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.sort.bam

7)samtools rmdup-S/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.sort.bam/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.rmdup.bam

8)samtools index/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.rmdup.bam/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.rmdup.bam.bai

9)samtools mpileup-gf/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq.fna/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.rmdup.bam>/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.bcf

10)bcftools view-A./Dsp.bcf>Dsp.vcf

Step 8, sequencing fragment read length is greater than 100, using an aln mem algorithm, and the program commands as follows:

1)samtools view-@20-b-S./result/SRR2602391.sam-o./result/Dsp.fq.bam

2)samtools sort-@20-m150G./result/Dsp.fq.bam-o./result/Dsp.fq.sort.bam

3)samtools rmdup-S./result/Dsp.fq.sort.bam./result/Dsp.fq.rmdup.bam

4)samtools index./result/SRR2602391.rmdup.bam./result/Dsp.fq.rmdup.bam.bai

5)samtools mpileup-gf./database/grape.fa./result/*.rmdup.bam>Vitis_2.bcf

6)bcftools call-Avm Vitis.bcf>Vitis.vcf

and 9, carrying out SNP and InDel detection programs and algorithms of other representative algae genomes in the same steps as the algae strains to be identified.

Step 10, detecting effective Single Nucleotide Polymorphism (SNP) and insertion/deletion site (InDel) data by using BWA0.7.17 software, wherein when detecting the SNP and the InDel, a repeated segment is marked and ignored firstly, then the region near the InDel is compared again, and finally the SNP and the InDel are obtained by screening. As can be seen from the statistical results of the strain Dq _ SX to be identified (Table 3), the major SNP type of the genome of the strain is mainly converted into nucleotide, and the transversion mainly occurs between adenine (A) and thymine (T).

TABLE 3 statistics of SNPs and InDel of the strains to be identified

Species (II)	The strain Dunaliella to be identified.
		Number of SNPs	968,450
Number of InDel	61,140
		SNP type 1	TC conversion (number: 167,620)
SNP type 2	AG conversion (quantity: 167,120)
		SNP type 3	GA conversion (quantity: 167,060)
SNP type 4	CT conversion (number: 266,320)
		SNP type 5	AT transversion (quantity: 200,330)

Step 11, using easy specificity tree 1.0 software, performing phylogenetic tree construction (fig. 5) based on the obtained effective SNP data, further determining the genetic relationship between the algal strain to be tested and the dunaliella d.quartolecta, adopting a maximum likelihood algorithm, wherein the step value is 1000, and the program command is set as follows:

1)orthofinder-forthsp1-M msa-S diamond-t 16-a 16

2)orthofinder-f/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/06tree-M msa-S diamond-t 10-a 10-o

3)vol3/agis/xiaoyutao_group/wangjingchun/yanzao/06tree/results

4) Second column of input file-in 1 orthopofinder/Results _ Sep25/working directory/specificids. txt-in 4 cat

5) Input files-in 2 and-in 3 from xxx/orthsp1/OrthoFinder/Results _ Sep 25/Orthologs

6)-in2#cp Orthogroups_SingleCopyOrthologues.txt../../../easy/SingleCopyOrthologues.txt

7)-in3#cp Orthogroups.tsv../../../easy/Orthogroups.csv

8)python2.7/vol1/agis/xiaoyutao_group/wangjingchun/software/EasySpeciesTree/EasySpeciesTree.py-in1

9)SpeciesIDs.txt-in2 SingleCopyOrthologues.txt-in3 Orthogroups.csv-in4 all.pep.fa-t 2

And step 12, determining whether the alga strain to be detected belongs to the D.quartz-glomerecta based on the support rate and the percentage value of the genetic similarity between branches of the constructed phylogenetic tree, namely when the support rate between the alga strain to be detected and the D.quartz-glomerecta is 0.99-1.00 and the percentage of the similarity is more than or equal to 99%, the genome coverage is more than or equal to 55%, and determining that the alga is the D.quartz-glomerecta. As can be seen from fig. 4, the support ratio between the strain to be identified (Dunaliella sp.) and the Dunaliella d.quartz is 1.00, the percentage of similarity is 100%, the genome coverage is 56.8%, and the strain can be identified as the Dunaliella d.quartz.

Example 3

Analyzing genetic variation and evolution characteristics of a identified alga strain Dq _ SX genome by taking the D.quartolecta core genome data of the Dunaliella as reference, and comprising the following steps:

step 1, referring to the method for screening and assembling dunaliella D.quatolocta core genome sequencing data constructed by the invention (see example 1), the software SOAPde novo 2.04 is used for screening core fragments and assembling de novo on the whole genome sequencing data of an identified dunaliella strain (tentatively named as Dq _ SX) (the method can be seen in examples 1 and 2), and the main indexes of the screened and assembled core genome sequence of the dunaliella strain are shown in Table 4.

Step 2, using LASTZ 1.02.00 software to perform co-linear analysis on the Dunaliella Dq _ SX core genome assembly data constructed in step 1, and obtaining the repeated segments of the doubling event occurring between different regions in the species genome (FIG. 6).

Step 3, taking the core genome sequence of the Dunaliella D.quartz necta constructed by the invention as a reference template, comparing the core genome sequence with the Dq _ SX core genome data assembled in the step 1 by using TBtools software, screening homologous genes between the core genome sequence and the Dq _ SX core genome data from a comparison result by using Orthofinander 2.3.11 software, and setting the screening conditions as follows: p-value<10^-50，score>80, program naming is set as: this method is characterized by that the first and second keys are used to generate a new key, and the first key is used to generate a new key, and the second key is used to generate a new key, and the third key is used to generate a new key, and the new key is used to generate a new key, so that the new key can be used to generate a new key.

And 4, taking the homologous gene information screened in the step 3 as a data analysis set, detecting synonymous and non-synonymous mutation sites by using PAML 4.8 software, calculating a non-synonymous substitution rate (Ka) and a synonymous substitution rate (Ks) value, and estimating the evolutionary selection pressure of the identified strain Dq _ SX according to the Ka/Ks value (figure 7).

TABLE 4 identified Dunaliella alga Dq _ SX core genome assembly data and quality evaluation thereof

As can be seen from table 4, the core genome assembly of the dunaliella strain Dq _ SX was identified to be complete, with incomplete fragments accounting for only 16.12%, and with only 1.54% of gaps or deletions. As can be seen from FIG. 6, it was identified that the algal strain Dq _ SX may have a large number of doubling events in different regions of its genome during the evolution process, and there are 1007 pairs of segments involved in the doubling events, which suggests the complexity of the species evolution process. As can be seen from FIG. 7, it was identified that 80.52% of the genes in the core genome of the strain Dq _ SX have a Ka/Ks ratio of less than 1.0 (mean value of Ka/Ks is 0.47; when the Ka/Ks ratio is in the range of 0.35-0.45, the frequency is at most 0.108) relative to the core genome of D.quartolecta constructed according to the present invention, suggesting that most genes of the strain were subjected to purification selection pressure during the evolution process (FIG. 7).

Example 4

The repeated fragment prediction, the function annotation of the predicted protein and the structural feature analysis of the identified Dunaliella strain Dq _ SX core genome comprise the following steps:

step 1, using replay scanner 4.0.9 software to perform repeated sequence analysis on the identified dunaliella Dq _ SX core genome assembly data in example 3, firstly constructing a sequence database to be tested (BuildDatabase-name Dq _ SX _ contig. fa), and setting the following program commands:

1)RepeatModeler-pa 10-database/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/03repeat/Dq_SX/Dq_SX-engine ncbi-recoverDir/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/03repeat/Dq_SX

2)qsub-l nodes＝1-q queue8./repeatmodeler.sh

Step 2, obtaining consensus, fa, masked family, stk in the family directory

And 3, a # fasta file and a family of common identification repeat sequences obtained by training are marked after the sequence id, and if the family can not be classified, the family is marked as 'Unkown'. Stk is a Seed alignment (Seed alignment) file, is in a Dfam-compatible Stockholm format, and can be uploaded to a Dfam _ con-sensus database by using a tool 'RepeatModler/util/dfamConnsensolsTool.pl' carried by a RepeatModler installation path.

And 4, searching a repetitive sequence in the Dunaliella Dq _ SX core genome, and setting a program command as follows:

1)RepeatMasker-pa 4gff lib/public/home/wangjingchun/RM_Dq_SX/consensi.fa dir/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/03repeat/Dq_SX/Repeatmasker/lib_result/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/03repeat/Dq_SX/Dq_SX_contig.fa

2)qsub-l nodes＝1-q queue8./repeatmasker2.sh

step 5, BLASTp comparison is carried out on CDS coding protein sequences of the core genome of the identified algal strains and a non-redundant protein database (NR) by utilizing Diamond 0.9.14 software, so as to obtain functional annotations of the proteins, and the comparison parameter is set to be 1e-value less than or equal to 10^-5The program command is set as follows:

1)$diamondmakedb--innr_eukaryon.fasta-d nr_eukaryon_20200805

2)$diamond blastx--db nr_eukaryon_20200805--query reads.fq.gz--outreads.tab

3)$diamond blastp--db nr_eukaryon_20200805--query proteins.fasta--outnr.tab--outfmt 6--sensitive--max-target-seqs 20--evalue 1e-5--id 30--block-size20.0--tmpdir/dev/shm--index-chunks 1

and 6, performing the collinear analysis of the repetitive fragments of the core genome of the identified strain by using MCScanX software, and setting a program command as follows:

1)makeblastdb-in Dq_SX.fa-dbtype prot-out Dq_SX

2)Blastp-query/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/07circos/Dq_SX.fa-db/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/07circos/Dq_SXnum_threads 10-evalue 1e outfmt 6out/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/07circos/Dq_SX.blastp

3)MCScanX./Dq_SX

step 7, because the conservation of the repetitive sequences among different species is relatively low, the prediction of the repetitive sequences aiming at a specific species needs to query a specific repetitive sequence database. In view of this, we aligned the sequencing assembly data of the core genome of the identified strain Dq _ SX with the data in the RepBase using the repeatmaskerv4.0.6 software to query possible scattered repeat sequences in the strain. The core genome data of the identified strain Dq _ SX was annotated with the RepeatModler, LTR-Finder, RepeatScout software to obtain tandem repeats (including microsatellite sequences, etc.).

And 8, filtering repeated parts in the results to obtain a final non-redundant repeated sequence annotation result (table 5).

Step 9, comparing the core genome data of the identified algal strain Dq _ SX with an NR database, and screening the result by comparison (e-value)<10^-5)。

And step 10, performing COG functional annotation on the screened homologous protein sequences by utilizing eggNOG software, performing annotation on the protein sequences by using an emapper. py script in eggNOG, and performing classification statistics on the top20 (top20) protein cluster in the annotation result (FIG. 8).

Step 11, running eggNOG software to perform COG functional annotation on homologous protein encoded by the gene; the program commands are set as follows:

python/public/home/wangjingchun/miniconda2/envs/qiime1/bin/emapper.py-i/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/new/04cog/Dq_SX_protein.fa--output/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/new/04cog/out-mdiamond--data_dir/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/new/04cog/database--cpu20

step 12, performing transmembrane domain prediction analysis on the top-ranked protein in the top20 protein cluster by using an online TMHMM2.0 analysis tool (FIG. 9); using an online analytical tool of SignalP4.1 to predict the signal peptide of the protein, and setting the threshold value of the number of amino acids of each protein sequence not to exceed 6000 (figure 10); the output format is selected as extend, within graphic, and other parameters are selected as default.

TABLE 5 statistics of the results of classification of repetitive sequences in the identified Dunaliella Dq _ SX core genome

Repetitive sequence types	Repeat size (bp)	Genome proportion of repetitive sequence (%)
			LINE	165380	0.26
LTR	118737	0.19
			SINE	984126	1.57
Others (C)	1007445	1.60
			Total number of	2275688	3.62

As can be seen from Table 5, the identified Dunaliella Dq _ SX core genome has a searched length of 2275688bp, which accounts for about 3.62% of the whole genome. As can be seen from FIG. 6, the Dq _ SX core genome has been annotated with the highest number of classes of transcriptional regulators (88) and dynein heavy chains (87) in the functional proteins. The prediction result of the transmembrane domain of the transcription regulatory factor shows that the structure of the 60-110 amino acids of the factor is probably outside the membrane (the probability value is about 0.8), the part of the structure after the 130 amino acids is in the membrane with the probability (the probability value is 0.82), and the probability of being on the membrane is not higher than 0.4 (FIG. 9). As is clear from the signal peptide prediction results of this factor (FIG. 10), the C value is the largest, the S value is steep, and the Y value is the highest around amino acids 25 to 26, suggesting that this is a signal peptide cleavage site.

Example 5

The differential metabolic pathway comparative analysis and characteristic gene mining based on the core genome data of the Dunaliella D.quartz necta and the identified strain Dq _ SX comprise the following steps:

step 1, performing BLASTp comparison on a protein sequence predicted in a core genome of a Dunaliella D.quartz algae and an identified strain Dq _ SX in example 3 (a Dq _ SX core genome sequencing assembly data acquisition method is shown in example 3, and a protein sequence acquisition method is shown in example 4) and a KEGG database (Kyoto Gene and encyclopedia of genomes in Japan) to acquire a metabolic pathway in which a gene coding product possibly participates, wherein the set program command is as follows:

1)diamond makedb--in/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/ko.pep.fasta-d

2)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/kegg

3)diamond blastp-d/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/kegg--query

4)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/Dq_protein.fa-f6-o

5)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/Dq.blastp-p 30-e0.00005

6)diamond blastp-d/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/kegg--query

7)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/Dq_SX_protein.fa-f6-o

8)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/Dq_SX.blastp-p 30-e 0.00005

And 2, performing intersection analysis on prediction results of the D.quartolecta and the KEGG channel of the identified strain Dq _ SX according to the KO number distributed to each metabolic channel in the step 1 to construct a Venturi diagram (figure 11), and screening respective unique metabolic channels.

And 3, respectively screening the characteristic genes with the highest enrichment degree, namely the highest enrichment ratio from the Dunaliella D.quartz and the identified unique metabolic pathway top20 of the strain Dq _ SX (in the statistics of the step 2, the KEGG pathways 20 at the top of the rank except the intersection) obtained in the step 2 (fig. 12 and 13).

And 4, respectively carrying out query analysis on the Dunaliella D.quartz and the metabolic pathway genes with the highest enrichment degree (significant enrichment) in the identified algal strain Dq _ SX in a GO (gene ontology association) database, further obtaining GO function annotation enrichment results (figures 14 and 15) of the pre-ranked 20(top20) of the Dunaliella D.quartz and the identified algal strain Dq _ SX, and screening characteristic genes which are interested by researchers from a gene set with higher enrichment degree, namely a higher GO entry number and a higher corresponding-log 10(Q value) (confidence).

As can be seen from fig. 11, based on the core genome sequencing data of dunaliella d.quartolecta and Dq _ SX, we predicted 608 channels of KEGG, 141 channels of common metabolic channels and 467 channels of distinctive metabolic channels (85 channels of dunaliella d.quartolecta and Dq _ SX 382 channels). As can be seen from fig. 12 and 13, the most enriched specific metabolic pathway of d.quartolecta of dunaliella was spliceosome-associated metabolism, and the most enriched specific pathway of Dq _ SX was cellular component synthesis-associated metabolism. As can be seen from fig. 14 and 15, most of the functions of the genes involved in the metabolic pathway of the d.quartolecta spliceosome of the dunaliella salina are closely related to RNA transport, processing and synthesis, and most of the functions of the genes involved in the anabolism of the Dq _ SX membrane component are related to protein structure and processing.

Example 6

Comparing and analyzing three different Dunaliella D.quartolecta molecular identification technologies,

collecting 20 to-be-detected algal strains and the identified dunaliella D.quartolecta in the embodiment 1, and performing molecular identification on the to-be-detected algal strains by using ITS genes, SSR molecular markers and genome sequencing data, wherein the method specifically comprises the following steps:

step 1, extracting genome DNA of each strain by using the improved CTAB method (see example 1 specifically), designing and synthesizing an ITS gene amplification primer of the Dunaliella alga as shown in SEQ ID NO.1 and SEQ ID NO. 2:

SEQ ID NO.1：5'-GAAGGAGAAGTCGTAACAAG-3'；

SEQ ID NO.2：5'-CCTCCCTTATTGATATGC-3'；

preparing an ITS gene PCR amplification system: 2.0. mu.L dNTPs (2mmol/L), 1.0. mu.L Mg²⁺(25mmol/L), 1.0. mu.L of DNA, 0.3. mu.L of LTaq enzyme (5U/. mu.L) and 2.5. mu.L of 10 XBuffer buffer, 1.0. mu.L of each of the above primers, ddH₂Supplementing O to 25 μ L; setting a PCR reaction program: 3min at 95 ℃, 30sec at 95 ℃, 40sec at 52 ℃ and 1min at 72 ℃, and after circulating for 35 times, extending for 10min at 72 ℃; detecting by 1.2% agarose gel electrophoresis, collecting the specific amplification product of 800-1000 bp, and sending to a sequencing company for sequencing.

And 2, constructing an ITS gene system evolutionary tree of 21 strains of algae by using MEGA5.0 software according to a sequencing result fed back by a sequencing company based on a maximum likelihood method, wherein the step value is 1000, and identifying the D.quartolecta from the to-be-detected algae strain according to the support rate and the genetic similarity percentage of each branch node in the evolutionary tree (figure 16).

Step 3, based on 9 groups of Dunaliella tertiolecta transcriptome sequencing data (NCBI database number: SRR8393723, SRR8393722, SRR8393725, SRR8393724, SRR8393727, SRR8393726, SRR8393729, SRR8393728 and SRR8393721) obtained by the inventor, we screened and obtained 15 specific markers from 24311 SSR markers, and designed 10 pairs of polymorphism amplification primers according to the marker information, wherein the primer information is shown as follows:

CL1007：SEQ ID NO.3：5'-CTAAATCCATGCGTTCTTCTTTC-3'；

SEQ ID NO.4：5'-ACAGTACAACCAGAGGCTTTGAA-3'；

CL1008：SEQ ID NO.5：5'-AACAATGTCACCTCTCATTTGCT-3'；

SEQ ID NO.6：5'-TCGTTTTGTTGTTGTTCTTCAAA-3'；

CL102：SEQ ID NO.7：5'-GCCAATTCCAAAAAGTTAAAATCT-3'；

SEQ ID NO.8：5'-ATTGTGGTTTTCTTCCTGGTTTT-3'；

CL1041：SEQ ID NO.9：5'-AGGCAAGCAGTGCATTTGTA-3'；

SEQ ID NO.10：5'-GGCTCTCTATGAGTCGATGTGTC-3'；

CL1047：SEQ ID NO.11：5'-GCAGTGGAAACACACTTCCTTAC-3'；

SEQ ID NO.12：5'-TCTCTCAAATCAAAGGTGCTTTC-3'；

CL1157：SEQ ID NO.13：5'-GAGATCGAACTTGAGGCTTAGAA-3'；

SEQ ID NO.14：5'-AAAATAGAAGCCATCATGAAACG-3'；

CL1160：SEQ ID NO.15：5'-GGATACAGATTTCCACACTGCTC-3'；

SEQ ID NO.16：5'-CTATCTGGCTGAAGGTCATGTTT-3'；

CL1168：SEQ ID NO.17：5'-CGTTTTTGGAACTGATTTCTTTG-3'；

SEQ ID NO.18：5'-TTCTTGTAATACATCGCAGGAAG-3'；

CL1322：SEQ ID NO.19：5'-AACAGAGGAAATTCTGATGATGC-3'；

SEQ ID NO.20：5'-CTTGCAAGAAGGAACAACTCACT-3'；

CL1627：SEQ ID NO.21：5'-GTGGTCACCAGGAAGAGACAG-3'；

SEQ ID NO.22：5'-ACGGTACTGACAGTGGAAACAAT-3'；

the sizes of the amplified products are 155bp, 131bp, 139bp, 121bp, 158bp, 136bp, 118bp, 149bp, 160bp and 127bp in sequence;

and 4, sending the SSR primers to a biological company for synthesis, and preparing an SSR-PCR amplification system, namely: 2.5. mu.L dNTPs (2mmol/L), 1.2. mu.L Mg²⁺(25mmol/L), 1.0. mu.L of DNA (obtained in step 1), 0.4. mu.L of Taq enzyme (5U/. mu.L) and 2.5. mu.L of 10 XBuffer buffer, 0.8. mu.L of each of the above primers, ddH₂Supplementing O to 25 μ L; the SSR-PCR reaction program is as follows: 5min at 94 ℃; 35 cycles (94 ℃ 45sec, 57 ℃ 35sec, 72)1min at DEG C); 8min at 72 ℃; carrying out electrophoretic separation on the amplified SSR product by using 4% denatured polyacrylamide, carrying out silver staining for 30min, developing for 15min, fixing for 20min, and then carrying out marking on '1' (with strips) and '0' (without strips) on an electrophoretic map; clustering analysis of the algal strains to be detected is carried out by using an UPGMA method and NTSYSpc 2.2 software, and a phylogenetic tree marked by the SSR is constructed (figure 17).

Step 5, establishing a sequencing library based on the whole genome DNA obtained in the step 1 by taking the assembling data of the core genome of the Dunaliella alga D.quartolecta constructed in the invention as reference, wherein the library establishment method can be carried out according to the example 1; the genome of the strain to be tested is sequenced, the sequencing fragment does not need to be assembled from the beginning, and the step can be finished by a qualified sequencing company.

Step 6, using the d.quartz pecta core genome data of the dunaliella salina constructed by the invention as a reference, detecting Single Nucleotide Polymorphism (SNP) and insertion deletion (InDel) data among the algae strains to be detected by using BWA0.7.17 software, when detecting SNP and InDel, firstly marking out a repeated segment and neglecting, then carrying out re-comparison on the region near the InDel, finally screening to obtain SNP and InDel, and carrying out a program command according to the embodiment 2.

Step 7, using easy specificity tree 1.0 software, building a phylogenetic tree (fig. 18) based on the obtained SNP data, setting the step size to 1000 by using a maximum likelihood algorithm, and performing a program command with reference to example 2.

And 8, comparing and analyzing the three different molecular identification results, wherein the technical advantages and disadvantages are shown in a table 6.

As can be seen from fig. 16, the strain Dsp11 and the dunaliella d.quartz are clustered together, the support rate is 0.99, the genetic similarity is 99%, and the strain can be identified as d.quartz. As can be seen from fig. 17, the algal strain Dsp4 and the dunaliella d.quartolecta cluster together, the supporting rate of Dsp4 and the dunaliella d.quartolecta cluster is 0.99, the genetic similarity is 99%, and the algal strain Dsp11 and the algal strain Dsp4 and d.quartolecta cluster together, the supporting rate is 1.00, the genetic similarity is 99%, and the algal strain Dsp 3825 and the dunaliella d.quartolecta cluster can also be identified as d.quartolecta; as can be seen from fig. 18, dpsp 11 and dpsp 4 can be copolymerized with d.quartolecta into a cluster, the support rate is 1.00, the genetic similarity is 100%, and the cluster can be identified as d.quartolecta. As can be seen from table 6, compared with the other two molecular identification methods, the simplified genome sequencing is performed on the alga strain to be detected and SNP data is obtained by taking the core genome data of the dunaliella salina constructed by the invention as reference, the d.quartz-tacta can be accurately identified in a short period (7-10 days), the cost is low, and abundant biological information data can be provided for later-stage deep research.

Table 6 comparison of three molecular identification methods for dunaliella d

Example 7

The comparison of the Dunaliella D.quartz pecta core genome sequencing and assembling technology established by the invention and the traditional genome sequencing technology comprises the following steps:

step 1, a genomic DNA extraction of the identified dunaliella d.quartz, from example 3, can be performed as described in example 1.

Step 2, DNA samples with qualified quality control (the DNA concentration is more than or equal to 150 ng/mu L, an electrophoresis band is bright and has no degradation, and OD₂₆₀/OD₂₈₀1.8-1.9) sent to a sequencing company for DNA sequencing library construction, sequencing, core fragment screening, de novo assembly, and selection of Nanopore, PacBio and HiSeq by a sequencing analysis platform respectively (the step can be entrusted to the company with the relevant sequencing platform for operation).

And 3, comparing key indexes of the autonomously constructed core genome sequencing fragment assembly data (detailed in the operation steps of the example 1) of the dunaliella salina D.quartz pectera and the assembly data of each sequencing platform obtained in the step 2.

And 4, comparing the sequencing data of the D.quartolecta core genome of the dunaliella salina obtained by each technical platform with reference to a Dunsal1 v.2 published by NCBI, and analyzing the difference between the technologies according to the comparison result (Table 7).

TABLE 7 analysis of alignment results during core genome sequencing data Assembly

Step 5, using SOAPsnp software to detect Single Nucleotide Polymorphism (SNP) of the uniquely-compared sequencing fragment obtained in the step 4, filtering out repeated fragments in the detection process, performing re-comparison on the region near an insertion/deletion (InDel) site, and screening effective high-quality SNP; and (3) comparing and clustering the short sequence in the sequencing data with a reference genome, detecting InDel, and setting the gap length: 1 to 10 bases. The mean number of effective SNPs and InDel obtained by the four techniques were analyzed in comparison (table 8).

TABLE 8 comparative analysis of SNP and InDel statistical results

And 6, calculating the proportion of the repetitive sequences of the algal strains to the total sequencing fragments under different technical platform conditions by using the sequencing fragments obtained in the step 4 and combining the repetitive fragment prediction method in the embodiment 4 (Table 9).

TABLE 9 comparative analysis of repeat sequence ratios

Technique of	Proportion of repeat sequence to total sequence fragment (%)
		Autonomous techniques	1.45％
Nanopore	15.27％
		PacBio	12.99％
HiSeq	3.58％

As can be seen from Table 7, under the technical conditions established by the method, the genome coverage rate, the aligned sequence and the identification ratio of the sequencing fragment of the strain to be tested are all higher than those of the other three sequencing technologies. As can be seen from Table 8, the effective SNP and InDel detected under the technical conditions of the invention are higher than those of the other three technologies, and the error rate is lowest. As can be seen from Table 9, the ratio of the repeat sequences detected under the conditions of the present invention is lower than that of the other three techniques. In conclusion, the overall performance of the dunaliella D.quartz pecta core genome sequencing fragment assembly technology created by the invention is superior to that of Nanopore, PacBio and HiSeq.

While there have been shown and described what are at present considered to be the basic principles and essential features of the invention and advantages thereof, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, but is capable of other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Sequence listing

<110> university of Shanxi

<120> method for strain identification based on Dunaliella core genome sequence

<160> 22

<170> SIPOSequenceListing 1.0

<210> 16

<211> 20

<212> DNA

<213> ITS Gene upstream primer (ITS-F)

<400> 16

gaaggagaag tcgtaacaag 20

<210> 17

<211> 18

<212> DNA

<213> ITS Gene downstream primer (ITS-R)

<400> 17

cctcccttat tgatatgc 18

<210> 18

<211> 23

<212> DNA

<213> CL1007 upstream primer (CL1007-F)

<400> 18

ctaaatccat gcgttcttct ttc 23

<210> 19

<211> 23

<212> DNA

<213> CL1007 downstream primer (CL1007-R)

<400> 19

acagtacaac cagaggcttt gaa 23

<210> 20

<211> 23

<212> DNA

<213> CL1008 upstream primer (CL1008-F)

<400> 20

aacaatgtca cctctcattt gct 23

<210> 21

<211> 23

<212> DNA

<213> CL1008 downstream primer (CL1008-R)

<400> 21

tcgttttgtt gttgttcttc aaa 23

<210> 22

<211> 24

<212> DNA

<213> upstream primer of CL102 (CL102-F)

<400> 22

gccaattcca aaaagttaaa atct 24

<210> 23

<211> 23

<212> DNA

<213> CL102 downstream primer (CL102-R)

<400> 23

attgtggttt tcttcctggt ttt 23

<210> 24

<211> 20

<212> DNA

<213> upstream primer of CL1041 (CL1041-F)

<400> 24

aggcaagcag tgcatttgta 20

<210> 25

<211> 23

<212> DNA

<213> downstream primer of CL1041 (CL1041-R)

<400> 25

ggctctctat gagtcgatgt gtc 23

<210> 26

<211> 23

<212> DNA

<213> upstream primer of CL1047 (CL1047-F)

<400> 26

gcagtggaaa cacacttcct tac 23

<210> 27

<211> 23

<212> DNA

<213> downstream primer of CL1047 (CL1047-R)

<400> 27

tctctcaaat caaaggtgct ttc 23

<210> 28

<211> 23

<212> DNA

<213> upstream primer of CL1157 (CL1157-F)

<400> 28

gagatcgaac ttgaggctta gaa 23

<210> 29

<211> 23

<212> DNA

<213> CL1157 downstream primer (CL1157-R)

<400> 29

aaaatagaag ccatcatgaa acg 23

<210> 30

<211> 23

<212> DNA

<213> CL1160 upstream primer (CL1160-F)

<400> 30

ggatacagat ttccacactg ctc 23

<210> 31

<211> 23

<212> DNA

<213> CL1160 downstream primer (CL1160-R)

<400> 31

ctatctggct gaaggtcatg ttt 23

<210> 32

<211> 23

<212> DNA

<213> CL1168 upstream primer (CL1168-F)

<400> 32

cgtttttgga actgatttct ttg 23

<210> 33

<211> 23

<212> DNA

<213> CL1168 downstream primer (CL1168-R)

<400> 33

ttcttgtaat acatcgcagg aag 23

<210> 34

<211> 23

<212> DNA

<213> CL1322 upstream primer (CL1322-F)

<400> 34

aacagaggaa attctgatga tgc 23

<210> 35

<211> 23

<212> DNA

<213> CL1322 downstream primer (CL1322-R)

<400> 35

cttgcaagaa ggaacaactc act 23

<210> 36

<211> 21

<212> DNA

<213> upstream primer of CL1627 (CL1627-F)

<400> 36

gtggtcacca ggaagagaca g 21

<210> 37

<211> 23

<212> DNA

<213> downstream primer of CL1627 (CL1627-R)

<400> 37

acggtactga cagtggaaac aat 23

Claims

1. The method for strain identification based on the core genome sequence of the dunaliella is characterized by comprising the following steps:

(1) collecting, purifying and culturing a sample: collecting an alga strain to be detected and a Dunaliella quartolecta strain of Dunaliella, purifying the alga strain to be detected, and then carrying out indoor expanded culture, wherein the method comprises the following specific steps: performing monoclonal picking on algal cells of an algal strain to be detected under an aseptic condition, performing indoor expanded culture under the aseptic condition after passing microscopic examination, wherein the indoor expanded culture condition is as follows: the photoperiod is 18 h: 6h, illumination intensity 19000lx, temperature: keeping the aseptic ventilation environment at 23 +/-3 ℃, shaking the culture dish every 5 days to prevent the algal cells from adhering to the walls, performing microscopic examination on 0.5-1 mL of algal solution, and preparing the following culture medium solutions to perform indoor expanded culture on the algal strains to be detected, wherein the formula of the culture medium is as follows:

30g/L NaCl，1.5g/L NaNO₃，1.4g/L K₂HPO₄，1.75g/L MgSO4·7H₂O，1.36g/L CaCl₂·7H₂O，1.2g/L Na₂CO₃，0.006g/L FeC₆H₅O₇，0.005g/L NaH₂PO₄·2H₂O，0.5g/L Co(NO₃)₂·6H₂O，0.8g/L CuSO₄·5H₂O，2.3g/L ZnSO₄·7H₂O，0.03g/L H₃BO₃，4.0g/L Na₂MoO₄·2H₂O，0.02g/L MnCl₂·4H₂O，0.5g/LVB₁，0.5g/L VB₁₂VH is 0.5g/L, and the volume of ultrapure water is constant to 1L;

(2) extracting whole genome DNA: respectively extracting the whole genome DNA of the to-be-detected alga strain and the D.quartolecta strain by using an improved CTAB method, and freezing and storing; the improved CTAB method comprises the following specific steps: taking 600-800 mg of algae to be tested, washing with ultrapure water for 2-3 times, centrifuging at 4 ℃ 8000r/min for 1.5min, adding liquid nitrogen, grinding for 15sec, adding 800 mu L of 2% W/V CTAB solution preheated at 20 ℃ and 1 mu L of 1% V/V beta-mercaptoethanol, uniformly mixing, carrying out water bath at 60 ℃ for 1.5h, shaking for 1 time every 20min, adding 800 mu L of LTris saturated phenol, centrifuging at 4 ℃ 12000r/min for 2.5min, taking supernatant, adding the mixture into the mixture, and adding the mixture into the mixture in a volume ratio of 25: 24: 2, mixing Tris saturated phenol, chloroform and isoamylol, standing for 10min at 4 ℃ after vortex oscillation, uniformly mixing for 2-3 times, and adding 800 mu L of ddH treated by 0.1% V/V DEPC₂O, water bath at 60 ℃ for 30min, centrifuging at 4 ℃ for 4min at 12000r/min, taking supernatant, adding 150mL of 3mol/L sodium acetate and 250mL of 4-5 ℃ precooled absolute ethanol, precipitating at-20 ℃ for 50min, centrifuging at 4 ℃ for 3min at 10000r/min, discarding supernatant, adding 1mL of 4-5 ℃ precooled 70% V/V ethanol solution, carrying out vortex oscillation for 20sec, volatilizing liquid in a nucleic acid vacuum drying system after discarding supernatant, adding 100 xTE buffer solution to dissolve precipitate so as to ensure that the DNA concentration is more than or equal to 150 ng/mu L and the 1% W/V agarose gel electrophoresis combined fluorescence quantifier is used for detecting genome DNA, ensuring that an electrophoresis strip is bright and has no degradation, and OD is not degraded ₂₆₀/OD₂₈₀1.8-1.9, no pollution;

(5) taking the whole genome data of the dunaliella salina (D.salina) published by NCBI as reference, comparing the sequencing data of the whole genome of the dunaliella salina obtained in the step (4) with the sequencing data of the whole genome of the dunaliella salina, obtaining a core genome sequence of the dunaliella salina D.quatolecta after screening, de novo assembly and quality evaluation, wherein the size of the core genome sequence is 6592916bp, the number of contigs is 3000, the length of the maximum contig is 1133322bp, the average length of the contig is 2197.64bp, the contig N50 is 15270, the proportion of the complete gene is 23.65%, the proportion of the single copy gene is 15.18%, the proportion of the multi-copy gene is 13.76%, the proportion of vacancy/deletion is 1.89%, and the proportion of the incomplete fragment is 17.45%, constructing a circular map of the core genome of the dunaliella salina assembled de D.quatolecta, and then performing gene component, protein function annotation and genome overlap collinearity analysis on the core genome sequence of the dunalina D.quatolecta;

2. The method for strain identification based on a Dunaliella alga core genome sequence according to claim 1, wherein the specific steps of constructing the DNA sequencing library in the step (3) are as follows: breaking the whole genome DNA by using a strong-grade ultrasonic wave band of 80-100W for 6sec, repeating the breaking for 1 time every 3sec, carrying out ultrasonic treatment for 5 times in total, and setting breaking parameters to be 300-400 bp; carrying out agarose gel electrophoresis on the fragments, and recovering 300-400 bp target fragments by using the agarose gel; adsorbing and recovering the target fragments by using silicon-based magnetic beads, and detecting the quality of the adsorbed and recovered target fragments by using a fluorescence quantitative instrument; DNA end repair, adding A at the 3' end; adding a joint for a connection reaction, and purifying, converting and PCR verifying a connection product; and (3) carrying out single-stranded DNA cyclization reaction on the positive product after the positive product is denatured at 95 ℃ for 20sec, and purifying the product to construct a whole genome DNA sequencing library for use in the computer.

3. The method for strain identification based on the core genome sequence of the dunaliella salina according to claim 1, wherein the specific steps of obtaining the core genome sequence of the dunaliella salina after screening, assembling and quality evaluation in the step (5) are as follows: screening a high-quality sequence from a sequencing platform, taking a fragment with the screening sequencing depth of 50-80X, the average length of 12-15K and the length of N50 being more than 18K as a query sequence, replying the query sequence onto a reported dunaliella salina reference genome by utilizing SOAPaligner or BWA software, further screening a sequencing fragment with the sequence consistency of more than or equal to 90 percent and the comparison result E value of less than 1E-10 as dunaliella salina D.quartz genome core sequence candidate data; comparing all the residual sequencing fragments with the candidate data set to obtain an overlapping area between comparison data; error correction and correction operation are carried out on the comparison result by using Falcon or Pilot software, and the contig is assembled by using SOAPde novo 2.04, Mecat, HERA or Canu software; determining the order of each contig using BySS 2.2.3, Velvet 1.2.10 or ABySS 2.2.3 software; carrying out whole genome coverage measurement and calculation by using BAMStats or GATK DepthOfCoverage software, and screening contigs with the reference genome coverage of more than or equal to 50% and continuous arrangement number of more than or equal to 2000; evaluating the assembly quality of the screened overlapping groups by using BUSCO 2.0 or Quast software, and selecting an assembly sequence with the complete gene ratio of more than or equal to 20 percent, the single-copy gene ratio of 15 percent, the multi-copy gene ratio of more than or equal to 12 percent and the deletion/vacancy ratio of less than or equal to 3 percent as a core genome sequence of the Dunaliella tertiolecta D.quartz tacta; the circular map of the core genome of this species was constructed using the Circos software.

4. The method for strain identification based on a core genome sequence of dunaliella salina according to claim 1, wherein the step (5) is performed on the core genome sequence of dunaliella salina by genetic composition, protein function annotation and genome contig collinearity analysis, and comprises the following steps: CDS prediction is carried out on the assembly data by using Augusts 3.3.3, ESTScan3.0.1, TransDecoder 2.0.1 or Prodigal 2.6.1 software, repeated sequence analysis is carried out on the assembly data by using replay asker 4.0.9, replay proteinMask 3.2.2, LTR-FINDER, Piler 1.0.6 or replay Scout 1.0.5 software, protein sequences coded by CDS are aligned to an NR database by using Diamons 0.9.14 or BLASTX software and are annotated with functions, and after the predicted protein sequences are aligned by BLASTp, the co-linear analysis of genome is carried out by using MCScanX, Last, Mugsy, Spines or progressive analytical software.

5. The method for strain identification based on a Dunaliella core genome sequence of claim 1, wherein the specific steps of constructing phylogenetic tree by using single nucleotide polymorphism in the step (6) are as follows: comparing the algae strain to be detected and 5-6 kinds of representative algae genome data reported in an NCBI database with the core genome sequence of the Dunaliella alga D.quartz, which is assembled in the step (5), respectively by using LASTZ 1.02.00 or Mauvee 2.3.1 software, extracting the corresponding genotype of each species and the Dunaliella alga D.quartz genome according to the result of the compared collinear block, merging, extracting and filtering the genotype information of all the species by using the core genome of the Dunaliella alga D.quartz as a template, and detecting the single nucleotide polymorphism data and the insertion/deletion site data by using BWA 0.7.17 software; based on single nucleotide polymorphism data, a phylogenetic tree is constructed by utilizing a maximum likelihood algorithm in easy SpecifesTree 1.0, MEGA 5.0, TreeBeST 1.9.2, PHYLIP, Puzzle 5.2 or PHYLO-WIN software, and then the genetic relationship between the to-be-detected algae strain and the Dunaliella D.quartz necta is determined.

6. The method of claim 5, wherein the deletion rate of the filtering is no greater than 20%.