CN112349350B - Method for strain identification based on Dunaliella core genome sequence - Google Patents
Method for strain identification based on Dunaliella core genome sequence Download PDFInfo
- Publication number
- CN112349350B CN112349350B CN202011238521.2A CN202011238521A CN112349350B CN 112349350 B CN112349350 B CN 112349350B CN 202011238521 A CN202011238521 A CN 202011238521A CN 112349350 B CN112349350 B CN 112349350B
- Authority
- CN
- China
- Prior art keywords
- dunaliella
- genome
- strain
- sequencing
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 241000195634 Dunaliella Species 0.000 title claims abstract description 97
- 238000000034 method Methods 0.000 title claims abstract description 66
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 86
- 239000010453 quartz Substances 0.000 claims abstract description 55
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N silicon dioxide Inorganic materials O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 claims abstract description 55
- 239000012634 fragment Substances 0.000 claims abstract description 48
- 241000738556 Dunaliella quartolecta Species 0.000 claims abstract description 45
- 238000012216 screening Methods 0.000 claims abstract description 33
- 241000195493 Cryptophyta Species 0.000 claims abstract description 30
- 238000004458 analytical method Methods 0.000 claims abstract description 24
- 230000002068 genetic effect Effects 0.000 claims abstract description 22
- 238000012268 genome sequencing Methods 0.000 claims abstract description 22
- 239000002773 nucleotide Substances 0.000 claims abstract description 21
- 125000003729 nucleotide group Chemical group 0.000 claims abstract description 21
- 238000001712 DNA sequencing Methods 0.000 claims abstract description 13
- 238000012070 whole genome sequencing analysis Methods 0.000 claims abstract description 13
- 241000195632 Dunaliella tertiolecta Species 0.000 claims abstract description 6
- 230000004853 protein function Effects 0.000 claims abstract description 5
- 108700023863 Gene Components Proteins 0.000 claims abstract description 3
- 238000012258 culturing Methods 0.000 claims abstract description 3
- 238000012163 sequencing technique Methods 0.000 claims description 74
- 108020004414 DNA Proteins 0.000 claims description 72
- 102000053602 DNA Human genes 0.000 claims description 65
- 241000894007 species Species 0.000 claims description 30
- 241000195633 Dunaliella salina Species 0.000 claims description 29
- 102000004169 proteins and genes Human genes 0.000 claims description 28
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 claims description 18
- 239000000203 mixture Substances 0.000 claims description 18
- 239000000047 product Substances 0.000 claims description 18
- 239000000243 solution Substances 0.000 claims description 18
- 238000012217 deletion Methods 0.000 claims description 17
- 230000037430 deletion Effects 0.000 claims description 17
- 239000006228 supernatant Substances 0.000 claims description 17
- 238000002156 mixing Methods 0.000 claims description 15
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 15
- 238000001914 filtration Methods 0.000 claims description 14
- LZZYPRNAOMGNLH-UHFFFAOYSA-M Cetrimonium bromide Chemical compound [Br-].CCCCCCCCCCCCCCCC[N+](C)(C)C LZZYPRNAOMGNLH-UHFFFAOYSA-M 0.000 claims description 12
- 239000007788 liquid Substances 0.000 claims description 11
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000007476 Maximum Likelihood Methods 0.000 claims description 9
- FAPWRFPIFSIZLT-UHFFFAOYSA-M Sodium chloride Chemical compound [Na+].[Cl-] FAPWRFPIFSIZLT-UHFFFAOYSA-M 0.000 claims description 9
- 238000000246 agarose gel electrophoresis Methods 0.000 claims description 9
- 239000011324 bead Substances 0.000 claims description 9
- IJGRMHOSHXDMSA-UHFFFAOYSA-N Atomic nitrogen Chemical compound N#N IJGRMHOSHXDMSA-UHFFFAOYSA-N 0.000 claims description 8
- HEDRZPFGACZZDS-UHFFFAOYSA-N Chloroform Chemical compound ClC(Cl)Cl HEDRZPFGACZZDS-UHFFFAOYSA-N 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 239000001963 growth medium Substances 0.000 claims description 8
- XLYOFNOQVPJJNP-ZSJDYOACSA-N heavy water Substances [2H]O[2H] XLYOFNOQVPJJNP-ZSJDYOACSA-N 0.000 claims description 8
- 238000003780 insertion Methods 0.000 claims description 8
- 230000037431 insertion Effects 0.000 claims description 8
- 108091081062 Repeated sequence (DNA) Proteins 0.000 claims description 7
- 230000010355 oscillation Effects 0.000 claims description 7
- -1 Tris saturated phenol Chemical class 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000013441 quality evaluation Methods 0.000 claims description 6
- 229910021642 ultra pure water Inorganic materials 0.000 claims description 6
- 239000012498 ultrapure water Substances 0.000 claims description 6
- 238000012937 correction Methods 0.000 claims description 5
- 238000000227 grinding Methods 0.000 claims description 5
- 238000012165 high-throughput sequencing Methods 0.000 claims description 5
- 239000011780 sodium chloride Substances 0.000 claims description 5
- VMHLLURERBWHNL-UHFFFAOYSA-M Sodium acetate Chemical compound [Na+].CC([O-])=O VMHLLURERBWHNL-UHFFFAOYSA-M 0.000 claims description 4
- 239000007983 Tris buffer Substances 0.000 claims description 4
- 239000011543 agarose gel Substances 0.000 claims description 4
- 238000001962 electrophoresis Methods 0.000 claims description 4
- 229910052564 epsomite Inorganic materials 0.000 claims description 4
- PHTQWCKDNZKARW-UHFFFAOYSA-N isoamylol Chemical compound CC(C)CCO PHTQWCKDNZKARW-UHFFFAOYSA-N 0.000 claims description 4
- 229910052757 nitrogen Inorganic materials 0.000 claims description 4
- 102000039446 nucleic acids Human genes 0.000 claims description 4
- 108020004707 nucleic acids Proteins 0.000 claims description 4
- 150000007523 nucleic acids Chemical class 0.000 claims description 4
- 239000002244 precipitate Substances 0.000 claims description 4
- 230000001376 precipitating effect Effects 0.000 claims description 4
- 239000001632 sodium acetate Substances 0.000 claims description 4
- 235000017281 sodium acetate Nutrition 0.000 claims description 4
- 238000001291 vacuum drying Methods 0.000 claims description 4
- 238000009423 ventilation Methods 0.000 claims description 4
- DGVVWUTYPXICAM-UHFFFAOYSA-N β‐Mercaptoethanol Chemical compound OCCS DGVVWUTYPXICAM-UHFFFAOYSA-N 0.000 claims description 4
- 238000012300 Sequence Analysis Methods 0.000 claims description 3
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 claims description 3
- 230000015556 catabolic process Effects 0.000 claims description 3
- 238000006731 degradation reaction Methods 0.000 claims description 3
- 238000007710 freezing Methods 0.000 claims description 3
- 230000008014 freezing Effects 0.000 claims description 3
- 150000002989 phenols Chemical class 0.000 claims description 3
- 230000008439 repair process Effects 0.000 claims description 3
- 229910052710 silicon Inorganic materials 0.000 claims description 3
- 239000010703 silicon Substances 0.000 claims description 3
- 238000005406 washing Methods 0.000 claims description 3
- 108020004682 Single-Stranded DNA Proteins 0.000 claims description 2
- 239000007853 buffer solution Substances 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000005259 measurement Methods 0.000 claims description 2
- 230000000750 progressive effect Effects 0.000 claims description 2
- 238000007363 ring formation reaction Methods 0.000 claims description 2
- 238000009210 therapy by ultrasound Methods 0.000 claims description 2
- UFMZWBIQTDUYBN-UHFFFAOYSA-N cobalt dinitrate Chemical compound [Co+2].[O-][N+]([O-])=O.[O-][N+]([O-])=O UFMZWBIQTDUYBN-UHFFFAOYSA-N 0.000 claims 2
- UXVMQQNJUSDDNG-UHFFFAOYSA-L Calcium chloride Chemical compound [Cl-].[Cl-].[Ca+2] UXVMQQNJUSDDNG-UHFFFAOYSA-L 0.000 claims 1
- 229910021380 Manganese Chloride Inorganic materials 0.000 claims 1
- GLFNIEUTAYBVOC-UHFFFAOYSA-L Manganese chloride Chemical compound Cl[Mn]Cl GLFNIEUTAYBVOC-UHFFFAOYSA-L 0.000 claims 1
- 229910004619 Na2MoO4 Inorganic materials 0.000 claims 1
- 239000001110 calcium chloride Substances 0.000 claims 1
- 229910001628 calcium chloride Inorganic materials 0.000 claims 1
- 229910052927 chalcanthite Inorganic materials 0.000 claims 1
- 238000005286 illumination Methods 0.000 claims 1
- 239000011565 manganese chloride Substances 0.000 claims 1
- 239000011734 sodium Substances 0.000 claims 1
- AJPJDKMHJJGVTQ-UHFFFAOYSA-M sodium dihydrogen phosphate Chemical compound [Na+].OP(O)([O-])=O AJPJDKMHJJGVTQ-UHFFFAOYSA-M 0.000 claims 1
- 239000011684 sodium molybdate Substances 0.000 claims 1
- TVXXNOYZHKPKGW-UHFFFAOYSA-N sodium molybdate (anhydrous) Chemical compound [Na+].[Na+].[O-][Mo]([O-])(=O)=O TVXXNOYZHKPKGW-UHFFFAOYSA-N 0.000 claims 1
- 229910000162 sodium phosphate Inorganic materials 0.000 claims 1
- NWONKYPBYAMBJT-UHFFFAOYSA-L zinc sulfate Chemical compound [Zn+2].[O-]S([O-])(=O)=O NWONKYPBYAMBJT-UHFFFAOYSA-L 0.000 claims 1
- 229910000368 zinc sulfate Inorganic materials 0.000 claims 1
- 239000011686 zinc sulphate Substances 0.000 claims 1
- 230000003321 amplification Effects 0.000 description 16
- 238000003199 nucleic acid amplification method Methods 0.000 description 16
- 230000037353 metabolic pathway Effects 0.000 description 15
- 238000005516 engineering process Methods 0.000 description 14
- 238000011144 upstream manufacturing Methods 0.000 description 11
- 238000003752 polymerase chain reaction Methods 0.000 description 10
- 230000003252 repetitive effect Effects 0.000 description 10
- 210000004027 cell Anatomy 0.000 description 9
- 238000001514 detection method Methods 0.000 description 9
- 230000037361 pathway Effects 0.000 description 8
- 239000000872 buffer Substances 0.000 description 7
- 238000010276 construction Methods 0.000 description 7
- 239000010432 diamond Substances 0.000 description 7
- 229910003460 diamond Inorganic materials 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000009826 distribution Methods 0.000 description 6
- 108010076504 Protein Sorting Signals Proteins 0.000 description 5
- 239000012154 double-distilled water Substances 0.000 description 5
- 239000012528 membrane Substances 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 102000016928 DNA-directed DNA polymerase Human genes 0.000 description 4
- 108010014303 DNA-directed DNA polymerase Proteins 0.000 description 4
- 150000001413 amino acids Chemical class 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 4
- 235000019441 ethanol Nutrition 0.000 description 4
- 230000002503 metabolic effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000003908 quality control method Methods 0.000 description 4
- 238000006467 substitution reaction Methods 0.000 description 4
- 241000195628 Chlorophyta Species 0.000 description 3
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 3
- 229910015667 MoO4 Inorganic materials 0.000 description 3
- 241000219095 Vitis Species 0.000 description 3
- 238000010835 comparative analysis Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- SDJLVPMBBFRBLL-UHFFFAOYSA-N dsp-4 Chemical compound ClCCN(CC)CC1=CC=CC=C1Br SDJLVPMBBFRBLL-UHFFFAOYSA-N 0.000 description 3
- 230000007614 genetic variation Effects 0.000 description 3
- 238000000746 purification Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 238000013518 transcription Methods 0.000 description 3
- 230000035897 transcription Effects 0.000 description 3
- 241001440012 Chlamydomonas eustigma Species 0.000 description 2
- 102000012410 DNA Ligases Human genes 0.000 description 2
- 108010061982 DNA Ligases Proteins 0.000 description 2
- 238000007400 DNA extraction Methods 0.000 description 2
- 241001560459 Dunaliella sp. Species 0.000 description 2
- KCXVZYZYPLLWCC-UHFFFAOYSA-N EDTA Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 2
- 102000004190 Enzymes Human genes 0.000 description 2
- 108090000790 Enzymes Proteins 0.000 description 2
- 102000003960 Ligases Human genes 0.000 description 2
- 108090000364 Ligases Proteins 0.000 description 2
- 241000192710 Microcystis aeruginosa Species 0.000 description 2
- 238000012408 PCR amplification Methods 0.000 description 2
- 235000009392 Vitis Nutrition 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 230000001580 bacterial effect Effects 0.000 description 2
- 230000008827 biological function Effects 0.000 description 2
- 239000007795 chemical reaction product Substances 0.000 description 2
- 238000010367 cloning Methods 0.000 description 2
- 238000011109 contamination Methods 0.000 description 2
- 230000001351 cycling effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000029087 digestion Effects 0.000 description 2
- 102000038379 digestive enzymes Human genes 0.000 description 2
- 108091007734 digestive enzymes Proteins 0.000 description 2
- 238000010201 enrichment analysis Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000004060 metabolic process Effects 0.000 description 2
- 230000000877 morphologic effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000006116 polymerization reaction Methods 0.000 description 2
- 239000012264 purified product Substances 0.000 description 2
- 238000012797 qualification Methods 0.000 description 2
- 239000011535 reaction buffer Substances 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000001105 regulatory effect Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 210000001324 spliceosome Anatomy 0.000 description 2
- 239000008223 sterile water Substances 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 108010064245 urinary gonadotropin fragment Proteins 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- JWZZKOKVBUJMES-UHFFFAOYSA-N (+-)-Isoprenaline Chemical compound CC(C)NCC(O)C1=CC=C(O)C(O)=C1 JWZZKOKVBUJMES-UHFFFAOYSA-N 0.000 description 1
- PENWAFASUFITRC-UHFFFAOYSA-N 2-(4-chlorophenyl)imidazo[2,1-a]isoquinoline Chemical compound C1=CC(Cl)=CC=C1C1=CN(C=CC=2C3=CC=CC=2)C3=N1 PENWAFASUFITRC-UHFFFAOYSA-N 0.000 description 1
- QKNYBSVHEMOAJP-UHFFFAOYSA-N 2-amino-2-(hydroxymethyl)propane-1,3-diol;hydron;chloride Chemical compound Cl.OCC(N)(CO)CO QKNYBSVHEMOAJP-UHFFFAOYSA-N 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 208000002109 Argyria Diseases 0.000 description 1
- 241000983532 Chara braunii Species 0.000 description 1
- 241000196240 Characeae Species 0.000 description 1
- 241000195627 Chlamydomonadales Species 0.000 description 1
- 241000196319 Chlorophyceae Species 0.000 description 1
- 230000004544 DNA amplification Effects 0.000 description 1
- 241001231664 Dunaliella viridis Species 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 241000192701 Microcystis Species 0.000 description 1
- 108091092878 Microsatellite Proteins 0.000 description 1
- AMQJEAYHLZJPGS-UHFFFAOYSA-N N-Pentanol Chemical compound CCCCCO AMQJEAYHLZJPGS-UHFFFAOYSA-N 0.000 description 1
- 239000007984 Tris EDTA buffer Substances 0.000 description 1
- 235000009754 Vitis X bourquina Nutrition 0.000 description 1
- 235000012333 Vitis X labruscana Nutrition 0.000 description 1
- 235000014787 Vitis vinifera Nutrition 0.000 description 1
- 241000195615 Volvox Species 0.000 description 1
- 241000195614 Volvox carteri Species 0.000 description 1
- 239000013543 active substance Substances 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- OENHQHLEOONYIE-UKMVMLAPSA-N all-trans beta-carotene Natural products CC=1CCCC(C)(C)C=1/C=C/C(/C)=C/C=C/C(/C)=C/C=C/C=C(C)C=CC=C(C)C=CC1=C(C)CCCC1(C)C OENHQHLEOONYIE-UKMVMLAPSA-N 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 239000011648 beta-carotene Substances 0.000 description 1
- TUPZEYHYWIEDIH-WAIFQNFQSA-N beta-carotene Natural products CC(=C/C=C/C=C(C)/C=C/C=C(C)/C=C/C1=C(C)CCCC1(C)C)C=CC=C(/C)C=CC2=CCCCC2(C)C TUPZEYHYWIEDIH-WAIFQNFQSA-N 0.000 description 1
- 235000013734 beta-carotene Nutrition 0.000 description 1
- 229960002747 betacarotene Drugs 0.000 description 1
- 230000000975 bioactive effect Effects 0.000 description 1
- 239000003225 biodiesel Substances 0.000 description 1
- 230000010307 cell transformation Effects 0.000 description 1
- 210000002421 cell wall Anatomy 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000003776 cleavage reaction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000011157 data evaluation Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010790 dilution Methods 0.000 description 1
- 239000012895 dilution Substances 0.000 description 1
- 102000013035 dynein heavy chain Human genes 0.000 description 1
- 108060002430 dynein heavy chain Proteins 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 210000003495 flagella Anatomy 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 239000003292 glue Substances 0.000 description 1
- 150000004676 glycans Chemical class 0.000 description 1
- 230000015784 hyperosmotic salinity response Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000009776 industrial production Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- TWRXJAOTZQYOKJ-UHFFFAOYSA-L magnesium chloride Substances [Mg+2].[Cl-].[Cl-] TWRXJAOTZQYOKJ-UHFFFAOYSA-L 0.000 description 1
- 229910001629 magnesium chloride Inorganic materials 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000011259 mixed solution Substances 0.000 description 1
- 239000003147 molecular marker Substances 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000007481 next generation sequencing Methods 0.000 description 1
- 210000004940 nucleus Anatomy 0.000 description 1
- 230000007030 peptide scission Effects 0.000 description 1
- 210000002706 plastid Anatomy 0.000 description 1
- 229920002401 polyacrylamide Polymers 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 229920001282 polysaccharide Polymers 0.000 description 1
- 239000005017 polysaccharide Substances 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 238000013341 scale-up Methods 0.000 description 1
- 230000007017 scission Effects 0.000 description 1
- 239000000344 soap Substances 0.000 description 1
- 238000001179 sorption measurement Methods 0.000 description 1
- 238000010186 staining Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 108091008023 transcriptional regulators Proteins 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- OENHQHLEOONYIE-JLTXGRSLSA-N β-Carotene Chemical compound CC=1CCCC(C)(C)C=1\C=C\C(\C)=C\C=C\C(\C)=C\C=C\C=C(/C)\C=C\C=C(/C)\C=C\C1=C(C)CCCC1(C)C OENHQHLEOONYIE-JLTXGRSLSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6888—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
- C12Q1/6895—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
Abstract
The invention belongs to the technical field of plant molecular identification, and particularly relates to a method for strain identification based on a dunaliella core gene sequence. The method mainly comprises the following steps: collecting, purifying and culturing a sample; extracting whole genome DNA; constructing a DNA sequencing library; obtaining whole genome sequencing data of an alga strain to be detected and Dunaliella quartolecta; screening and de novo assembling a core genome sequencing fragment of the Dunaliella D.quartz pecta, and performing gene component, protein function annotation and genome contig colinearity analysis on the assembled core genome sequence; the method comprises the steps of constructing a phylogenetic tree by utilizing single nucleotide polymorphism, and when the to-be-detected algae strain and the Dunaliella tertiolecta are gathered into a cluster, the branched data support rate is 0.99-1.00, the genetic similarity percentage is more than or equal to 99%, and the to-be-detected algae strain is D.quartz.
Description
Technical Field
The invention belongs to the technical field of plant molecular identification, and particularly relates to a method for strain identification based on a Dunaliella core genome sequence.
Background
Dunaliella viridis Dunaliella quatolytica is a eukaryotic unicellular microalgae living in oceans, salt lakes and other extreme environments, belongs to Chlorophyta, Chlorophyceae, Volvocales, Dunaliella, has strong stress resistance, no cell wall, contains a chromoplast and a protein nucleus, and has flagella at the top of the cell. The Dunaliella tertiolecta D.quartolecta is rich in bioactive substances such as glycerol, beta-carotene, algal polysaccharides and the like, and belongs to characteristic economic microalgae. The characteristic strain in the Dunaliella D.quartz is used as a bioreactor to extract active substances and carry out industrial production, and the method has important application prospect in the fields of food processing, medical care, biodiesel and the like. However, at present, 23 types of dunaliella identified at home and abroad have similar morphology and broad-spectrum salt tolerance, and the identification of the dunaliella D.quartz is difficult from the morphological point of view. Although the efficiency of identifying the algal strains is improved from the perspective of DNA (deoxyribonucleic acid) markers, gene markers and protein markers, the accuracy is still limited by factors such as molecular marker means, conservation of fragments and non-universality of amplification or experimental procedures, the conventional molecular identification of some kindred algal strains usually has the defects of few candidate amplification fragments, poor specificity of universal markers, long development period of novel markers and specific primers, optimization of PCR (polymerase chain reaction) amplification procedures and the like, and the obtained identification result also often has false positive. As an important characteristic strain with high added value in the genus Dunaliella, the molecular identification of the D.quartz pecta resource of the Dunaliella is very key. Therefore, there is a need to develop a more accurate, rapid and universal method for identifying the D.quartolecta molecule in Dunaliella.
Due to the rapid development of next generation DNA sequencing technologies, molecular identification technologies based on the whole genome level of species are possible. Compared with the traditional molecular identification technology, the identification genetic information quantity of the whole genome level is larger, the detection range is wider, the identification of related species is more effective, and the obtained genetic variation information is richer. Currently, whole genome sequencing data for many model species have been published. Although reference genome sequencing data of dunaliella salina (d.salina) has been published in 2017 (Dunsal1 v.2), there has been no report on whole genome sequencing work of the strain as another typical dunaliella salina d.quartococta. The currently popular second generation and third generation combined sequencing technology is used for sequencing the whole genome of a species, and although complete genetic information of the species can be obtained, the following defects still exist: (1) all sequencing fragments need to be completely compared, the operation time is long, the data output is huge, a large amount of time and resources of a computer can be consumed, and the molecular identification work is not facilitated to be carried out in time; (2) genome assembly and biological information analysis not only highly depend on second-generation and third-generation high-throughput sequencing platforms of domestic and foreign sequencing companies, such as Illimina, Nanopore, PacBio and the like, but also are limited by the size of species genomes and the computing capability of the platforms, so that the result output period is longer, the manufacturing cost is higher, and common laboratories are often difficult to bear; (3) molecular identification is carried out on related species, the whole genome re-sequencing quality of the related species is highly dependent, the whole genome re-sequencing quality is closely related to the genome quality of a reference species, if the genome sequencing depth of the reference species is not enough and the assembling quality is not high, the re-sequencing result of the genome of the species to be detected is influenced, and further the species identification is deviated.
Therefore, how to provide an accurate, efficient and economic method for identifying the dunaliella D.quatorecta from the strain to be detected is an urgent technical problem to be solved in the field.
Disclosure of Invention
The invention provides a method for strain identification based on a Dunaliella core genome sequence.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for strain identification based on the core genome sequence of the dunaliella salina comprises the following steps:
(1) collecting, purifying and culturing a sample: collecting an alga strain to be detected and a Dunaliella tertiolecta D.quartz, purifying the alga strain to be detected, and then carrying out indoor expanded culture;
(2) extracting whole genome DNA: respectively extracting the whole genome DNA of the to-be-detected alga strain and the D.quartolecta by using an improved CTAB method, and freezing and storing;
(3) respectively constructing a DNA sequencing library after breaking and purifying the whole genome DNA of the alga strain to be detected and the Dunaliella D.quartz ectca in the step (2);
(4) sequencing the DNA sequencing libraries in the step (3) by adopting a high-throughput sequencing method respectively to obtain second-generation sequencing data of the to-be-detected alga strain and the D.quartolecta whole genome;
(5) taking the saline Dunaliella salina whole genome data published by NCBI as reference, comparing the D.quatolytica whole genome sequencing data obtained in the step (4) with the data, obtaining the D.quatolytica core genome sequence of the Dunaliella salina through screening, de novo assembly and quality evaluation, wherein the size of the core genome sequence is 6592916bp, the number of contigs is 3000, the length of the maximum contig is 1133322bp, the average length of the contig is 2197.64bp, the length of the contig N50 is 15270, the proportion of the complete gene is 23.65%, the proportion of the single copy gene is 15.18%, the proportion of the multi-copy gene is 13.76%, the proportion of vacancy/deletion is 1.89%, and the proportion of the incomplete fragment is 17.45%, constructing a Dunaliella salinolytica core genome circular map which is assembled de, and then performing gene component, protein function annotation and genome overlap collinearity analysis on the D.quatolytica core genome sequence of the Dunaliella salinolytica;
(6) And (3) taking the core genome sequence of the Dunaliella D.quartz Colecta constructed in the step (5) as a reference, comparing the whole genome sequencing data of the to-be-detected algal strain obtained in the step (4) and published genome sequencing data of representative algae with the to-be-detected algal strain, detecting single nucleotide polymorphism and insertion/deletion sites among species, and constructing a phylogenetic tree by using the single nucleotide polymorphism, wherein when the to-be-detected algal strain and the Dunaliella D.quartz Colecta are gathered into a cluster, the branched data support rate is 0.99-1.00, the genetic similarity percentage is more than or equal to 99%, and the to-be-detected algal strain is the Dunaliella D.quartz Colecta.
Further, the indoor expanding culture in the step (1) comprises the following specific steps: performing monoclonal picking on algal cells of an algal strain to be detected under an aseptic condition, performing indoor expanded culture under the aseptic condition after passing microscopic examination, wherein the indoor expanded culture condition is as follows: the photoperiod is 18 h: 6h, light intensity 19000lx, temperature: keeping the aseptic ventilation environment at 23 +/-3 ℃, shaking the culture dish every 5 days to prevent the algal cells from adhering to the walls, performing microscopic examination on 0.5-1 mL of algal solution, and preparing the following culture medium solutions to perform indoor expanded culture on the algal strains to be detected, wherein the formula of the culture medium is as follows:
30g/L NaCl,1.5g/L NaNO3,1.4g/L K2HPO4,1.75g/L MgSO4·7H2O,1.36g/LCaCl2·7H2O,1.2g/LNa2CO3,0.006g/L FeC6H5O7,0.005g/LNaH2PO4·2H2O,0.5g/LCo(NO3)2·6H2O,0.8g/LCuSO4·5H2O,2.3g/LZnSO4·7H2O,0.03g/LH3BO3,4.0g/LNa2MoO4·2H2O,0.02g/LMnCl2·4H2O,0.5g/LVB1,0.5g/LVB12VH 0.5g/L and ultrapure water to constant volume of 1L.
Further, the improved CTAB method in the step (2) comprises the following specific steps: taking 600-800 mg of algae to be tested, washing with ultrapure water for 2-3 times, centrifuging at 4 ℃ 8000r/min for 1.5min, adding liquid nitrogen, grinding for 15sec, adding 800 mu L of 2% W/V CTAB solution preheated at 20 ℃ and 1 mu L of 1% V/V beta-mercaptoethanol, uniformly mixing, carrying out water bath at 60 ℃ for 1.5h, shaking for 1 time every 20min, adding 800 mu L of LTris saturated phenol, centrifuging at 4 ℃ 12000r/min for 2.5min, taking supernatant, adding the mixture into the mixture, and adding the mixture into the mixture in a volume ratio of 25: 24: 2, mixing Tris saturated phenol, chloroform and isoamylol, standing for 10min at 4 ℃ after vortex oscillation, uniformly mixing for 2-3 times, and adding 800 mu L of ddH treated by 0.1% V/V DEPC2O, water bath at 60 ℃ for 30min, centrifuging at 4 ℃ for 4min at 12000r/min, taking supernatant, adding 150mL of 3mol/L sodium acetate and 250mL of 4-5 ℃ precooled absolute ethanol, precipitating at-20 ℃ for 50min, centrifuging at 4 ℃ for 3min at 10000r/min, discarding supernatant, adding 1mL of 4-5 ℃ precooled 70% V/V ethanol solution, carrying out vortex oscillation for 20sec, volatilizing liquid in a nucleic acid vacuum drying system after discarding supernatant, adding 100 xTE buffer solution to dissolve precipitate so as to ensure that the DNA concentration is more than or equal to 150 ng/mu L and the 1% W/V agarose gel electrophoresis combined fluorescence quantifier is used for detecting genome DNA, ensuring that an electrophoresis strip is bright and has no degradation, and OD is not degraded 260/OD2801.8 to 1.9, and no pollution.
Further, the specific steps of constructing the DNA sequencing library in the step (3) are as follows: breaking the whole genome DNA by using a strong-grade ultrasonic wave band of 80-100W for 6sec, repeating the breaking for 1 time every 3sec, carrying out ultrasonic treatment for 5 times in total, and setting breaking parameters to be 300-400 bp; carrying out agarose gel electrophoresis on the fragments, and recovering 300-400 bp target fragments by using the agarose gel; adsorbing and recovering the target fragments by using silicon-based magnetic beads, and detecting the quality of the adsorbed and recovered target fragments by using a fluorescence quantitative instrument; DNA end repair, adding A at the 3' end; adding a joint for a connection reaction, and purifying, converting and PCR verifying a connection product; and (3) carrying out single-stranded DNA cyclization reaction on the positive product after the positive product is denatured at 95 ℃ for 20sec, and purifying the product to construct a whole genome DNA sequencing library for use in the computer.
Further, the specific steps of obtaining the core genome sequence of the dunaliella d.quartz necta after screening, assembling and quality evaluation in the step (5) are as follows: screening from a sequencing platform to obtain a high-quality sequence, taking a fragment with the screening sequencing depth of 50-80 x, the average length of 12-15K and the length of N50 greater than 18K as a query sequence, replying the query sequence to a reported dunaliella salina reference genome (Dunal 1 v.2) by utilizing SOAPaligner or BWA software, further screening a sequencing fragment with the sequence consistency of more than or equal to 90 percent and the comparison result E value of less than 1E-10 as dunaliella salina D.quartolola core genome sequence candidate data; comparing all the residual sequencing fragments with the candidate data set to obtain an overlapping area between comparison data; error correction and correction operation are carried out on the comparison result by using Falcon or Pilot software, and the contig is assembled by using SOAPde novo 2.04, Mecat, HERA or Canu software; determining the order of each contig using BySS 2.2.3, Velvet 1.2.10 or ABySS 2.2.3 software; carrying out whole genome coverage measurement and calculation by using BAMStats or GATK DepthOfCoverage software, and screening a core sequence with reference genome coverage of not less than 50% and contig continuous arrangement number of not less than 2000; evaluating the assembly quality of the screened overlapped groups by using BUSCO 2.0 or Quast software, and selecting an assembly sequence with the complete gene ratio of more than or equal to 20 percent, the single-copy gene ratio of 15 percent, the multi-copy gene ratio of more than or equal to 12 percent and the deletion/vacancy ratio of less than or equal to 3 percent as a Dunaliella D.quartolecta core genome sequence; the circular map of the core genome of this species was constructed using the Circos software.
Further, in the step (5), the gene composition, protein function annotation and genome contig collinearity analysis are carried out on the core genome sequence of the dunaliella D.quartolecta, and the specific steps are as follows: CDS prediction is carried out on the assembly data by using Augusts 3.3.3, ESTScan3.0.1, TransDecoder 2.0.1 or Prodigal 2.6.1 software, repeated sequence analysis is carried out on the assembly data by using replay asker 4.0.9, replay proteomMask 3.2.2, LTR-FINDER, Piler 1.0.6 or replay Scout 1.0.5 software, protein sequences coded by CDS are aligned to NR database by using Diamond 0.9.14 or BLASTX software and are annotated with functions, and after the predicted protein sequences are aligned by BLASTSc, MCanX, Last, Mugsy, Spines or progressive masive software, the co-linear analysis of genome is carried out.
Further, the specific steps of constructing the phylogenetic tree by using the single nucleotide polymorphisms in the step (6) are as follows: comparing the algae strain to be detected and 5-6 kinds of representative algae genome data reported in an NCBI database with the Dunaliella D.quartz core genome sequence assembled in the step (5) by using LASTZ 1.02.00 or Mauvee 2.3.1 software, extracting the corresponding genotype of each species and the Dunaliella D.quartz core genome according to the result of the compared collinear block, merging, extracting and filtering the genotype information of all the species by using the Dunaliella D.quartz core genome as a template, and detecting the single nucleotide polymorphism data and the insertion/deletion site data by using BWA0.7.17 software; based on single nucleotide polymorphism data, a phylogenetic tree is constructed by utilizing a maximum likelihood algorithm in easy SpecifesTree 1.0, MEGA 5.0, TreeBeST 1.9.2, PHYLIP, Puzzle 5.2 or PHYLO-WIN software, and then the genetic relationship between the to-be-detected algae strain and the Dunaliella D.quartz necta is determined.
Further, the deletion rate of the filtration is not higher than 20%.
The method provided by the invention does not completely depend on the known whole-genome sequencing result of the Dunaliella, the genome of a related strain without published genome sequencing data, namely the Dunaliella D.quartz genome, is sequenced, and the defects of time consumption, high dependence on an advanced sequencing system platform, high manufacturing cost and the like in the traditional genome sequencing are avoided and overcome by using an optimized data comparison method and a sequence assembly strategy. An operator can perform sequencing data processing, assembling and information analysis according to the genome core sequence and the program command constructed by the invention after obtaining the second-generation sequencing data from a domestic sequencing company, the steps can select a wide software range, the program setting in the example is strict, the operation on the computer is easy, and the method has wide application prospects in the aspects of Dunaliella strain molecule identification, variation detection, system evolution analysis and the like.
On the basis that the whole genome sequencing data of the Dunaliella alga D.quartz necta is not published at home and abroad, the invention firstly constructs the core genome assembly sequence of the Dunaliella alga D.quartz necta, the sequence comprises the current most abundant genetic information and the D.quartz necta core genome information with higher assembly quality, and theory and information support are provided for the genetic oriented improvement and the industrial application of the alga strain by taking the D.quartz necta as reference.
Compared with the prior art, the invention has the following advantages:
1. according to the invention, a D.quartolecta core genome sequence of the dunaliella is constructed for the first time by utilizing a second-generation sequencing combined genome de novo assembly technology, and the sequence contains the D.quartolecta core genome information which is most abundant in genetic information amount and higher in assembly quality at present, so that the blank of the genome information of the species is made up.
2. The core genome sequence of the Dunaliella D.quartz necta constructed by the invention can be applied to the molecular identification of the algae strain, and can be used as the theoretical and technical basis for the phylogenetic and evolutionary research and identification of the Dunaliella at home and abroad while greatly improving the accurate identification efficiency of the Dunaliella strain.
3. Compared with the published Dunaliella salina D.salina whole genome sequence, the Dunaliella salina D.quartz genome constructed by the invention has smaller data volume, and is used as a reference sequence to analyze the sequencing data of the genome of the strain to be detected, so that the data comparison time can be greatly shortened, the effective Single Nucleotide Polymorphism (SNP) data acquisition efficiency of the strain to be detected is improved, the important reference value is provided for the genetic variation analysis of the genome level Dunaliella salina related strain, and a rich data basis is provided for the systematic research of origin and evolution of low-class algae, particularly green algae.
4. By taking the core genome sequence of the Dunaliella alga D.quartz necta constructed by the invention as reference, corresponding experimental groups and control groups are set according to different experimental purposes of researchers, or the alga strain and the kindred strain thereof are compared to mine difference or characteristic genes, which lays a foundation for improving and researching the quality of the alga strain from the molecular level and promoting the industrial application of the alga strain.
5. The method for indoor expanded culture of the Dunaliella D.quartolecta and the to-be-detected algal strains, the improved CTAB method, the screening of core genome sequencing data and the de novo assembly of sequencing fragments can be widely applied to algae, particularly to the aspects of artificial culture of green algae, high-quality whole genome DNA extraction, genome sequencing data optimization processing and the like, has shorter experimental period, higher efficiency and easy operation compared with the traditional method, and is a set of indirectly-replicable technical method.
Drawings
FIG. 1 is a circular map of the core genome of Dunaliella alga D.quartolecta assembled from the head, the outermost layer of the map is the nucleotide sequence size coordinate (unit: Mbp), the inner side is the de novo assembled fragments arranged based on the sequence identity (relative to the reference genome Dunsal1 v.2), the internal lines of the genome fragments represent the gene sites of each type, the innermost side is the corresponding contig sequencing abundance map, and the internal part of the circular map is the basic information of the core genome of the alga;
FIG. 2 is a morphological observation result of an alga strain to be identified (tentatively named Dunaliella sp.) after indoor expanding culture for 30 days, wherein the upper part is macroscopic condition, the lower part is microscopic condition (scale bar: 50 μm), and No. 1-4 samples of the alga are sequentially arranged from left to right;
FIG. 3 is a schematic diagram of 1% agarose gel electrophoresis detection of whole genome DNA of a sample to be identified, M1 and M2 represent DNAsadeders;
FIG. 4 is a plot of collinearity analysis scatter diagram between the D.quartolecta core genome of Dunaliella and the sequencing fragment of the genome of the strain to be identified, the dots in the plot represent collinearity blocks between the genomes of the two species, and A and B in the plot represent 2 collinearity regions densely distributed between the D.quartolecta and the genome of the strain to be identified, respectively;
FIG. 5 is a phylogenetic tree between 7 different algae constructed based on Single Nucleotide Polymorphism (SNP) data, the phylogenetic tree construction algorithm is maximum likelihood method, the step value is set to 1000, and the data between each branch node represents the support rate and the genetic similarity percentage respectively;
FIG. 6 is a circle of collinearity analysis within the core genome of an identified Dunaliella strain Dq _ SX, the connecting lines between the segments within the circle representing possible doubling events during evolution of the species' genome, the numbers on the circle representing core genome contig numbers;
FIG. 7 is a histogram of the frequency distribution of the Ka/Ks values of the identified Dunaliella strain Dq _ SX, where the data on the histogram represent the frequency values in different intervals, Ka represents nucleotide non-synonymous substitution rate, and Ks represents nucleotide synonymous substitution rate;
FIG. 8 is a histogram of the statistics of the annotation information of the protein COG in the core genome of an identified Dunaliella strain Dq _ SX, i.e., the orthologous protein database, with the histogram accounting for the functional information of the homologous protein annotation information at the top20 (top 20);
FIG. 9 is a diagram showing prediction of transmembrane domain of a transcription regulatory factor in the identified Dunaliella strain Dq _ SX, in which different lines represent the region of the membrane, the intramembrane region and the extramembrane region, respectively, the vertical axis represents the probability value predicted by the region, and the horizontal axis represents the amino acid position;
FIG. 10 is a diagram showing the structure prediction of a signal peptide of a transcription regulator identified in Dunaliella strain Dq _ SX, wherein C-score, S-score and Y-score represent the cleavage site score, signal peptide score and comprehensive score value, respectively;
fig. 11 is a venturi diagram of metabolic pathways of d.quartz ecta and Dq _ SX of dunaliella, the intersection part is a common metabolic pathway between two algal strains, and the metabolic pathway prediction of the two algal strains is performed based on KEGG database, i.e. japanese Kyoto gene and genome encyclopedia;
Fig. 12 is a map of the unique pre-20 (top20) metabolic pathway enrichment bubbles in dunaliella d.quartz, the metabolic pathway information is from KEGG, i.e. japanese kyoto genes and genome encyclopedia database, the larger the bubble volume represents the more genes involved in the pathway, the darker the bubble color represents the higher the confidence of the pathway (the lower the Q value), the degree of enrichment (significance) is expressed as the enrichment ratio, which is the number of genes/total number of genes annotated by KEGG pathway;
fig. 13 is a map of the enrichment of the top20 (top20) metabolic pathway unique to the identified strain Dq _ SX, the metabolic pathway information is from kyoto genes and genome encyclopedia database (KEGG) in japan, the larger the bubble volume represents the larger the number of genes involved in the pathway, the darker the bubble color represents the higher the confidence (lower Q-value) of the pathway, the degree of enrichment (significance) is expressed as an enrichment ratio, which is the number of genes/total number of genes annotated by the KEGG pathway;
fig. 14 is a GO enrichment analysis histogram of the d.quartolecta significantly enriched metabolic pathway top20 (top20), GO is a database established by the gene ontology association, and the more GO entries, the higher the corresponding-log 10(Q value) (the higher the confidence), the higher the degree of the gene participating in the biological function;
Fig. 15 is a GO enrichment analysis histogram of the identified strain Dq _ SX significantly enriching the top20 ranking in the metabolic pathway (top20), GO is a database established by the gene ontology association, the more GO entries, the higher the corresponding-log 10(Q value) (higher confidence), the higher the degree of gene involvement in the biological function;
FIG. 16 is a phylogenetic tree constructed based on ITS genes of 21 Dunaliella, the construction algorithm of the phylogenetic tree is a maximum likelihood method, the step value is set to 1000, and the data among the branch nodes respectively represent the support rate and the genetic similarity percentage;
FIG. 17 is a phylogenetic tree constructed based on 21 Dunaliella SSR markers, the evolutionary tree construction algorithm is a maximum likelihood method, the step value is set to 1000, and data among branch nodes respectively represent support rate and genetic similarity percentage;
FIG. 18 is a phylogenetic tree constructed based on 21 Dunaliella genome SNP, the evolutionary tree construction algorithm is a maximum likelihood method, the step value is set to 1000, and the data among all branch nodes respectively represent the support rate and the genetic similarity percentage.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
A method for whole genome sequencing of Dunaliella D.quartolecta and de novo assembly of core genome sequence fragments thereof comprises the following steps:
30g/L NaCl,1.5g/L NaNO3,1.4g/L K2HPO4,1.75g/L MgSO4·7H2O,1.36g/LCaCl2·7H2O,1.2g/LNa2CO3,0.006g/L FeC6H5O7,0.005g/LNaH2PO4·2H2O,0.5g/LCo(NO3)2·6H2O,0.8g/LCuSO4·5H2O,2.3g/LZnSO4·7H2O,0.03g/LH3BO3,4.0g/LNa2MoO4·2H2O,0.02g/LMnCl2·4H2O,0.5g/LVB1,0.5g/LVB12VH is 0.5g/L, and the volume of ultrapure water is constant to 1L;
step 2, extracting the whole genome DNA of the Dunaliella D.quartz necta by using the improved CTAB method of the invention, ensuring that the DNA concentration is not lower than 150 ng/mu L and the OD is not lower than260/OD280Between 1.8 and 1.9, free of protein, salt ion and RNA contamination; the specific procedures are as follows: taking 600-800 mg of indoor expanded cultured algae cells, centrifuging at 8000r/min at 4 ℃ for 1.5min, adding liquid nitrogen, grinding for 15sec, adding 800 mu L of 2% W/V CTAB solution preheated at 20 ℃ and 1 mu L of 1% beta-mercaptoethanol (V/V), uniformly mixing, then carrying out water bath at 60 ℃ for 1.5h, shaking up 1 time every 20min during the mixing, adding 800 mu L of L-phenol, centrifuging at 12000r/min at 4 ℃ for 2.5min after uniformly mixing, taking supernatant, adding the mixture into the mixture, and adding the mixture into the mixture according to the volume ratio of 25: 24: 2 Tris saturated phenol, chloroform and iso And (3) standing the amyl alcohol mixed solution for 10min at 4 ℃ after vortex oscillation, uniformly mixing for 2-3 times, and adding 800 mu L of ddH treated by 0.1% DEPC (V/V)2O, carrying out water bath at 60 ℃ for 30min, centrifuging at 4 ℃ of 12000r/min for 4min, taking supernatant, adding 150mL of 3mol/L sodium acetate and 250mL of 4-5 ℃ absolute ethyl alcohol, precipitating at-20 ℃ for 50min, centrifuging at 4 ℃ of 10000r/min for 3min, then discarding supernatant, adding 1mL of 70% (V/V) ethanol solution precooled at 4-5 ℃ and carrying out vortex oscillation for 20sec, removing supernatant, volatilizing liquid in a nucleic acid vacuum drying system, and adding a proper amount of 100 × TE buffer solution (10mmol/LTris-HCl, 1mmol/L EDTA) to dissolve precipitate;
step 5, repairing the ends of the obtained qualified DNA sample under the action of T4 DNA polymerase and Klenow polymerase, preparing blunt ends, and adding A bases at the 3' end; preparing a connection reaction system: 1 μ LT4 DNA ligase, 1 μ LT vector, 5 μ L of 1 Xligation reaction buffer, 5 μ L linker (10 μmol/L), 5 μ L DNA sample, sterile water to constant volume of 20 μ L; obtaining a connecting reaction product after water bath at 16 ℃ overnight, and purifying the product according to the requirements of an Agencourt AMPure XP kit; carrying out PCR verification and sequencing on the purified product by bacterial liquid after competent cell transformation and blue-white screening (the step can be finished by a sequencing company), selecting a positive cloning result, and detecting an amplification product by using an Agilent 2100 Bioanalyzer; after the positive amplification product is denatured at 96 ℃ for 30sec, a DNA circularization amplification system is prepared: 2 mu L of DNA sample, 4 mu L of 5 × Rapid ligation buffer, 1 mu L of ligase, and double distilled water to constant volume of 20 mu L; after the amplification system is subjected to water bath at 25 ℃ for 15min, adding linear DNA digestive enzyme for digestion for 10min, and finally obtaining a DNA sequencing library; detecting the concentration of the library by using an Agilent SureSelectQXT WGS instrument, ensuring that the concentration of the library does not exceed 2nmol/L and the volume is not less than 12 mu L;
Step 6, performing gradient PCR on the sequencing library obtained in the step 5 to prepare an amplification system: mu.L of the library sample to be tested, 1. mu.L of each primer pair (optionally using a second generation sequencing adapter primer kit), 0.5. mu.L of DNA polymerase, 2.5. mu.L of dNTPs, and 1.5. mu.L of MgCl22.5 μ Lbuffer buffer, ddH2O is added to the volume of 25 mu L; the PCR amplification procedure was: cycling at 96 deg.C for 3min and 96 deg.C for 30sec for 40 times (reducing 1 deg.C to 56 deg.C and 72 deg.C for 45sec every 0.5 sec), at 72 deg.C for 8min, and storing at 4 deg.C; the amplified fragment is subjected to high-throughput sequencing by a combined anchored polymerization technology (cPAS), and the step is finished by a sequencing company with related technical qualification;
and 7, filtering the original sequencing data of the Dunaliella D.quartz-origin obtained in the step 6, filtering out low-quality sequencing data (short sequences with the length less than 5kb, sequences with the average quality less than 8 and linker sequences) by using ngsQCToolkit 2.3.3, respectively storing the obtained high-quality sequencing data in a FASTQ file format, wherein the file is named as Dq.fq, and performing core fragment screening and assembling on the D.quartz-origin whole genome sequencing data (Dq.fq) of the Dunaliella.
1)#maximal read length
2)max_rd_len=100
3)[LIB]
4)#average insert size
5)avg_ins=300
6)#ifsequence needs to be reversed
7)reverse_seq=0
8)#in which part(s)the reads are used
9)asm_flags=3
10)#use only first 100 bps ofeach read
11)rd_len_cutoff=100
12)#in which order the reads are used while scaffolding
13)rank=1
14)#cutoffofpair number for a reliable connection(at least 3 for short insert size)
15)pair_num_cutoff=3
16)#minimum aligned length to contigs for a reliable read location(at least 32for short insert size)
17)map_len=32
18)#a pair offastq file,read 1 file should always be followed by read 2 file
19)q1=/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/Dq_1.fq
20)#SOAPdenovo-63mer all–s config.txt-p 10-K 55-M 3-F-u–o
21)#SOAPdenovo-63mer all-s-config.txt p 40-K 27-D 1-N 500m-o./result/MDCZ_27>MDCZ_27.log
22)SOAPdenovo-63mer all-s/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/soapdenovo/config.txt-p 10-K 55-o
23)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/soapdenovo/test
24)qsub-l nodes=1-q queue8./soap.sh
And 9, reassembling the contigs by ABySS 2.2.3 software, wherein the set program command is as follows:
25)conda install-c conda-forge-c bioconda-c defaults ABySS
26)ABYSS-k 31-o/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/ABySS/31_contigs.fa
27)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/Dq.fq
28)qsub-l nodes=1-q queue6./ABySS.sh
and step 10, evaluating the quality of the Dunaliella D.quartz vitrecta genome assembly sequence by using BUSCO 2.0 software, and selecting the assembly sequence with the complete gene ratio of more than or equal to 20 percent, the single-copy gene ratio of 15 percent, the multi-copy gene ratio of more than or equal to 12 percent and the deletion/vacancy ratio of less than or equal to 3 percent as the Dunaliella D.quartz vitrecta core genome sequence. The set program commands are:
29)python/public/home/wangjingchun/miniconda2/bin/run_BUSCO.py-i
30)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/02busco/Dq_contig.fa-m geno-l
31)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/02busco/eukaryota_odb10-o results_Dq
and step 11, performing functional gene CDS prediction on the screened core genome assembly data by using Augustus 3.3.3 software, wherein the set program command is as follows:
32)augustus--strand=both--genemodel=partial--singlestrand=false--protein=on--introns=on--start=on--stop=on--cds=on--codingseq=on--alternatives-from-evidence=true--gff3=on--UTR=false--outfile=/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/04gene/Dqaugustus/out.gff--species=volvox/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/04gene/Dq/Dq_masked.fa
step 12, constructing a core genome circular map of the alga by using a Circos software, wherein the set program command is as follows:
33)#circos.conf
34)karyotype=data/karyotype/karyotype.Dq.txt
35)<ideogram>
36)<spacing>
37)default=0.005r
38)</spacing>
39)radius=0.9r
40)thickness=20p
41)fill=yes
42)</ideogram>
43)#The remaining content is standard and required.It is imported
44)#from default files inthe Circos distribution.
45)#These shouldbe present in every Circos configuration file and
46)#overridden as required.To see the content ofthese files,
47)#look in etc/in the Circos distribution.
48)<image>
49)#Included from Circos distribution.
50)<<include etc/image.conf>>
51)</image>
52)#RGB/HSV color definitions,colorlists,location offonts,fill patterns.
53)#Included from Circos distribution.
54)<<include etc/colors_fonts_patterns.conf>>
55)#Debugging,I/O an dother systemparameters
56)#Included from Circos distribution.
57)<<include etc/housekeeping.conf>>
according to the genome assembly quality evaluation results, a core genome sequence can be screened from the Dunaliella tertiolecta D.quartolecta, the size of the core genome sequence is 6592916bp, the number of contigs is 3000, the maximum contig length is 1133322bp, the average length of the contigs is 2197.64bp, the contig N50 is 15270, the proportion of complete genes is 23.65%, the proportion of single-copy genes is 15.18%, the proportion of multi-copy genes is 13.76%, the proportion of vacancy/deletion is 1.89%, the predicted CDS proportion is 38.03%, and the core genome circular map is shown in FIG. 1.
Example 2
A method for strain identification using the core genome sequence of dunaliella d.quartolecta, comprising the steps of:
30g/L NaCl,1.5g/L NaNO3,1.4g/L K2HPO4,1.75g/L MgSO4·7H2O,1.36g/LCaCl2·7H2O,1.2g/LNa2CO3,0.006g/L FeC6H5O7,0.005g/LNaH2PO4·2H2O,0.5g/LCo(NO3)2·6H2O,0.8g/LCuSO4·5H2O,2.3g/LZnSO4·7H2O,0.03g/LH3BO3,4.0g/LNa2MoO4·2H2O,0.02g/LMnCl2·4H2O,0.5g/LVB1,0.5g/LVB12VH is 0.5g/L, and the volume of ultrapure water is constant to 1L; the algal strains obtained by the scale-up culture were divided into 4 specimens (Nos. 1 to 4).
Step 2, extracting whole genome DNA: respectively taking algae liquid (figure 2) in a mature period (about 30 days), centrifuging at a low temperature of 4 ℃ for 1.5min (8000r/min), enriching algae cells, quickly freezing by using liquid nitrogen, quickly grinding for 15sec, and respectively extracting whole genome DNA by using an improved CTAB method, wherein the specific procedure is as follows: adding 800 mu L of 2% (W/V) CTAB solution preheated at 20 ℃ into the grinding powder, adding 1 mu L of 1% beta-mercaptoethanol (V/V), gently mixing uniformly, then carrying out water bath at 60 ℃ for 1.5h, adding 800 mu L of Tris saturated phenol, gently mixing uniformly, centrifuging at 4 ℃ of 12000r/min for 2.5min, taking supernatant, and adding the mixture into the mixture according to the volume ratio of 25: 24: 2 Tris-saturated phenol, chloroform and isoamyl alcohol mixture, and vortex oscillating Standing at 4 deg.C for 10min, gently mixing for 2-3 times, adding 800 μ L of 0.1% DEPC (V/V) -treated ddH2O, water bath at 60 ℃ for 30min, centrifuging at 12000r/min at 4 ℃ for 4min, taking supernatant, adding 150mL of 3mol/L sodium acetate and 250mL of anhydrous ethanol pre-cooled at 4-5 ℃, precipitating at 20 ℃ for 50min, centrifuging at 10000r/min at 4 ℃ for 3min, discarding supernatant, adding 1mL of 70% (V/V) ethanol solution pre-cooled at 4-5 ℃, performing vortex oscillation for 20sec, volatilizing the supernatant in a nucleic acid vacuum drying system, adding 100 muL of 100 xTE buffer (10mmol/L Tris-HCl, 1mmol/L EDTA) to dissolve and precipitate, detecting the quality of genome DNA by 1% (W/V) agarose gel electrophoresis combined with a fluorescence quantifier, and ensuring that the DNA concentration is not lower than 150 ng/muL and the OD is not lower than 150 ng/muL260/OD280Between 1.8 and 1.9, free of protein, salt ion and RNA contamination. Agarose gel electrophoresis detection results show (fig. 3) that the DNA concentration of the No. 1 and No. 4 samples is higher, and the integrity is better; the results of the fluorescent quantitative detection also show (Table 1), that the samples No. 1 and No. 4 have higher DNA concentration and less pollution, and are suitable for being used as candidate samples for the next library construction.
TABLE 1 fluorescent quantitative determination of the quality of the whole genome DNA of an algae sample to be identified
Sample numbering | Dilution factor (X) | Sample size (μ L) | Detection concentration (ng/. mu.L) | OD260/ |
1 | 1 | 1 | 204.6 | 1.85 |
2 | 1 | 1 | 152.0 | 1.69 |
3 | 1 | 1 | 72.2 | 1.62 |
4 | 1 | 1 | 384.1 | 1.89 |
And 5, performing quality control on the original sequencing data of the to-be-detected algae strain obtained in the step 4 (Q20 is more than 96%, and GC content is more than 45%), respectively performing data filtration, filtering out low-quality sequencing data (short sequences with the length less than 5kb, sequences with the average quality less than 8 and linker sequences) by utilizing ngsQCToolkit 2.3.3 software, setting a filtration parameter to be-l 20-Q0.5-n 0.03-A0.28', storing the obtained high-quality sequencing data (table 2) in a FASTQ file format, and naming the file as Dsp.fq.
TABLE 2 statistical table of the sequencing information of filtered strains to be identified
Sample numbering | Number of fragments after filtration | Number of bases after filtration | Read length | Q20(%) | GC(%) |
1 | 238,959 | 23,895,898 | 100 | 97.90 | 49.11 |
4 | 155,286 | 15,528,625 | 100 | 95.36 | 47.47 |
As can be seen from Table 2, the quality control test shows that the sample of the strain to be identified with the number of 1 has better sequencing quality (higher Q20 and GC content), and can be used for data comparison and analysis in the next step.
Step 6, genome sequencing data (Dsp. fq) of the to-be-identified algae strain obtained in the step 5 and genome sequencing data of 5 representative algae published by NCBI database, namely, stonewort (Chara braunii), Chlamydomonas eustigma (Chlamydomonas eustigma), Microcystis aeruginosa (Microcystis aeruginosa), Microcystis paniformis and Volvox carteri, are collected, with reference to the D.quartolecta D.quarttactuctive genome sequence assembled and constructed in the example 1, the genome data of the algae are compared with the D.quartolecta D.tact core genome data by using LASTZ1.02.00 software, and the genotype corresponding to the D.quarttact of each species is extracted from the results of the result of the collinearity blocks (A and B in the figure 4), and the genotype information is merged, extracted and filtered (the loss rate of filtration is less than or equal to 20%).
And 7, detecting Single Nucleotide Polymorphism (SNP) and insertion/deletion sites (Indel) among the species in the step 5 by using an BWA0.7.17 software with the core genome sequence of the Dunaliella alga D.quartz as a reference, wherein a program command for detecting the data of the strain to be detected is as follows:
1) Establishing a library of bw index-abwtsw Dq
2)bwa aln-t 2-f/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq_results/Dsp_R1.sai/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq.fna/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/00data/Dsp_1.fq
3)bwa aln-t 2-f/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp_R2.sai/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq.fna/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/00data/Dsp_2.fq
4)bwa sampe-f/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.sam/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq.fna/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp_R1.sai/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp_R2.sai/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/00data/Dsp_1.fq/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/00data/Dsp_2.fq
5)samtools view-@20-b-S/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.sam-o/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.bam
6)samtools sort-@20-m 150G/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.bam-o/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.sort.bam
7)samtools rmdup-S/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.sort.bam/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.rmdup.bam
8)samtools index/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.rmdup.bam/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.rmdup.bam.bai
9)samtools mpileup-gf/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dq.fna/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.rmdup.bam>/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/08snp/Dsp_results/Dsp.bcf
10)bcftools view-A./Dsp.bcf>Dsp.vcf
1)samtools view-@20-b-S./result/SRR2602391.sam-o./result/Dsp.fq.bam
2)samtools sort-@20-m150G./result/Dsp.fq.bam-o./result/Dsp.fq.sort.bam
3)samtools rmdup-S./result/Dsp.fq.sort.bam./result/Dsp.fq.rmdup.bam
4)samtools index./result/SRR2602391.rmdup.bam./result/Dsp.fq.rmdup.bam.bai
5)samtools mpileup-gf./database/grape.fa./result/*.rmdup.bam>Vitis_2.bcf
6)bcftools call-Avm Vitis.bcf>Vitis.vcf
and 9, carrying out SNP and InDel detection programs and algorithms of other representative algae genomes in the same steps as the algae strains to be identified.
TABLE 3 statistics of SNPs and InDel of the strains to be identified
Species (II) | The strain Dunaliella to be identified. |
Number of SNPs | 968,450 |
Number of InDel | 61,140 |
|
TC conversion (number: 167,620) |
|
AG conversion (quantity: 167,120) |
|
GA conversion (quantity: 167,060) |
|
CT conversion (number: 266,320) |
|
AT transversion (quantity: 200,330) |
Step 11, using easy specificity tree 1.0 software, performing phylogenetic tree construction (fig. 5) based on the obtained effective SNP data, further determining the genetic relationship between the algal strain to be tested and the dunaliella d.quartolecta, adopting a maximum likelihood algorithm, wherein the step value is 1000, and the program command is set as follows:
1)orthofinder-forthsp1-M msa-S diamond-t 16-a 16
2)orthofinder-f/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/06tree-M msa-S diamond-t 10-a 10-o
3)vol3/agis/xiaoyutao_group/wangjingchun/yanzao/06tree/results
4) Second column of input file-in 1 orthopofinder/Results _ Sep25/working directory/specificids. txt-in 4 cat
5) Input files-in 2 and-in 3 from xxx/orthsp1/OrthoFinder/Results _ Sep 25/Orthologs
6)-in2#cp Orthogroups_SingleCopyOrthologues.txt../../../easy/SingleCopyOrthologues.txt
7)-in3#cp Orthogroups.tsv../../../easy/Orthogroups.csv
8)python2.7/vol1/agis/xiaoyutao_group/wangjingchun/software/EasySpeciesTree/EasySpeciesTree.py-in1
9)SpeciesIDs.txt-in2 SingleCopyOrthologues.txt-in3 Orthogroups.csv-in4 all.pep.fa-t 2
And step 12, determining whether the alga strain to be detected belongs to the D.quartz-glomerecta based on the support rate and the percentage value of the genetic similarity between branches of the constructed phylogenetic tree, namely when the support rate between the alga strain to be detected and the D.quartz-glomerecta is 0.99-1.00 and the percentage of the similarity is more than or equal to 99%, the genome coverage is more than or equal to 55%, and determining that the alga is the D.quartz-glomerecta. As can be seen from fig. 4, the support ratio between the strain to be identified (Dunaliella sp.) and the Dunaliella d.quartz is 1.00, the percentage of similarity is 100%, the genome coverage is 56.8%, and the strain can be identified as the Dunaliella d.quartz.
Example 3
Analyzing genetic variation and evolution characteristics of a identified alga strain Dq _ SX genome by taking the D.quartolecta core genome data of the Dunaliella as reference, and comprising the following steps:
And 4, taking the homologous gene information screened in the step 3 as a data analysis set, detecting synonymous and non-synonymous mutation sites by using PAML 4.8 software, calculating a non-synonymous substitution rate (Ka) and a synonymous substitution rate (Ks) value, and estimating the evolutionary selection pressure of the identified strain Dq _ SX according to the Ka/Ks value (figure 7).
TABLE 4 identified Dunaliella alga Dq _ SX core genome assembly data and quality evaluation thereof
As can be seen from table 4, the core genome assembly of the dunaliella strain Dq _ SX was identified to be complete, with incomplete fragments accounting for only 16.12%, and with only 1.54% of gaps or deletions. As can be seen from FIG. 6, it was identified that the algal strain Dq _ SX may have a large number of doubling events in different regions of its genome during the evolution process, and there are 1007 pairs of segments involved in the doubling events, which suggests the complexity of the species evolution process. As can be seen from FIG. 7, it was identified that 80.52% of the genes in the core genome of the strain Dq _ SX have a Ka/Ks ratio of less than 1.0 (mean value of Ka/Ks is 0.47; when the Ka/Ks ratio is in the range of 0.35-0.45, the frequency is at most 0.108) relative to the core genome of D.quartolecta constructed according to the present invention, suggesting that most genes of the strain were subjected to purification selection pressure during the evolution process (FIG. 7).
Example 4
The repeated fragment prediction, the function annotation of the predicted protein and the structural feature analysis of the identified Dunaliella strain Dq _ SX core genome comprise the following steps:
1)RepeatModeler-pa 10-database/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/03repeat/Dq_SX/Dq_SX-engine ncbi-recoverDir/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/03repeat/Dq_SX
2)qsub-l nodes=1-q queue8./repeatmodeler.sh
And 3, a # fasta file and a family of common identification repeat sequences obtained by training are marked after the sequence id, and if the family can not be classified, the family is marked as 'Unkown'. Stk is a Seed alignment (Seed alignment) file, is in a Dfam-compatible Stockholm format, and can be uploaded to a Dfam _ con-sensus database by using a tool 'RepeatModler/util/dfamConnsensolsTool.pl' carried by a RepeatModler installation path.
And 4, searching a repetitive sequence in the Dunaliella Dq _ SX core genome, and setting a program command as follows:
1)RepeatMasker-pa 4gff lib/public/home/wangjingchun/RM_Dq_SX/consensi.fa dir/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/03repeat/Dq_SX/Repeatmasker/lib_result/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/03repeat/Dq_SX/Dq_SX_contig.fa
2)qsub-l nodes=1-q queue8./repeatmasker2.sh
1)$diamondmakedb--innr_eukaryon.fasta-d nr_eukaryon_20200805
2)$diamond blastx--db nr_eukaryon_20200805--query reads.fq.gz--outreads.tab
3)$diamond blastp--db nr_eukaryon_20200805--query proteins.fasta--outnr.tab--outfmt 6--sensitive--max-target-seqs 20--evalue 1e-5--id 30--block-size20.0--tmpdir/dev/shm--index-chunks 1
and 6, performing the collinear analysis of the repetitive fragments of the core genome of the identified strain by using MCScanX software, and setting a program command as follows:
1)makeblastdb-in Dq_SX.fa-dbtype prot-out Dq_SX
2)Blastp-query/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/07circos/Dq_SX.fa-db/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/07circos/Dq_SXnum_threads 10-evalue 1e outfmt 6out/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/07circos/Dq_SX.blastp
3)MCScanX./Dq_SX
step 7, because the conservation of the repetitive sequences among different species is relatively low, the prediction of the repetitive sequences aiming at a specific species needs to query a specific repetitive sequence database. In view of this, we aligned the sequencing assembly data of the core genome of the identified strain Dq _ SX with the data in the RepBase using the repeatmaskerv4.0.6 software to query possible scattered repeat sequences in the strain. The core genome data of the identified strain Dq _ SX was annotated with the RepeatModler, LTR-Finder, RepeatScout software to obtain tandem repeats (including microsatellite sequences, etc.).
And 8, filtering repeated parts in the results to obtain a final non-redundant repeated sequence annotation result (table 5).
Step 9, comparing the core genome data of the identified algal strain Dq _ SX with an NR database, and screening the result by comparison (e-value)<10-5)。
And step 10, performing COG functional annotation on the screened homologous protein sequences by utilizing eggNOG software, performing annotation on the protein sequences by using an emapper. py script in eggNOG, and performing classification statistics on the top20 (top20) protein cluster in the annotation result (FIG. 8).
Step 11, running eggNOG software to perform COG functional annotation on homologous protein encoded by the gene; the program commands are set as follows:
python/public/home/wangjingchun/miniconda2/envs/qiime1/bin/emapper.py-i/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/new/04cog/Dq_SX_protein.fa--output/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/new/04cog/out-mdiamond--data_dir/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/new/04cog/database--cpu20
step 12, performing transmembrane domain prediction analysis on the top-ranked protein in the top20 protein cluster by using an online TMHMM2.0 analysis tool (FIG. 9); using an online analytical tool of SignalP4.1 to predict the signal peptide of the protein, and setting the threshold value of the number of amino acids of each protein sequence not to exceed 6000 (figure 10); the output format is selected as extend, within graphic, and other parameters are selected as default.
TABLE 5 statistics of the results of classification of repetitive sequences in the identified Dunaliella Dq _ SX core genome
Repetitive sequence types | Repeat size (bp) | Genome proportion of repetitive sequence (%) |
LINE | 165380 | 0.26 |
LTR | 118737 | 0.19 |
SINE | 984126 | 1.57 |
Others (C) | 1007445 | 1.60 |
Total number of | 2275688 | 3.62 |
As can be seen from Table 5, the identified Dunaliella Dq _ SX core genome has a searched length of 2275688bp, which accounts for about 3.62% of the whole genome. As can be seen from FIG. 6, the Dq _ SX core genome has been annotated with the highest number of classes of transcriptional regulators (88) and dynein heavy chains (87) in the functional proteins. The prediction result of the transmembrane domain of the transcription regulatory factor shows that the structure of the 60-110 amino acids of the factor is probably outside the membrane (the probability value is about 0.8), the part of the structure after the 130 amino acids is in the membrane with the probability (the probability value is 0.82), and the probability of being on the membrane is not higher than 0.4 (FIG. 9). As is clear from the signal peptide prediction results of this factor (FIG. 10), the C value is the largest, the S value is steep, and the Y value is the highest around amino acids 25 to 26, suggesting that this is a signal peptide cleavage site.
Example 5
The differential metabolic pathway comparative analysis and characteristic gene mining based on the core genome data of the Dunaliella D.quartz necta and the identified strain Dq _ SX comprise the following steps:
1)diamond makedb--in/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/ko.pep.fasta-d
2)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/kegg
3)diamond blastp-d/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/kegg--query
4)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/Dq_protein.fa-f6-o
5)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/Dq.blastp-p 30-e0.00005
6)diamond blastp-d/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/kegg--query
7)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/Dq_SX_protein.fa-f6-o
8)/vol3/agis/xiaoyutao_group/wangjingchun/yanzao/09kegg/Dq_SX.blastp-p 30-e 0.00005
And 2, performing intersection analysis on prediction results of the D.quartolecta and the KEGG channel of the identified strain Dq _ SX according to the KO number distributed to each metabolic channel in the step 1 to construct a Venturi diagram (figure 11), and screening respective unique metabolic channels.
And 3, respectively screening the characteristic genes with the highest enrichment degree, namely the highest enrichment ratio from the Dunaliella D.quartz and the identified unique metabolic pathway top20 of the strain Dq _ SX (in the statistics of the step 2, the KEGG pathways 20 at the top of the rank except the intersection) obtained in the step 2 (fig. 12 and 13).
And 4, respectively carrying out query analysis on the Dunaliella D.quartz and the metabolic pathway genes with the highest enrichment degree (significant enrichment) in the identified algal strain Dq _ SX in a GO (gene ontology association) database, further obtaining GO function annotation enrichment results (figures 14 and 15) of the pre-ranked 20(top20) of the Dunaliella D.quartz and the identified algal strain Dq _ SX, and screening characteristic genes which are interested by researchers from a gene set with higher enrichment degree, namely a higher GO entry number and a higher corresponding-log 10(Q value) (confidence).
As can be seen from fig. 11, based on the core genome sequencing data of dunaliella d.quartolecta and Dq _ SX, we predicted 608 channels of KEGG, 141 channels of common metabolic channels and 467 channels of distinctive metabolic channels (85 channels of dunaliella d.quartolecta and Dq _ SX 382 channels). As can be seen from fig. 12 and 13, the most enriched specific metabolic pathway of d.quartolecta of dunaliella was spliceosome-associated metabolism, and the most enriched specific pathway of Dq _ SX was cellular component synthesis-associated metabolism. As can be seen from fig. 14 and 15, most of the functions of the genes involved in the metabolic pathway of the d.quartolecta spliceosome of the dunaliella salina are closely related to RNA transport, processing and synthesis, and most of the functions of the genes involved in the anabolism of the Dq _ SX membrane component are related to protein structure and processing.
Example 6
Comparing and analyzing three different Dunaliella D.quartolecta molecular identification technologies,
collecting 20 to-be-detected algal strains and the identified dunaliella D.quartolecta in the embodiment 1, and performing molecular identification on the to-be-detected algal strains by using ITS genes, SSR molecular markers and genome sequencing data, wherein the method specifically comprises the following steps:
SEQ ID NO.1:5'-GAAGGAGAAGTCGTAACAAG-3';
SEQ ID NO.2:5'-CCTCCCTTATTGATATGC-3';
preparing an ITS gene PCR amplification system: 2.0. mu.L dNTPs (2mmol/L), 1.0. mu.L Mg2+(25mmol/L), 1.0. mu.L of DNA, 0.3. mu.L of LTaq enzyme (5U/. mu.L) and 2.5. mu.L of 10 XBuffer buffer, 1.0. mu.L of each of the above primers, ddH2Supplementing O to 25 μ L; setting a PCR reaction program: 3min at 95 ℃, 30sec at 95 ℃, 40sec at 52 ℃ and 1min at 72 ℃, and after circulating for 35 times, extending for 10min at 72 ℃; detecting by 1.2% agarose gel electrophoresis, collecting the specific amplification product of 800-1000 bp, and sending to a sequencing company for sequencing.
And 2, constructing an ITS gene system evolutionary tree of 21 strains of algae by using MEGA5.0 software according to a sequencing result fed back by a sequencing company based on a maximum likelihood method, wherein the step value is 1000, and identifying the D.quartolecta from the to-be-detected algae strain according to the support rate and the genetic similarity percentage of each branch node in the evolutionary tree (figure 16).
CL1007:SEQ ID NO.3:5'-CTAAATCCATGCGTTCTTCTTTC-3';
SEQ ID NO.4:5'-ACAGTACAACCAGAGGCTTTGAA-3';
CL1008:SEQ ID NO.5:5'-AACAATGTCACCTCTCATTTGCT-3';
SEQ ID NO.6:5'-TCGTTTTGTTGTTGTTCTTCAAA-3';
CL102:SEQ ID NO.7:5'-GCCAATTCCAAAAAGTTAAAATCT-3';
SEQ ID NO.8:5'-ATTGTGGTTTTCTTCCTGGTTTT-3';
CL1041:SEQ ID NO.9:5'-AGGCAAGCAGTGCATTTGTA-3';
SEQ ID NO.10:5'-GGCTCTCTATGAGTCGATGTGTC-3';
CL1047:SEQ ID NO.11:5'-GCAGTGGAAACACACTTCCTTAC-3';
SEQ ID NO.12:5'-TCTCTCAAATCAAAGGTGCTTTC-3';
CL1157:SEQ ID NO.13:5'-GAGATCGAACTTGAGGCTTAGAA-3';
SEQ ID NO.14:5'-AAAATAGAAGCCATCATGAAACG-3';
CL1160:SEQ ID NO.15:5'-GGATACAGATTTCCACACTGCTC-3';
SEQ ID NO.16:5'-CTATCTGGCTGAAGGTCATGTTT-3';
CL1168:SEQ ID NO.17:5'-CGTTTTTGGAACTGATTTCTTTG-3';
SEQ ID NO.18:5'-TTCTTGTAATACATCGCAGGAAG-3';
CL1322:SEQ ID NO.19:5'-AACAGAGGAAATTCTGATGATGC-3';
SEQ ID NO.20:5'-CTTGCAAGAAGGAACAACTCACT-3';
CL1627:SEQ ID NO.21:5'-GTGGTCACCAGGAAGAGACAG-3';
SEQ ID NO.22:5'-ACGGTACTGACAGTGGAAACAAT-3';
the sizes of the amplified products are 155bp, 131bp, 139bp, 121bp, 158bp, 136bp, 118bp, 149bp, 160bp and 127bp in sequence;
and 4, sending the SSR primers to a biological company for synthesis, and preparing an SSR-PCR amplification system, namely: 2.5. mu.L dNTPs (2mmol/L), 1.2. mu.L Mg2+(25mmol/L), 1.0. mu.L of DNA (obtained in step 1), 0.4. mu.L of Taq enzyme (5U/. mu.L) and 2.5. mu.L of 10 XBuffer buffer, 0.8. mu.L of each of the above primers, ddH2Supplementing O to 25 μ L; the SSR-PCR reaction program is as follows: 5min at 94 ℃; 35 cycles (94 ℃ 45sec, 57 ℃ 35sec, 72)1min at DEG C); 8min at 72 ℃; carrying out electrophoretic separation on the amplified SSR product by using 4% denatured polyacrylamide, carrying out silver staining for 30min, developing for 15min, fixing for 20min, and then carrying out marking on '1' (with strips) and '0' (without strips) on an electrophoretic map; clustering analysis of the algal strains to be detected is carried out by using an UPGMA method and NTSYSpc 2.2 software, and a phylogenetic tree marked by the SSR is constructed (figure 17).
Step 6, using the d.quartz pecta core genome data of the dunaliella salina constructed by the invention as a reference, detecting Single Nucleotide Polymorphism (SNP) and insertion deletion (InDel) data among the algae strains to be detected by using BWA0.7.17 software, when detecting SNP and InDel, firstly marking out a repeated segment and neglecting, then carrying out re-comparison on the region near the InDel, finally screening to obtain SNP and InDel, and carrying out a program command according to the embodiment 2.
Step 7, using easy specificity tree 1.0 software, building a phylogenetic tree (fig. 18) based on the obtained SNP data, setting the step size to 1000 by using a maximum likelihood algorithm, and performing a program command with reference to example 2.
And 8, comparing and analyzing the three different molecular identification results, wherein the technical advantages and disadvantages are shown in a table 6.
As can be seen from fig. 16, the strain Dsp11 and the dunaliella d.quartz are clustered together, the support rate is 0.99, the genetic similarity is 99%, and the strain can be identified as d.quartz. As can be seen from fig. 17, the algal strain Dsp4 and the dunaliella d.quartolecta cluster together, the supporting rate of Dsp4 and the dunaliella d.quartolecta cluster is 0.99, the genetic similarity is 99%, and the algal strain Dsp11 and the algal strain Dsp4 and d.quartolecta cluster together, the supporting rate is 1.00, the genetic similarity is 99%, and the algal strain Dsp 3825 and the dunaliella d.quartolecta cluster can also be identified as d.quartolecta; as can be seen from fig. 18, dpsp 11 and dpsp 4 can be copolymerized with d.quartolecta into a cluster, the support rate is 1.00, the genetic similarity is 100%, and the cluster can be identified as d.quartolecta. As can be seen from table 6, compared with the other two molecular identification methods, the simplified genome sequencing is performed on the alga strain to be detected and SNP data is obtained by taking the core genome data of the dunaliella salina constructed by the invention as reference, the d.quartz-tacta can be accurately identified in a short period (7-10 days), the cost is low, and abundant biological information data can be provided for later-stage deep research.
Table 6 comparison of three molecular identification methods for dunaliella d
Example 7
The comparison of the Dunaliella D.quartz pecta core genome sequencing and assembling technology established by the invention and the traditional genome sequencing technology comprises the following steps:
And 3, comparing key indexes of the autonomously constructed core genome sequencing fragment assembly data (detailed in the operation steps of the example 1) of the dunaliella salina D.quartz pectera and the assembly data of each sequencing platform obtained in the step 2.
And 4, comparing the sequencing data of the D.quartolecta core genome of the dunaliella salina obtained by each technical platform with reference to a Dunsal1 v.2 published by NCBI, and analyzing the difference between the technologies according to the comparison result (Table 7).
TABLE 7 analysis of alignment results during core genome sequencing data Assembly
TABLE 8 comparative analysis of SNP and InDel statistical results
And 6, calculating the proportion of the repetitive sequences of the algal strains to the total sequencing fragments under different technical platform conditions by using the sequencing fragments obtained in the step 4 and combining the repetitive fragment prediction method in the embodiment 4 (Table 9).
TABLE 9 comparative analysis of repeat sequence ratios
Technique of | Proportion of repeat sequence to total sequence fragment (%) |
Autonomous techniques | 1.45% |
Nanopore | 15.27% |
PacBio | 12.99% |
HiSeq | 3.58% |
As can be seen from Table 7, under the technical conditions established by the method, the genome coverage rate, the aligned sequence and the identification ratio of the sequencing fragment of the strain to be tested are all higher than those of the other three sequencing technologies. As can be seen from Table 8, the effective SNP and InDel detected under the technical conditions of the invention are higher than those of the other three technologies, and the error rate is lowest. As can be seen from Table 9, the ratio of the repeat sequences detected under the conditions of the present invention is lower than that of the other three techniques. In conclusion, the overall performance of the dunaliella D.quartz pecta core genome sequencing fragment assembly technology created by the invention is superior to that of Nanopore, PacBio and HiSeq.
While there have been shown and described what are at present considered to be the basic principles and essential features of the invention and advantages thereof, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, but is capable of other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.
Sequence listing
<110> university of Shanxi
<120> method for strain identification based on Dunaliella core genome sequence
<160> 22
<170> SIPOSequenceListing 1.0
<210> 16
<211> 20
<212> DNA
<213> ITS Gene upstream primer (ITS-F)
<400> 16
<210> 17
<211> 18
<212> DNA
<213> ITS Gene downstream primer (ITS-R)
<400> 17
cctcccttat tgatatgc 18
<210> 18
<211> 23
<212> DNA
<213> CL1007 upstream primer (CL1007-F)
<400> 18
ctaaatccat gcgttcttct ttc 23
<210> 19
<211> 23
<212> DNA
<213> CL1007 downstream primer (CL1007-R)
<400> 19
acagtacaac cagaggcttt gaa 23
<210> 20
<211> 23
<212> DNA
<213> CL1008 upstream primer (CL1008-F)
<400> 20
aacaatgtca cctctcattt gct 23
<210> 21
<211> 23
<212> DNA
<213> CL1008 downstream primer (CL1008-R)
<400> 21
tcgttttgtt gttgttcttc aaa 23
<210> 22
<211> 24
<212> DNA
<213> upstream primer of CL102 (CL102-F)
<400> 22
gccaattcca aaaagttaaa atct 24
<210> 23
<211> 23
<212> DNA
<213> CL102 downstream primer (CL102-R)
<400> 23
attgtggttt tcttcctggt ttt 23
<210> 24
<211> 20
<212> DNA
<213> upstream primer of CL1041 (CL1041-F)
<400> 24
<210> 25
<211> 23
<212> DNA
<213> downstream primer of CL1041 (CL1041-R)
<400> 25
ggctctctat gagtcgatgt gtc 23
<210> 26
<211> 23
<212> DNA
<213> upstream primer of CL1047 (CL1047-F)
<400> 26
gcagtggaaa cacacttcct tac 23
<210> 27
<211> 23
<212> DNA
<213> downstream primer of CL1047 (CL1047-R)
<400> 27
tctctcaaat caaaggtgct ttc 23
<210> 28
<211> 23
<212> DNA
<213> upstream primer of CL1157 (CL1157-F)
<400> 28
gagatcgaac ttgaggctta gaa 23
<210> 29
<211> 23
<212> DNA
<213> CL1157 downstream primer (CL1157-R)
<400> 29
aaaatagaag ccatcatgaa acg 23
<210> 30
<211> 23
<212> DNA
<213> CL1160 upstream primer (CL1160-F)
<400> 30
ggatacagat ttccacactg ctc 23
<210> 31
<211> 23
<212> DNA
<213> CL1160 downstream primer (CL1160-R)
<400> 31
ctatctggct gaaggtcatg ttt 23
<210> 32
<211> 23
<212> DNA
<213> CL1168 upstream primer (CL1168-F)
<400> 32
cgtttttgga actgatttct ttg 23
<210> 33
<211> 23
<212> DNA
<213> CL1168 downstream primer (CL1168-R)
<400> 33
ttcttgtaat acatcgcagg aag 23
<210> 34
<211> 23
<212> DNA
<213> CL1322 upstream primer (CL1322-F)
<400> 34
aacagaggaa attctgatga tgc 23
<210> 35
<211> 23
<212> DNA
<213> CL1322 downstream primer (CL1322-R)
<400> 35
cttgcaagaa ggaacaactc act 23
<210> 36
<211> 21
<212> DNA
<213> upstream primer of CL1627 (CL1627-F)
<400> 36
gtggtcacca ggaagagaca g 21
<210> 37
<211> 23
<212> DNA
<213> downstream primer of CL1627 (CL1627-R)
<400> 37
acggtactga cagtggaaac aat 23
Claims (6)
1. The method for strain identification based on the core genome sequence of the dunaliella is characterized by comprising the following steps:
(1) collecting, purifying and culturing a sample: collecting an alga strain to be detected and a Dunaliella quartolecta strain of Dunaliella, purifying the alga strain to be detected, and then carrying out indoor expanded culture, wherein the method comprises the following specific steps: performing monoclonal picking on algal cells of an algal strain to be detected under an aseptic condition, performing indoor expanded culture under the aseptic condition after passing microscopic examination, wherein the indoor expanded culture condition is as follows: the photoperiod is 18 h: 6h, illumination intensity 19000lx, temperature: keeping the aseptic ventilation environment at 23 +/-3 ℃, shaking the culture dish every 5 days to prevent the algal cells from adhering to the walls, performing microscopic examination on 0.5-1 mL of algal solution, and preparing the following culture medium solutions to perform indoor expanded culture on the algal strains to be detected, wherein the formula of the culture medium is as follows:
30g/L NaCl,1.5g/L NaNO3,1.4g/L K2HPO4,1.75g/L MgSO4·7H2O,1.36g/L CaCl2·7H2O,1.2g/L Na2CO3,0.006g/L FeC6H5O7,0.005g/L NaH2PO4·2H2O,0.5g/L Co(NO3)2·6H2O,0.8g/L CuSO4·5H2O,2.3g/L ZnSO4·7H2O,0.03g/L H3BO3,4.0g/L Na2MoO4·2H2O,0.02g/L MnCl2·4H2O,0.5g/LVB1,0.5g/L VB12VH is 0.5g/L, and the volume of ultrapure water is constant to 1L;
(2) extracting whole genome DNA: respectively extracting the whole genome DNA of the to-be-detected alga strain and the D.quartolecta strain by using an improved CTAB method, and freezing and storing; the improved CTAB method comprises the following specific steps: taking 600-800 mg of algae to be tested, washing with ultrapure water for 2-3 times, centrifuging at 4 ℃ 8000r/min for 1.5min, adding liquid nitrogen, grinding for 15sec, adding 800 mu L of 2% W/V CTAB solution preheated at 20 ℃ and 1 mu L of 1% V/V beta-mercaptoethanol, uniformly mixing, carrying out water bath at 60 ℃ for 1.5h, shaking for 1 time every 20min, adding 800 mu L of LTris saturated phenol, centrifuging at 4 ℃ 12000r/min for 2.5min, taking supernatant, adding the mixture into the mixture, and adding the mixture into the mixture in a volume ratio of 25: 24: 2, mixing Tris saturated phenol, chloroform and isoamylol, standing for 10min at 4 ℃ after vortex oscillation, uniformly mixing for 2-3 times, and adding 800 mu L of ddH treated by 0.1% V/V DEPC2O, water bath at 60 ℃ for 30min, centrifuging at 4 ℃ for 4min at 12000r/min, taking supernatant, adding 150mL of 3mol/L sodium acetate and 250mL of 4-5 ℃ precooled absolute ethanol, precipitating at-20 ℃ for 50min, centrifuging at 4 ℃ for 3min at 10000r/min, discarding supernatant, adding 1mL of 4-5 ℃ precooled 70% V/V ethanol solution, carrying out vortex oscillation for 20sec, volatilizing liquid in a nucleic acid vacuum drying system after discarding supernatant, adding 100 xTE buffer solution to dissolve precipitate so as to ensure that the DNA concentration is more than or equal to 150 ng/mu L and the 1% W/V agarose gel electrophoresis combined fluorescence quantifier is used for detecting genome DNA, ensuring that an electrophoresis strip is bright and has no degradation, and OD is not degraded 260/OD2801.8-1.9, no pollution;
(3) respectively constructing a DNA sequencing library after breaking and purifying the whole genome DNA of the alga strain to be detected and the Dunaliella D.quartz ectca in the step (2);
(4) sequencing the DNA sequencing libraries in the step (3) by adopting a high-throughput sequencing method respectively to obtain second-generation sequencing data of the to-be-detected alga strain and the D.quartolecta whole genome;
(5) taking the whole genome data of the dunaliella salina (D.salina) published by NCBI as reference, comparing the sequencing data of the whole genome of the dunaliella salina obtained in the step (4) with the sequencing data of the whole genome of the dunaliella salina, obtaining a core genome sequence of the dunaliella salina D.quatolecta after screening, de novo assembly and quality evaluation, wherein the size of the core genome sequence is 6592916bp, the number of contigs is 3000, the length of the maximum contig is 1133322bp, the average length of the contig is 2197.64bp, the contig N50 is 15270, the proportion of the complete gene is 23.65%, the proportion of the single copy gene is 15.18%, the proportion of the multi-copy gene is 13.76%, the proportion of vacancy/deletion is 1.89%, and the proportion of the incomplete fragment is 17.45%, constructing a circular map of the core genome of the dunaliella salina assembled de D.quatolecta, and then performing gene component, protein function annotation and genome overlap collinearity analysis on the core genome sequence of the dunalina D.quatolecta;
(6) And (3) taking the core genome sequence of the Dunaliella D.quartz Colecta constructed in the step (5) as a reference, comparing the whole genome sequencing data of the to-be-detected algal strain obtained in the step (4) and published genome sequencing data of representative algae with the to-be-detected algal strain, detecting single nucleotide polymorphism and insertion/deletion sites among species, and constructing a phylogenetic tree by using the single nucleotide polymorphism, wherein when the to-be-detected algal strain and the Dunaliella D.quartz Colecta are gathered into a cluster, the branched data support rate is 0.99-1.00, the genetic similarity percentage is more than or equal to 99%, and the to-be-detected algal strain is the Dunaliella D.quartz Colecta.
2. The method for strain identification based on a Dunaliella alga core genome sequence according to claim 1, wherein the specific steps of constructing the DNA sequencing library in the step (3) are as follows: breaking the whole genome DNA by using a strong-grade ultrasonic wave band of 80-100W for 6sec, repeating the breaking for 1 time every 3sec, carrying out ultrasonic treatment for 5 times in total, and setting breaking parameters to be 300-400 bp; carrying out agarose gel electrophoresis on the fragments, and recovering 300-400 bp target fragments by using the agarose gel; adsorbing and recovering the target fragments by using silicon-based magnetic beads, and detecting the quality of the adsorbed and recovered target fragments by using a fluorescence quantitative instrument; DNA end repair, adding A at the 3' end; adding a joint for a connection reaction, and purifying, converting and PCR verifying a connection product; and (3) carrying out single-stranded DNA cyclization reaction on the positive product after the positive product is denatured at 95 ℃ for 20sec, and purifying the product to construct a whole genome DNA sequencing library for use in the computer.
3. The method for strain identification based on the core genome sequence of the dunaliella salina according to claim 1, wherein the specific steps of obtaining the core genome sequence of the dunaliella salina after screening, assembling and quality evaluation in the step (5) are as follows: screening a high-quality sequence from a sequencing platform, taking a fragment with the screening sequencing depth of 50-80X, the average length of 12-15K and the length of N50 being more than 18K as a query sequence, replying the query sequence onto a reported dunaliella salina reference genome by utilizing SOAPaligner or BWA software, further screening a sequencing fragment with the sequence consistency of more than or equal to 90 percent and the comparison result E value of less than 1E-10 as dunaliella salina D.quartz genome core sequence candidate data; comparing all the residual sequencing fragments with the candidate data set to obtain an overlapping area between comparison data; error correction and correction operation are carried out on the comparison result by using Falcon or Pilot software, and the contig is assembled by using SOAPde novo 2.04, Mecat, HERA or Canu software; determining the order of each contig using BySS 2.2.3, Velvet 1.2.10 or ABySS 2.2.3 software; carrying out whole genome coverage measurement and calculation by using BAMStats or GATK DepthOfCoverage software, and screening contigs with the reference genome coverage of more than or equal to 50% and continuous arrangement number of more than or equal to 2000; evaluating the assembly quality of the screened overlapping groups by using BUSCO 2.0 or Quast software, and selecting an assembly sequence with the complete gene ratio of more than or equal to 20 percent, the single-copy gene ratio of 15 percent, the multi-copy gene ratio of more than or equal to 12 percent and the deletion/vacancy ratio of less than or equal to 3 percent as a core genome sequence of the Dunaliella tertiolecta D.quartz tacta; the circular map of the core genome of this species was constructed using the Circos software.
4. The method for strain identification based on a core genome sequence of dunaliella salina according to claim 1, wherein the step (5) is performed on the core genome sequence of dunaliella salina by genetic composition, protein function annotation and genome contig collinearity analysis, and comprises the following steps: CDS prediction is carried out on the assembly data by using Augusts 3.3.3, ESTScan3.0.1, TransDecoder 2.0.1 or Prodigal 2.6.1 software, repeated sequence analysis is carried out on the assembly data by using replay asker 4.0.9, replay proteinMask 3.2.2, LTR-FINDER, Piler 1.0.6 or replay Scout 1.0.5 software, protein sequences coded by CDS are aligned to an NR database by using Diamons 0.9.14 or BLASTX software and are annotated with functions, and after the predicted protein sequences are aligned by BLASTp, the co-linear analysis of genome is carried out by using MCScanX, Last, Mugsy, Spines or progressive analytical software.
5. The method for strain identification based on a Dunaliella core genome sequence of claim 1, wherein the specific steps of constructing phylogenetic tree by using single nucleotide polymorphism in the step (6) are as follows: comparing the algae strain to be detected and 5-6 kinds of representative algae genome data reported in an NCBI database with the core genome sequence of the Dunaliella alga D.quartz, which is assembled in the step (5), respectively by using LASTZ 1.02.00 or Mauvee 2.3.1 software, extracting the corresponding genotype of each species and the Dunaliella alga D.quartz genome according to the result of the compared collinear block, merging, extracting and filtering the genotype information of all the species by using the core genome of the Dunaliella alga D.quartz as a template, and detecting the single nucleotide polymorphism data and the insertion/deletion site data by using BWA 0.7.17 software; based on single nucleotide polymorphism data, a phylogenetic tree is constructed by utilizing a maximum likelihood algorithm in easy SpecifesTree 1.0, MEGA 5.0, TreeBeST 1.9.2, PHYLIP, Puzzle 5.2 or PHYLO-WIN software, and then the genetic relationship between the to-be-detected algae strain and the Dunaliella D.quartz necta is determined.
6. The method of claim 5, wherein the deletion rate of the filtering is no greater than 20%.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011238521.2A CN112349350B (en) | 2020-11-09 | 2020-11-09 | Method for strain identification based on Dunaliella core genome sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011238521.2A CN112349350B (en) | 2020-11-09 | 2020-11-09 | Method for strain identification based on Dunaliella core genome sequence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112349350A CN112349350A (en) | 2021-02-09 |
CN112349350B true CN112349350B (en) | 2022-07-19 |
Family
ID=74428639
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011238521.2A Active CN112349350B (en) | 2020-11-09 | 2020-11-09 | Method for strain identification based on Dunaliella core genome sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112349350B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160893B (en) * | 2021-06-09 | 2022-08-19 | 中国科学院昆明植物研究所 | Mining plant ITSs sequence from second generation sequencing data and using the same for identifying variety families |
CN113549620B (en) * | 2021-07-13 | 2022-09-23 | 山西大学 | Multi-type Dunaliella salt stress response miRNAs and application thereof |
CN114664379A (en) * | 2022-04-12 | 2022-06-24 | 桂林电子科技大学 | Third generation sequencing data self-correction error correction method based on deep learning |
CN115810393B (en) * | 2022-12-22 | 2023-08-25 | 南京普恩瑞生物科技有限公司 | Sequencing sample homology detection method and system based on SNPs library of construction crowd |
CN116705155A (en) * | 2023-08-03 | 2023-09-05 | 海南大学三亚南繁研究院 | Definition method of whole-gene DNA data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013177615A1 (en) * | 2012-06-01 | 2013-12-05 | Agriculture Victoria Services Pty Ltd | Selection of symbiota by screening multiple host-symbiont associations |
CN106282330A (en) * | 2015-12-02 | 2017-01-04 | 香港中文大学深圳研究院 | A kind of method developing Caulis et Folium Ammopiptanthi Mongolici Plant Genome simple repeated sequence molecular marker |
WO2018190170A1 (en) * | 2017-04-12 | 2018-10-18 | 花王株式会社 | Method for improving resistance to nitrate substrate analogue in microalga |
Family Cites Families (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101504697B (en) * | 2008-12-12 | 2010-09-08 | 深圳华大基因研究院 | Construction method and system for genome sequencing equipment and its fragment connection stand |
WO2011143231A2 (en) * | 2010-05-10 | 2011-11-17 | The Broad Institute | High throughput paired-end sequencing of large-insert clone libraries |
US9506167B2 (en) * | 2011-07-29 | 2016-11-29 | Ginkgo Bioworks, Inc. | Methods and systems for cell state quantification |
WO2013170235A1 (en) * | 2012-05-11 | 2013-11-14 | University Of Hawaii | Ultrasound mediated delivery of substances to algae |
US10777301B2 (en) * | 2012-07-13 | 2020-09-15 | Pacific Biosciences For California, Inc. | Hierarchical genome assembly method using single long insert library |
WO2016192772A1 (en) * | 2015-06-02 | 2016-12-08 | Siemens Healthcare Gmbh | Genetic testing for predicting resistance of shigella species against antimicrobial agents |
WO2017012659A1 (en) * | 2015-07-22 | 2017-01-26 | Curetis Gmbh | Genetic testing for predicting resistance of salmonella species against antimicrobial agents |
WO2017016600A1 (en) * | 2015-07-29 | 2017-02-02 | Curetis Gmbh | Genetic testing for predicting resistance of enterobacter species against antimicrobial agents |
WO2017117633A1 (en) * | 2016-01-07 | 2017-07-13 | Commonwealth Scientific And Industrial Research Organisation | Plants with modified traits |
CN107190003A (en) * | 2017-06-09 | 2017-09-22 | 武汉天问生物科技有限公司 | A kind of method of efficient quick separating T DNA insertion point flanking sequences and application thereof |
CN111052250A (en) * | 2017-06-28 | 2020-04-21 | 西奈山伊坎医学院 | High resolution microbiological analysis method |
CN110042148B (en) * | 2018-01-16 | 2023-01-31 | 深圳华大基因科技有限公司 | Method for efficiently acquiring chloroplast DNA sequencing data and application thereof |
CN108034706B (en) * | 2018-01-16 | 2021-03-26 | 浙江大学 | Method for rapidly determining insertion site of transgenic strain by using re-sequencing technology |
US11913006B2 (en) * | 2018-03-16 | 2024-02-27 | Nuseed Global Innovation Ltd. | Plants producing modified levels of medium chain fatty acids |
CN109295185B (en) * | 2018-09-05 | 2022-03-22 | 暨南大学 | Method for determining genome size of unicellular eukaryotic algae |
CN109355410A (en) * | 2018-10-30 | 2019-02-19 | 厦门极元科技有限公司 | A method of identification and parting are carried out to the salmonella in macro genome based on the analysis of two generation sequencing datas |
CN111276185B (en) * | 2020-02-18 | 2023-11-03 | 上海桑格信息技术有限公司 | Microorganism identification analysis system and device based on second-generation high-throughput sequencing |
CN111363706A (en) * | 2020-04-13 | 2020-07-03 | 天津中医药大学 | Ecliptae herba endophytic bacteria, eclipta alba composition and application thereof |
CN111647680A (en) * | 2020-06-18 | 2020-09-11 | 北京市园林科学研究院 | Method for rapidly identifying and tracing sedge variety at whole genome level based on second-generation high-throughput sequencing |
-
2020
- 2020-11-09 CN CN202011238521.2A patent/CN112349350B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013177615A1 (en) * | 2012-06-01 | 2013-12-05 | Agriculture Victoria Services Pty Ltd | Selection of symbiota by screening multiple host-symbiont associations |
CN106282330A (en) * | 2015-12-02 | 2017-01-04 | 香港中文大学深圳研究院 | A kind of method developing Caulis et Folium Ammopiptanthi Mongolici Plant Genome simple repeated sequence molecular marker |
WO2018190170A1 (en) * | 2017-04-12 | 2018-10-18 | 花王株式会社 | Method for improving resistance to nitrate substrate analogue in microalga |
Also Published As
Publication number | Publication date |
---|---|
CN112349350A (en) | 2021-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112349350B (en) | Method for strain identification based on Dunaliella core genome sequence | |
CN104164479B (en) | Heterozygous genes group processing method | |
US20180258421A1 (en) | Compositions, methods and uses for multiplex protein sequence activity relationship mapping | |
CN105740650B (en) | A method of quick and precisely identifying high-throughput genomic data pollution sources | |
CN105112569A (en) | Virus infection detection and identification method based on metagenomics | |
CN103088120A (en) | Large-scale genetic typing method based on SLAF-seq (Specific-Locus Amplified Fragment Sequencing) technology | |
Mark et al. | Barcoding lichen-forming fungi using 454 pyrosequencing is challenged by artifactual and biological sequence variation | |
CN106868116A (en) | A kind of mulberry tree pathogen high throughput identification and kind sorting technique and its application | |
CN106947827A (en) | One kind obtains flathead sex specific molecular marker and its screening technique and application | |
CN108103235A (en) | A kind of SNP marker, primer and its application of apple rootstock cold hardness evaluation | |
CA3114759A1 (en) | Sequence-graph based tool for determining variation in short tandem repeat regions | |
CN109402241A (en) | Identification and the method for analyzing ancient DNA sample | |
CN109112217A (en) | A kind of and pig body length and the significantly associated genetic marker of number of nipples and application | |
Lemoine et al. | Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data | |
Xu et al. | Genome reconstruction and haplotype phasing using chromosome conformation capture methodologies | |
CN111197050A (en) | Ribosomal RNA gene of mulberry pseudoblight pathogenic bacteria and application thereof | |
Olds et al. | Applying a modified metabarcoding approach for the sequencing of macrofungal specimens from fungarium collections | |
CN110438244A (en) | A kind of molecular labeling of quick raising duck group blueness shell rate and application | |
US20220243267A1 (en) | Compositions and methods related to quantitative reduced representation sequencing | |
CN113564266B (en) | SNP typing genetic marker combination, detection kit and application | |
CN107354151A (en) | STR molecular labelings and its application based on the exploitation of sika deer full-length genome | |
CN102102129B (en) | Method for detecting single nucleotide polymorphism or small insertions and deletions by utilizing MutS proteins in genome range | |
CN104357563A (en) | Method for performing high-throughput sequencing on haplotype of genome subjected to two-time DNA fragmentation | |
Yang et al. | A new perspective on codon usage, selective pressure, and phylogenetic implications of the plastomes in the Telephium clade (Crassulaceae) | |
Kust et al. | Model cyanobacterial consortia reveal a consistent core microbiome independent of inoculation source or cyanobacterial host species |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231108 Address after: No. 9 Fulong Road, Shinan District, Qingdao, Shandong Province, 266000, 317 Patentee after: Qingdao Aixin Biotechnology Co.,Ltd. Address before: 030006 No. 92, Hollywood Road, Taiyuan, Shanxi Patentee before: SHANXI University |