CN113957130A - Method for identifying transgenic event based on high-throughput sequencing and probe enrichment - Google Patents

Method for identifying transgenic event based on high-throughput sequencing and probe enrichment Download PDF

Info

Publication number
CN113957130A
CN113957130A CN202111133102.7A CN202111133102A CN113957130A CN 113957130 A CN113957130 A CN 113957130A CN 202111133102 A CN202111133102 A CN 202111133102A CN 113957130 A CN113957130 A CN 113957130A
Authority
CN
China
Prior art keywords
sequence
target
effective
reference genome
matched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111133102.7A
Other languages
Chinese (zh)
Other versions
CN113957130B (en
Inventor
陈利红
彭海
李甜甜
周俊飞
高利芬
李论
方治伟
肖华锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jianghan University
Original Assignee
Jianghan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jianghan University filed Critical Jianghan University
Priority to CN202111133102.7A priority Critical patent/CN113957130B/en
Publication of CN113957130A publication Critical patent/CN113957130A/en
Application granted granted Critical
Publication of CN113957130B publication Critical patent/CN113957130B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/6895Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/13Plant traits

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Zoology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Genetics & Genomics (AREA)
  • Biochemistry (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Botany (AREA)
  • Mycology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application relates to the technical field of bioinformatics, in particular to a method for rapidly identifying transgenic events based on high-throughput sequencing and probe enrichment. The method comprises the following steps: obtaining a target exogenous gene sequence of a target organism, a capture probe of the target exogenous gene sequence, a plurality of DNA fragments to be identified and a reference genome sequence; obtaining the DNA fragment containing the joint; capturing and constructing a library for the target exogenous gene in the DNA fragment containing the joint by using the capture probe to obtain an enriched library; performing high-throughput sequencing on the enriched library to obtain high-throughput sequencing data; and comparing the DNA sequence in the high-throughput sequencing data with the reference genome sequence and the target exogenous gene respectively to determine the position of the target exogenous gene inserted into the genome of the target organism. The method does not need whole genome sequencing, greatly saves cost, and can be used for transferring exogenous genes or transgenic materials with unclear vectors.

Description

Method for identifying transgenic event based on high-throughput sequencing and probe enrichment
Technical Field
The application relates to the technical field of bioinformatics, in particular to a method for rapidly identifying transgenic events based on high-throughput sequencing and probe enrichment.
Background
Transgenic food (genetic modified food) refers to a method of transforming exogenous genes into recipient organisms (such as animals, plants or microorganisms) by using genetic engineering or genetic engineering means to change the genetic characteristics of the organisms and obtain traits, nutritional values or quality characteristics which are not possessed by the original species.
In recent years, the number and diversity of transgenic plants and transgenic crops on the market have increased dramatically. In order to protect the public's option and right of awareness, the identification management of transgenic products is gradually established and strengthened to regulate and track the control of transgenic food and feed. For this reason, transgenic developers must identify the molecular characterization of each of the novel transgenic organisms authorized. The conventional molecular characterization method mainly utilizes Southern hybridization to analyze the number of foreign genes inserted into a recipient organism, utilizes a chromosome walking technology to determine the sequence of the foreign genes or the connection part of an insertion vector and a recipient genome, and utilizes an in situ hybridization technology to determine the chromosome position of the foreign genes or the vector integrated on the recipient organism. However, these methods are cumbersome, relatively time consuming, and require elaborate, custom-designed experiments for each new plant variety.
With the rapid development of next-generation sequencing technologies, whole genome sequencing is beginning to be utilized to detect transgenic events. However, the method needs to sequence the whole receptor genome and to a certain depth to scan the insertion position of the exogenous gene in the receptor genome, and particularly, for large receptor genomes such as corn and wheat, the method needs larger sequencing data volume, and the sequencing cost is also obviously increased. If the whole genome sequencing data volume is small, the target foreign gene or vector is not covered by the sequencing sequence at the insertion position of the target organism, so that a missing detection event is caused, and the false negative rate is increased.
Disclosure of Invention
The application provides a method for rapidly identifying a transgenic event based on high-throughput sequencing and probe enrichment, which aims to solve the problems existing in the prior art for detecting the transgenic event.
In a first aspect, the present application provides a method for identifying transgenic events based on high throughput sequencing and probe enrichment, the method comprising the steps of:
obtaining the target exogenous gene sequence of the target organism,
obtaining a capture probe for the target exogenous gene sequence;
obtaining a plurality of DNA fragments to be identified and a reference genome sequence of the target organism;
connecting the DNA fragment to be identified with a linker sequence to obtain the DNA fragment containing the linker;
capturing the target exogenous gene in the DNA fragment containing the joint by using the capture probe and constructing a library to obtain an enriched library;
performing high-throughput sequencing on the enriched library to obtain high-throughput sequencing data;
and comparing the DNA sequence in the high-throughput sequencing data with the reference genome sequence and the target exogenous gene respectively, and determining the position of the target exogenous gene inserted into the target organism genome so as to identify a transgenic event.
Optionally, the high throughput sequencing comprises single-ended sequencing or double-ended sequencing.
Optionally, the comparing the DNA sequence in the high-throughput sequencing data with the reference genome sequence and the target foreign gene respectively, and the determining the position of the target foreign gene inserted into the genome of the target organism includes:
determining whether paired-end reads in the high-throughput sequencing data comprise overlapping fragments if the double-end sequencing is performed;
if so, splicing double-end reads in the high-throughput sequencing data containing the overlapped fragments to obtain a spliced sequence;
the splicing sequence is respectively aligned with the reference genome sequence and the target exogenous gene sequence at least once,
wherein, when the first comparison is carried out, if one end of the splicing sequence is matched with at least N1 base sequences of the target exogenous gene sequence, a preliminary matching splicing sequence is obtained; if the preliminary matching splicing sequence is matched with the reference genome sequence by at least N1 base sequences, obtaining a first effective sequence;
if not, respectively comparing double-end reading sequences in the high-throughput sequencing data with the target exogenous gene sequence and the reference genome sequence for at least one time;
if one end of double-ended reading in the high-throughput sequencing data is matched with at least N3 base sequences of the target exogenous gene sequence, and the other end of the double-ended reading is matched with at least N3 base sequences of the reference genome sequence, a second effective sequence is obtained;
screening the first effective sequence and the second effective sequence to obtain a first target effective sequence,
and judging whether the first target effective sequence has boundary sites which are matched with or not matched with the reference genome sequence, if so, if the number of sequences covering the boundary sites is more than or equal to N4, respectively extending the boundary sites along the upstream and downstream by N5bp to obtain the insertion position and the insertion direction of the target foreign gene in the reference genome, wherein N1, N3, N4 and N5 are positive integers, N1 is more than or equal to 30, N3 is more than or equal to 30, N4 is more than or equal to 5, and N5 is more than or equal to 20.
Optionally, the aligning the splicing sequence with the reference genome sequence and the target foreign gene sequence at least once respectively further comprises:
and when second alignment is carried out, the first effective sequence is respectively aligned with the target exogenous gene sequence and the reference genome sequence, and the first effective sequence is judged to be matched with the sequence with more than N2 bases of the target exogenous gene sequence and with the sequence with more than N2 bases of the reference genome sequence, so as to obtain a third effective sequence, wherein N2 is a positive integer, and N2 is more than or equal to 30.
Optionally, the screening the first effective sequence and the second effective sequence to obtain a first target effective sequence includes:
screening the first effective sequence to obtain a first effective sequence,
if the number of the bases of the first effective sequence matched with the target foreign gene sequence and the reference genome sequence is more than N8bp,
if the number of mismatched bases of the first effective sequence and the target foreign gene sequence and the reference genome sequence is less than N9bp,
if the sequencing read length of the first effective sequence is 130-150bp, the sum of the numbers of the matched bases of the first target effective sequence, the target exogenous gene sequence and the reference genome sequence is more than 80 bp;
if the number of the bases of the first effective sequence which are matched with the target exogenous gene sequence and the reference genome sequence is less than 10bp,
if the number of the bases of the first effective sequence, which are not matched with the target exogenous gene sequence and the reference genome sequence, is less than N10bp, obtaining a first target effective sequence;
screening the second effective sequence to obtain a second effective sequence,
if the number of bases of the second effective sequence matched with the target foreign gene sequence and the reference genome sequence is more than N8bp,
if the number of mismatched bases of the second effective sequence and the target foreign gene sequence and the reference genome sequence is less than N9bp,
if the sequencing read length of the second effective sequence is 130-150bp, the sum of the number of matched bases of the first effective sequence, the target exogenous gene sequence and the reference genome sequence is more than 80 bp;
if the number of the bases of the second effective sequence which are matched with the target exogenous gene sequence and the reference genome sequence is less than 10bp,
if the number of the bases of the second effective sequence, which are not matched with the target exogenous gene sequence and the reference genome sequence, is less than N10bp, obtaining a first target effective sequence;
wherein N8, N9 and N10 are positive integers, N8 is more than or equal to 30, N9 is less than or equal to 10, and N10 is less than or equal to 20.
Optionally, the comparing the DNA sequence in the high-throughput sequencing data with the reference genome sequence and the target foreign gene, respectively, and determining the position of the target foreign gene inserted into the genome of the target organism further comprises:
if the single-end sequencing is carried out, carrying out third comparison on the DNA sequence in the high-throughput sequencing data with the target exogenous gene sequence and the reference genome sequence respectively;
if one end of the DNA sequence is matched with at least N7 base sequences of the target exogenous gene sequence, obtaining a primary matching read sequence;
comparing the preliminary matching read sequence with the reference genome sequence, and obtaining a fourth effective sequence if the preliminary matching read sequence is matched with the reference genome sequence by at least N6 basic groups;
screening the fourth effective sequence to obtain a second target effective sequence,
and judging whether the second target effective sequence has boundary sites which are matched with or not matched with the reference genome sequence, if so, covering the number of sequences of the boundary sites to be more than or equal to N4, respectively extending the boundary sites along the upstream and downstream by N5bp to obtain the insertion position and the insertion direction of the target foreign gene in the reference genome, wherein N6 and N7 are positive integers, N6 is more than or equal to 30, and N7 is more than or equal to 30.
Optionally, before the comparing the gene fragment in the high-throughput sequencing data with the reference genome sequence and the target exogenous gene sequence, the method further comprises: and removing the impurity sequence.
Optionally, the impurity removal sequence includes:
removing sequencing adaptors of sequences in the high-throughput sequencing data;
removing sequences which do not meet preset standards; the sequences not meeting the preset standard comprise:
the 3' end of the single-ended sequence contains a sequence with the number of preset quality bases exceeding the self sequence 1/3, wherein the preset quality bases are bases with the quality value less than or equal to 20;
removing the sequence with the length less than 80 bp.
Optionally, the molar ratio of the sum of guanine and cytosine in the capture probe is 30% -80%; the capture probe has the same sequence segment with the length less than 40bp with the non-target sequence, and the sequence homology of the capture probe and the non-target sequence is less than 85%.
In a second aspect, the present application provides a use of the method of any one of claims 1-9 for any one of plants, animals and microorganisms for identifying transgenic events based on high throughput sequencing and probe enrichment.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the method provided by the embodiment of the application, the broken DNA fragment is subjected to end repair and then is connected with a linker sequence, a sample bar code is added to each sample to be identified, and a plurality of samples can be mixed for probe capture and high-throughput sequencing; designing and obtaining a capture probe according to the target exogenous gene sequence; capturing and enriching the DNA fragment with the capture probe connected with the adaptor sequence to obtain an enriched library; performing high-throughput sequencing on the enriched library to obtain high-throughput sequencing data; comparing the high-throughput sequencing data with the target foreign gene sequence and the reference gene sequence to determine the position of the target foreign gene inserted into the genome of the target organism; compared with the whole genome sequencing and identification of transgenic events, the method only needs to enrich the transferred exogenous gene (vector) and the adjacent flanking sequence thereof, does not need whole genome sequencing, greatly saves the cost, can be used for transferring the exogenous gene or transgenic materials with unclear vectors, and has the advantages of rapidness, high repeatability and stable result compared with the traditional common PCR method (such as a chromosome walking method).
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a method for identifying a transgenic event based on high throughput sequencing and probe enrichment provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of the principle of the method of the present application for identifying transgenic events in plants using targeted sequencing techniques;
FIG. 3 is an analytical flow chart of the method of the present application for identifying plant transgenic events using targeted sequencing techniques;
FIG. 4 is a schematic diagram of the foreign gene insert (vector) in example 1;
FIG. 5 is a top alignment of the p35s promoter and maize genomic sequences found in the transgenic event analysis in the MON810 transformant of example 1;
FIG. 6 is a schematic diagram showing the structure of the insertion of the foreign gene insert (vector) into the genome of maize in example 1;
FIG. 7 is a schematic diagram of the foreign gene insert of example 2;
FIG. 8 is a flow chart of the experimental analysis of the method of the present invention for identifying the flanking sequences of the inserted gene of Chlamydomonas using the technique of targeted sequencing;
FIG. 9 is the alignment information of a valid read with the foreign insert and the inserted genome in example 3;
FIG. 10 is a diagram showing the details of the insertion of an insertion mutant into the genome of Chlamydomonas in example 3 (precise insertion);
FIG. 11 is the alignment information of one valid read with the foreign insert and the inserted genome in example 4;
FIG. 12 is a schematic diagram showing the specific case of the insertion of an insertion mutant of Chlamydomonas into the Chlamydomonas genome in example 4 (non-precise insertion).
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In a first aspect, the present application provides a method for identifying transgenic events based on high throughput sequencing and probe enrichment, as shown in fig. 1, comprising the steps of:
s1, obtaining a target exogenous gene sequence of a target organism;
in the examples of this application, if the sequence of the inserted target foreign gene (vector) is known, the method of the present invention can quickly identify the insertion position, copy number, orientation and flanking sequence information of the foreign gene in the recipient species. If the inserted target foreign gene (vector) is unknown, the invention can effectively identify the insertion position, copy number, direction and flanking sequence information of the foreign gene in the receptor species according to the collected commonly used foreign inserted foreign gene (vector sequence information). The method simultaneously detects the position, copy number, direction and flanking sequence information of the insertion of a plurality of identical or different target exogenous genes.
In the embodiment of the present application, the reference genome of the target foreign gene target organism may be provided by the unit providing the target organism, or may be the collected common or published reference genome of the target foreign gene, the target organism, or the target organism.
S2, obtaining a capture probe for the target exogenous gene sequence;
in the embodiment of the application, the capture probe for the target exogenous gene sequence is obtained, if the capture probe is 50-80bp, the target region is covered at high density (namely overlap exists between the probes), and the problem that the capture effect is slightly weak due to the short probe sequence can be avoided due to the overlap; if the capture probe is at 100-120bp, the capture probe has good capture effect even without overlap; the capture probe needs to avoid SSR and N areas during design; conventionally, the capture probes cannot be identical to each other by more than 40 bp.
In an embodiment of the present application, the method for preparing the capture probe comprises: and designing by using the target exogenous gene sequence to obtain a capture probe.
S3, obtaining a reference genome sequence of the target organism and a plurality of DNA fragments to be identified;
s4, connecting the DNA fragment to be identified with a linker sequence to obtain the DNA fragment containing the linker;
s5, capturing the target exogenous gene in the DNA fragment containing the joint by using the capture probe and constructing a library to obtain an enriched library;
s6, carrying out high-throughput sequencing on the enriched library to obtain high-throughput sequencing data;
in the embodiment of the application, according to a molecular specific bar code added on an original template of each DNA molecule before a sequencing library is constructed, original DNA template molecules of an internal reference gene are calculated by a program, and the number of the molecules is less than 100; comparing high-throughput sequencing data of the reference genome, and carrying out re-experiment on the sample when the actual number of original DNA template molecules is less than a specific value; this particular value may be 100 or 105, etc., or at least greater than or equal to 100. And when the DNA original template molecule of the actual reference gene is larger than a specific value, judging to obtain an enriched library of the target organism gene.
The selection of the reference gene is in accordance with the following conditions: (1) the reference gene is relatively conserved in plants; (2) the designed sites of the selected conserved gene probe are not too many and can be 2-3 sites, and the selected reference gene is verified in various plants to ensure the universality of the selected reference gene in the plants.
S7, comparing the DNA sequence in the high-throughput sequencing data with the reference genome sequence and the target exogenous gene respectively, and determining the position of the target exogenous gene inserted into the target organism genome to identify a transgenic event.
In the embodiment of the present application, obtaining the target foreign gene sequence of the target organism may be performed after obtaining several DNA fragments to be identified and a reference genome sequence of the target organism.
As an alternative embodiment, the high throughput sequencing comprises single-ended sequencing or double-ended sequencing.
Currently, 3 high-throughput sequencing technologies, Roche454, Solexa and abisolipid, have two modes, namely single-ended sequencing and double-ended sequencing. During genome DeNovo sequencing, single-ended sequencing read length of Roche454 can reach 400bp, often used for genome backbone assembly, while Solexa and abisolipid double-ended sequencing can be used for scaffold assembly and gap filling. Single-ended sequencing (Single-read) and double-ended sequencing (Paired-end and Mate-pair) are described below with solexa as an example. Single-read, Paired-end and Mate-pair differ mainly in the method of construction of the sequencing library.
In the embodiment of the application, the high-throughput sequencing library is subjected to high-throughput sequencing to obtain single-end (200-.
As an alternative embodiment, the determining the position of the target foreign gene inserted into the genome of the target organism by aligning the DNA sequence in the high throughput sequencing data with the reference genome sequence and the target foreign gene respectively as shown in fig. 2 and fig. 3 comprises:
determining whether paired-end reads in the high-throughput sequencing data comprise overlapping fragments if the double-end sequencing is performed;
if so, splicing double-end reads in the high-throughput sequencing data containing the overlapped fragments to obtain a spliced sequence;
respectively comparing the spliced sequence with the reference genome sequence and the target exogenous gene sequence at least once, wherein if one end of the spliced sequence is matched with the target exogenous gene sequence by at least N1 base sequences during first comparison, a primary matched spliced sequence is obtained; if the preliminary matching splicing sequence is matched with the reference genome sequence by at least N1 base sequences, obtaining a first effective sequence;
if not, respectively comparing the double-end reading sequence in the high-throughput sequencing data with the target exogenous gene sequence and the reference genome sequence for at least one time,
if one end of double-ended reading in the high-throughput sequencing data is matched with at least N3 base sequences of the target exogenous gene sequence, and the other end of the double-ended reading is matched with at least N3 base sequences of the reference genome sequence, a second effective sequence is obtained;
screening the first effective sequence and the second effective sequence to obtain a first target effective sequence,
and judging whether the first target effective sequence has boundary sites which are matched with or not matched with the reference genome sequence, if so, if the number of sequences covering the boundary sites is more than or equal to N4, respectively extending the boundary sites along the upstream and downstream by N5bp to obtain the insertion position and the insertion direction of the target foreign gene in the reference genome, wherein N1, N3, N4 and N5 are positive integers, N1 is more than or equal to 30, N3 is more than or equal to 30, N4 is more than or equal to 5, and N5 is more than or equal to 20.
As an alternative embodiment, the at least one alignment between the reference genome sequence and the target foreign gene sequence is performed by using the splicing sequence, and the method further comprises:
and when second alignment is carried out, the first effective sequence is respectively aligned with the target exogenous gene sequence and the reference genome sequence, and the first effective sequence is judged to be matched with the sequence with more than N2 bases of the target exogenous gene sequence and with the sequence with more than N2 bases of the reference genome sequence, so as to obtain a third effective sequence, wherein N2 is a positive integer, and N2 is more than or equal to 30.
As an alternative embodiment, the screening the first effective sequence and the second effective sequence to obtain the first target effective sequence includes:
screening the first effective sequence to obtain a first effective sequence,
if the number of the bases of the first effective sequence matched with the target foreign gene sequence and the reference genome sequence is more than N8bp,
if the number of mismatched bases of the first effective sequence and the target foreign gene sequence and the reference genome sequence is less than N9bp,
if the sequencing read length of the first effective sequence is 130-150bp, the sum of the numbers of the matched bases of the first target effective sequence, the target exogenous gene sequence and the reference genome sequence is more than 80 bp;
if the number of the bases of the first effective sequence which are matched with the target exogenous gene sequence and the reference genome sequence is less than 10bp,
if the number of the bases of the first effective sequence, which are not matched with the target exogenous gene sequence and the reference genome sequence, is less than N10bp, obtaining a first target effective sequence;
screening the second effective sequence to obtain a second effective sequence,
if the number of bases of the second effective sequence matched with the target foreign gene sequence and the reference genome sequence is more than N8bp,
if the number of mismatched bases of the second effective sequence and the target foreign gene sequence and the reference genome sequence is less than N9bp,
if the sequencing read length of the second effective sequence is 130-150bp, the sum of the number of matched bases of the first effective sequence, the target exogenous gene sequence and the reference genome sequence is more than 80 bp;
if the number of the bases of the second effective sequence which are matched with the target exogenous gene sequence and the reference genome sequence is less than 10bp,
if the number of the bases of the second effective sequence, which are not matched with the target exogenous gene sequence and the reference genome sequence, is less than N10bp, obtaining a first target effective sequence;
wherein N8, N9 and N10 are positive integers, N8 is more than or equal to 30, N9 is less than or equal to 10, and N10 is less than or equal to 20.
As an alternative embodiment, the determining the position of the target foreign gene inserted into the genome of the target organism by aligning the DNA sequence in the high throughput sequencing data with the reference genome sequence and the target foreign gene respectively further comprises:
if the single-end sequencing is carried out, carrying out third comparison on the DNA sequence in the high-throughput sequencing data with the target exogenous gene sequence and the reference genome sequence respectively;
if one end of the DNA sequence is matched with at least N7 base sequences of the target exogenous gene sequence, obtaining a primary matching read sequence;
comparing the preliminary matching read sequence with the reference genome sequence, and obtaining a fourth effective sequence if the preliminary matching read sequence is matched with the reference genome sequence by at least N6 basic groups;
screening the fourth effective sequence to obtain a second target effective sequence;
and judging whether the second target effective sequence has boundary sites which are matched with or not matched with the reference genome sequence, if so, covering the number of sequences of the boundary sites to be more than or equal to N4, respectively extending the boundary sites along the upstream and downstream by N5bp to obtain the insertion position and the insertion direction of the target foreign gene in the reference genome, wherein N6 and N7 are positive integers, N6 is more than or equal to 30, and N7 is more than or equal to 30.
As an alternative embodiment, before aligning the high-throughput sequencing data with the reference genomic sequence and the target exogenous gene sequence, the method further comprises: and removing the impurity sequence.
As an alternative embodiment, the impurity removal sequence comprises:
removing sequencing adapters from the high-throughput sequencing data;
removing sequences which do not meet preset standards; the sequences not meeting the preset standard comprise:
the 3' end of the single-ended sequence contains a sequence with the number of preset quality bases exceeding the self sequence 1/3, wherein the preset quality bases are bases with the quality value less than or equal to 20;
removing the sequence with the length less than 80 bp.
In an alternative embodiment, the molar ratio of the sum of guanine and cytosine in the capture probe is 30% to 80%; the capture probe has the same sequence segment with the length less than 40bp with the non-target sequence, and the sequence homology of the capture probe and the non-target sequence is less than 85%; the capture probes preferably have no dimer or hairpin structure, and the annealing temperature is close.
In a second aspect, the present application provides a use of the method of the first aspect for any of transgenic plants, transgenic animals and transgenic microorganisms for identifying transgenic events based on high throughput sequencing and probe enrichment.
The method has the following main advantages: each sample is added with a specific bar code of the sample in advance, and detection of hundreds to thousands of exogenous gene insertion mutants can be completed at one time; the cost is low: compared with the whole genome sequencing method for detecting the insertion site of the exogenous gene, the method for capturing the target sequence by the probe can enable the target sequence to be sequenced to a higher depth, and the cost is obviously reduced; even compared with the traditional tail-PCR or genowalking, the cost is almost equivalent; results were stable, reproducible and easier to interpret; the function of the gene in which the insertion mutation occurs can be determined by comparing the phenotypic difference of the transgenic material and the wild material; the method has the advantages of high flux, capability of detecting a plurality of transgenic events at one time, and capability of obtaining the flanking sequence information of the insertion position, copy number, direction and insertion position of the exogenous gene on the genome at one time by means of a bioinformatics method.
The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Without being specifically indicated, the examples are carried out according to conventional experimental conditions, for example, some conventional molecular experiments can be carried out according to the molecular cloning handbook Sambrook, et al, Sambrook J & Russell DW, molecular cloning: Anathora, 2001, or according to the manufacturer's instructions.
Example 1
Experimental materials: MON810 transgenic maize was purchased from IRMM, Inc., a European Union Standard (IRMM, Geel, Belgium). The exogenous segment transferred by the transgenic sample is shown in figure 4, and contains an exogenous gene tobacco mosaic virus promoter (p35S), an intron sequence of a corn heat shock protein HSP70 and a cry1Ab gene with insect-resistant property, and the transgenic sample serves as a research material.
Extraction and fragmentation of DNA: the plant genome is extracted by using a high-efficiency plant genome DNA extraction kit (DP350) of Tiangen Biochemical technology (Beijing) Ltd. The plant material from which the DNA is extracted may be a seed, or may be fresh plant material such as roots, stems, pieces or seeds, or a mixture of such organs, or a young plant from which the seed has just germinated. The experimental plant used MON810 powder made by IRMM, european, with a transgene content of 10%. After DNA extraction using the Tiangen DNA extraction kit, 0.5ug-1ug was taken out and used in an ultrasonicator (Covaris, Woburn, MA, USA), and finally the genomic DNA was fragmented into 200-and 500-bp fragments.
The design of the probe for capturing the exogenous fragment relates to the synthesis of the qualified probe in the Kinry organism company. In this example, the probe capture was performed mainly for the exogenous insertion element in FIG. 4, mainly for identifying the insertion position of the exogenous gene in the recipient crop, the p35s promoter (SEQ ID NO: 1) and cry1Ab (SEQ ID NO: 2) genes were designed in a full-coverage manner, the corn HSP70 intron (SEQ ID NO: 1) located at the middle position was designed at both ends of the target gene, and 26 probes were designed in total, and the probe sequences are shown in Table 1. When the received probes are in a dry powder state, 100 mul of probe diluent is needed to be added to dilute the probes to 2pM, all the probes are mixed into the probe mix in equal quantity, and the using amount of the probes is 2 mul/sample when sample detection is carried out.
Table 1 probe sequences used in example 1.
Figure BDA0003281027850000111
Figure BDA0003281027850000121
Figure BDA0003281027850000131
Figure BDA0003281027850000141
1. Constructing and sequencing a probe capture library: genomic DNA from each sample was used to construct an Illumina-compatible NGS library. Briefly, DNA fragments fragmented using Covariss220 were subjected to end repair and A-tailing, and then ligated with an excess of a DNA molecule-specific barcode (barcode) and a sample-specific barcode (index). The 3 'linker sequence also contained an additional 8bp DNA molecule specific barcode (UMI) compared to the 5' universal linker sequence. A DNA library was then constructed according to the protocol using the GenoBaitsDNA library preparation kit (DL002, molbredingbiotechnological co., Ltd, China). Each Illumina-compatible GenoBaits library (100ng) was pooled together according to its instructions, and target sequence capture and enrichment was performed using the GenoBaits dna library preparation kit (DL001, molbredingbiotechnologies co., Ltd, China) according to its instructions. The enriched library was finally used for double-ended high throughput sequencing on Illumina hiseqx-ten (Illumina, inc., san diego, CA) with a sequencing read length of 2 × 150 bp. And (3) performing quality control after sequencing and off-line, wherein the quality control software is cutadapt2.4, and removing impurity sequences in high-throughput sequencing data.
2. Performing quality control according to the number of reference genes in the high-throughput sequencing data, and judging whether a library of target organism genes is obtained: (1) calculating the original DNA template molecules of the reference genes by using a program according to the molecular specific bar codes added on the original template of each DNA molecule before the sequencing library is constructed; (2) when the original template DNA molecule is less than 100, the sample needs to be re-tested; the original sequencing data of the reference gene utilized in the example is 197999, and 48782 DNA original template molecules are obtained after reduction according to the DNA molecular bar code, so that the probe calling and the library construction are successful, and the next bioinformatics analysis can be carried out.
3. The screening of the target effective sequence is consistent with the method in the application document, and the screened target effective sequence needs to meet the following five conditions:
a, the number of matched bases of a target effective sequence, an exogenous gene and a plant reference genome is more than 30 bp;
the number of mismatched bases of the target effective sequence, the foreign gene and the plant reference genome is less than 10 bp;
c, the sum of the number of matched bases of the target effective sequence, the exogenous gene and the plant reference genome is more than 80bp (when the sequencing read length is between 130 and 150 bp);
d, matching the target effective sequence with the exogenous gene at the same time, wherein the number of matched bases with the plant reference genome is less than 10 bp;
e, the number of the matched basic groups of the target effective sequence which is not matched with the foreign gene and not matched with the plant reference genome is less than 20 bp;
4. determination of insertion site and flanking sequence: judging whether the target effective sequence has boundary sites matched with or not matched with the reference genome sequence, if so, the number of sequences covering the boundary sites is more than or equal to 5, and respectively extending the boundary sites along the upstream and downstream by 20bp to obtain a flanking sequence and an insertion direction of the exogenous insertion gene;
screening according to the above steps, as shown in FIG. 3, finding the reading sequence aligned to the p35s promoter at one end and the corn chromosome at the other end, determining the insertion position information of the p35s promoter in corn, inserting the insertion position information into 55879236 of the corn chromosome 5, taking one of the sequences as NCBI alignment can indeed find the 5' end sequencing sequence (Genbank number: JQ406879.1) of the rice MON810 transgenic line, and the alignment result is shown in FIG. 5. As shown in FIG. 6, the insertion of the foreign fragment resulted in the introduction of a 9bp sequence at the right border of the cry1Ab gene insertion, while the insertion or deletion of a nucleotide sequence was not resulted at the left border of p35 s; it is understood that the insertion position of cry1Ab in the rice genome was found to be also located on rice chromosome 5, but at a distance from p35s, which may be that the maize reference genome we used is B73, while the transgene receptor used by Toyobo, Inc. is maize Hi-II, and therefore, it is some distance from the p35s and cry1Ab genes to find their insertion position in the maize genome. While continuing the search at NCBI with the sequences from which we could determine the cry1Ab insertion position, it was found that these sequences could indeed be aligned to the 3' sequencing of the MON810 line in NCBI on maize species (Genbank number: JQ406878.1), and these results demonstrate the reliability of our inventive method for identifying transgenic events.
Example 2
1. Experimental materials: the JMJ705-pU1301 vector (see figure 7) carrying the exogenous gene fragment containing the selection marker gene Hyg and having hygromycin resistance is transferred into the rice middle flower strain by the agrobacterium transformation method. The transformants obtained were selected on hygromycin-containing plates. The resulting stable strain was used as our research material.
Extraction and fragmentation of DNA: the plant genome is extracted by using a high-efficiency plant genome DNA extraction kit (DP350) of Tiangen Biochemical technology (Beijing) Ltd. The plant material from which the DNA is extracted may be a seed, or may be fresh plant material such as roots, stems, pieces or seeds, or a mixture of such organs, or a young plant from which the seed has just germinated. The experiment uses the transgenic material obtained by our laboratory, and the fresh leaves of the transgenic line are used for extracting DNA. After DNA extraction using the Tiangen DNA extraction kit, 0.5ug-1ug was taken out and used in an ultrasonicator (Covaris, Woburn, MA, USA), and finally the genomic DNA was fragmented into 200-and 500-bp fragments.
3. Design of a probe for capturing exogenous fragments: the exogenous probe sequence is designed to meet the conditions of the present application, and finally the probes meeting the conditions are entrusted to the Kinsley BioCorp for synthesis. In this example, the probe capture was performed mainly for 2 exogenous insertion elements in FIG. 7, and we designed 4 probes in the form of full coverage of the t35s promoter (SEQ ID NO: 4) and tNOS (SEQ ID NO: 5) genes, since we mainly identified the insertion position of the exogenous gene in the recipient crop, and the sequences of the probes are shown in Table 2. When the received probes are in a dry powder state, 100 mul of probe diluent is needed to be added to dilute the probes to 2pM, all the probes are mixed into the probe mix in equal quantity, and the using amount of the probes is 2 mul/sample when sample detection is carried out.
Table 2 probe sequences used in example 2.
Figure BDA0003281027850000161
Figure BDA0003281027850000171
4. Constructing and sequencing a probe capture library: genomic DNA from each sample was used to construct an Illumina-compatible NGS library. Briefly, the DNA fragments disrupted by Covariss220 were ligated after end-repair and A-tailing using an excess of a barcode (index) specific to the DNA molecule and a barcode (index) specific to the sample. The 3 'linker sequence also contained an additional 8bp DNA molecule specific barcode (UMI) compared to the 5' universal linker sequence. A DNA library was then constructed according to the protocol using the GenoBaitsDNA library preparation kit (DL002, molbredingbiotechnological co., Ltd, China). Each Illumina-compatible GenoBaits library (100ng) was pooled according to its instructions and target sequence capture and enrichment was performed using the GenoBaits dna library preparation kit (DL001, molbredingbiotechnology co., Ltd, China) according to its instructions. Finally, the enriched library was subjected to double-ended high-throughput sequencing on Illumina hiseqx-ten (Illumina, inc., san diego, CA) with a sequencing read length of 2 × 150 bp; and (3) performing quality control after sequencing and off-line, wherein the quality control software is cutadapt2.4, and removing impurity sequences in high-throughput sequencing data.
5. And (3) performing quality control according to the number of the reference genes in the high-throughput sequencing data:
(1) calculating the original DNA template molecules of the reference genes by using a program according to the molecular specific bar codes added on the original template of each DNA molecule before the sequencing library is constructed;
(2) when the original template DNA molecule is less than 100, the sample needs to be re-tested; the original sequencing data of the reference gene utilized in the example is 53364, and 7964 DNA original template molecules are obtained after the DNA molecule bar codes are reduced, so that the probe calling and the library construction are successful, and the next bioinformatics analysis can be carried out.
6. The screening of the target effective sequence is consistent with the method in the application document, and the screened target effective sequence needs to meet the following five conditions:
a, the number of matched bases of a target effective sequence, an exogenous gene and a plant reference genome is more than 30 bp;
the number of mismatched bases of the target effective sequence, the foreign gene and the plant reference genome is less than 10 bp;
c, the sum of the number of matched bases of the target effective sequence, the exogenous gene and the plant reference genome is more than 80bp (when the sequencing read length is between 130 and 150 bp);
d, matching the target effective sequence with the exogenous gene at the same time, wherein the number of matched bases with the plant reference genome is less than 10 bp;
and E, the number of the matched basic groups of the target effective sequence which is not matched with the foreign gene and not matched with the plant reference genome is less than 20 bp.
7. Determination of insertion site and flanking sequence: judging whether the target effective sequence has boundary sites matched with or not matched with the reference genome sequence, if so, the number of sequences covering the boundary sites is more than or equal to 5, and respectively extending the boundary sites along the upstream and downstream by 20bp to obtain a flanking sequence and an insertion direction of the exogenous insertion gene;
screening according to the steps, finding out a reading sequence with one end aligned to the tNOS terminator and the other end aligned to the rice chromosome, determining the insertion position information of the tNOS terminator in rice (table 3) and inserting the tNOS terminator into 21340946 of the rice chromosome 2; meanwhile, the insertion position of t35s in a rice genome is found, 4 bases are introduced into the genome at 21340892 and 14 bases are introduced into a 21340946 end genome at a rice chromosome 2 21340892, and the insertion of the exogenous fragment causes 54 bases of the genome to be deleted. There is a common copy on the rice chromosome. The method effectively identifies the novel transgenic event, the position of the exogenous gene insertion, the copy number and the insertion direction, and the results show that the method for identifying the transgenic event is feasible.
The method provided by the invention adds a sample bar code to each Chlamydomonas reinhardtii mutant formed by inserting an exogenous gene, mixes a plurality of samples, captures a target sequence by using a designed probe, performs high-throughput sequencing, and can effectively identify the insertion position, insertion direction, copy number and flanking sequence of the target exogenous gene on each mutant genome by combining bioinformatics at the later stage (as shown in figure 8), and the method is described by combining with example 3 and example 4.
Example 3
1. Experimental materials: after carrying on the restriction enzyme of pJMG-aphVIII carrier of foreign gene fragment, utilize the electric transformation method to transfer to the wild type algal strain 21gr of Chlamydomonas reinhardtii (Chlamydomonas reinhardtii). Wherein, the exogenous gene fragment contains a selection marker gene AphVIII with batroxobin resistance. The obtained transformant strain is screened on a plate containing paromomycin to obtain a stable strain which is used as a research material.
Extraction and fragmentation of DNA: the extraction of Chlamydomonas reinhardtii genome adopts 0.5ug-1ug of high-efficiency plant genome DNA extraction kit (DP350) of Tiangen Biochemical technology (Beijing) Co., Ltd.) to break the genome DNA into 200-fold 500bp fragments by an ultrasonic disruptor (Covaris, Woburn, MA, USA).
3. The design of the capture exogenous fragment probe, and finally the synthesis of the probe meeting the conditions in Wuhan Pongzi biotechnology limited company. In this example, only 3 probe sequences at both ends of the foreign fragment were used for the test, and the probe sequences were SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8.
and (3) probe hybridization: corresponding reagents or samples, following hybridization solution, were added according to the following system, denatured at 95 ℃ for 10min, and hybridized at 67 ℃ for 1h in a hybridization chamber. As shown in table 2.
TABLE 2 Probe hybridization System.
Probe (10um/l 10pmol/ul) 12ul
Disrupted DNA 18ul(1ug)
20SSC(finalconcentration=6XSSC) 15ul
500mMEDTA(final=5mM) 0.5ul
10%SDS(final=0.1%) 0.5ul
50Denhardt’sreagentfinal=1×) 1ul
ddH20 3ul
Complement 50ul
Capturing an exogenous fragment: capture was performed using streptomycin affinity magnetic beads (SA) from NEB (NEB, cat # S1420S) according to the instructions. Wherein the streptomycin affinity magnetic beads are taken out of the refrigerator in advance and equilibrated at room temperature for 30 min.
4. High-throughput library construction and sequencing: the captured target fragment was subjected to library construction using the IonPlusFragmentLibrarykit (manufactured by Lifetechnology, USA, Cat. No. 4471252), and the procedures were performed according to the instructions. Sequencing kit used was the IonS5precision IDChef & sequencing kit (by Life technology, USA, Cat. No. A33208). The constructed sequencing library was quantified by TaqMan probe method, then mixed in equimolar amounts, sequenced by single-ended sequencing on an ion S5 sequencer (A27212, ThermoFisher scientific, Waltham, MA, USA) reading 400bp in length. And the quality control software is fastx _ toolkit, and impurity sequences in high-throughput sequencing data are removed.
5. And (3) performing quality control according to the number of the reference genes in the high-throughput sequencing data:
4751 effective sequencing fragments are obtained, so that the probe is successfully called and the library is successfully constructed, and the next bioinformatics analysis can be carried out.
6. The screening of the target effective sequence is consistent with the method in the application document, and the screened target effective sequence needs to meet the following five conditions:
a, the number of matched bases of the target effective sequence, the exogenous gene and the chlamydomonas reference genome is more than 30 bp;
the number of mismatched bases of the target effective sequence, the exogenous gene and the chlamydomonas reference genome is less than 10 bp;
c, the sum of the number of matched bases of the target effective sequence, the exogenous gene and the chlamydomonas reference genome is more than 80bp
(when the sequencing read length is between 130-150 bp);
d, matching the target effective sequence with the exogenous gene and matching the number of basic groups of the chlamydomonas reinhardtii reference genome to be less than 10 bp;
e, the number of matched basic groups of the target effective sequence which is not matched with the exogenous gene and not matched with the chlamydomonas reinhardtii reference genome is less than 20 bp;
7. determination of insertion site and flanking sequence: judging whether the target effective sequence has boundary sites matched with or not matched with the reference genome sequence, if so, the number of sequences covering the boundary sites is more than or equal to 5, and respectively extending the boundary sites along the upstream and downstream by 20bp to obtain a flanking sequence and an insertion direction of the exogenous insertion gene;
the blast comparison result of one of the obtained reads, the exogenous insert and the chlamydomonas reinhardtii reference genome is shown in FIG. 9; the specific case of inserting the obtained foreign fragment into the genome of Chlamydomonas reinhardtii is shown in FIG. 10. The Blast alignment results are shown in Table 3, and it can be seen from the results in Table 3 that the insertion of the foreign fragment did not cause a change in the sequence of the inserted position in the Chlamydomonas reinhardtii genome.
Table 3.
Figure BDA0003281027850000201
Example 4
1. Experimental materials: after carrying on the restriction enzyme of pJMG-aphVIII carrier of foreign gene fragment, utilize the electric transformation method to transfer to the wild type algal strain 21gr of Chlamydomonas reinhardtii (Chlamydomonas reinhardtii). Wherein the exogenous gene fragment contains a screening marker gene AphVIII with barnacin resistance, and the obtained transformant is screened on a barnacin-containing plate to obtain a stable strain which is used as a research material.
Extraction and fragmentation of DNA: the extraction of the chlamydomonas reinhardtii genome adopts a high-efficiency plant genome DNA extraction kit (DP350) of Tiangen Biochemical technology (Beijing) Co., Ltd. The 0.5ug-1ug mutant was fragmented into 200-300bp fragments from genomic DNA using a sonicator (Covaris, Woburn, MA, USA).
3. Design of a probe for capturing exogenous fragments: the sequence of the exogenous probe is designed to meet the conditions of the application:
(1) the length of the probe is designed to be 30-80bp, and the target region is covered at high density (namely overlap exists between the probes);
or the length of the probes is designed to be 100-120bp, and the probes are not overlapped;
(2) the SSR and N areas are avoided during probe design;
(3) calculating the content of guanine and cytosine in all probe sequences, wherein the molar ratio of the sum of the guanine and the cytosine in the capture probe is 30-80%;
(4) calculating the molar ratio of the sum of guanine and cytosine in the capture probe to be 30% -80%; the capture probe has the same sequence segment with the length less than 40bp with the non-target sequence, and the sequence homology of the capture probe and the non-target sequence is less than 85%; the capture probes preferably have no dimer or hairpin structure, and the annealing temperature is close.
Finally, the probes meeting the conditions were synthesized in Wuhan Pongziaceae Biotechnology Ltd. In this example, only 3 probe sequences at both ends of the foreign fragment were used for the test, and the probe sequences were SEQ ID NO: 9, SEQ ID NO: 10, SEQ ID NO: 11.
and (3) probe hybridization: corresponding reagents or samples were added according to the following system, the following hybridization solution, denatured at 95 ℃ for 10min, and hybridized at 67 ℃ for 1h in a hybridization chamber, the hybridization system is shown in Table 4.
Table 4 components of the hybridization system.
Probe (10um/l 10pmol/ul) 12ul
Disrupted DNA 18ul(1ug)
20SSC(finalconcentration=6XSSC) 15ul
500mMEDTA(final=5mM) 0.5ul
10%SDS(final=0.1%) 0.5ul
50Denhardt’sreagent(final=1×) 1ul
ddH20 3ul
Complement 50ul
Capturing an exogenous fragment: capture was performed using streptomycin affinity magnetic beads (SA) from NEB (NEB, cat # S1420S) according to the instructions. Wherein the streptomycin affinity magnetic beads are taken out of the refrigerator in advance and equilibrated at room temperature for 30 min.
4. High-throughput library construction and sequencing: the captured target fragments were subjected to library construction using NEBNextDNA library preparation premix kit (manufactured by NEB Inc., cat # E6040S) according to the protocol. The constructed library was subjected to double-ended sequencing on the Illumina platform, each read length being 150 bp. And the quality control software is cutadapt2.4, and impurity sequences in high-throughput sequencing data are removed.
5. According to high-throughput sequencing data, performing quality control according to the number of internal reference genes:
the obtained effective sequencing fragment shows that the probe is successfully called and the library is successfully constructed, and the next bioinformatics analysis can be carried out.
6. The screening of the target effective sequence is consistent with the method in the application document.
The impurity removal sequence comprises the following three steps:
1) removing linkers in the sequencing data:
2) removing sequences which do not meet preset standards; removing the whole sequence of the single-ended read sequence when the number of low-quality bases contained at the 3' end of the read sequence exceeds one third of the full-length sequence, wherein the low-quality bases are bases with the mass of 20;
3) removing sequences with the reading length of less than 80 bp;
when the screened target effective sequence needs to meet the following five conditions:
a, the number of matched bases of the target effective sequence, the exogenous gene and the chlamydomonas reference genome is more than 30 bp;
the number of mismatched bases of the target effective sequence, the exogenous gene and the chlamydomonas reference genome is less than 10 bp;
c, the sum of the number of the matched basic groups of the target effective sequence, the exogenous gene and the chlamydomonas reference genome is more than 80bp
(when the sequencing read length is between 130-150 bp);
d, matching the target effective sequence with the exogenous gene and matching the number of basic groups of the chlamydomonas reinhardtii reference genome to be less than 10 bp;
e, the number of the matched basic groups of the target effective sequence which is not matched with the exogenous gene and not matched with the chlamydomonas reinhardtii reference genome is less than 20 bp;
7. determination of insertion site and flanking sequence: judging whether the target effective sequence has boundary sites matched with or not matched with the reference genome sequence, if so, the number of sequences covering the boundary sites is more than or equal to 5, and respectively extending the boundary sites along the upstream and downstream by 20bp to obtain a flanking sequence and an insertion direction of the exogenous insertion gene;
the blast comparison result of one of the obtained reads, the exogenous insert and the Chlamydomonas reinhardtii reference genome, screened according to the above steps, is shown in FIG. 11; the specific case of inserting the obtained foreign fragment into the genome of Chlamydomonas reinhardtii is shown in FIG. 12. The Blast alignment results are shown in Table 5, and it can be seen from the results in Table 5 that the insertion of the foreign fragment causes the insertion of a 15bp sequence into the Chlamydomonas reinhardtii genome on chromosome 4.
Table 5.
Figure BDA0003281027850000221
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for identifying a transgenic event based on high throughput sequencing and probe enrichment, the method comprising the steps of:
obtaining the target exogenous gene sequence of the target organism,
obtaining a capture probe for the target exogenous gene sequence;
obtaining a plurality of DNA fragments to be identified and a reference genome sequence of the target organism;
connecting the DNA fragment to be identified with a linker sequence to obtain the DNA fragment containing the linker;
capturing the target exogenous gene in the DNA fragment containing the joint by using the capture probe and constructing a library to obtain an enriched library;
performing high-throughput sequencing on the enriched library to obtain high-throughput sequencing data;
and comparing the DNA sequence in the high-throughput sequencing data with the reference genome sequence and the target exogenous gene respectively, and determining the position of the target exogenous gene inserted into the target organism genome so as to identify a transgenic event.
2. The method of any one of claims 1, wherein the high throughput sequencing comprises single-ended sequencing or double-ended sequencing.
3. The method of claim 2, wherein the comparing the DNA sequences in the high-throughput sequencing data with the reference genomic sequence and the target foreign gene, respectively, and the determining the location of the target foreign gene inserted into the genome of the target organism comprises:
determining whether paired-end reads in the high-throughput sequencing data comprise overlapping fragments if the double-end sequencing is performed;
if so, splicing double-end reads in the high-throughput sequencing data containing the overlapped fragments to obtain a spliced sequence;
respectively comparing the spliced sequence with the reference genome sequence and the target exogenous gene sequence at least once, wherein if one end of the spliced sequence is matched with the target exogenous gene sequence by at least N1 base sequences during first comparison, a primary matched spliced sequence is obtained; if the preliminary matching splicing sequence is matched with the reference genome sequence by at least N1 base sequences, obtaining a first effective sequence;
if not, respectively comparing the double-end reading sequence in the high-throughput sequencing data with the target exogenous gene sequence and the reference genome sequence for at least one time,
if one end of double-ended reading in the high-throughput sequencing data is matched with at least N3 base sequences of the target exogenous gene sequence, and the other end of the double-ended reading is matched with at least N3 base sequences of the reference genome sequence, a second effective sequence is obtained;
screening the first effective sequence and the second effective sequence to obtain a first target effective sequence,
and judging whether the first target effective sequence has boundary sites which are matched with or not matched with the reference genome sequence, if so, extending the boundary sites by N5bp respectively along the upstream and downstream to obtain the insertion positions and the insertion directions of the target foreign genes in the reference genome if the number of sequences covering the boundary sites is more than or equal to N4, wherein N1, N3, N4 and N5 are positive integers, N1 is more than or equal to 30, N3 is more than or equal to 30, N4 is more than or equal to 5, and N5 is more than or equal to 20.
4. The method of claim 3, wherein the at least one alignment with the reference genomic sequence and the target exogenous gene sequence is performed using the splicing sequence, and further comprising:
and when second alignment is carried out, the first effective sequence is respectively aligned with the target exogenous gene sequence and the reference genome sequence, and the first effective sequence is judged to be matched with the sequence with more than N2 bases of the target exogenous gene sequence and with the sequence with more than N2 bases of the reference genome sequence, so as to obtain a third effective sequence, wherein N2 is a positive integer, and N2 is more than or equal to 30.
5. The method of claim 4, wherein the screening the first effective sequence and the second effective sequence to obtain a first target effective sequence comprises:
screening the first effective sequence to obtain a first effective sequence,
if the number of the bases of the first effective sequence matched with the target foreign gene sequence and the reference genome sequence is more than N8bp,
if the number of mismatched bases of the first effective sequence and the target foreign gene sequence and the reference genome sequence is less than N9bp,
if the sequencing read length of the first effective sequence is 130-150bp, the sum of the numbers of the matched bases of the first target effective sequence, the target exogenous gene sequence and the reference genome sequence is more than 80 bp;
if the number of the bases of the first effective sequence which are matched with the target exogenous gene sequence and the reference genome sequence is less than 10bp,
if the number of the bases of the first effective sequence, which are not matched with the target exogenous gene sequence and the reference genome sequence, is less than N10bp, obtaining a first target effective sequence;
screening the second effective sequence to obtain a second effective sequence,
if the number of bases of the second effective sequence matched with the target foreign gene sequence and the reference genome sequence is more than N8bp,
if the number of mismatched bases of the second effective sequence and the target foreign gene sequence and the reference genome sequence is less than N9bp,
if the sequencing read length of the second effective sequence is 130-150bp, the sum of the number of matched bases of the first effective sequence, the target exogenous gene sequence and the reference genome sequence is more than 80 bp;
if the number of the bases of the second effective sequence which are matched with the target exogenous gene sequence and the reference genome sequence is less than 10bp,
if the number of the bases of the second effective sequence, which are not matched with the target exogenous gene sequence and the reference genome sequence, is less than N10bp, obtaining a first target effective sequence;
wherein N8, N9 and N10 are positive integers, N8 is more than or equal to 30, N9 is less than or equal to 10, and N10 is less than or equal to 20.
6. The method of claim 4, wherein said determining the location of the insertion of the target foreign gene into the genome of the target organism by aligning the DNA sequences in the high throughput sequencing data with the reference genomic sequence and the target foreign gene, respectively, further comprises:
if the single-end sequencing is carried out, carrying out third comparison on the DNA sequence in the high-throughput sequencing data with the target exogenous gene sequence and the reference genome sequence respectively;
if one end of the DNA sequence is matched with at least N7 base sequences of the target exogenous gene sequence, obtaining a primary matching read sequence;
comparing the preliminary matching read sequence with the reference genome sequence, and obtaining a fourth effective sequence if the preliminary matching read sequence is matched with the reference genome sequence by at least N6 basic groups;
screening the fourth effective sequence to obtain a second target effective sequence;
and judging whether the second target effective sequence has boundary sites which are matched with or not matched with the reference genome sequence, if so, covering the number of sequences of the boundary sites to be more than or equal to N4, respectively extending the boundary sites by N5bp along the upstream and downstream to obtain the insertion position and the insertion direction of the target foreign gene in the reference genome, wherein N6 and N7 are positive integers, N6 is more than or equal to 30, and N7 is more than or equal to 30.
7. The method of claim 1, wherein prior to aligning the high throughput sequencing data with the reference genomic sequence and the target exogenous gene sequence, further comprising: and removing the impurity sequence.
8. The method of claim 7, wherein the contaminant removal sequence comprises:
removing sequencing adapters from the high-throughput sequencing data;
removing sequences which do not meet preset standards; the sequences not meeting the preset standard comprise:
the 3' end of the single-ended sequence contains a sequence with the number of preset quality bases exceeding the self sequence 1/3, wherein the preset quality bases are bases with the quality value less than or equal to 20;
removing the sequence with the length less than 80 bp.
9. The method of claim 1, wherein the molar ratio of the sum of guanine and cytosine in the capture probe is 30% to 80%; the capture probe has the same sequence segment with the length less than 40bp with the non-target sequence, and the sequence homology of the capture probe and the non-target sequence is less than 85%.
10. Use of high throughput sequencing and probe enrichment based identification of transgenic events, wherein the method of any of claims 1-9 is used in any of transgenic plants, transgenic animals and transgenic microorganisms.
CN202111133102.7A 2021-09-27 2021-09-27 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment Active CN113957130B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111133102.7A CN113957130B (en) 2021-09-27 2021-09-27 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111133102.7A CN113957130B (en) 2021-09-27 2021-09-27 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment

Publications (2)

Publication Number Publication Date
CN113957130A true CN113957130A (en) 2022-01-21
CN113957130B CN113957130B (en) 2023-12-22

Family

ID=79462617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111133102.7A Active CN113957130B (en) 2021-09-27 2021-09-27 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment

Country Status (1)

Country Link
CN (1) CN113957130B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014005329A1 (en) * 2012-07-06 2014-01-09 深圳华大基因科技有限公司 Method and system for determining integration manner of foreign gene in human genome
CN105567830A (en) * 2016-01-29 2016-05-11 江汉大学 Method for detecting transgenic ingredients of plant
CN110172504A (en) * 2019-04-19 2019-08-27 武汉明了生物科技有限公司 A kind of detection method and kit of foreign gene
CN110556165A (en) * 2019-09-12 2019-12-10 浙江大学 method for rapidly identifying transgene or gene editing material and insertion site thereof by using whole genome re-sequencing data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014005329A1 (en) * 2012-07-06 2014-01-09 深圳华大基因科技有限公司 Method and system for determining integration manner of foreign gene in human genome
CN105567830A (en) * 2016-01-29 2016-05-11 江汉大学 Method for detecting transgenic ingredients of plant
CN110172504A (en) * 2019-04-19 2019-08-27 武汉明了生物科技有限公司 A kind of detection method and kit of foreign gene
CN110556165A (en) * 2019-09-12 2019-12-10 浙江大学 method for rapidly identifying transgene or gene editing material and insertion site thereof by using whole genome re-sequencing data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN L等: "GmoDetector: An accurate and efficient GMO identification approach and its applications", 《FOOD RES INT》, no. 149, pages 110662 *
FRANK U等: "A T-DNA mutant screen that combines high-throughput phenotyping with the efficient identification of mutated genes by targeted genome sequencing", 《BMC PLANT BIOL》, vol. 19, no. 1, pages 539 *

Also Published As

Publication number Publication date
CN113957130B (en) 2023-12-22

Similar Documents

Publication Publication Date Title
Liu et al. Genome-scale sequence disruption following biolistic transformation in rice and maize
Williams‐Carrier et al. Use of Illumina sequencing to identify transposon insertions underlying mutant phenotypes in high‐copy Mutator lines of maize
Liu et al. Genome-wide analysis uncovers regulation of long intergenic noncoding RNAs in Arabidopsis
Sallaud et al. High throughput T‐DNA insertion mutagenesis in rice: a first step towards in silico reverse genetics
Arribas-Hernández et al. Principles of mRNA targeting via the Arabidopsis m6A-binding protein ECT2
Holst-Jensen et al. Application of whole genome shotgun sequencing for detection and characterization of genetically modified organisms and derived products
EP1546345B1 (en) Genome partitioning
Macas et al. Hypervariable 3′ UTR region of plant LTR-retrotransposons as a source of novel satellite repeats
AU779568B2 (en) Genetically filtered shotgun sequencing of complex eukaryotic genomes
Howard III et al. Identification of the maize gravitropism gene lazy plant1 by a transposon-tagging genome resequencing strategy
Ma et al. RNA-seq-mediated transcriptome analysis of a fiberless mutant cotton and its possible origin based on SNP markers
Wei et al. Detailed analysis of a contiguous 22-Mb region of the maize genome
Liu et al. Genome‐wide profiling of circular RNAs, alternative splicing, and R‐loops in stem‐differentiating xylem of Populus trichocarpa
Delseny Towards an accurate sequence of the rice genome
Philips et al. Expression landscape of circRNAs in Arabidopsis thaliana seedlings and adult tissues
CN113046835A (en) Sequencing library construction method for detecting lentivirus insertion site and lentivirus insertion site detection method
Garg et al. Near‐gapless genome assemblies of Williams 82 and Lee cultivars for accelerating global soybean research
CN113957130B (en) Method for identifying transgenic event based on high-throughput sequencing and probe enrichment
Zhang et al. LIFE‐Seq: a universal L arge I ntegrated DNA F ragment E nrichment Seq uencing strategy for deciphering the transgene integration of genetically modified organisms
CN108624709B (en) Universal primer and detection method for detecting target gene expression in transgenic plant
CN111518933B (en) Wheat grain length related SNP marker and application thereof
Nandety et al. Insertional mutagenesis of Brachypodium distachyon using the Tnt1 retrotransposable element
Feng et al. Gene discovery and functional analyses in the model plant Arabidopsis
CN112143830A (en) Molecular marker of rice sword leaf width regulation gene NAL1 and application thereof
Niu et al. Resolving a Systematic Error in STARR-seq for quantitative enhancer activity mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant