CN117737216A

CN117737216A - A method for detecting genomic information based on restriction endonucleases

Info

Publication number: CN117737216A
Application number: CN202410122596.6A
Authority: CN
Inventors: 汤富酬; 文路; 王艳; 陈怡珺
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-03-22

Abstract

The invention discloses a method for detecting genome information based on restriction enzymes. The method comprises the steps of cutting a genome of a sample by using restriction enzymes to obtain genome DNA fragments with different lengths, enriching the amplified or non-amplified DNA into long fragments, sequencing the enriched long fragment genome DNA fragments on a long length sequencing platform, and finally analyzing the sequenced data by a computer. The method of the invention obviously improves the probability of detecting two alleles at the same time, obviously reduces the allele deletion rate, can better detect heterozygous tumor mutation, and has important significance for early diagnosis and treatment of tumors.

Description

A method for detecting genomic information based on restriction endonucleases

技术领域Technical field

本发明属于基因组检测技术领域，具体涉及一种基于限制性内切酶的检测基因组信息的方法。The invention belongs to the technical field of genome detection, and specifically relates to a method for detecting genome information based on restriction endonucleases.

背景技术Background technique

细胞是生物体的基本组成单元，在每个细胞中，遗传信息以染色体的形式储存。一般认为每个个体的所有细胞都有着相同的基因组，因此可以在物种或个体水平进行基因组研究，但是在以下几种情况下，人们需要从单细胞尺度进行基因组的探究：（1）细胞非常宝贵，数量稀少，例如，人类卵母细胞、胚胎细胞和循环肿瘤细胞；（2）不同细胞有着其独特的基因组，例如，同一个个体的精子细胞由于减数分裂同源重组而拥有不同的基因组；（3）细胞谱系追踪，单细胞的基因组随时间变化，可以用基因组的变化反映细胞随时间的演变；（4）单细胞基因组具有异质性，如肿瘤、神经、免疫以及嵌合体。因此单细胞基因组测序技术应运而生。Cells are the basic building blocks of organisms. In each cell, genetic information is stored in the form of chromosomes. It is generally believed that all cells of each individual have the same genome, so genome research can be conducted at the species or individual level. However, in the following situations, people need to conduct genome exploration from the single-cell scale: (1) Cells are very precious , rare in number, such as human oocytes, embryonic cells and circulating tumor cells; (2) Different cells have their own unique genomes, for example, sperm cells of the same individual have different genomes due to meiotic homologous recombination; (3) Cell lineage tracking, the genome of single cells changes over time, and genome changes can be used to reflect the evolution of cells over time; (4) Single cell genomes are heterogeneous, such as tumors, nerves, immunity, and chimeras. Therefore, single-cell genome sequencing technology came into being.

单细胞基因组只有约6 pg的DNA，远低于高通量测序所需要的DNA量，所以在进行测序前需要进行均匀的扩增。全基因组扩增（Whole-genome amplification, WGA）技术的发展使大家能够在单细胞中扩增出足够测序的基因组DNA，从而研究细胞的遗传异质性，包括单核苷酸变异（Single-Nucleotide Variations, SNVs）、拷贝数变异（Copy-NumberVariations, CNVs）和结构变异（Structural Variations, SVs）。已有多种基于二代测序（Next-Generation Sequencing, NGS）平台的单细胞全基因组扩增技术被开发出来，如简并寡核苷酸引物聚合酶链反应（Degenerate Oligonucleotide-Primed Polymerase ChainReaction, DOP-PCR）、多重置换扩增（Multiple Displacement Amplification, MDA）、多重退火和环基扩增（Multiple Annealing and Looping-Based Amplification Cycles,MALBAC）、转座子插入线性扩增（Linear Amplification via Transposon Insertion,LIANTI）、初级模板定向扩增（primary template-directed amplification, PTA）和互补链多路末端标记扩增（multiplexed end-tagging amplification of complementarystrands, META-CS）。其中谢晓亮教授组发表的META-CS方法，利用DNA的互补性消除了SNV检测时几乎所有的假阳性，只有正链和反向链都支持的变异位点才被判断为SNVs，达到了目前为止最高的精度。The single-cell genome only contains about 6 pg of DNA, which is far less than the amount of DNA required for high-throughput sequencing, so uniform amplification is required before sequencing. The development of whole-genome amplification (WGA) technology enables us to amplify enough genomic DNA for sequencing in single cells to study the genetic heterogeneity of cells, including single-nucleotide variations (Single-Nucleotide). Variations, SNVs), copy number variations (Copy-NumberVariations, CNVs) and structural variations (Structural Variations, SVs). A variety of single-cell whole-genome amplification technologies based on Next-Generation Sequencing (NGS) platforms have been developed, such as Degenerate Oligonucleotide-Primed Polymerase Chain Reaction (DOP) -PCR), Multiple Displacement Amplification (MDA), Multiple Annealing and Looping-Based Amplification Cycles (MALBAC), Linear Amplification via Transposon Insertion, LIANTI), primary template-directed amplification (PTA) and complementary strand multiplexed end-tagging amplification of complementary strands (META-CS). Among them, the META-CS method published by Professor Xie Xiaoliang's group uses the complementarity of DNA to eliminate almost all false positives in SNV detection. Only mutation sites supported by both the forward and reverse strands are judged to be SNVs. This has achieved so far Highest precision.

由于NGS平台的测序准确性高，上述这些技术在CNVs和SNVs的检测方面功能非常强大，但受到读长短的限制，因此在SVs的检测方面性能较差。SVs包括缺失、插入、重复和易位，是许多可遗传疾病（如癌症）的重要变异类型。因此，在单细胞分辨率下研究SVs是一个至关重要的问题。Due to the high sequencing accuracy of the NGS platform, the above-mentioned technologies are very powerful in the detection of CNVs and SNVs, but are limited by the short read length, so their performance in the detection of SVs is poor. SVs include deletions, insertions, duplications and translocations, and are important types of mutations in many heritable diseases, such as cancer. Therefore, studying SVs at single-cell resolution is a crucial issue.

基于长读长测序平台，即三代测序（Third-Generation Sequencing, TGS）平台，本课题组开发了通过转座子插入扩增的长片段单分子实时测序（single-molecule real-time sequencing of long fragments amplified through transposon insertion,SMOOTH-seq），使用低浓度Tn5转座酶随机片段化单细胞基因组DNA，以实现相对均匀的基因组扩增。除了CNVs和SNVs之外，SMOOTH-seq还能有效地检测到SVs。然而，在二倍体细胞中，两个等位基因同时覆盖的情况非常有限，使得SMOOTH-seq在检测杂合SNPs（heterozygousSNPs, hetSNPs）时具有很高的假阴性。Based on the long-read sequencing platform, namely Third-Generation Sequencing (TGS) platform, our research group developed single-molecule real-time sequencing of long fragments amplified by transposon insertion. amplified through transposon insertion, SMOOTH-seq), which uses low-concentration Tn5 transposase to randomly fragment single-cell genomic DNA to achieve relatively uniform genome amplification. In addition to CNVs and SNVs, SMOOTH-seq can also effectively detect SVs. However, in diploid cells, the simultaneous coverage of both alleles is very limited, causing SMOOTH-seq to have a high false negative rate when detecting heterozygous SNPs (hetSNPs).

等位基因缺失是单细胞全基因组扩增技术面临的重要问题。二倍体细胞杂合突变时，两个等位基因中的某一个如果无法被扩增和检测到则会导致等位基因缺失，这是SNVs假阴性的主要原因。之前的单细胞全基因组方法或借助随机引物扩增，或借助Tn5随机片段化，这些随机打断或扩增的方法不利于等位基因的同时捕获。例如，对于一对等位基因A和B，如果基因组覆盖度是n%，也就是有n%的概率捕获到基因A或基因B，那么同时捕获这两个等位基因的概率就是n%×n%，也就是n2‰。因此同时捕获两个等位基因的概率非常低。Allelic loss is an important problem faced by single-cell whole-genome amplification technology. When diploid cells have heterozygous mutations, if one of the two alleles cannot be amplified and detected, it will lead to allele deletion, which is the main cause of false negative SNVs. Previous single-cell whole-genome methods either relied on random primer amplification or Tn5 random fragmentation. These random interruption or amplification methods are not conducive to simultaneous capture of alleles. For example, for a pair of alleles A and B, if the genome coverage is n%, that is, there is an n% probability of capturing gene A or gene B, then the probability of capturing both alleles at the same time is n%× n%, that is, n2‰. Therefore the probability of capturing both alleles at the same time is very low.

由于二倍体基因组中的两个等位基因通常具有相同的限制性内切酶识别位点，酶切产生的同源DNA片段通常具有相同的长度，比Tn5随机转座或随机引物扩增产生的DNA片段更容易同时扩增。基于此，本发明开发了基于限制性内切酶切割和连接策略的单细胞长读长全基因组测序技术Refresh-seq（Restriction fragments ligation-based genomeamplification and third-generation sequencing），显著提高了两个等位基因同时检出的概率。等位基因缺失的情况导致产前诊断的假阴性，由于等位基因缺失，等位基因中可能只测到一个allele，另一个allele没有测到，对于单基因致病的异常胚胎（杂合突变）来说，如果只测到了正常的allele，由于等位基因缺失，没有测到致病的突变allele，会误判这个胚胎为正常胚胎。本方法显著降低了等位基因缺失率，可以减少这种错误判断的情况，可以选择健康的胚胎从而促进优生优育。除此之外，肿瘤细胞有较高的突变负荷，并且突变通常是杂合的，以往的方法由于等位基因缺失容易低估肿瘤基因组的突变情况，而本方法可以更好地检测出杂合肿瘤突变，对于肿瘤的早期诊断和治疗具有重要意义。Since the two alleles in a diploid genome usually have the same restriction endonuclease recognition site, the homologous DNA fragments produced by restriction endonuclease digestion are usually of the same length and are much smaller than those produced by Tn5 random transposition or random primer amplification. DNA fragments are more likely to be amplified simultaneously. Based on this, the present invention developed a single-cell long-read whole-genome sequencing technology Refresh-seq (Restriction fragments ligation-based genome amplification and third-generation sequencing) based on restriction endonuclease cutting and ligation strategies, which significantly improved two et al. The probability of simultaneous detection of all genes. Allele deletion causes false negatives in prenatal diagnosis. Due to allele deletion, only one allele may be detected and the other allele may not be detected. For abnormal embryos (heterozygous mutations) caused by a single gene ), if only normal allele is detected and the disease-causing mutant allele is not detected due to allele deletion, the embryo will be misjudged as a normal embryo. This method significantly reduces the allele deletion rate, can reduce such erroneous judgments, and can select healthy embryos to promote eugenics and postnatal care. In addition, tumor cells have a high mutation load, and mutations are usually heterozygous. Previous methods tend to underestimate the mutation status of tumor genomes due to allelic loss, but this method can better detect heterozygous tumors. Mutations are of great significance for the early diagnosis and treatment of tumors.

发明内容Contents of the invention

本发明的目的在于提供一种基于限制性内切酶的检测基因组信息的方法。The object of the present invention is to provide a method for detecting genomic information based on restriction endonucleases.

一种基于限制性内切酶的检测基因组信息的方法，包括下列步骤：A method for detecting genomic information based on restriction endonucleases, including the following steps:

（1）采用限制性内切酶对样本的基因组进行切割，获得不同长度的基因组DNA片段；此时同源染色体的等位基因通常被切割成相同长度的DNA片段；(1) Use restriction endonucleases to cut the genome of the sample to obtain genomic DNA fragments of different lengths; at this time, alleles of homologous chromosomes are usually cut into DNA fragments of the same length;

本发明通过对目标物种的基因组进行酶切片段模拟，推断酶切之后的基因组片段分布，从而选择合适的限制性内切酶（图3）；在小体积体系下进行细胞裂解，释放基因组DNA；This invention simulates enzyme digestion fragments of the genome of the target species and infers the distribution of genome fragments after enzyme digestion, thereby selecting appropriate restriction endonucleases (Figure 3); cell lysis is performed in a small volume system to release genomic DNA;

所述限制性内切酶为识别4-10 bp特异序列的限制性内切酶，优选的，为识别6bp、8 bp特异序列的限制性内切酶，更优选的，所述限制性内切酶为EcoR I、SacI和AsiS I。The restriction endonuclease is a restriction endonuclease that recognizes a 4-10 bp specific sequence. Preferably, it is a restriction endonuclease that recognizes a 6 bp or 8 bp specific sequence. More preferably, the restriction endonuclease The enzymes are EcoRI , SacI and AsiSI .

对于人类基因组来说，6 bp识别序列的限制性内切酶的酶切片段长度大多数分布在1-8 kb，而8 bp识别序列的限制性内切酶的酶切片段长度大多数分布在15 kb-16 Mb之间（图3）。因此希望获得更高覆盖度时选择6 bp识别序列的限制性内切酶，如EcoR I、SacI，希望富集效果更好时选择8 bp识别序列的内切酶，如AsiS I。希望获得更高覆盖度时酶切片段需要尽量集中的分布，即切割得到的DNA片段具有相似的长度并且集中在1-3 kb之间，此时可以有更好的扩增均匀性，可以兼顾基因组覆盖度和两个等位基因的检出率。For the human genome, the length of restriction endonuclease fragments with a 6 bp recognition sequence is mostly distributed between 1 and 8 kb, while the length of restriction enzyme fragments with an 8 bp recognition sequence is mostly distributed between 1 and 8 kb. Between 15 kb-16 Mb (Figure 3). Therefore, if you want to obtain higher coverage, choose a restriction endonuclease with a 6 bp recognition sequence, such as Eco R I and Sac I. If you want a better enrichment effect, choose an endonuclease with an 8 bp recognition sequence, such as Asi S I. When hoping to obtain higher coverage, the digested fragments need to be distributed as concentrated as possible, that is, the DNA fragments obtained by cutting have similar lengths and are concentrated between 1-3 kb. In this case, better amplification uniformity can be achieved, and both Genome coverage and detection rates of both alleles.

在上述细胞裂解步骤中，所述细胞可以来源于人、动物、植物和微生物中的任何一种；In the above cell lysis step, the cells can be derived from any one of humans, animals, plants and microorganisms;

（2）对扩增或不扩增的基因组样本进行长基因组DNA片段富集；(2) Enrich long genomic DNA fragments from amplified or non-amplified genomic samples;

（3）将富集后的长基因组DNA片段在测序平台上机测序；(3) Sequence the enriched long genomic DNA fragments on a sequencing platform;

（4）对测序得到的数据进行计算机分析，通过将所述长基因组DNA片段回帖至基因组区域，经过比对和计算获得所述样本在所述基因组区域的序列信息。所述序列信息包括遗传及表观遗传信息。(4) Computer analysis is performed on the data obtained by sequencing, by attaching the long genomic DNA fragments to the genomic region, and obtaining the sequence information of the sample in the genomic region through comparison and calculation. The sequence information includes genetic and epigenetic information.

所述基因组样本为游离DNA、培养基中细胞（如胚胎或卵）释放的DNA、一个或一个以上的细胞或细胞核、病毒、线粒体、叶绿体、及其他样本基因组。The genomic samples are free DNA, DNA released from cells in culture medium (such as embryos or eggs), one or more cells or nuclei, viruses, mitochondria, chloroplasts, and other sample genomes.

步骤（1）选取的限制性内切酶是根据对目标物种的基因组进行酶切片段模拟，推断酶切之后的基因组片段分布，从而选择的内切酶。The restriction endonuclease selected in step (1) is an endonuclease selected based on simulating the digestion fragments of the genome of the target species and inferring the distribution of genome fragments after digestion.

优选的，步骤（2）对基因组DNA片段进行末端修复、加A，并连接接头，进行PCR扩增，扩增后对长基因组DNA片段进行富集。所用接头可以是不带条码的接头或带条码的接头。Preferably, step (2) performs end repair on the genomic DNA fragments, adds A, connects adapters, performs PCR amplification, and enriches long genomic DNA fragments after amplification. The adapters used can be either non-barcoded adapters or barcoded adapters.

使用所述不带条码的接头后续纯化和建库过程中每个PCR管单独进行，并在PCR扩增时带上5’端和3’端的接头；使用所述带条码的接头（也就是在连接头时带上了5’端接头），接头连接后将带不同条码的样品管混合纯化后在一个管中进行扩增，并通过扩增带上3’端接头。Each PCR tube is carried out separately during subsequent purification and library construction using the adapter without barcode, and adapters at the 5' end and 3' end are provided during PCR amplification; use the adapter with barcode (that is, in The 5' end adapter is attached when the connector is connected). After the adapter is connected, the sample tubes with different barcodes are mixed and purified, and then amplified in one tube, and the 3' end adapter is attached through amplification.

步骤（2）中所述长基因组DNA片段是指长度大于700核苷酸对的片段，优选长度大于1000核苷酸对的片段。The long genomic DNA fragment described in step (2) refers to a fragment with a length greater than 700 nucleotide pairs, preferably a fragment with a length greater than 1000 nucleotide pairs.

步骤（2）所述扩增为聚合酶链式反应，采用聚合酶链式反应与片段筛选富集长基因组DNA片段，其中所述片段筛选为跑胶片段筛选或磁珠片段筛选。The amplification in step (2) is a polymerase chain reaction, and polymerase chain reaction and fragment screening are used to enrich long genomic DNA fragments, wherein the fragment screening is gel running fragment screening or magnetic bead fragment screening.

对于大起始量的样本，在限制性内切酶切割后直接进行基因组片段筛选。采用酶切方法，使得固定区域有固定的片段大小，接着通过片段筛选，富集特定大小的片段，从而能够把固定区域富集下来，使得测序区域集中于特定大小的基因组区域，也就是说，特定片段长度的基因组区域的测序深度增加，非该长度的基因组区域测序深度减少或不被检测到。因此就能够更灵敏地检测到这些区域的等位基因信息。For large input samples, perform genomic fragment screening directly after restriction enzyme cleavage. The enzyme digestion method is used to make the fixed region have a fixed fragment size, and then through fragment screening, fragments of a specific size are enriched, so that the fixed region can be enriched, so that the sequencing area is concentrated in the genome region of a specific size, that is to say, The sequencing depth of genomic regions of a specific fragment length is increased, and the sequencing depth of genomic regions of other lengths is reduced or not detected. Therefore, allelic information in these regions can be detected more sensitively.

对于少量起始样本及单细胞样本，酶切后连接头、扩增，并进行跑胶片段筛选或磁珠片段筛选。接头连接以及PCR扩增对片段筛选也有作用，即过长的DNA片段接头连接效率降低，PCR扩增优先扩增短片段，因此过滤掉了过长的片段。经过跑胶片段筛选或磁珠片段筛选过滤掉小片段，所以连接接头扩增的样本片段筛选的能力更强，该文库最终片段长度主要分布在1-3 kb。除此之外，由于等位基因区域的扩增效率倾向于一致，PCR可以对等位基因进行进一步的富集，增加了同时测到两个等位基因的概率。For a small amount of starting samples and single cell samples, enzyme digestion is followed by ligation, amplification, and gel fragment screening or magnetic bead fragment screening. Adapter ligation and PCR amplification also play a role in fragment screening, that is, the adapter ligation efficiency of overly long DNA fragments is reduced, and PCR amplification preferentially amplifies short fragments, so overly long fragments are filtered out. Small fragments are filtered out through gel fragment screening or magnetic bead fragment screening, so the sample fragments amplified by ligated adapters have a stronger ability to screen fragments. The final fragment length of this library is mainly distributed between 1-3 kb. In addition, since the amplification efficiency of allelic regions tends to be consistent, PCR can further enrich alleles, increasing the probability of detecting two alleles at the same time.

步骤（3）中所述的测序平台为长读长测序平台，可选的，所述测序平台为Nanopore测序平台或PacBio测序平台及后续发展的其他长读长测序平台。The sequencing platform described in step (3) is a long-read sequencing platform. Optionally, the sequencing platform is a Nanopore sequencing platform or PacBio sequencing platform and other long-read sequencing platforms subsequently developed.

NGS测序中构建文库过程因为PCR引入的扩增错误导致的测序质量问题以及测序序列的读长限制（通常小于500bp）使得NGS技术难以满足一些现代生物学问题的更高要求：例如DNA上较长的重复片段的测定、DNA/RNA甲基化修饰问题的测定以及结构变异的测定等。长读长测序技术的出现弥补了NGS的缺陷。目前长读长测序平台主要有PacificBiosciences（PacBio）的单分子实时测序（Single-molecule real-time sequencing,SMRT）和Oxford Nanopore Technologies（ONT）公司的单纳米孔测序两种平台。PacBio是基于零模波导特性（zero-mode waveguide, ZMW）的SMRT。ZMW是一种纳米光子封闭结构，由放置在透明二氧化硅基板上的铝覆膜中的圆孔组成。ZMW 孔的直径约为70 nm，深度约为100nm。由于ZMW的孔径小，当光通过ZMW孔径时，光场呈现指数衰减。在照射的ZMW孔内，包含一个单核苷酸的DNA聚合酶的活性会很容易被检测出来。PacBio SMRT 测序技术是以拓扑环状DNA分子为模板文库（称为SMRTbell），SMRTbell由插入的双链DNA的片段的两端通过连接上发夹结构组成，是一种封闭的单链环状DNA。其中插入的DNA片段的长度可以从 1 到大于数十万碱基不等，从而可以生成长测序读数。将SMRTbell组装好之后，它会被 DNA 聚合酶结合并加载到SMRT Cell上，SMRT Cell包含多达800万个ZMW。在每个 ZMW 中，单个聚合酶固定在底部，它可以与 SMRTbell 的任一发夹接头结合并开始复制。在边合成边测序中，聚合酶以SMRTbell为模板进行加工，将产生不同发射光谱的四种荧光标记的脱氧核苷三磷酸结合到新生链中。位于小孔底部的激发光能够激发核苷酸底物上的荧光标记，进而通过监测系统将荧光信号记录下来，从而获得碱基信息。整个测序过程 DNA 分子不需要经过PCR扩增，实现了对每一条DNA分子的单独测序。目前PacBio测序有两种常用的测序模式，连续长读数（Continuous Long Reads，CLR）以及滚环测序（Circular Consensus Sequencing，CCS）模式。ONT是基于电信号的测序技术，该技术的核心是蛋白质纳米孔。纳米孔的基本工作原理是：在两个电解液室之间会形成一个纳米级孔，且电解液室之间有一层不透水的膜，蛋白质纳米孔被嵌在合成膜上，大约有数百到数千个纳米孔，这些纳米孔浸没在电生理溶液中（合成膜具有非常高的电阻，而蛋白质纳米孔的本质是在膜上形成通道）。当向电解液室内施加电压时，会产生穿过孔的稳态离子电流。大分子在孔中通过会导致穿过孔的离子通量的瞬态变化，因此，监测穿过孔的电流便可以实现分子传感。这些电流波动传达了样品的许多特性，包括生物分子大小、浓度和结构。通过控制孔的尺寸、其表面特性、施加的电压和溶液条件，人们可以定制不同的纳米孔来检测不同类型的生物分子。同时由于纳米孔传感不需要生物分子修饰、标记或表面固定，因此该技术可用于检测范围广泛的分子和复合物。ONT技术使用线性DNA分子，这些线性DNA分子的长度通常为一到数百个千碱基，但有些也可以达到数兆碱基。ONT测序首先将双链 DNA 分子连接到预载了运动蛋白的测序接头上，运动蛋白解开双链 DNA，并与电流一起驱动带负电的 DNA 以受控速率穿过孔。当 DNA通过纳米孔时，它会对电流造成特征性的破坏，这些破坏被实时分析从而确定 DNA 链中的碱基序列。目前长读数据可以在三种标准 ONT 平台中的任何一个上生成：MinION、GridION和 PromethION。ONT 测序平台生成的另一种读数类型是 ONT 超长读取。这些读段首先由Josh Quick生成，长度通常大于100 kb，但也可能有几兆碱基长。In the library construction process of NGS sequencing, sequencing quality problems caused by amplification errors introduced by PCR and the read length limit of sequencing sequences (usually less than 500 bp) make it difficult for NGS technology to meet the higher requirements of some modern biological problems: such as long DNA Determination of repeated fragments, determination of DNA/RNA methylation modification issues, determination of structural variations, etc. The emergence of long-read sequencing technology has made up for the shortcomings of NGS. Currently, there are two main long-read sequencing platforms: Single-molecule real-time sequencing (SMRT) from Pacific Biosciences (PacBio) and single nanopore sequencing from Oxford Nanopore Technologies (ONT). PacBio is an SMRT based on zero-mode waveguide (ZMW) characteristics. ZMW is a nanophotonic containment structure consisting of circular holes in an aluminum coating placed on a transparent silicon dioxide substrate. The ZMW pores have a diameter of approximately 70 nm and a depth of approximately 100 nm. Due to the small aperture of ZMW, when light passes through the ZMW aperture, the light field exhibits exponential attenuation. Within the illuminated ZMW wells, the activity of DNA polymerase containing a single nucleotide is easily detected. PacBio SMRT sequencing technology uses topological circular DNA molecules as template libraries (called SMRTbell). SMRTbell consists of the two ends of the inserted double-stranded DNA fragment connected by a hairpin structure. It is a closed single-stranded circular DNA. . The length of the inserted DNA fragments can vary from 1 to greater than hundreds of thousands of bases, allowing the generation of long sequencing reads. After the SMRTbell is assembled, it is bound by DNA polymerase and loaded onto the SMRT Cell, which contains up to 8 million ZMW. In each ZMW, a single polymerase is anchored to the bottom, which can bind to either of the SMRTbell's hairpin adapters and initiate replication. In sequencing-by-synthesis, the polymerase uses SMRTbell as a template for processing, and combines four fluorescently labeled deoxynucleoside triphosphates that produce different emission spectra into the nascent chain. The excitation light located at the bottom of the small hole can excite the fluorescent label on the nucleotide substrate, and then the fluorescence signal is recorded through the monitoring system to obtain base information. During the entire sequencing process, DNA molecules do not need to undergo PCR amplification, and each DNA molecule can be sequenced individually. Currently, PacBio sequencing has two commonly used sequencing modes, continuous long reads (CLR) and rolling circle sequencing (Circular Consensus Sequencing, CCS). ONT is a sequencing technology based on electrical signals, and the core of this technology is protein nanopores. The basic working principle of nanopores is: a nanoscale pore will be formed between two electrolyte chambers, and there is an impermeable membrane between the electrolyte chambers. The protein nanopores are embedded in the synthetic membrane, with approximately hundreds of them. to thousands of nanopores that are immersed in an electrophysiological solution (synthetic membranes have very high electrical resistance, and the nature of protein nanopores is to form channels in the membrane). When a voltage is applied to the electrolyte chamber, a steady-state ionic current is generated through the pores. The passage of large molecules through a pore causes transient changes in the ion flux through the pore, so monitoring the current through the pore enables molecular sensing. These current fluctuations convey many properties of the sample, including biomolecule size, concentration and structure. By controlling the size of the pore, its surface properties, applied voltage, and solution conditions, one can tailor different nanopores to detect different types of biomolecules. And because nanopore sensing does not require biomolecule modification, labeling, or surface immobilization, the technology can be used to detect a wide range of molecules and complexes. ONT technology uses linear DNA molecules that are typically one to hundreds of kilobases in length, but some can reach several megabases. ONT sequencing first connects double-stranded DNA molecules to sequencing adapters preloaded with motor proteins, which unwind the double-stranded DNA and, together with an electric current, drive the negatively charged DNA through the pore at a controlled rate. As DNA passes through the nanopore, it causes characteristic disruptions in the electrical current, and these disruptions are analyzed in real time to determine the sequence of bases in the DNA strand. Currently long-read data can be generated on any of three standard ONT platforms: MinION, GridION, and PromethION. Another type of read generated by the ONT sequencing platform is the ONT ultra-long read. These reads, first generated by Josh Quick, are typically greater than 100 kb in length, but can also be several megabases long.

基因组结构变异（SVs）主要包括基因组上大片段的DNA缺失、插入、片段重复等变异类型。研究显示，SV与癌症、孤独症、神经发育障碍等多种复杂遗传病有关，近年来在医学和遗传学领域中持续受到关注。NGS由于读长的限制，在SV的检测方面受到很大限制。长读长基因组测序技术的进步和普及，使得大量的结构变异被不断发现和研究，一些具有强致病性的结构变异也逐渐得到验证。本方法基于长读长测序平台，可以高效地检测出SV，并可以对单倍体的SV进行高精确度的整条染色体的分型。Genome structural variations (SVs) mainly include large-segment DNA deletions, insertions, segment duplications and other variation types in the genome. Studies have shown that SV is related to a variety of complex genetic diseases such as cancer, autism, and neurodevelopmental disorders, and has continued to receive attention in the fields of medicine and genetics in recent years. Due to the limitation of read length, NGS is greatly limited in the detection of SV. The advancement and popularization of long-read genome sequencing technology has enabled a large number of structural variations to be continuously discovered and studied, and some structural variations with strong pathogenicity have been gradually verified. This method is based on a long-read sequencing platform, which can efficiently detect SV, and can perform high-precision typing of haploid SV across the entire chromosome.

基于长度长测序平台，可以更好检测SNP或其他变异的连锁信息。由于NGS测序读长较短，大多数reads上只有最多一个变异信息，而长度长测序可以在同一条reads上检测到多种变异，因此可以用于研究SNP、SV等变异的连锁。连锁信息对于疾病的判断至关重要，例如对于隐性遗传病来说，假设一个基因有两个不同位点发生了突变，如果这两个突变位于同一条染色体上，即连锁，那么这条染色体上的这个基因丧失功能，而另一条染色体上还有一个正常的拷贝，因此该细胞不具有突变的表型；如果这两个突变位于不同的染色体上，也就是两个等位基因都发生了突变，则会表现为致病的状态。此外，连锁信息有助于判断遗传病突变基因来自父亲还是来自母亲。基因组印记指致病基因亲缘性（即父源或母源）的不同导致不同临床表型的发生，某些基因只有来自父亲时才具有转录活性，来自母亲的基因则不表达，相反，某些基因只有来自母亲时才具有转录活性，来自父亲的基因则不表达。此时区分突变基因来自父亲还是来自母亲可以判断致病基因是否会表达，从而判断胚胎的健康状态。Based on the long-length sequencing platform, the linkage information of SNP or other variations can be better detected. Due to the short read length of NGS sequencing, most reads only have at most one variant information, while long-length sequencing can detect multiple variants on the same read, so it can be used to study the linkage of SNP, SV and other variants. Linkage information is crucial to the judgment of diseases. For example, for recessive genetic diseases, assume that a gene has mutations at two different sites. If the two mutations are located on the same chromosome, that is, linked, then this chromosome The gene on the cell loses its function, but there is a normal copy on the other chromosome, so the cell does not have a mutant phenotype; if the two mutations are located on different chromosomes, that is, both alleles have occurred. Mutations will cause disease. In addition, linkage information helps determine whether the mutated gene for a genetic disease comes from the father or the mother. Genomic imprinting refers to the different clinical phenotypes caused by differences in the genetic affinity (ie, paternal or maternal origin) of the disease-causing genes. Some genes are transcriptionally active only when they come from the father, while genes from the mother are not expressed. On the contrary, some genes are transcriptionally active. Genes are transcriptionally active only if they come from the mother; genes from the father are not expressed. At this time, distinguishing whether the mutated gene comes from the father or the mother can determine whether the disease-causing gene will be expressed, and thus determine the health status of the embryo.

长读长测序具有直接获取DNA/RNA修饰（无需抗体或化学处理）的潜力，具有重要意义。修饰会改变核苷酸匹配的效率，SMRT测序通过检测不同荧光标记的dNTP/NTP结合目标核苷酸的时间差异来计算单核苷酸精度的修饰。同时，修饰也会改变核苷酸的电信号，而Nanopore测序通过检测核苷酸通过纳米孔径的电信号来计算携带的修饰。基于此原理，本方法应用于大量起始样本时，由于无需扩增，可以保留表观修饰的信息，并可通过长读长测序直接读取。因此本方法可以在大量样本实现等位基因表观修饰的检测和比较。是NGS测序不能实现的。并且可以探究不同类型基因组变异与表观状态之间的连锁关系。Long-read sequencing has the potential to directly obtain DNA/RNA modifications (without antibodies or chemical treatments), which is of great significance. Modifications will change the efficiency of nucleotide matching. SMRT sequencing calculates modifications with single nucleotide accuracy by detecting the difference in the time that different fluorescently labeled dNTPs/NTPs bind to target nucleotides. At the same time, the modification will also change the electrical signal of the nucleotide, and Nanopore sequencing calculates the modification carried by detecting the electrical signal of the nucleotide passing through the nanopore. Based on this principle, when this method is applied to a large number of starting samples, since amplification is not required, epigenetic modification information can be retained and can be directly read through long-read sequencing. Therefore, this method can detect and compare allele epigenetic modifications in a large number of samples. This is something that NGS sequencing cannot achieve. And the linkage between different types of genomic variations and epigenetic states can be explored.

步骤（4）中所述片段信息包含下列1种或多种：1）片段长度信息；2）片段丰度信息；3）杂合单核苷酸多态性信息；4）基因组结构变异信息，所述基因组结构变异信息包括插入、缺失、重复、倒位、易位中一种或多种；5）重复序列信息，所述重复序列信息包括短散座元件、长散座元件、长终端重复元件、DNA重复元件、简单重复、卫星灶、其他重复元件中一种或多种；6）基因组拷贝数变异信息；7）等位基因信息；8）等位基因信息的连锁关系；9）表观遗传信息，所述表观遗传信息包括DNA甲基化、DNA羟甲基化。The fragment information in step (4) includes one or more of the following: 1) fragment length information; 2) fragment abundance information; 3) hybrid single nucleotide polymorphism information; 4) genome structural variation information, The genome structural variation information includes one or more of insertion, deletion, duplication, inversion, and translocation; 5) Repeated sequence information, the repeated sequence information includes short interspersed elements, long interspersed elements, and long terminal repeats One or more of elements, DNA repetitive elements, simple repeats, satellite foci, and other repetitive elements; 6) Genome copy number variation information; 7) Allele information; 8) Linkage relationships of allele information; 9) Table Epigenetic information includes DNA methylation and DNA hydroxymethylation.

通过本发明的方法可同时检测低至单个细胞的基因组信息，灵敏度高，同时检测到两个等位基因的概率高，可以分析少至单个细胞或细胞核。本发明将该方法命名为基于限制性内切酶及片段连接的三代单细胞全基因组扩增方法（Restriction fragmentsligation-based genome amplification and third-generation sequencing, Refresh-seq），下文简称为Refresh-seq。其中连接带标签的接头的称为Refresh-seq（multiplexed）。The method of the present invention can detect genome information as low as a single cell at the same time, with high sensitivity, high probability of detecting two alleles at the same time, and can analyze as little as a single cell or cell nucleus. The present invention names this method as Restriction fragmentsligation-based genome amplification and third-generation sequencing (Refresh-seq), hereafter referred to as Refresh-seq. The connection of labeled adapters is called Refresh-seq (multiplexed).

本发明中的术语：Terms used in this invention:

限制性内切酶（restriction endonuclease）：生物体内能识别并切割特异的双链DNA序列的一种内切核酸酶，包括I型限制性内切酶和II型限制性内切酶，Ⅰ型限制性内切酶既能催化宿主DNA的甲基化，又催化非甲基化的DNA的水解；而Ⅱ型限制性内切酶只催化非甲基化的DNA的水解。限制性内切酶一般是以微生物属名的第一个字母和种名的前两个字母组成，第四个字母表示菌株。例如，从Bacillus amylolique faciensH中提取的限制性内切酶称为BamH，在同一品系细菌中得到的识别不同碱基顺序的几种不同特异性的酶，可以编成不同的号，如HindII、HindIII、HpaI、HpaII等。Restriction endonuclease: an endonuclease that can recognize and cut specific double-stranded DNA sequences in organisms, including type I restriction endonuclease and type II restriction endonuclease, type I restriction enzyme Sexual endonucleases can catalyze both methylation of host DNA and hydrolysis of unmethylated DNA; whereas type II restriction endonucleases can only catalyze the hydrolysis of unmethylated DNA. Restriction enzymes are generally composed of the first letter of the microorganism's genus name and the first two letters of the species name, and the fourth letter indicates the strain. For example, the restriction endonuclease extracted from Bacillus amylolique faciens H is called Bam H. Several enzymes with different specificities that recognize different base sequences obtained from the same strain of bacteria can be compiled into different numbers, such as Hind II, Hind III, Hpa I, Hpa II, etc.

同源染色体（homologous chromosomes）：细胞的有丝分裂中期看到的长度和着丝点位置相同的两个染色体，或减数分裂时看到的两两配对的染色体，同源染色体一个来自父体，一个来自母体；它们的形态、大小和结构一般相同。Homologous chromosomes: Two chromosomes with the same length and centromere position seen in the metaphase of mitosis of a cell, or paired chromosomes seen during meiosis. One homologous chromosome comes from the father and the other Come from the parent body; they are generally the same shape, size and structure.

等位基因（allele）：位于一对同源染色体相同位置上控制同一性状不同形态的基因。Allele: A gene located at the same position on a pair of homologous chromosomes that controls different forms of the same trait.

等位基因信息：本专利中涉及到的等位基因信息包括同源染色体上等位基因处的所有变异类型，包括等位基因上的SNP、SV、重复序列信息（短散座元件、长散座元件、长终端重复元件、DNA重复元件、简单重复、卫星灶）、表观遗传信息等。Allele information: The allele information involved in this patent includes all variant types at alleles on homologous chromosomes, including SNPs, SVs, and repeated sequence information (short interspersed elements, long interspersed elements) on alleles. base elements, long terminal repeat elements, DNA repeat elements, simple repeats, satellite foci), epigenetic information, etc.

表观遗传信息：表观遗传修饰是指对基因表达的调控，通过化学修饰改变染色体上的DNA和蛋白质，从而影响基因的表达。这种修饰可以影响基因的转录、剪接、稳定性、翻译、核小体组装和染色质结构等多个层面，从而影响细胞的生理和病理过程，以及后代的表型。常见的表观遗传修饰包括DNA甲基化、组蛋白修饰、非编码RNA、RNA修饰和染色质重塑等。Epigenetic information: Epigenetic modification refers to the regulation of gene expression, which changes the DNA and proteins on the chromosome through chemical modification, thereby affecting the expression of genes. This modification can affect gene transcription, splicing, stability, translation, nucleosome assembly, and chromatin structure at multiple levels, thereby affecting the physiological and pathological processes of cells, as well as the phenotype of offspring. Common epigenetic modifications include DNA methylation, histone modifications, non-coding RNA, RNA modifications and chromatin remodeling.

DNA甲基化：DNA甲基化是指在DNA分子上加上一个甲基基团，从而改变DNA的化学性质和结构，影响基因的表达。通常发生在CpG二核苷酸上，可以抑制基因表达。这种修饰是由 DNA 甲基转移酶（DNMTs）催化完成的。人类有三种主要的 DNMTs，分别是 DNMT1、DNMT3A和 DNMT3B。其中，DNMT1 主要负责维持甲基化模式，而 DNMT3A 和 DNMT3B 则负责新的甲基化。DNA methylation: DNA methylation refers to adding a methyl group to the DNA molecule, thereby changing the chemical properties and structure of the DNA and affecting gene expression. Usually occurs on CpG dinucleotides and can inhibit gene expression. This modification is catalyzed by DNA methyltransferases (DNMTs). There are three main DNMTs in humans, namely DNMT1, DNMT3A and DNMT3B. Among them, DNMT1 is mainly responsible for maintaining methylation patterns, while DNMT3A and DNMT3B are responsible for new methylation.

DNA羟甲基化：DNA羟甲基化（DNA Hydroxymethylation）是DNA甲基化中5-甲基胞嘧啶（5mC）在TET家族酶的催化下发生氧化形成5-羟甲基胞嘧啶（5hmC）。5hmC具有非常重要的生物学功能，5hmC不仅参与了染色体重新编程、基因表达的转录调控，而且在DNA去甲基化过程中发挥重要作用。且研究显示5hmC与肿瘤的发生密切相关。DNA hydroxymethylation: DNA hydroxymethylation is the oxidation of 5-methylcytosine (5mC) in DNA methylation under the catalysis of TET family enzymes to form 5-hydroxymethylcytosine (5hmC) . 5hmC has very important biological functions. 5hmC not only participates in chromosome reprogramming and transcriptional regulation of gene expression, but also plays an important role in the DNA demethylation process. And research shows that 5hmC is closely related to the occurrence of tumors.

接头：使两个DNA分子或一个DNA分子的两端经酶切可以配对再经连接酶共价连接的序列。Linker: A sequence that allows two DNA molecules or the two ends of one DNA molecule to be paired after being digested by enzymes and then covalently connected by ligase.

磁珠片段筛选：磁珠能够和DNA在一定条件下相互作用吸附在一起，在较高浓度的PEG和NaCl溶液中，PEG夺取DNA分子外面水化层的水，导致水化层被破坏，DNA分子发生聚集沉淀，带负电的磷酸基团裸露出来，通过钠离子与磁珠表面的羧基形成“盐桥”，或者也叫“电桥”，使得DNA吸附到磁珠表面，DNA越长，表面裸露出来带负电的磷酸基团越多，整条分子带的负电就更强，更容易吸附到磁珠，只需要较低浓度的PEG和NaCl，就可以回收；DNA越短，就需要更高浓度的PEG和NaCl，将其表面的水化层破坏得更彻底，裸露出来足够多带负电的磷酸基团，才能被磁珠吸附住，从而回收回来；因此，通过控制PEG和NaCl的浓度以及磁珠的用量，可以筛选不同长度的DNA片段。Magnetic bead fragment screening: Magnetic beads can interact and adsorb with DNA under certain conditions. In higher concentrations of PEG and NaCl solutions, PEG takes away the water in the hydration layer outside the DNA molecule, causing the hydration layer to be destroyed and the DNA The molecules aggregate and precipitate, and the negatively charged phosphate groups are exposed. The sodium ions and the carboxyl groups on the surface of the magnetic beads form a "salt bridge", or also called an "electric bridge", causing the DNA to be adsorbed to the surface of the magnetic beads. The longer the DNA, the higher the surface. The more exposed negatively charged phosphate groups, the stronger the negative charge of the entire molecule, and the easier it is to adsorb to the magnetic beads. It only requires a lower concentration of PEG and NaCl to recover; the shorter the DNA, the higher the concentration required. The concentration of PEG and NaCl will destroy the hydration layer on the surface more thoroughly, exposing enough negatively charged phosphate groups, which can be adsorbed by the magnetic beads and recycled back; therefore, by controlling the concentrations of PEG and NaCl and The amount of magnetic beads used can screen DNA fragments of different lengths.

本发明的有益效果：本发明首次将限制性内切酶切割和连接策略与第三代单细胞基因组测序平台相结合，开发了长读基因组测序技术Refresh-seq。与基于随机切割原理的SMOOTH-seq相比，Refresh-seq增加了基因组覆盖率和均匀性。它提高了二倍体细胞的两个等位基因的同时检出的概率，即使在非常浅的测序深度也可以得到可观的检出概率，因此具有巨大的医疗应用潜力，如植入前遗传学诊断。该方法可以根据不同的限制酶来调节，以满足不同的需求。一般来说，Refresh-seq利用6 bp识别序列的限制性内切酶（例如EcoR I和SacI）切割时能够获得相对较高的基因组覆盖度，是单细胞全基因组测序的首选；利用8bp识别序列的限制性内切酶（例如AsiS I）切割时，在同等测序量的前提下，Refresh-seq能够把reads富集到特定基因组区域（图1），从而实现简化基因组测序。Refresh-seq基于第三代测序平台，可以有效地检测结构变异和重复元件。Refresh-seq也有局限性。由于连接反应的效率，扩增子长度仅为2-3 kb，远短于全长约6 kb的SMOOTH-seq。因此Refresh-seq由于其扩增子长度范围的限制，不能捕获很长的插入事件。Refresh-seq的库构建可以在一天内完成，单管版本的库构建成本为20元/细胞，多管版本的库构建成本为12元/细胞。本发明成功地应用Refresh-seq技术研究了雄性和雌性B6D2F1小鼠单个生殖细胞的减数分裂。在0.1-0.3×深度测序，精子、PG卵母细胞和PB2的平均覆盖率约为5%，卵母细胞和PB2的平均覆盖率为7.7%。这与MALBAC扩增的精子和卵母细胞的覆盖率一致。发明人在低测序深度获得了雄性和雌性减数分裂重组的高分辨率遗传图谱，并揭示了雌性和雄性的差异。由于Refresh-seq具有均匀性高、等位基因缺失率低的特点，在非整倍体精子和卵细胞筛查中具有良好的应用前景。由于其与NGS平台相比具有较长的读取长度，因此在检测高度重复或低复杂性的基因组区域的SVs方面也具有优势。发明人分别用Refresh-seq数据成功地对精子细胞和雌性单倍体生殖细胞进行了全染色体的hetSV分型，并分析了这些SV的重复元件特征。Beneficial effects of the present invention: For the first time, the present invention combines restriction endonuclease cleavage and ligation strategies with the third-generation single-cell genome sequencing platform, and develops the long-read genome sequencing technology Refresh-seq. Compared with SMOOTH-seq based on the random cutting principle, Refresh-seq increases genome coverage and uniformity. It improves the probability of simultaneous detection of both alleles of diploid cells, allowing considerable detection probabilities even at very shallow sequencing depths, and therefore has great potential for medical applications such as preimplantation genetics diagnosis. The method can be adjusted with different restriction enzymes to meet different needs. Generally speaking, Refresh-seq can obtain relatively high genome coverage when cutting with restriction endonucleases (such as Eco R I and Sac I) with 6 bp recognition sequences, and is the first choice for single-cell whole-genome sequencing; using 8 bp recognition sequences When the sequence is cut with a restriction endonuclease (such as Asi S I), Refresh-seq can enrich reads to specific genomic regions (Figure 1) under the premise of the same sequencing volume, thereby simplifying genome sequencing. Refresh-seq is based on a third-generation sequencing platform and can effectively detect structural variations and repetitive elements. Refresh-seq also has limitations. Due to the efficiency of the ligation reaction, the amplicon length is only 2-3 kb, which is much shorter than the full length of SMOOTH-seq, which is about 6 kb. Therefore Refresh-seq cannot capture very long insertion events due to its limited amplicon length range. The library construction of Refresh-seq can be completed within one day. The library construction cost of the single-tube version is 20 yuan/cell, and the library construction cost of the multi-tube version is 12 yuan/cell. The present invention successfully applied Refresh-seq technology to study the meiosis of single germ cells in male and female B6D2F1 mice. At 0.1-0.3× depth sequencing, the average coverage of sperm, PG oocytes and PB2 was approximately 5%, and the average coverage of oocytes and PB2 was 7.7%. This is consistent with the coverage of MALBAC-amplified sperm and oocytes. The inventors obtained high-resolution genetic maps of male and female meiotic recombination at low sequencing depth and revealed the differences between females and males. Because Refresh-seq has the characteristics of high uniformity and low allele loss rate, it has good application prospects in aneuploid sperm and egg cell screening. It also has advantages in detecting SVs in highly repetitive or low-complexity genomic regions due to its longer read length compared to NGS platforms. The inventors used Refresh-seq data to successfully conduct full-chromosome hetSV typing on sperm cells and female haploid germ cells, and analyzed the repetitive element characteristics of these SVs.

附图说明Description of drawings

图1 为实施例1的建库流程图。Figure 1 is a flow chart of database construction in Example 1.

图2 为使用PCR富集长基因组DNA片段的测试效果图。Figure 2 shows the test effect of using PCR to enrich long genomic DNA fragments.

图3 为酶切片段模拟图；Figure 3 is a simulation diagram of enzyme digestion fragments;

图中a-EcoR I 模拟酶切片段分布情况；b-SacI 模拟酶切片段分布情况；c-AsiSI 模拟酶切片段分布情况。In the figure, a- Eco R I simulates the distribution of enzyme digestion fragments; b- Sac I simulates the distribution of enzyme digestion fragments; c- Asi SI simulates the distribution of enzyme digestion fragments.

图4为Refresh-seq（multiplexed）交叉污染的测试图。Figure 4 shows the test chart of Refresh-seq (multiplexed) cross-contamination.

图5 为两个等位基因同时检出的情况。Figure 5 shows the situation where two alleles are detected simultaneously.

图中a-每个HG002细胞的测序量及杂合SNP的比例；b-在0.25×测序深度下对三种方法杂合SNP比例的定量；c-对覆盖5条reads以上的区域计算的等位基因缺失率。In the figure, a-the sequencing amount of each HG002 cell and the proportion of heterozygous SNP; b-quantification of the proportion of hybrid SNP by the three methods at 0.25× sequencing depth; c-the calculation of the area covering more than 5 reads. Gene deletion rate.

图6 为Refresh-seq使用不同的内切酶、在不同细胞系的表现；Figure 6 shows the performance of Refresh-seq using different endonucleases in different cell lines;

图中a-显示Refresh-seq（EcoR I/SacI）和Refresh-seq（multiplexed）（EcoR I/SacI/AsiS I）扩增的HG001细胞的测序量和基因组覆盖率，其中SMOOTH-seq的数据来自HG002细胞系；b-显示Refresh-seq（EcoR I/SacI）和Refresh-seq （multiplexed）（EcoR I/SacI/AsiS I）扩增的HG001细胞的测序量和杂合SNP检出率，其中SMOOTH-seq的数据来自HG002细胞系；c-显示Refresh-seq（EcoR I/SacI）、Refresh-seq（multiplexed）（EcoR I/SacI/AsiS I）以及SMOOTH-seq扩增的HG002细胞的测序量和基因组覆盖率；d-显示Refresh-seq（EcoR I/SacI）、Refresh-seq（multiplexed）（EcoR I/SacI/AsiS I）以及SMOOTH-seq扩增的HG002细胞的测序量和杂合SNP检出率；e-使用SMOOTH-seq以及不同的限制性内切酶（EcoR I/SacI/AsiS I）对HG002细胞进行Refresh-seq和Refresh-seq（multiplexed）的测序深度。Figure a- shows the sequencing volume and genome coverage of HG001 cells amplified by Refresh-seq ( Eco R I/ Sac I) and Refresh-seq (multiplexed) ( Eco R I/ Sac I/ Asi S I), in which SMOOTH-seq Data are from HG002 cell line; b-shows sequencing volume and heterozygous SNP calls of HG001 cells amplified by Refresh-seq ( Eco R I/ Sac I) and Refresh-seq (multiplexed) ( Eco R I/ Sac I/ Asi S I) rate, where SMOOTH-seq data comes from HG002 cell line; c-shows Refresh-seq ( Eco R I/ Sac I), Refresh-seq (multiplexed) ( Eco R I/ Sac I/ Asi S I) and SMOOTH-seq amplified Sequencing volume and genome coverage of HG002 cells; d-shows Refresh-seq ( Eco R I/ Sac I), Refresh-seq (multiplexed) ( Eco R I/ Sac I/ Asi S I) and SMOOTH-seq amplified HG002 cells Sequencing volume and hybrid SNP detection rate; e- Refresh-seq and Refresh-seq (multiplexed) sequencing of HG002 cells using SMOOTH-seq and different restriction enzymes ( Eco R I/ Sac I/ Asi S I) depth.

图7 为Refresh-seq在精子的应用；Figure 7 shows the application of Refresh-seq in sperm;

图中a-杂交小鼠精子减数分裂过程示意图及单精子的Refresh-seq，获得B6D2F1（B6×DBA F1杂合子）经过减数分裂同源重组的成熟精子，流式分选后对每个单精子进行Refresh-seq；b-显示每个精子的测序数据量和基因组覆盖率，选取基因组覆盖率大于1%的精子进行后续分析，分界用红色虚线标记；c-显示通过质控的每个精细胞的测序数据量和基因组覆盖率，在95%可信区间拟合线性回归；d-Refresh-seq扩增单精子平均读长分布；e-每个精子中覆盖的hetSNP数量分布；f-通过每个精子的非连续性评分识别二倍体细胞（此时排除频率最高的常染色体），二倍体细胞的非连续性评分远高于单倍体精子，红色虚线标记了一个拐点，超过这个拐点的细胞被标记为潜在的二倍体细胞；g-利用X和Y染色体的reads数来区分X精子和Y精子。In the figure a - Schematic diagram of the meiosis process of hybrid mouse sperm and Refresh-seq of single sperm. Mature sperm of B6D2F1 (B6×DBA F1 heterozygote) that have undergone meiotic homologous recombination are obtained. After flow sorting, each Refresh-seq is performed on a single sperm; b-displays the sequencing data volume and genome coverage of each sperm. Sperm with a genome coverage greater than 1% are selected for subsequent analysis. The demarcation is marked with a red dotted line; c-displays each sperm that has passed the quality control. Sequencing data volume and genome coverage of sperm cells, fitting linear regression at 95% confidence interval; d-Refresh-seq amplified average read length distribution of single sperm; e-distribution of the number of hetSNPs covered in each sperm; f- Diploid cells are identified by the discontinuity score of each sperm (the highest frequency autosomes are excluded at this time). The discontinuity score of diploid cells is much higher than that of haploid sperm. The red dotted line marks an inflection point beyond Cells at this inflection point are marked as potential diploid cells; g-use the number of reads on the X and Y chromosomes to distinguish X sperm from Y sperm.

图8 为Refresh-seq鉴定非整倍体精子；Figure 8 shows the identification of aneuploid sperm by Refresh-seq;

图中a-每个精子中所有染色体的非连续性评分，二倍体细胞标记为D1-D12，非整倍体精子细胞标记为A1-A7；b-h-每个精子中特定染色体的非连续性评分。二倍体细胞在大多数染色体上具有更高的非连续性得分，非整倍体精子只在分离异常的染色体上具有更高的非连续性得分；i-将7个非整倍体精子细胞的19个常染色体中hetSNPs是比例与金标准进行比较，蓝点表示染色单体的损失，红点表示染色单体的增加，点的大小表示与平均比率的偏差，经验证的非整倍体染色体用矩形突出显示，精子A7更可能是一个不均匀扩增的样本（技术误差），而不是真正的非整倍体；j-精子细胞中覆盖两个等位基因的比率。热图显示了来自12个二倍体细胞（2N）、7个非整倍体精子细胞（1N±m）和若干个单倍体精子细胞（1N）的19个常染色体的杂合率。In the figure, a-the discontinuity score of all chromosomes in each sperm, diploid cells are marked D1-D12, and aneuploid sperm cells are marked A1-A7; b-h-the discontinuity score of specific chromosomes in each sperm score. Diploid cells have higher discontinuity scores on most chromosomes, and aneuploid sperm have higher discontinuity scores only on chromosomes with abnormal segregation; i-7 aneuploid sperm cells The proportions of hetSNPs in the 19 autosomal chromosomes were compared to the gold standard. Blue dots indicate chromatid losses, red dots indicate chromatid gains. Dot size indicates deviation from the mean ratio. Verified aneuploidy. Chromosomes highlighted with rectangles, sperm A7 is more likely to be an unevenly amplified sample (technical error) than true aneuploidy; j - ratio of two alleles covered in sperm cells. Heatmap showing heterozygosity rates for 19 autosomes from 12 diploid cells (2N), 7 aneuploid sperm cells (1N ± m), and several haploid sperm cells (1N).

图9 为Refresh-seq进行结构变异的鉴定和染色体分型；Figure 9 Identification of structural variants and chromosome typing for Refresh-seq;

图中a-每个精子的真阳性结构变异数分布；b-鉴定出的真阳性SV的长度分布，SV长度的局部峰用橙色虚线表示；c-Refresh-seq检测到的SVs（缺失和插入）的准确度以及不同支持细胞数量的SVs真阳性百分比；d-在染色体尺度上SV的全基因组分型的准确度；e-正确分型的SV的召回率；f-分型的缺失事件不同类型元素的比例；g-分型的插入事件不同类型元素的比例。In the figure, a-distribution of true positive structural variants per sperm; b-length distribution of identified true positive SVs, the local peak of SV length is represented by an orange dotted line; c-SVs (deletions and insertions) detected by Refresh-seq ) and the percentage of true positives for SVs with different numbers of supporting cells; d-accuracy of genome-wide typing of SVs at the chromosome scale; e-recall rate of correctly typed SVs; f-different missing events of typing The proportion of type elements; the proportion of different types of elements for g-typed insertion events.

图10 为Refresh-seq用于卵细胞、极体；Figure 10 shows Refresh-seq used for egg cells and polar bodies;

图中a-杂交雌性小鼠取样示意图，B6D2F1经过减数分裂同源重组的MII卵母细胞与DBA雄鼠受精或孤雌生殖激活诱导PB2挤出，获得单倍体PB2, 孤雌激活卵细胞和二倍体PB1, MII，受精卵，并通过毛细管分离；b-不同类型细胞的数量及倍性；c-显示每个细胞的测序数据和基因组覆盖率；d-单倍体雌性生殖细胞的交叉数分布；e-雌性单倍体细胞交叉测定的分辨率；f-雌雄小鼠所有染色体的交叉位置密度图，显示从着丝粒到端粒的交叉密度。In the figure a - schematic diagram of sampling from hybrid female mice, B6D2F1 MII oocytes that have undergone meiotic homologous recombination are fertilized with DBA male mice or parthenogenetic activation induces PB2 extrusion to obtain haploid PB2, parthenogenetically activated oocytes and Diploid PB1, MII, fertilized eggs and separated by capillary; b-number and ploidy of different types of cells; c-displays sequencing data and genome coverage of each cell; d-crossover of haploid female germ cells Number distribution; e - Resolution of crossover assay in female haploid cells; f - Crossover position density map of all chromosomes in male and female mice, showing crossover density from centromere to telomere.

具体实施方式Detailed ways

为了便于理解本发明，下面将对本发明进行更全面的描述。但是，本发明可以以许多不同的形式来实现，并不限于本文所描述的实施例。相反地，提供这些实施例的目的是使对本发明的公开内容的理解更加透彻全面。In order to facilitate an understanding of the invention, the invention will be described more fully below. However, the invention may be embodied in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that a thorough understanding of the present disclosure will be provided.

实施例1Example 1

依据人类基因组序列以及选择的限制性内切酶识别序列，进行片段长度的模拟。如图3所示，EcoR I和SacI模拟切割的片段长度大多数在1-3 kb之间，因此适用于全基因组扩增和测序。AsiS I识别序列为8 bp，在基因组上分布更稀疏，因此当使用AsiS I进行Refresh-seq建库时可以实现简化基因组测序，在同样数据量的情况下对特定区域进行深测序。Based on the human genome sequence and the selected restriction endonuclease recognition sequence, the fragment length is simulated. As shown in Figure 3, most of the fragments simulated by Eco R I and Sac I are between 1-3 kb in length, so they are suitable for whole-genome amplification and sequencing. The Asi S I recognition sequence is 8 bp and is more sparsely distributed on the genome. Therefore, when using Asi S I for Refresh-seq library construction, it can simplify genome sequencing and perform deep sequencing of specific regions with the same amount of data.

Refresh-seq具体步骤如图1所示：本实施例采用两种人类细胞系（HG002和HG001），细胞用含0.1%BSA的PBS洗涤三次后，用口吸管或流式细胞仪分选将单细胞，放入含有2.5 μL裂解缓冲液的八连排 PCR管中。50℃ 3 h消化组蛋白，70℃ 30 min使蛋白酶失活；所述细胞裂解液为10 mM Tris-EDTA（1M Tris + 0.1M EDTA），1 mg/mL Qiagenprotease, 0.3% triton X-100, 20mM KCL,以及15 mM DTT。The specific steps of Refresh-seq are shown in Figure 1: This example uses two human cell lines (HG002 and HG001). After the cells are washed three times with PBS containing 0.1% BSA, the single cells are separated using an oral pipette or flow cytometer. cells, into eight-row PCR tubes containing 2.5 μL of lysis buffer. Digest histones at 50°C for 3 hours, and inactivate the protease at 70°C for 30 minutes; the cell lysis solution is 10 mM Tris-EDTA (1M Tris + 0.1M EDTA), 1 mg/mL Qiagenprotease, 0.3% triton X-100, 20mM KCL, and 15mM DTT.

单细胞裂解后，加入0.5 μL 10 ×酶切缓冲液、1.9 μL水和0.1 μL限制性内切酶对单细胞gDNA进行酶切。反应程序根据所使用的限制性内切酶进行调节。对于EcoRI （NEWENGLAND BioLabs, cat# R3101L）和SacI （NEW ENGLAND BioLabs, cat# R3156S），在37℃条件下酶切15分钟，然后在65℃ 20分钟以使酶失活。对于AsiSI（NEW ENGLAND BioLabs,cat# R0630S），在37℃条件下酶切1小时，然后在80℃ 20分钟以使内切酶失活。然后进行末端修复和加A（Kapa Biosystems, Kapa HyperPrep kit, cat# KK8504），将dsDNA接头（NEBNext Singleplex Oligos for Illumina）连接到末端加A的分子上，然后加入USER酶（尿嘧啶特异性切除试剂，NEW ENGLAND BioLabs, cat# M5505L）使环形接头切成“Y”形接头。每个样品用1 × AMPure XP （BECKMAN COULTER, cat# A63882）纯化，用Barcode- P5（GCTA-[24 bp P5-barcode 81-96]- TACACTCTTTCCCTAfter single cell lysis, add 0.5 μL 10× enzyme digestion buffer, 1.9 μL water and 0.1 μL restriction endonuclease to digest the single cell gDNA. The reaction program is adjusted according to the restriction enzyme used. For EcoR I (NEWENGLAND BioLabs, cat# R3101L) and Sac I (NEW ENGLAND BioLabs, cat# R3156S), digest at 37°C for 15 minutes, then inactivate the enzymes at 65°C for 20 minutes. For AsiS I (NEW ENGLAND BioLabs, cat# R0630S), digest at 37°C for 1 hour, then inactivate the endonuclease at 80°C for 20 minutes. Then perform end repair and add A (Kapa Biosystems, Kapa HyperPrep kit, cat# KK8504), connect the dsDNA adapter (NEBNext Singleplex Oligos for Illumina) to the molecule with A added at the end, and then add USER enzyme (uracil-specific excision reagent) , NEW ENGLAND BioLabs, cat# M5505L) to cut the ring connector into a "Y" shaped connector. Each sample was purified with 1 × AMPure

ACACGACGCTCTTCCGATCT）和Barcode-P3（ATCG-[24 bp P3-barcode 1-24]-GACTGGAGTTCAGACGTGTGCT）扩增。PCR程序为98℃45 s, 98℃15 s,然后98℃15 s, 65℃30s, 72℃5 min, 20个循环。之后，用0.7 × AMPure XP纯化两次（单倍体细胞为0.65 ×AMPure XP纯化两次）。纯化的扩增子使用Equalbit 1 × dsDNA HS Assay Kit进行定量。按测序需求进行混样后上机测序。ACACGACGCTCTTCCGATCT) and Barcode-P3 (ATCG-[24 bp P3-barcode 1-24]-GACTGGAGTTCAGACGTGTGCT) amplification. The PCR program was 98°C for 45 s, 98°C for 15 s, then 98°C for 15 s, 65°C for 30 s, and 72°C for 5 min, for 20 cycles. Afterwards, purify twice with 0.7 × AMPure XP (twice with 0.65 × AMPure XP for haploid cells). Purified amplicons were quantified using the Equalbit 1 × dsDNA HS Assay Kit. Mix the samples according to the sequencing requirements and then run the sequence on the computer.

对于Refresh-seq（multiplexed），本实施例使用了带barcode的接头来增加通量。首先将配对的单链寡核苷酸anneal成“Y”形接头，NEB same-A：5’磷酸化（GATCGGAAGAGCACACGTCTGAACTCCAGTC和Barcoded-B：ACACTFor Refresh-seq (multiplexed), this example uses a barcoded adapter to increase throughput. First anneal the paired single-stranded oligonucleotides into a "Y"-shaped linker, NEB same-A: 5' phosphorylation (GATCGGAAGAGCACACGTCTGAACTCCAGTC and Barcoded-B: ACACT

CTTTCCCTACACGAC-[24 bp adaptor-barcode 31-46]-GCTCTTCCGATC*T）用水溶成100μM的母液，1:1混合后降温anneal，得到浓度为50μM的barcoded adaptors。限制性内切酶打断的基因组DNA末端修复和加A后与barcoded adaptor连接，将与不同barcodedadaptors连接的细胞pooling在一起1×AMPure XP纯化后，用Common-P5（ACACTCTTTCCCTACACGAC）和 Barcode-P3（ATCG-[24 bp P3-barcode 1-24]-GACTGGAGTTCAGACGTGTGCT）进行扩增。之后，用0.7×AMPure XP纯化两次。纯化的扩增子使用Equalbit 1×dsDNA HS Assay Kit进行定量。按测序需求进行混样后上机测序（图2）。CTTTCCCTACACGAC-[24 bp adapter-barcode 31-46]-GCTCTTCCGATC*T) was dissolved in water into a 100 μM mother solution, mixed 1:1 and then cooled anneal to obtain barcoded adapters with a concentration of 50 μM. The genomic DNA ends interrupted by restriction endonucleases were repaired and A was added and then ligated with barcoded adapters. The cells ligated with different barcoded adapters were pooled together. After 1×AMPure XP purification, Common-P5 (ACACTCTTTCCCTACACGAC) and Barcode-P3 ( ATCG-[24 bp P3-barcode 1-24]-GACTGGAGTTCAGACGTGTGCT) was amplified. Afterwards, purify twice with 0.7×AMPure XP. Purified amplicons were quantified using the Equalbit 1×dsDNA HS Assay Kit. Mix the samples according to sequencing requirements and run the sequence on the machine (Figure 2).

表1. Refresh-seq所涉及的引物序列Table 1. Primer sequences involved in Refresh-seq

* 代表硫代磷酸。* stands for thiophosphoric acid.

在利用ONT纳米孔测序技术对Refresh-seq文库进行测序，并且获得原始测序数据之后，发明人对数据的基本处理是将读段比对至参考基因组，它包括了以下步骤：After using ONT nanopore sequencing technology to sequence the Refresh-seq library and obtaining the original sequencing data, the inventor's basic processing of the data is to align the reads to the reference genome, which includes the following steps:

ONT测序产生的原始数据转换为fastq格式。根据Refresh-seq双端barcode文库结构，本实施例用nanoplexer v0.1对每个单细胞进行连续两次barcode拆解，用Cutadaptv3.4去除reads两端的接头序列以及长度小于500 bp的reads。然后将这些reads通过minimap2 v2.24比对到人的参考基因组hg38或小鼠参考基因组mm10。本实施例用samtoolsv1.14过滤mapping质量小于30的reads，并去除PCR重复。The raw data generated by ONT sequencing is converted to fastq format. According to the structure of the Refresh-seq double-end barcode library, this example uses nanoplexer v0.1 to perform two consecutive barcode disassemblies on each single cell, and Cutadaptv3.4 is used to remove adapter sequences at both ends of the reads and reads less than 500 bp in length. These reads were then aligned to the human reference genome hg38 or the mouse reference genome mm10 through minimap2 v2.24. In this example, samtoolsv1.14 is used to filter reads with mapping quality less than 30 and remove PCR duplicates.

交叉污染评估：为了评估Refresh-seq（multiplexed）的交叉污染情况，本实施例采用hg38和mm10人鼠混合基因组定位策略，混合基因组由minimap2索引，参数为' -I 10G'。本实施例计算了每个单个细胞比对到mm10和hg38基因组的reads数和比例。主要比对（大于90%）的参考基因组种属即判断为该单细胞的物种。若比对到次要种属基因组的reads比例在大于10%则判断为交叉污染的细胞。结果如图4所示：结果质控的细胞没有被判定为人鼠混合的细胞，说明Refresh-seq （multiplexed）交叉污染非常小。Cross-contamination assessment: In order to evaluate the cross-contamination situation of Refresh-seq (multiplexed), this example uses hg38 and mm10 human and mouse mixed genome positioning strategies. The mixed genome is indexed by minimap2, and the parameter is '-I 10G'. This example calculates the number and proportion of reads mapped to the mm10 and hg38 genomes for each single cell. The reference genome species with the main comparison (greater than 90%) is judged to be the single-celled species. If the proportion of reads aligned to the genome of a minor species is greater than 10%, it is judged to be a cross-contaminated cell. The results are shown in Figure 4: The cells in the quality control results were not judged to be mixed human and mouse cells, indicating that the cross-contamination of Refresh-seq (multiplexed) is very small.

SNP位点杂合度分析：本实施例使用whatshap v.1.5来计算给定杂合SNP位点上所有三种基因型（0/0、0/1、1/1）的可能性，并将它们与基因型预测一起输出到VCF文件中。运行命令“whatshap genotype -reference ref.fasta -o genotyped.vcf variants.vcfreads.bam”。variants.vcf文件是从GIAB中下载的HG002或HG001 SNP基准集。SNP site heterozygosity analysis: This example uses whatshap v.1.5 to calculate the likelihood of all three genotypes (0/0, 0/1, 1/1) at a given heterozygous SNP site and combine them Output to a VCF file along with genotype predictions. Run the command "whatshap genotype -reference ref.fasta -o genotyped.vcf variants.vcfreads.bam". The variants.vcf file is the HG002 or HG001 SNP benchmark set downloaded from GIAB.

结果如图5所示，与基于Tn5随机切割基因组片段扩增原理的SMOOTH-seq相比，基于EcoR I的Refresh-seq拥有更优的扩增均匀性、更高的全基因组覆盖度以及更高的单核苷酸多样性位点的双等位基因检出率。在~0.25×的测序深度下，Refresh-seq检测到~1.64%的杂合SNP，是SMOOTH-seq （~0.33%）的5倍。在超过5条reads覆盖的杂合SNP位点中，Refresh-seq的平均双等位基因捕获率为62%，显著高于SMOOTH-seq的10%捕获率。The results are shown in Figure 5. Compared with SMOOTH-seq based on the principle of Tn5 random cutting genomic fragment amplification, Refresh-seq based on Eco RI has better amplification uniformity, higher whole-genome coverage, and higher double allele detection rate at single nucleotide diversity sites. At a sequencing depth of ~0.25×, Refresh-seq detected ~1.64% of heterozygous SNPs, which is 5 times that of SMOOTH-seq (~0.33%). Among heterozygous SNP sites covered by more than 5 reads, the average double allele capture rate of Refresh-seq was 62%, significantly higher than the 10% capture rate of SMOOTH-seq.

图6显示，Refresh-seq和Refresh-seq （multiplexed）在HG001细胞上和HG002细胞有着一致的表现。使用EcoR I和SacI的Refresh-seq和Refresh-seq （multiplexed）的基因组覆盖度高于SMOOTH-seq，并且检测到更多的杂合SNP。而使用AsiS I的Refresh-seq（multiplexed）在同样测序量的情况下得到了更深的测序深度。Figure 6 shows that Refresh-seq and Refresh-seq (multiplexed) have consistent performance on HG001 cells and HG002 cells. Refresh-seq and Refresh-seq (multiplexed) using Eco R I and Sac I had higher genome coverage than SMOOTH-seq, and more heterozygous SNPs were detected. Refresh-seq (multiplexed) using Asi S I achieves deeper sequencing depth with the same sequencing volume.

上述实验证实了Refresh-seq技术的普适性及优势，Refresh-seq有着更好的基因组覆盖度和两个等位基因同时检出的概率。并且使用长识别序列内切酶的Refresh-seq可以实现reads富集，从而进行简化基因组测序。The above experiments confirmed the universality and advantages of Refresh-seq technology. Refresh-seq has better genome coverage and the probability of detecting two alleles at the same time. And Refresh-seq using long recognition sequence endonucleases can achieve read enrichment, thereby simplifying genome sequencing.

实施例2Refresh-seq技术应用于单精子测序Example 2 Refresh-seq technology applied to single sperm sequencing

本实施例中使用EcoR I进行Refresh-seq，具体建库步骤与实施例1一致，区别在于库扩增完的纯化为0.65×AMPure XP纯化两次。用Refresh-seq（单管版）扩增了676个精子细胞，用Refresh-seq（multiplexed）扩增了152个精子细胞。由于Refresh-seq和Refresh-seq（multiplexed）在交叉事件的检测上没有差异，所以后续没有区分不同版本的Refresh-seq。In this example, Eco R I is used to perform Refresh-seq. The specific library construction steps are consistent with Example 1. The difference is that the purification after library amplification is 0.65×AMPure XP purification twice. 676 sperm cells were amplified using Refresh-seq (single-tube version), and 152 sperm cells were amplified using Refresh-seq (multiplexed). Since there is no difference in the detection of crossover events between Refresh-seq and Refresh-seq (multiplexed), different versions of Refresh-seq are not distinguished subsequently.

实验结果如图7所示，Refresh-seq可以在低测序量下获得足够的基因组覆盖度。在~0.1-0.3×深度测序下，828个精子中有700个精子通过质量控制，基因组覆盖率高于1%（图7b）。基因组覆盖率随着测序数据的增加呈近线性增长，在0.1-1 Gb测序量下，平均覆盖率约为5%（图7c）。平均reads长度为1.9 kb（图7d），每个精子的平均reads数为143,914。平均每个精子检测到~250,000个hetSNP（图7e）， SNP检测的准确率超过98.9%。通过定义不连续性分数（即连续SNP在父源和母源之间变换的频率），Refresh-seq可以高效地筛选出污染的二倍体细胞，并进行准确的X精子和Y精子的判断。通过质控的700个精子细胞中，有688个单倍体精子细胞和12个污染的二倍体细胞（图7f）。将二倍体细胞标记为D1至D12（图7f），并使用分布图验证了这12个二倍体细胞的真实性。然后，根据映射到X和Y染色体的reads的数量和比例区分X精子细胞和Y精子细胞（图7g）。共鉴定了344个X精子细胞和329个Y精子细胞，其中8个精子细胞无法区分（性染色体的增加或减少），X精子和Y精子的比例接近1:1，符合孟德尔的分离定律。The experimental results are shown in Figure 7. Refresh-seq can obtain sufficient genome coverage at low sequencing volume. Under ~0.1-0.3× deep sequencing, 700 sperm out of 828 sperm passed quality control, with genome coverage higher than 1% (Fig. 7b). The genome coverage increased nearly linearly with the increase of sequencing data, and at 0.1-1 Gb sequencing volume, the average coverage was approximately 5% (Figure 7c). The average read length was 1.9 kb (Fig. 7d), and the average number of reads per sperm was 143,914. On average, ~250,000 hetSNPs were detected per sperm (Fig. 7e), and the accuracy of SNP detection exceeded 98.9%. By defining discontinuity scores (i.e., the frequency with which continuous SNPs transition between paternal and maternal origin), Refresh-seq can efficiently screen out contaminated diploid cells and make accurate X and Y sperm determinations. Among the 700 sperm cells that passed the quality control, there were 688 haploid sperm cells and 12 contaminated diploid cells (Figure 7f). The diploid cells were labeled D1 to D12 (Fig. 7f), and the authenticity of these 12 diploid cells was verified using the distribution plot. Then, X spermatids and Y spermatids were distinguished based on the number and proportion of reads mapping to the X and Y chromosomes (Fig. 7g). A total of 344 X sperm cells and 329 Y sperm cells were identified, of which 8 sperm cells were indistinguishable (increase or decrease of sex chromosomes). The ratio of X sperm cells to Y sperm cells was close to 1:1, consistent with Mendel's law of segregation.

实施例3 Refresh-seq技术应用于非整倍体的鉴定Example 3 Refresh-seq technology is applied to the identification of aneuploidy

由于Refresh-seq同时检测到两个等位基因的能力更强，首先通过计算每条染色体的不连续性分数进行非整倍体的初筛。之后利用SNP位点的杂合性进行染色体增加事件的确认，若一条染色体发生了拷贝数的增加，在杂合后代的一条染色体当中，往往会出现同一个SNP位点同时检测到两个等位基因的情况，依据1 Mb区间内同时检测到两个等位基因事件数目的陡增，确认染色体增加的事件；对于染色体发生缺失的情况，与正常相比，能够检测到的SNP位点的数目会发生骤减，因此能够通过在1 Mb检测到的SNP的数目的增减确定染色体发生缺失的事件。Since Refresh-seq has a stronger ability to detect two alleles at the same time, a primary screening for aneuploidy is first performed by calculating the discontinuity score of each chromosome. The heterozygosity of the SNP site is then used to confirm the chromosome addition event. If a chromosome has an increase in copy number, two alleles of the same SNP site will often be detected at the same time in a chromosome of the heterozygous offspring. In the case of genes, based on the sudden increase in the number of two allele events detected simultaneously within the 1 Mb interval, the event of chromosome gain is confirmed; for the case of chromosome deletion, the number of SNP sites that can be detected is compared with normal. A sudden decrease occurs, so chromosomal deletion events can be determined by an increase or decrease in the number of SNPs detected at 1 Mb.

实验结果如图8所示，Refresh-seq技术可以用多种方法进行非整倍体的鉴定。单倍体精子的基因组中只含有来自父母两套染色体的其中一套，在发生了染色体增加的事件时，该染色体便同时拥有来自父母的两套不同基因，理想状态下，每个SNP位点能够同时检测到父本和母本的基因型，然而由于等位基因丢失现象的普遍存在，多数SNP位点往往只能检测到一个基因型，在染色体增加的区间的SNP位点的基因型便会出现父本和母本的随机交替出现，与单倍体区间便存在父母本基因型交替出现的频率的差异，也就是不连续性分数显著升高。因此，第一种方法通过计算每条染色体的不连续性分数（图8a-h）筛选出了染色体发生增加的单精子细胞A1、A2、A3、A4和A6。在随机扩增的情况下，染色体增加意味着能够捕获到更多的DNA片段，染色体缺失意味着捕获到的DNA片段的减少，在测序数据当中表现为染色体增加的位置有更多的测序读段覆盖，能够检测到更多的SNP，而染色体缺失的位置则有更少的读段覆盖，能够检测到的SNP数目更少。因此第二种方法能够通过SNP数目相较于其他所有染色体SNP数目的均值的偏离情况，获知染色体增加和减少的事件（图8i）。与方法一的原理一致，方法三利用Refresh-seq具有同时捕获两个等位基因的优势，在染色体发生增加的染色体当中能够检测到更多的双基因型SNP位点，即杂合度增加，而在染色体丢失时表现为杂合度的降低（图8j）。这三种方法找到的非整倍体染色体相互验证，并可以通过染色体分布图以及CNV进行验证。最终找到了6个发生常染色体非整倍体的精子，其中A1、A3、A6发生了染色体增加，A5发生了染色体缺失，A4和A6在染色体（chr3）同时发生了增加和缺失。精子A7更可能是一个不均匀扩增的样本（技术误差），而不是真正的非整倍体。The experimental results are shown in Figure 8. Refresh-seq technology can use a variety of methods to identify aneuploidy. The genome of haploid sperm contains only one set of two sets of chromosomes from the parents. When a chromosome addition event occurs, the chromosome will have two different sets of genes from the parents at the same time. Ideally, each SNP site The genotypes of both the paternal and maternal parents can be detected at the same time. However, due to the ubiquitous phenomenon of allele loss, most SNP sites can often only detect one genotype. The genotypes of SNP sites in the chromosome-increasing interval are There will be random alternation of male and female parents, and there will be a difference in the frequency of alternating parental genotypes in the haploid interval, that is, the discontinuity score will increase significantly. Therefore, the first method screened out single sperm cells A1, A2, A3, A4, and A6 with increased chromosome occurrence by calculating the discontinuity score of each chromosome (Fig. 8a–h). In the case of random amplification, the increase of chromosomes means that more DNA fragments can be captured, and the loss of chromosomes means that the number of DNA fragments that can be captured is reduced. In the sequencing data, the positions where the chromosomes are increased have more sequencing reads. Coverage, more SNPs can be detected, while the positions of chromosome deletions have less read coverage, and fewer SNPs can be detected. Therefore, the second method can know the events of chromosome increase and decrease through the deviation of SNP number compared with the mean number of SNP numbers of all other chromosomes (Figure 8i). Consistent with the principle of method one, method three uses Refresh-seq to have the advantage of capturing two alleles at the same time. More double-genotype SNP sites can be detected in chromosomes with increased chromosome occurrence, that is, the heterozygosity increases, and The loss of chromosomes manifests itself as a decrease in heterozygosity (Fig. 8j). The aneuploid chromosomes found by these three methods verify each other and can be verified through chromosome distribution maps and CNV. Finally, 6 sperm with autosomal aneuploidy were found. Among them, A1, A3, and A6 had chromosome gains, A5 had chromosome deletions, and A4 and A6 had both gains and losses in chromosomes (chr3). Sperm A7 is more likely to be an unevenly amplified sample (technical error) than a true aneuploidy.

实施例4Refresh-seq技术应用于精子结构变异的鉴定Example 4 Refresh-seq technology is applied to the identification of sperm structural variations

本实施例中，Refresh-seq采用一个具有高灵敏度、快速的适用于第三代测序数据的生信软件cuteSV对Nanopore产生的长读长数据进行结构变异（SVs）的检测。参数设定为专用于Nanopore的默认参数并将最小支持读段数设为1以达到单细胞单分子分辨率。在结构变异检出的多细胞支持准确性的分析当中，首先使用SURVIVOR对所有细胞的SV进行合并，根据公式准确性=真阳性/（真阳性+假阳性）计算不同细胞支持的SV准确性，参考集使用大量单精子细胞起始的三代Nanopore测序数据。在SV的单倍体分型当中，我们首先根据参考集中SV在单精子中的父母本基因型情况建立0/1矩阵，其中0代表C57母型，1代表DBA父型，判断与参考集中SV是否一致需要满足该SV的长度与参考集中SV长度类似并且位置需要在±100 bp以内，一致则标记与参考集SV一致的基因型，不一致则反之。生成的矩阵先利用R包Hapi中的工具` hapiFrameSelection`过滤少于5个精子支持的SV，再选取拥有最多的SV的100个细胞作为后续分型的前体框架。为了提高分型准确性，需要对前体框架进行HMM校准，每个位置若有一半以上的细胞支持其发生了错误，则对其基因型进行翻转。至此，形成基本框架，缺失的基因型参照其他细胞使用`imputationFun1`功能进行迭代填充。之后使用`hapiPhase`功能进行单倍体初次分型，`hapiBlockMPR`校准后利用` hapiAssemble`进行高分辨度高一致性单倍型的组装。In this example, Refresh-seq uses cuteSV, a highly sensitive and fast bioinformatics software suitable for third-generation sequencing data, to detect structural variants (SVs) in long-read data generated by Nanopore. Parameters were set to default parameters specific to Nanopore and the minimum number of supported reads was set to 1 to achieve single-cell single-molecule resolution. In the analysis of the accuracy of multi-cell support for structural variation detection, SURVIVOR is first used to merge the SVs of all cells, and the accuracy of SVs supported by different cells is calculated according to the formula accuracy = true positive/(true positive + false positive). The reference set uses third-generation Nanopore sequencing data initiated from a large number of single sperm cells. In the haplotype typing of SV, we first establish a 0/1 matrix based on the parental genotypes of SV in single sperm in the reference set, where 0 represents the C57 maternal type and 1 represents the DBA paternal type. Judgment and reference set SV Whether it is consistent or not requires that the length of the SV is similar to the length of the SV in the reference set and the position needs to be within ±100 bp. If consistent, the genotype is marked with the same genotype as the SV in the reference set, and vice versa if inconsistent. The generated matrix first uses the tool ` hapiFrameSelection` in the R package Hapi to filter SVs supported by less than 5 sperm, and then selects the 100 cells with the most SVs as the precursor frame for subsequent typing. In order to improve the accuracy of typing, it is necessary to perform HMM calibration on the precursor framework. If more than half of the cells at each position support that an error has occurred, its genotype will be flipped. At this point, the basic framework is formed, and the missing genotypes are iteratively filled using the `imputationFun1` function with reference to other cells. Then use the `hapiPhase` function for primary haplotype typing, and use `hapiAssemble` after calibration with `hapiBlockMPR` to assemble high-resolution and high-consistency haplotypes.

结果如图9所示，Refresh-seq能够在精子中进行结构变异的鉴定，每个细胞平均检测到973个结构变异事件（图9a）。在所有检测到的结构变异长度分布0中，180 bp和6 kb-7 kb左右出现两个峰，分别对应B1元件（等同于人类当中的Alu）和LINE1元件（图9b）。Refresh-seq检测到的结构变异事件的准确性在三个以上细胞支持的情况下能够达到80%（图9c），而染色体尺度单倍体分型的准确性高达98%（图9d），并成功对这些分型的结构变异进行了基因组元件的注释（图9e-g）。The results are shown in Figure 9. Refresh-seq was able to identify structural variations in sperm, with an average of 973 structural variation events detected per cell (Figure 9a). In the length distribution 0 of all detected structural variants, two peaks appeared around 180 bp and 6 kb-7 kb, corresponding to the B1 element (equivalent to Alu in humans) and the LINE1 element respectively (Figure 9b). The accuracy of structural variation events detected by Refresh-seq can reach 80% when supported by more than three cells (Figure 9c), while the accuracy of chromosome-scale haploid typing is as high as 98% (Figure 9d). Genomic elements were successfully annotated for these typed structural variants (Fig. 9e–g).

实施例5 Refresh-seq技术应用于卵细胞、极体的鉴定Example 5 Refresh-seq technology is applied to the identification of egg cells and polar bodies

本实施例中通过受精和孤雌激活的方法获取了卵细胞和极体，使用EcoR I进行Refresh-seq，具体建库步骤与实施例1一致。共收集了185个第二极体、87个孤雌激活卵细胞、132个第二极体、33个第二次减数分裂期细胞和26个合子细胞。其中第二极体和孤雌激活卵为单倍体细胞，平均检测到14次的交叉事件。In this example, egg cells and polar bodies were obtained through fertilization and parthenogenetic activation, and EcoRI was used to perform Refresh-seq. The specific library construction steps were consistent with Example 1. A total of 185 second polar bodies, 87 parthenogenetically activated egg cells, 132 second polar bodies, 33 second meiotic cells and 26 zygotic cells were collected. Among them, the second polar body and parthenogenetically activated eggs are haploid cells, and an average of 14 crossover events were detected.

实验结果如图10所示，除了应用于雄性的精子细胞，Refresh-seq也在雌性生殖细胞获得较好的结果。Refresh-seq同样可以在低测序量下获得足够的基因组覆盖度，其中二倍体细胞在同等测序量的情况下具有更高的基因组覆盖度，且基因组覆盖率随着测序数据的增加而增长，说明没有达到覆盖度的饱和（图10c）。第二极体和孤雌激活卵能够检测到平均14次的交叉事件（图10d），单细胞交叉事件从6次到25次不等。交叉分辨率的中位数在283kb，因此在雌性单倍体生殖细胞中Refresh-seq也可以在浅测序的情况下获得高分辨的交叉数据。交叉分布密度图显示雌性小鼠在近着丝粒的位置相对于亚端粒附近具有相对较少的交叉分布，而在亚端粒附近则分布有更多的交叉，相对于雄性而言，雌性在亚端粒附近的富集程度会较轻（图10f）。The experimental results are shown in Figure 10. In addition to being applied to male sperm cells, Refresh-seq also achieved good results in female germ cells. Refresh-seq can also obtain sufficient genome coverage under low sequencing volume. Diploid cells have higher genome coverage under the same sequencing volume, and the genome coverage increases with the increase in sequencing data. This shows that coverage saturation has not been reached (Figure 10c). Second polar bodies and parthenogenetically activated eggs were able to detect an average of 14 crossover events (Fig. 10d), with single-cell crossover events ranging from 6 to 25. The median crossover resolution is 283kb, so Refresh-seq can also obtain high-resolution crossover data in female haploid germ cells with shallow sequencing. Crossover distribution density plot shows that female mice have relatively fewer crossovers near centromeres and more crossovers near subtelomeres, relative to males. The degree of enrichment near subtelomeres will be lighter (Fig. 10f).

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present invention, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the invention. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the scope of protection of the patent of the present invention should be determined by the appended claims.

Claims

1. A method for detecting genomic information based on restriction endonucleases, characterized by comprising the following steps:

(1) Use restriction endonucleases to cut the genome of the sample to obtain genomic DNA fragments of different lengths;

(2) Enrich long genomic DNA fragments from amplified or non-amplified genomic samples;

(3) Sequence the enriched long genomic DNA fragments on a long-read sequencing platform;

(4) Computer analysis is performed on the data obtained by sequencing, by attaching the long genomic DNA fragments to the genomic region, and obtaining the sequence information of the sample in the genomic region through comparison and calculation.

2. The method for detecting genomic information based on restriction endonucleases according to claim 1, characterized in that the restriction endonuclease is a restriction endonuclease that recognizes a 4-10 bp specific sequence, preferably, In order to identify restriction endonucleases with specific sequences of 6 bp and 8 bp, it is more preferable to select Eco R I and Sac I when the goal is to obtain higher coverage, and to select when the goal is to achieve better enrichment effects. Asi S I; When the goal is to obtain higher coverage, the resulting DNA fragments are of similar length and concentrated between 1-3 kb.

3. The method for detecting genomic information based on restriction endonucleases according to claim 1, characterized in that the genomic sample is free DNA, DNA released by cells in the culture medium, one or more cells or nuclei, Viruses, mitochondria or chloroplasts.

4. The method for detecting genomic information based on restriction enzymes according to claim 1, characterized in that step (2) performs end repair on the genomic DNA fragments, adds A, connects adapters, performs PCR amplification, and amplifies the genomic information. After enrichment, long genomic DNA fragments are enriched.

5. The method for detecting genomic information based on restriction enzymes according to claim 4, characterized in that the adapter used in the amplification in step (2) is an adapter without a barcode or an adapter with a barcode; use During the subsequent purification and library construction process of the adapter without barcode, each PCR tube is carried out separately, and adapters at the 5' end and 3' end are attached during PCR amplification; using the adapter with barcode, after the adapters are connected, Sample tubes with different barcodes are mixed and purified, amplified in one tube, and 3'-end adapters are attached through amplification.

6. The method for detecting genomic information based on restriction endonucleases according to claim 1, characterized in that the sequencing platform described in step (3) is a long-read sequencing platform, and optionally, the sequencing platform It is Nanopore sequencing platform or PacBio sequencing platform.

7. The method for detecting genomic information based on restriction endonucleases according to claim 1, characterized in that the restriction endonuclease selected in step (1) is inferred based on simulation of restriction endonuclease fragments of the genome of the target species. The distribution of genomic fragments after digestion, thereby selecting the endonuclease.

8. The method for detecting genomic information based on restriction endonucleases according to claim 1, characterized in that the long genomic DNA fragment in step (2) refers to a fragment with a length greater than 700 nucleotide pairs, preferably a length of Fragments larger than 1000 nucleotide pairs.

9. The method for detecting genomic information based on restriction endonucleases according to claim 1, characterized in that the amplification in step (2) is a polymerase chain reaction, and polymerase chain reaction and fragment screening are used to enrich the genome information. Collect long genomic DNA fragments, and the fragment screening is gel-running fragment screening or magnetic bead fragment screening.

10. The method for detecting genomic information based on restriction endonucleases according to claim 1, characterized in that the sequence information in step (4) includes one or more of the following: 1) fragment length information; 2) Fragment abundance information; 3) Hybrid single nucleotide polymorphism information; 4) Genome structural variation information, which includes insertions, deletions, duplications, inversions, and translocations; 5) Repeated sequence information, The repeated sequence information includes short interspersed elements, long interspersed elements, long terminal repeat elements, DNA repeat elements, simple repeats, and satellite foci; 6) genome copy number variation information; 7) allele information; 8) alleles Linkage relationship of genetic information; 9) Epigenetic information, which includes DNA methylation and DNA hydroxymethylation.

11. The method for detecting genomic information based on restriction endonucleases according to claim 10, characterized in that the allele information is the variation type at alleles on homologous chromosomes, including alleles. SNP, SV, repeated sequence information, epigenetic information; the repeated sequence information includes short scattered elements, long scattered elements, long terminal repeat elements, DNA repeat elements, simple repeats, and satellite foci; the epigenetic information Including DNA methylation and DNA hydroxymethylation.