CN117126827A

CN117126827A - Fusion protein, base editing system containing uracil-N-glycosylase mutant mediation and application

Info

Publication number: CN117126827A
Application number: CN202310733252.4A
Authority: CN
Inventors: 常兴; 何燕
Original assignee: Westlake University
Current assignee: Westlake University
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-11-28

Abstract

The invention discloses a fusion protein, a base editing system mediated by uracil-N-glycosylase mutant and application thereof. The fusion protein comprises nuclease and uracil-N-glycosylase mutant, wherein the uracil-N-glycosylase mutant is connected with the nuclease or inserted into the nuclease, and the nuclease is SpCas9 protein with D10A mutation or SpryCas9 protein with D10A mutation or Cas enzyme with other nuclease activity deleted and helicase activity reserved; uracil-N-glycosylase mutants are cytosine-N-glycosylase or thymine-N-glycosylase. The editing system can recognize and cleave cytosine/thymine of a target sequence under the guidance of sgRNA, and generate base mutation to guanine; can efficiently realize the site-directed targeting transformation of mammal cell lines, human primary T cells, mouse embryos and escherichia coli genetic materials DNA, and provides a powerful tool for treating genetic diseases caused by gene mutation and establishing related experimental animal models.

Description

Fusion protein, base editing system containing uracil-N-glycosylase mutant mediation and application

Technical Field

The invention belongs to the technical field of gene editing, and particularly relates to a fusion protein, a base editing system (CGBE, TSBE) and application thereof.

Background

Single nucleotide variation can lead to the occurrence of about 2/3 of human genetic diseases, with about 59,813 pathogenic single nucleotide variations. For example, in sickle cell anemia, the gene encoding the beta chain of hemoglobin undergoes a base substitution of CTT > CAT, mutating glutamic acid to valine, resulting in structural and functional abnormalities of hemoglobin. For another example, 99% of hyperphenylalaninemia or PKU is caused by PAH gene mutation, and more than 20 PAH gene mutations have been confirmed in our country, which account for about 80% of PAH mutant genes, wherein 259C > T (48.3%) and 286G > A (15.5%) are hot spot mutations. At present, the methods for treating genetic diseases caused by base mutation and relieving drugs are very limited, and the effect is difficult to satisfy, so that the research and development of safer, more efficient and economical treatment means are urgent.

CRISPR-Cas9 is an adaptive immune defense formed by bacteria and archaea in the long-term evolution process, and can be used to combat invasive viruses and foreign DNA, while CRISPR-Cas9 gene editing technology is a technology for specific DNA modification of targeted genes. Gene editing technology based on CRISPR-Cas9 has great application prospect in a series of application fields of gene therapy, such as treatment of hematopathy, tumor and other genetic diseases. The CRISPR/Cas9 technology induces homologous recombination (HDR) and non-homologous end joining (NHEJ) repair pathways in cells by creating DNA Double Strand Breaks (DSBs) at the target point, thereby enabling site-directed knockout, substitution, insertion, etc. modifications to genomic DNA. However, DSB-initiated DNA repair is difficult to achieve efficient and stable single base mutations, greatly limiting the broad application of CRISPR-Cas9 technology. While the advent of single base editing systems has effectively made up for this deficiency, researchers have begun using single base editing systems to create and correct animal models of human diseases, including duchenne muscular dystrophy (Duchenne muscular dystrophy), premature adult aging syndrome (Progeria), and age-related macular degeneration.

The single Base editing system (Base editing) fuses different Base modification enzymes by using a Nickase Cas9 (D10A), and introduces a single nucleotide mutation in a specific region of a gene. The two single base editors most widely used at present are a cytosine base editor (Cytosine base editors, CBE) and an adenine base editor (Adenine base editors, ABE), which can achieve accurate c.g to t.a or a.t to g.c substitutions within a 4-8 nucleotide (from the PAM distal end) window, respectively, without DNA double strand breaks, but CBE and ABE can only produce transition mutations, and cannot produce transversion mutations, so GBE (mutation to produce C to G) which can produce transversion mutations and a lead editor (PE) which can produce arbitrary mutations of arbitrary bases appear successively. However, the editing efficiency of the existing GBE for producing C to G has certain site preference and the purity of the product is not high, while PE is not suitable for different types of cells. Moreover, the existing base editing system has the conditions of PAM preference or low partial site targeting efficiency, and the size of the expression plasmid of the base editor is far beyond the packaging range of adenovirus, which is not beneficial to clinical research and application. Therefore, the development of a novel base editor which has no PAM limit, generates high-efficiency transversion mutation and has smaller expression plasmid is the key of the current gene editing application research and clinical application.

uracil-DNA-glycosylase (UDG), also known as uracil-N-glycosylase (UNG), is the first enzyme to be recruited in the base excision repair pathway, playing an important role in antibody Class Switching (CSR) and in somatic high frequency mutagenesis (SHM). Its function is to cleave uracil formed by cytosine deamination or a wrongly ligated deoxyribose diabetes pyrimidine nucleotide (dUMP) in DNA replication, forming a pyrimidine-free site. However, the pyrimidine-free site provides an informationless template for DNA synthesis, and DNA polymerase cannot replicate normally, but studies have found REV1 as a backbone molecule, stabilizing UNG, recruiting a repair DNA polymerase across lesions, creating both transition and transversion mutation types. Studies have shown that introducing mutations at the catalytically active site of UNG will result in enzymes with different catalytic activities, e.g., mutation of Asn at position 204 of UNG to Asp can result in cytosine-N-glycosylase (CDG) with cytosine cleaving activity, mutation of Tyr at position 147 of UNG to Ala can result in thymine-N-glycosylase (TDG) with thymine cleaving activity, while in vitro experiments confirm that CDG is more active on single stranded DNA than double stranded DNA, whereas TDG is more active on double stranded DNA than single stranded DNA. In yeast cells with deleted AP endonuclease genes, over-expression of CDG can generate transversion mutation mainly comprising C > G; over-expression of TDG will produce a T > G based transversion mutation. The main reason for the transversion mutation is probably due to the abasic site generated by UNG, CDG and TDG in the yeast cells, which tends to base pair with C during DNA damage repair and replication, and after one round of replication, abasic site to G base mutation is generated.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a novel cytosine editor and thymine editor which can generate efficient transversion mutation and are not limited by PAM sequences.

In a first aspect, the invention provides a fusion protein comprising a nuclease, a uracil-N-glycosylase mutant, wherein the uracil-N-glycosylase mutant is linked to the nuclease, or the uracil-N-glycosylase mutant is inserted into the nuclease. The nuclease is a D10A mutant SpCas9 protein or a D10A mutant SpryCas9 protein, or a Cas enzyme (such as Cas12f, ISCB and the like) with the activity of other nucleases deleted and the activity of helicase reserved; the uracil-N-glycosylase mutant is cytosine-N-glycosylase or thymine-N-glycosylase. The fusion protein has a structure shown as a formula (I), a formula (II) or a formula (III);

A-B-L1-C-L2-A is of formula (I);

A-C-L1-B-L2-A is of formula (II);

A-C ^1-1046 -L3-B-L3-C ^1063-1367 -L2-a formula (iii);

A-C ^1-1009 -L3-B-L3-C ^1011-1367 -L2-a formula (iv);

A-C ^1-1028 -L3-B-L3-C ^1030-1367 -L2-a formula (v);

A-C ^1-1248 -L3-B-L3-C ^1250-1367 -L2-a formula (vi);

wherein A is a nuclear localization signal or none, B is cytosine-N-glycosylase (CDG) or thymine-N-glycosylase (TDG), C is a nuclease, the superscript denotes the site of the nuclease, and L1, L2 and L3 are none or each independently a connecting peptide.

The invention designs the structure of the fusion protein, so that the fusion protein can form a complex with the sgRNA to form a base editing system, and the sgRNA can guide the fusion protein to recognize and cut a target sequence to generate single base conversion of C-to-G or T-to-S. The cytosine-N-glycosylase (CDG) or thymine-N-glycosylase (TDG) is uracil-N-glycosylase mutant including human uracil-N-glycosylase mutants (hCDG and hTDG) or E.coli uracil-N-glycosylase mutants (eCDG and eTDG) or nematode uracil-N-glycosylase mutants (cCDG and cTDG); the amino acid sequence of hCDG is shown as SEQ ID NO. 1; hTDG with amino acid sequence shown in SEQ ID NO.2 and modified ^m ，hTDG ^m Comprises a sequence shown in SEQ ID NO. 7-8; the amino acid sequence of eCDG is shown as SEQ ID NO. 3; the amino acid sequence of eTDG is shown as SEQ ID NO. 4; the amino acid sequence of cCDG is shown as SEQ ID NO. 5; the amino acid sequence of cTDG is shown in SEQ ID NO. 6.

Preferably, the fusion protein further comprises a nuclear localization signal; the nuclear localization signal is fused to the N-terminal and the C-terminal of the fusion protein;

preferably, the amino acid sequence of the nuclear localization signal is shown in SEQ ID NO. 9.

Preferably, the fusion protein further comprises a linker peptide L1, L3 for linking the nuclease, uracil-N-glycosylase mutants;

and/or a linker peptide L2 for linking the fusion protein and the nuclear localization signal.

Wherein, the connecting peptide L1 is a direct connecting nuclease and uracil-N-glycosylase mutant, the amino acid sequence is shown as SEQ ID NO.10, and the connecting peptide L3 is a connecting nuclease and uracil-N-glycosylase mutant after the uracil-N-glycosylase mutant is inserted into the nuclease, and comprises 5AA, the amino acid sequence is shown as SEQ ID NO.11, and 10AA, the amino acid sequence is shown as SEQ ID NO. 12; the amino acid sequence of the connecting peptide L2 is SGGS.

Preferably, the fusion protein comprises:

N-CGBE with the amino acid sequence shown in SEQ ID NO. 13; C-CGBE, the amino acid sequence of which is shown in SEQ ID NO. 14; CE-CGBE-1, the amino acid sequence of which is shown as SEQ ID NO. 15; CE-CGBE-2, the amino acid sequence of which is shown as SEQ ID NO. 17; CE-CGBE-3, the amino acid sequence of which is shown as SEQ ID NO. 18; pTac-CE-CGBE with the amino acid sequence shown in SEQ ID NO. 19; CE-sprycbe with an amino acid sequence shown as SEQ ID NO. 20; CE-TSBE-1, the amino acid sequence of which is shown as SEQ ID NO. 16; CE-TSBE-2, the amino acid sequence of which is shown in SEQ ID NO. 21; CE-TSBE-3, the amino acid sequence of which is shown as SEQ ID NO. 22; the amino acid sequence of CE-TSBE-V206I is shown as SEQ ID NO. 23; CE-TSBE-R260K, the amino acid sequence of which is shown in SEQ ID NO. 24; CE (1010) -TSBE-R260K, the amino acid sequence of which is shown in SEQ ID NO. 25; CE (1029) -TSBE-R260K, the amino acid sequence of which is shown in SEQ ID NO. 26; CE (1249) -TSBE-R260K, the amino acid sequence of which is shown in SEQ ID NO. 27.

In a second aspect, the invention provides a nucleic acid molecule comprising a gene encoding a fusion protein according to the first aspect. In the invention, the fusion protein can be prepared by inserting the encoding gene of the fusion protein in the first aspect into an expression vector and introducing the encoding gene into cells for expression.

In a third aspect, the invention provides a kit comprising a base editing system mediated by uracil-N-glycosylase mutant, the base editing system comprising a fusion protein according to the first aspect and a sgRNA. Wherein, the base editing system formed by fusion proteins comprising nuclease and cytosine-N-glycosylase (CDG) is named as 'C to G' editor (CGBE); the base editing system consisting of fusion proteins comprising nuclease, thymine-N-glycosylase (TDG) is named 'T to S' editor (TSBE).

In the invention, a designed fusion protein and sgRNA form a base editing system, the fusion protein can form a complex with the sgRNA, and the sgRNA can guide the fusion protein to recognize and cut a target sequence to generate single base conversion of C-to-G or T-to-S.

In a fourth aspect, the present invention provides the use of a fusion protein according to the first aspect, a nucleic acid molecule according to the second aspect or a base editing system according to the third aspect for the preparation of a gene editing product.

In a fifth aspect, the present invention provides a gene editing kit comprising the base editing system of the third aspect.

In a sixth aspect, the present invention provides a base editing method comprising base editing using the base editing system according to the third aspect.

Compared with the prior art, the invention has the following beneficial effects:

(1) The fusion protein can form a complex with sgRNA to form a base editing system, the sgRNA can guide the fusion protein to recognize and cut cytosine or thymine on a target sequence, single base conversion of C-to-G or T-to-S occurs, and particularly thymine can not be edited directly by a previous base tool;

(2) The base editing system can realize single base conversion (C-to-G, C-to-A, T-to-G and T-to-C) on the genome of a Hela immortalized cell line and escherichia coli, so that the variety of editing products on a specific region of a gene is greatly enriched;

(3) The base editing system provided by the invention has higher editing efficiency than the reported base editor GBE, and can generate transversion mutation with higher editing efficiency on the same target site.

Drawings

FIG. 1 is a schematic diagram of the structure of a fusion protein CGBE and a schematic diagram of the action principle when genome editing occurs, wherein A is a schematic diagram of the structure of a part of the fusion protein CGBE; b is a functional principle diagram of editing fusion protein in genome.

FIG. 2 is a graph showing the results of editing CGBE at the endogenous gene locus of Hela cells, wherein A-C is the effect of different fusion modes of CDG and Cas9 proteins on editing efficiency: a is an editing statistical diagram of C-CGBE, N-CGBE and CE-CGBE-1 at a Dicer site; b is a statistical chart of the purity of products of C-CGBE, N-CGBE and CE-CGBE-1 at three sites C to G; c is a statistical graph of index generated by C-CGBE, N-CGBE and CE-CGBE-1 at three positions; D-E is the effect of CDG from different species on editing efficiency: d is a statistical diagram of the purity of products of CE-CGBE-1, CE-CGBE-2 and CE-CGBE-3 at three sites C to G; e is CE-CGBE-1, CE-CGBE-2 and CE-CGBE-3 to generate an index statistical map at three positions; F. g is an edit statistical plot of CE-spryCGBE at sites other than NGG PAM, respectively.

FIG. 3 is a graph showing the result of editing CGBE at the E.coli endogenous gene locus, wherein A-H are respectively statistical graphs of editing pTac-CE-CGBE at the E.coli 8 endogenous gene loci.

FIG. 4 is an edit of CGBE at an endogenous site of a human primary T cell, wherein A is a sequencing peak diagram of CGBE edited at a VEGFA at an endogenous site of a human primary T cell, B is statistics of a second generation sequencing result, and C is a statistics diagram of CGBE generation indels.

FIG. 5 is a graph of the results of comparing CGBE with the existing GBE editor, wherein A is a schematic diagram of the structure of 4 reported GBEs and CE-CGBE-1; b is a heat map of the reported editing efficiency of GBE and CE-CGBE-1 at 8 sites; c is a heat map of the reported purity of the GBE and CE-CGBE-1 products at 8 sites C to G; d is a heat map of the reported GBE and CE-CGBE-1 at 8-site index; e is a heat map of reported purity of the products of GBE and CE-CGBE-1 at 8 sites C to T.

FIG. 6 is a graph of the detection result of CGBE off-target effect.

FIG. 7 is a schematic diagram of the structure of a fusion protein TSBE and a schematic diagram of the action principle when editing occurs on a genome, wherein A is a schematic diagram of the structure of a part of the fusion protein TSBE; b is a functional principle diagram of editing fusion protein in genome.

FIG. 8 is a graph showing the effect of linker length on TSBE, A is the editing efficiency of TSBE-1, TSBE-2 and TSBE-3 at 2 endogenous sites of Hela cells; b is the purity of the products of TSBE-1, TSBE-2 and TSBE-3 at 2 endogenous sites T to G in HeLa cells.

FIG. 9 is a graph of the detection result of TSBE off-target effect.

FIG. 10 is a schematic diagram of the protein evolution of Artificial Intelligence (AI) assisted TSBE and the results verification, A is a schematic diagram of the protein evolution of Artificial Intelligence (AI) assisted TSBE; b is a comparison of the numbers of mutants screened to be higher in editing efficiency than the wild-type TDG by two different sequences, and C, D is a functional verification result graph of the screened mutants respectively.

FIG. 11 shows the effect of different positions of the TDG2 mutant TDG2 (R260K) inserted into spCas9 on the editing efficiency of TSBE, wherein A is the editing efficiency of TSBE-2, CE (1010) -TSBE-R260K, CE (1029) -TSBE-R260K and CE (1249) -TSBE-R260K at the endogenous site Dicer of Hela cells; b is the editing efficiency of TSBE-2, CE (1010) -TSBE-R260K, CE (1029) -TSBE-R260K and CE (1249) -TSBE-R260K at the endogenous site VEGFA of the HeLa cells.

FIG. 12 is a graph showing the results of editing TSBE in db/db heterozygote mouse embryos, A being the proportion of TSBE-correctable pathogenic base mutations; b is TSBE editing the sequencing peak diagram of db/db heterozygote mouse embryo; c is a statistical graph of the second generation sequencing results of TSBE edited db/db heterozygous mouse embryos. P-value: * p <0.05, < p <0.01, < p <0.001.

Detailed Description

The technical means adopted by the invention and the effects thereof are further described below with reference to the examples and the attached drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof.

EXAMPLE 1 construction of fusion protein CGBE plasmid

As shown in FIG. 1A, the gene editing tool according to the present invention provides fusion proteins including N-CGBE, C-CGBE, CE-CGBE-1, CE-CGBE-2, CE-CGBE-3, pTac-CE-CGBE and CE-spryCGBE.

The preparation method of the fusion protein N-CGBE comprises the following steps: the pCMV-BE3 (# 73021) is used as a basic vector, a human cytosine-N-glycosylase hCDG2 sequence is connected to the N end of SpCas9 (D10A) through a connecting peptide sequence XTEN with the length of 16 amino acids, a nuclear localization signal NLS is connected to the C end of SpCas9 (D10A), and a fusion protein N-CGBE expression vector is constructed, the structure diagram is shown in figure 1A, and the amino acid sequence of the fusion protein N-CGBE is SEQ ID NO.13.

The preparation method of the fusion protein C-CGBE comprises the following steps: the pCMV-BE3 (# 73021) is used as a basic vector, a human cytosine-N-glycosylase hCDG2 sequence is connected to the C end of SpCas9 (D10A) through a connecting peptide sequence XTEN with the length of 16 amino acids, a nuclear localization signal NLS is connected to the C end of hCDG2, and a fusion protein N-CGBE expression vector is constructed, the structure diagram is shown in figure 1A, and the amino acid sequence of the fusion protein C-CGBE is SEQ ID NO.14.

The preparation method of the fusion protein CE-CGBE-1 comprises the following steps: the pCMV-BE3 (# 73021) is used as a basic vector, a human cytosine-N-glycosylase hCDG2 sequence is inserted into the middle of the SpCas9 (D10A), 16 amino acids at 1047-1062 positions of the sequence are replaced, a nuclear localization signal NLS is connected to the N end and the C end of the SpCas9 (D10A), a fusion protein N-CGBE expression vector is constructed, the structure schematic diagram is shown in figure 1A, and the amino acid sequence of the fusion protein CE-CGBE-1 is SEQ ID NO.15.

The preparation method of the fusion protein CE-CGBE-2 comprises the following steps: the preparation method comprises the steps of taking pCMV-BE3 (# 73021) as a basic vector, inserting a cytosine-N-glycosylase eCDG2 sequence derived from escherichia coli into the middle of SpCas9 (D10A), replacing 16 amino acids at 1047-1062 positions of the sequence, connecting a nuclear localization signal NLS at the N end and the C end of the SpCas9 (D10A), and constructing a fusion protein CE-CGBE-2 expression vector, wherein the structural schematic diagram is shown in figure 1A, and the amino acid sequence of the fusion protein CE-CGBE-2 is SEQ ID NO.17.

The preparation method of the fusion protein CE-CGBE-3 comprises the following steps: the preparation method comprises the steps of taking pCMV-BE3 (# 73021) as a basic vector, inserting a cytosine-N-glycosylase cCDG2 sequence derived from nematodes into the middle of SpCas9 (D10A), replacing 16 amino acids at 1047-1062 positions of the sequence, connecting a nuclear localization signal NLS at the N end and the C end of the SpCas9 (D10A), and constructing a fusion protein CE-CGBE-3 expression vector, wherein the structural schematic diagram is shown in figure 1A, and the sequence of the fusion protein CE-CGBE-3 is SEQ ID NO.18.

The preparation method of the fusion protein pTac-CE-CGBE comprises the following steps: the prokaryotic expression vector pTac_ABE_pSC101_Kana is taken as a basic vector, NLS-spCas9n (1-1046) -hCDG2-spCas9n (1063-1367) is used for replacing the AID-spCas9 (dead) sequence in the original vector, a fusion protein pTac-CE-CGBE expression vector is constructed, the structure diagram is shown in figure 1A, the amino acid sequence of the fusion protein pTac-CE-CGBE is SEQ ID NO.19, and the nucleic acid sequence of pTac_ABE_pSC101_Kana is SEQ ID NO.30.

The preparation method of the fusion protein CE-spryCGBE comprises the following steps: the pCMV-BE3 (# 73021) is used as a basic vector, a human cytosine-N-glycosylase hCDG2 sequence is inserted into the middle of SpryCas9 (D10A), 1 amino acid at 1010 position is replaced, a nuclear localization signal NLS is connected to the N end and the C end of the SpryCas9 (D10A), a fusion protein CE-sprycGGE expression vector is constructed, the structure diagram is shown in figure 1A, and the amino acid sequence of the fusion protein CE-sprycGGE is SEQ ID NO.20.

The principle of the fusion protein when editing on genome is shown in fig. 1B, only SpCas9 (D10A) or Spry Cas9 (D10A) with single-stranded DNA cleavage activity or the like in CDG2, which is a core element of the fusion protein, is fused by a connecting peptide to form a complex with sgRNA in editing cells. Under the guidance of sgRNA, the fusion protein precisely recognizes and binds genomic DNA complementary to the sgRNA sequence, then unwinds the double helix structure of the genomic DNA and cuts the DNA single strand complementary to the sgRNA sequence to form a notch (Nick), meanwhile, CDG2 can bind to the single strand DNA region of an R-loop region formed by the unwound genomic DNA double strand and the sgRNA, cytosine (C) located in an active editing window on the DNA single strand is excised to form apurinic and apyrimidinic sites (AP site), and then base excision repair (base-precision repair) is initiated, and (1) if the AP site is recognized by AP lyase and cut or spontaneously breaks, a notch is formed on the DNA single strand, and the notch and a notch generated by adjacent positions on the complementary DNA single strand of nCas9 (D10A) can initiate DNA double strand break (DNable-strand and break, DSB), and Non-homologous end connection (NHou-homologous endjoining) is initiated, and random DNA repair is carried out to cause random mutation of the ends of Ind; (2) An AP site can be subjected to another error-prone repair, missing bases are randomly repaired into four different types of bases (A, T, C and G) with a certain probability, and the AP site is more prone to pairing with the C base due to the existence of 'C rule', so that mutation from the AP site to the G base is generated with a higher probability.

Example 2 editing of fusion protein CGBE at the endogenous site of Hela cells

This example uses N-CGBE, C-CGBE, CE-CGBE-1, CE-CGBE-2, CE-CGBE-3 and CE-sprycBE for base editing in HeLa cells.

N-CGBE, C-CGBE, CE-CGBE-1, CE-CGBE-2, CE-CGBE-3 and CE-spryCGBE, and specific sites of sgRNA (sgRNA sequence reference Koblan, L.W., arbab, M., shen, M.W., et al.effect C.G-to-G. C base editors developed using CRISPRi screens, target-library analysis, and map learning.Nat Biotechnol 39,1414-1425 (2021), https:// doi.org/10.1038/s 41587-021-00938-z) were co-transfected with the transfection reagent PEI (2.5. Mu.g) at a mass ratio of 2:1 (700 ng:350 ng) into the HeLa cell line (1X 10 ⁵ Cell amount/group). Only sgRNA of specific sites is added into a negative control group (Mock), the dosage of a single vector is consistent with that of an experimental group, and after 24 hours of transfection, the antibiotics puromycin (1 mug/ml) and Blasticidin Blastidin (20 mug/ml) are added for screening, and after 5 days of transfection, cells are collected for lysis, and genome is extracted. And (3) using a cell lysate as a template, amplifying a target fragment containing a target site by utilizing a PCR reaction, and identifying the integral base editing condition of the site by second generation sequencing.

The frequency of C3 base conversion of N-CGBE, C-CGBE and CE-CGBE-1 at the endogenous site Dicer 1 of Hela cells is shown in FIG. 2A (the abscissa represents the group, and the ordinate represents the efficiency of C conversion to different bases in the second-generation sequencing result), and the experimental results are derived from 3 biological replicates. The results can be seen: N-CGBE, C-CGBE, CE-CGBE-1 produced different degrees of base conversion at the C3 base site of Dicer 1, with the base conversion from C to G being the main, and the editing efficiency (33-38%) of CE-CGBE-1 was much higher than that of N-CGBE (9.8-12%) and C-CGBE (1.1-1.4%).

The base conversion frequency of N-CGBE, C-CGBE and CE-CGBE-1 at 3 endogenous sites of Hela cells is shown in FIG. 2B (the abscissa represents the group, and the ordinate represents the efficiency of C-to-G base conversion in the second generation sequencing result), and the frequency of index generation by N-CGBE, C-CGBE, CE-CGBE-1 at the endogenous sites of Hela cells is shown in FIG. 2C (the abscissa represents the group, and the ordinate represents the frequency of index generation). The results can be seen: N-CGBE, C-CGBE and CE-CGBE-1 compared, CE-CGBE-1 produced higher C-to-G editing efficiency (18-30%) and lower index frequency (0.84-4.88%), and C-CGBE produced lower C-to-G editing efficiency (0.6-7%) and higher index frequency (11.85-67.2%).

The frequency of base conversion of CE-CGBE-1, CE-CGBE-2 and CE-CGBE-3 at 3 endogenous sites in Hela cells is shown in FIG. 2D, the frequency of index generation is shown in FIG. 2E, and the results can be seen: CE-CGBE-1 produced higher editing efficiency of C to G than CE-CGBE-2, while producing lower indels, and CE-CGBE-3 had substantially no editing efficiency. The sequence of hCDG in CE-CGBE-1 was used in the subsequent experiments.

The frequency of base conversion of CE-sprycGGBE at 2 endogenous sites of HeLa cells is shown in FIGS. 2F and 2G, the abscissa indicates the position of edited C base in sgRNA (PAM is 21-23), and the ordinate indicates the conversion efficiency of C to different bases in the second generation sequencing result. The results can be seen: at sites other than NGG PAM, CE-sprycbe effectively edited the C base, mainly resulting in C to G base mutation (3-21%).

EXAMPLE 3 editing of fusion protein pTac-CE-CGBE at E.coli endogenous site

This example uses pTac-CE-CGBE for base editing in E.coli.

8ng of specific site sgRNA (same as above) and 8ng of pTac-CE-CGBE were simultaneously added to 50. Mu.l of BW25113 competent cells, and the mixture was allowed to stand on ice for 30min and immediately cooled on ice for 2-3min after heat shock in a water bath at 42℃for 60 s. After the addition of 450. Mu.l of LB medium, the mixture was resuscitated at 220rpm for 1h at 37 ℃.2400 Xg was centrifuged for 3min, 400. Mu.l of the supernatant was discarded, the cells were resuspended in the remaining medium and plated on LB plates (containing 25. Mu.g/ml chloramphenicol, 50. Mu.g/ml kanamycin sulfate). The following day the monoclonal was picked and inoculated into 2ml of liquid LB medium (containing 25. Mu.g/ml chloramphenicol, 50. Mu.g/ml kanamycin sulfate). After 16h of incubation, a precipitate was obtained by centrifugation at 1000 Xg for 10 min. The precipitate was resuspended with 100. Mu.l of alkaline lysis solution (25mM NaOH,0.2mM Na2-EDTA, pH 12), heated at 85℃for 30min, and neutralized by adding 100. Mu.l of neutralization solution (40 mM Tris-HCl, pH 7.5). The supernatant was obtained by centrifugation at 1000 Xg for 10min and used as a substrate for PCR. The target fragment containing the target site is amplified by PCR reaction, and the integral base editing condition of the site is identified by second generation sequencing.

The base conversion frequency of pTac-CE-CGBE at the E.coli endogenous site is shown in FIGS. 3A-3H (the abscissa indicates the group, and the ordinate indicates the efficiency of conversion of C to different bases in the second-generation sequencing result), and it can be seen that: pTac-CE-CGBE effectively generates base conversion at the E.coli endogenous site C base, wherein the base conversion from C to A and C to T is the main, and the total editing efficiency is 1% -80%.

Example 4 editing of fusion protein CGBE at endogenous sites in human primary T cells

Prior to electrotransformation, T cells were activated with human T-Activator CD3/CD28 beads (Thermo Fisher Scientific) for 2 days and cultured in T cell medium (containing 100U/mL IL-2,2mM L-glutamine and 2vol% human AB serum) and CD3/CD28 beads were removed from cells 1 day prior to electrotransformation. When electrotransport was performed using NEON Transfection System, 3.0X10 s in each sample were measured ⁵ Individual cells were pelleted by centrifugation for 5 minutes at a rate of 300×g and washed once with DPBS. T cells were then suspended in 10 μl NEON buffer R. 600ng of CE-CGBE-1mRNA,600ng sgRNA (Genscript synthesis) was added to the cell solution. Control wells used NEON buffer R without any RNA added. The parameters for the electrical transfer at NEON Transfection System are as follows: 160 v,10ms, three pulses. After electrotransformation, the cells were placed in 500mL fresh T cell culture medium in a 48-well plate. Culturing the cells for 3 days after electrotransformation, separating genome DNA, amplifying target fragments containing target sites by utilizing PCR reaction, and performing sanger sequencing and second generation sequencing.

Base conversion of CE-CGBE-1 at the endogenous site of human primary T cells As shown in FIGS. 4A-4C, 4A is the Mulberry sequencing result, and the second generation sequencing result of CE-CGBE-1 at the C6 of the VEGFA site resulted in 30% C to G,20% C to T,4B compared with the control group showed that CE-CGBE-1 resulted in 21% C to G,18% C to T,6% C to A base conversion at the C6 of the VEGFA site and 8.8% C to T base conversion at the C9 of the VEGFA site. 4C is a statistic that CE-CGBE-1 produced an indel at the VEGFA site, and CE-CGBE-1 produced 2.5% of the indel at the VEGFA site.

Example 5 comparison of CGBE with an existing GBE editor

The reported GBE used in this example: #163543, #140256, #163565 and #163546 were all purchased from addgene, and the structural schematic is shown in FIG. 5A, and CE-CGBE-1 and the reported GBE were base edited in Hela cells.

CE-CGBE-1, #163543, #140256, #163565 and #163546, and specific sites of sgRNA (supra) were co-transfected with the transfection reagent PEI (2.5. Mu.g) into the HeLa cell line (1X 10) at a mass ratio of 2:1 (700 ng:350 ng) ⁵ Cell amount/group). The negative control group is only added with sgRNA at specific sites, the single vector dosage is consistent with that of the experimental group, the screening is carried out by adding the antibiotic puromycin (1 mug/ml) after 24 hours of transfection, and after 5 days of transfection, cells are collected for lysis, and genome is extracted. And (3) using a cell lysate as a template, amplifying a target fragment containing a target site by utilizing a PCR reaction, and identifying the integral base editing condition of the site by second generation sequencing.

The frequency of base conversion of CE-CGBE-1, #163543, #140256, #163565 and #163546 at 8 endogenous sites of Hela cells is shown in FIG. 5B, the purity of conversion of C to G is shown in FIG. 5C, the frequency of generation of indels is shown in FIG. 5D, the purity of conversion of C to T is shown in FIG. 5E, the row of the heat map represents each editing site, the column represents the group, the darkness of each cell represents the size of the value, the darker the color, and the larger the value. The results can be seen: compared to the already reported GBEs (# 163543, #140256, #163565 and # 163546), the editing efficiency of CE-CGBE-1 at bit positions Dicer 1, #12 and #18 is higher than that of the already reported GBEs, and the editing efficiency at bit positions VEGFA, #1, #2 and #11 is equal to that of the already reported GBEs; the purity of the product of the CE-CGBE-1 for generating the C-to-G editing at each position is equivalent to that of the reported GBE; CE-CGBE-1 produced indels at multiple sites more frequently than the reported GBE; CE-CGBE-1 produced C-to-T edited products at positions Dicer 1 and #18 at a lower purity than the reported GBE.

Example 6 detection of off-target Effect of fusion protein CGBE

Cas-OFFinder (CRISPR RGEN Tools (rgenome. Net)) was used to predict potential off-target sites for Cas9 RNA-guided endonucleases, the first 10 off-target sites were selected for validation. Hela cells were transfected with CE-CGBE-1, the negative control group was supplemented with sgRNA at only specific sites, the single vector usage was consistent with that of the experimental group, screening was performed 24 hours after transfection with the antibiotics puromycin (1. Mu.g/ml) and Blastidin (20. Mu.g/ml), cells were collected for lysis 5 days after transfection, 10 potential off-target site sequences were PCR amplified for each target site using the cell lysate as template, and the overall off-target editing of the site was identified by second generation sequencing.

The off-target effects of CE-CGBE-1 at HeLa cell endogenous sites #1, #2 and Dicer are shown in FIG. 6, with the abscissa representing each off-target site and the ordinate representing the percent base editing. The results can be seen: the off-target effect of CE-CGBE-1 at HeLa cell endogenous sites #1, #2 and Dicer was very low, less than 0.25%.

EXAMPLE 7 construction of fusion protein TSBE plasmid

As shown in FIG. 7A, the present embodiment provides fusion proteins CE-TSBE-1, CE-TSBE-2, CE-TSBE-3, CE-TSBE-V206I and CE-TSBE-R260K.

The preparation method of the fusion protein CE-TSBE-1 comprises the following steps: the pCMV-BE3 (# 73021) is used as a basic vector, a humanized TDG2 sequence is inserted into the middle of the SpCas9 (D10A), 16 amino acids at 1047-1062 positions of the humanized TDG2 sequence are replaced, a nuclear localization signal NLS is connected to the N end and the C end of the SpCas9 (D10A), a fusion protein CE-TSBE-1 expression vector is constructed, the structure diagram is shown in FIG. 7A, and the amino acid sequence of the fusion protein CE-TSBE-1 is SEQ ID NO.16.

The preparation method of the fusion protein CE-TSBE-2 comprises the following steps: the CE-CGBE-1 is used as a basic vector, a humanized TDG2 sequence is inserted into the middle of the SpCas9 (D10A) to replace 16 amino acids at 1047-1062 positions, a 5AA linker is connected to the N end and the C end of the TDG2, a nuclear positioning signal NLS is connected to the N end and the C end of the SpCas9 (D10A), and a fusion protein CE-TSBE-2 expression vector is constructed, the structural schematic diagram is shown in figure 7A, and the amino acid sequence of the fusion protein CE-TSBE-2 is SEQ ID NO.21.

The preparation method of the fusion protein CE-TSBE-3 comprises the following steps: the pCMV-BE3 (# 73021) is taken as a basic vector, a humanized TDG2 sequence is inserted into the middle of the SpCas9 (D10A) to replace 16 amino acids at 1047-1062 positions, a 10AA linker is connected to the N end and the C end of the TDG2, a nuclear localization signal NLS is connected to the N end and the C end of the SpCas9 (D10A), a fusion protein CE-TSBE-3 expression vector is constructed, the structural schematic diagram is shown in FIG. 7A, and the amino acid sequence of the fusion protein CE-TSBE-3 is SEQ ID NO.22.

The preparation method of the fusion protein CE-TSBE-V206I comprises the following steps: the pCMV-BE3 (# 73021) is taken as a basic vector, a humanized TDG2 (V206I) sequence is inserted into the middle of the SpCas9 (D10A) to replace 16 amino acids at 1047-1062 positions, a 5AA linker is connected to the N end and the C end of the TDG2, a nuclear localization signal NLS is connected to the N end and the C end of the SpCas9 (D10A), and a fusion protein CE-TSBE-V206I expression vector is constructed, the structural diagram is shown in figure 7A, and the amino acid sequence of the fusion protein CE-TSBE-V206I is SEQ ID NO.23.

The preparation method of the fusion protein CE-TSBE-R260K comprises the following steps: the pCMV-BE3 (# 73021) is used as a basic vector, a humanized TDG2 (R260K) sequence is inserted into the middle of the SpCas9 (D10A) to replace 16 amino acids at 1047-1062 positions, a 5AA linker is connected to the N end and the C end of the TDG2, a nuclear localization signal NLS is connected to the N end and the C end of the SpCas9 (D10A), a fusion protein CE-TSBE-R260K expression vector is constructed, the structural diagram is shown in figure 7A, and the amino acid sequence of the fusion protein CE-TSBE-R260K is SEQ ID NO.24.

The preparation method of the fusion protein CE (1010) -TSBE-R260K, CE (1028) -TSBE-R260K, CE (1248) -TSBE-R260K is the same as that above, a humanized TDG2 (R260K) sequence is inserted into the middle of SpCas9 (D10A) to replace amino acids at different positions, and the amino acid sequences are shown in SEQ ID NO. 26-27.

The principle of the fusion protein when editing on genome is shown in fig. 7B, only SpCas9 (D10A) or Spry Cas9 (D10A) with single-stranded DNA cleavage activity or the like in the core element TDG2 of the fusion protein is fused by a connecting peptide, and a complex is formed with sgRNA in editing cells. Under the guidance of sgRNA, the fusion protein precisely recognizes and binds genomic DNA complementary to the sgRNA sequence, then unwinds the double helix structure of the genomic DNA and cuts a DNA single strand complementary to the sgRNA sequence to form a Nick (Nick), meanwhile, TDG2 can bind to a single strand DNA region of an R-loop region formed by the unwound genomic DNA double strand and the sgRNA, thymine (T) located in an active editing window on the DNA single strand is excised to form apurinic/Apyrimidinic (AP) site (AP site), and then base excision repair (base-precision repair) is initiated, wherein (1) if the AP site is recognized by AP lyase and cut or spontaneously breaks, a Nick is formed on the DNA single strand, and the Nick generated by nCas9 (D10A) at a position adjacent to the complementary DNA single strand can initiate DNA double strand break (DNA double strand break, DSB), and Non-homologous end connection (Non-homologous endjoining) is initiated to randomly mutate the end by DNA strand break, thereby causing mutation; (2) The AP site can also be repaired easily by another error, the missing base is randomly repaired into four different types of bases (A, T, C and G) with a certain probability, and the AP site is more prone to pairing with the C base due to the existence of 'C rule', so that mutation from the AP site to the G base is generated more probability.

Example 8 editing of fusion protein TSBE at the endogenous site of Hela cells

This example uses CE-TSBE-1, CE-TSBE-2 and CE-TSBE-3 for base editing in HeLa cells.

CE-TSBE-1, CE-TSBE-2 and CE-TSBE-3, and sgRNA at specific sites (supra) were co-transfected with the transfection reagent PEI (2.5. Mu.g) into the HeLa cell line (1X 10) at a mass ratio of 2:1 (700 ng:350 ng) ⁵ Cell amount/group). Only sgRNA at specific sites was added to the negative control group, the amount of single vector was consistent with that of the experimental group, and after 24 hours of transfection, the antibiotics puromycin (1. Mu.g/ml) and Blastidin (20. Mu.g/ml) were added for screening, and after 5 days of transfection, cells were collected for lysis to extract genome. And (3) using a cell lysate as a template, amplifying a target fragment containing a target site by utilizing a PCR reaction, and identifying the integral base editing condition of the site by second generation sequencing.

The editing efficiency of CE-TSBE-1, CE-TSBE-2 and CE-TSBE-3 at 2 endogenous sites of Hela cells is shown in FIG. 8A, the abscissa indicates the positions of edited T bases in sgRNA (PAM at positions 21-23), and the ordinate indicates the overall efficiency of T base mutation in the second generation sequencing results. The purity of the products of the T-G mutation of CE-TSBE-1, CE-TSBE-2 and CE-TSBE-3 at 2 endogenous sites of Hela cells is shown in FIG. 8B, the abscissa indicates the position of the edited T base in sgRNA (PAM positions 21-23), and the ordinate indicates the purity of the products of the T-G mutation. The experimental results were from 3 biological replicates. The results can be seen: the editing efficiency (T4: 3.8-5.8%; T5: 13-16.8%; T6: 14.2-17.4%; T8: 4.7-5.5%) produced by CE-TSBE-2 at positions T4, T5, T6 and T8 of Dicer site 1 were all higher than the editing efficiency (T4: 1.4-2.1%; T5: 4.7-4.9%; T6: 2.2-2.6%; T8: 0.5-1.9%) produced by CE-TSBE-1 and the editing efficiency (T4: 1-1.7%; T5: 4-7.9%; T6: 4.2-5.6%; T8: 2.8-4.7%) produced by CE-TSBE-3. The purity of the T-G edited product produced by the three methods is not obviously different. The editing efficiency (T-2:7-9.4%; T3:6.1-7.6%; T5:22.9-28.3%) generated by CE-TSBE-2 at sites T-2, T3 and T5 of VEGFA were all higher than the editing efficiency (T-2:5.9%; T3:3.5-3.7%; T5:4-4.5%) generated by CE-TSBE-1 and the editing efficiency (T-2:0.5-1.1%; T3:2.1-3.4%; T5:10.3-13.9%). The purity of the T-G edited product produced by the three methods is not obviously different.

Example 9 detection of TSBE off-target Effect of fusion proteins

Cas-OFFinder (CRISPR RGEN Tools (rgenome. Net)) was used to predict potential off-target sites for Cas9 RNA-guided endonucleases, the first 10 off-target sites were selected for validation. Hela cells were transfected with CE-TSBE-2, the negative control group was supplemented with sgRNA at only specific sites, the single vector usage was consistent with that of the experimental group, screening was performed 24 hours after transfection with the antibiotics puromycin (1. Mu.g/ml) and Blastidin (20. Mu.g/ml), cells were harvested for 5 days after transfection to lyse, 10 potential off-target site sequences were PCR amplified for each target site using the cell lysate as template, and overall off-target editing for that site was identified by second generation sequencing.

The off-target effects of CE-TSBE-1 at HeLa cell endogenous sites #1, VEGFA and Dicer are shown in FIG. 9, with the abscissa representing each off-target site and the ordinate representing the percent base editing. The results can be seen: the off-target effect of CE-TSBE-1 at the endogenous sites #1, VEGFA and Dicer of the Hela cells is very low, and has no obvious difference from that of a negative control group.

Example 10Artificial Intelligence (AI) assisted protein evolution and mutation validation of TSBE

Due to the wide similarity between human and protein languages, several recent research efforts have applied Large Language Models (LLMs) to protein domain studies, including directed protein evolution, evolution dynamics and evolution of human antibodies. This example uses 9 different ranking strategies to systematically benchmark 17 different pre-trained protein LLMs. The goal is to determine the best combination by computer evaluation, followed by protein evolution for true TDG. The LLM-based TSBE protein evolution flow is shown in FIG. 10A:

Stage 1: pretraining of Large Language Models (LLMs) was performed using ESM (evolution-scale modeling), including multiple data sets of Uniprot, uniref, uniref100, swissProt, etc. Of the 17 LLM candidates initially selected, based on their stability and high performance, esm2_t33_650m_ur50d was finally selected to evaluate the mutation effect.

Stage 2: (1) Analysis was initiated by identifying a range of potential mutation sites in the wild-type sequence. These sites are then input into LLM (large language model) via a mask language model, and the output generated from the previous step is the amino acid profile for each candidate site. (2) 9 different ranking strategies were employed to evaluate the quality of each amino acid distribution. The effectiveness of these ranking strategies was assessed using a Top-N Rate index that measures the proportion of variants with higher fitness than wild-type in the Top N ranking. Finally, the Wildtype marginal probability strategy was chosen because it performed better. (3) Using the data obtained in stage 2, a regression model was fitted to generate each possible variant of the selected site by mutating the given wild type sequence. These variants are then input into a regression model to obtain the respective ranks. The results of these analyses were then evaluated using the Top-N Rate index and used to refine the regression model.

Stage 3: after the regression model is completed, sorting based on the full-length sequence of the TDG and sorting based on the semi-non-conservative sequence of the TDG catalytic active region are carried out, 38 mutants of the TDG full-length sequence sorting top50 and 65 mutants of the TDG catalytic active region semi-non-conservative sequence sorting top100 (single mutation amino acid information is shown in table 1) are selected respectively, and plasmid construction and function verification are carried out. The preparation method of the mutant plasmid comprises the following steps: the CE-CGBE-2 is used as a basic vector, a humanized TDG2 (containing single amino acid mutation, a PCR primer is used for introducing amino acid point mutation) mutant sequence is inserted into the middle of SpCas9 (D10A), 16 amino acids at 1047-1062 positions of the mutant sequence are replaced, a 5AA linker is connected to the N end and the C end of the TDG2 mutant sequence, a nuclear localization signal NLS is connected to the N end and the C end of the SpCas9 (D10A), and a fusion protein mutant expression vector is constructed, the structural schematic diagram is shown in figure 7A, for example, the amino acid sequence of CE-TSBE-V206I in the fusion protein mutant is SEQ ID NO.23, and the amino acid sequence of CE-TSBE-R260K is SEQ ID NO.24.

TABLE 1

Variants from full length ranking	Variants from CD domain ranking	Two variants combination
			V274A,H92A,P43R,I103Q	P165H,P165K,P165S,P165F	V206I/R260K
P43K,I103K,D183G,H11A	P165V,P165N,P165A,P165Q	V206I/E219D
			H92L,H11P,E182P,P122E	P165R,P165T,V206I,R260K	V206I/G107E
H11K,H92V,L74A,V185K	R260T,V206C,L233I,R260H	R260K/G107E
			I103A,I37S,I103S,H283P	C132T,C132V,P165L,L201M	L74Q/H92E
H92Q,I37T,I103R,S55G	P165M,P165E,P165D,P165I	L74Q/I37T
			E219D,G107E,V185Q,I103E	P165G,L201V,P165C,L201C	H92E/I37T
D63A,G107E,H11S,H283I	L201Y,S270T,S270C,R260Q	G107E/L74Q
			N238S,K297E,H92E,L74Q	R260L,R260N,R260A,R260C	G107E/H92E
H11R,I103G	R260I,R260E,R260T,R260S	G107E/I37T
				R260M,R260D,R260Y,C132S	R260K/L74Q
	C132A,C132I,C132L,H154N	R260K/H92E
				H154Q,H154M,V206A,V206L
	V206N,V206M,V206G,L233V
				L233C,W245Y,P165Y,L201I
	W245L,S270N,P271L,R260V
				V206S

TSBE mutant plasmids, as well as specific sites of sgRNA (supra), were co-transfected with the transfection reagent PEI (2.5. Mu.g) into the Hela cell line (1X 10) at a mass ratio of 2:1 (700 ng:350 ng) ⁵ Cell amount/group). Adding T in the control groupSBE-2 and specific sites of sgRNA, single carrier dosage and experimental group are consistent, 24 hours after transfection, adding the antibiotics puromycin (1 mug/ml) and Blastidin (20 mug/ml) for screening, 5 days after transfection, collecting cells for lysis, extracting genome. And (3) using a cell lysate as a template, amplifying a target fragment containing a target site by utilizing a PCR reaction, and identifying the integral base editing condition of the site by second generation sequencing.

Comparison of the number of mutants screened for higher editing efficiency than wild-type TDG by two different orderings as shown in FIG. 10B, among 38 mutants based on TDG full-length sequence ordering top50, 33 were higher in editing efficiency than wild-type TDG, while among 65 mutants based on TDG catalytic activity region semi-non-conserved sequence ordering top100, 17 were higher in editing activity than wild-type TDG. As shown in FIG. 10C, the results of functional verification of the mutant (38+65) are shown, the abscissa indicates the total base editing efficiency, the ordinate indicates the editing efficiency from T to G, and the mutants having editing activity 1.5 times that of the wild type are R260K, G107E, L74Q, etc.

The amino acid mutations with improved editing efficiency are combined pairwise (the single mutation amino acid information is shown in table 1), and the functions of the double amino acid mutants are constructed and verified. The preparation method of the mutant plasmid comprises the following steps: the CE-CGBE-2 is used as a basic vector, a humanized TDG2 (containing two amino acid mutations, a PCR primer is used for introducing amino acid point mutation) mutant sequence is inserted into the middle of SpCas9 (D10A), 16 amino acids at 1047-1062 positions of the mutant sequence are replaced, a 5AA linker is connected to the N end and the C end of the TDG2 mutant sequence, and a nuclear localization signal NLS is connected to the N end and the C end of the SpCas9 (D10A), so that a fusion protein double mutant expression vector is constructed. TSBE double mutant plasmids, as well as specific sites of sgRNA (supra), were co-transfected with the transfection reagent PEI (2.5. Mu.g) into the HeLa cell line (1X 10) at a mass ratio of 2:1 (700 ng:350 ng) ⁵ Cell amount/group). TSBE-2 and specific site sgRNA were added to the control group, the amount of single vector was the same as that of the experimental group, and after 24 hours of transfection, the antibiotics puromycin (1. Mu.g/ml) and Blastidin (20. Mu.g/ml) were added for screening, and after 5 days of transfection, cells were collected for lysis to extract the genome. Using cell lysate as template, and amplifying by PCR reaction to obtain target siteAnd (3) identifying the overall base editing condition of the site through second generation sequencing.

As a result of functional verification of the double amino acid mutant, as shown in FIG. 10D, the abscissa represents the total base editing efficiency, the ordinate represents the editing efficiency from T to G, and the double amino acid mutants having further improved editing activity were G107E/260K, L74Q/G107E, H92E/G107E and I37T/G107E, wherein the editing activity of G107E/260K was 2.3 times that of the wild type, and the editing activity of L74Q/G107E was 2.6 times that of the wild type.

Example 11 Effect of TDG2 (260K) insertion into different locations of spCas9 on TSBE editing efficiency

TSBE-2, CE (1010) -TSBE-R260K, CE (1029) -TSBE-R260K and CE (1249) -TSBE-R260K, and the sgRNA at the specific site (supra) were co-transfected with the transfection reagent PEI (2.5 ug) into the HeLa cell line (1X 10) in a mass ratio of 2:1 (700 ng:350 ng) ⁵ Cell amount/group). Only sgRNA at specific sites was added to the negative control group, the amount of single vector was consistent with that of the experimental group, and after 24 hours of transfection, the antibiotics puromycin (1. Mu.g/ml) and Blastidin (20. Mu.g/ml) were added for screening, and after 5 days of transfection, cells were collected for lysis to extract genome. And (3) using a cell lysate as a template, amplifying a target fragment containing a target site by utilizing a PCR reaction, and identifying the integral base editing condition of the site by second generation sequencing.

The frequency of base conversion of TSBE-2, CE (1010) -TSBE-R260K, CE (1029) -TSBE-R260K and CE (1249) -TSBE-R260K at the endogenous site Dicer1 of Hela cells is shown in FIG. 11A, the abscissa indicates the position of the T base, and the ordinate indicates the percentage of base editing. Description of results: CE (1010) -TSBE-R260K has an editing efficiency of 28.8% at T8, which is much higher than that of TSBE-2 at T8 by 8.75%; the editing efficiency of CE (1029) -TSBE-R260K and CE (1249) -TSBE-R260K at T5 is lower than that of TSBE-2, but the editing efficiency is still higher than that of TSBE-2 at T6, and the editing windows of CE (1029) -TSBE-R260K and CE (1249) -TSBE-R260K are narrower, so that the method is more suitable for editing T5. As shown in FIG. 11B, the base conversion frequencies of the VEGFA at the endogenous sites of the HeLa cells of the TSBE-2, the CE (1010) -TSBE-R260K, CE (1029) -TSBE-R260K and the CE (1249) -TSBE-R260K are higher than those of the TSBE-2 in the editing efficiency of the CE (1010) -TSBE-R260K at the T-1, the T5 and the T7; the editing efficiency of the CE (1029) -TSBE-R260K and the CE (1249) -TSBE-R260K at T5 is equal to that of the TSBE-2, and the editing windows of the CE (1029) -TSBE-R260K and the CE (1249) -TSBE-R260K are narrower and are more suitable for editing T5. In summary, inserting TDG2 (260K) into different positions of spCas9 affects the editing efficiency and editing window of TSBE.

Example 12 editing of fusion protein TSBE in db/db heterozygous mouse embryo

db/db mice and C57BL/6J mice were purchased from southern model organism and laboratory animal resource centers at the western lake university, respectively, and the mice were kept in specific pathogen-free facilities with adequate diet and drinking water in a 12 hour light and 12 hour dark cycle. All animal experiments accord with the draft regulation formulated by Hangzhou laboratory animal evaluation and certification society, and are approved by the laboratory animal resource center of the university of West lake. db/db homozygous male mice and C57BL/6J female mice were used as embryo donors to obtain db/db heterozygous embryos. Chemically modified sgrnas were synthesized from gold (Genscript) and have the sequence shown in SEQ ID No.28. The CE-TSBE-R260K plasmid is used as a template, a T7 promoter sequence is added to a primer, a T7-CE-TSBE-R260K DNA fragment is obtained through PCR amplification, and then the DNA fragment is purified by a phenol chloroform method and used as a template for in vitro transcription, wherein the sequence of the DNA fragment is shown as SEQ ID NO.29. T7-CE-TSBE-R260K mRNA was transcribed using an in vitro RNA transcription Kit (mMESSAGE mMACHINE T7 Ultra Kit, ambion) and was obtained, and the resulting mixture was then purified to give a DNA fragment containing T7-CE-TSBE-R260K mRNA (100 ng. Mu.l ^-1 ) And sgRNA (60 ng. Mu.l) ^-1 ) Injecting the solution of the complex into db/db heterozygote embryo cytoplasm, culturing for 3.5 days, then lysing cells, extracting genome, amplifying target fragments containing target sites by utilizing PCR reaction, and identifying the integral base editing condition of the sites by Mulberry sequencing (sanger sequencing) and second generation sequencing.

FIG. 12A shows that TSBE can theoretically edit and repair 64% of single-base pathogenic mutations, CE-TSBE-R260K edits db/db heterozygous mouse embryos as shown in FIGS. 12B-12C, the ratio of spontaneous G to T mutations of db/db heterozygous mouse embryos is 47.8-49%, the ratio of G at the site of a successfully edited TSBE mouse embryo is 60-78%, and the improvement is 12.2-29% without other byproducts.

Claims

1. A fusion protein, characterized in that the fusion protein comprises nuclease, uracil-N-glycosylase mutant, wherein the uracil-N-glycosylase mutant is linked to nuclease, or uracil-N-glycosylase mutant is inserted into nuclease, and the nuclease is a D10A mutated SpCas9 protein or a D10A mutated spryccas 9 protein, or other Cas enzymes with deleted nuclease activity and retained helicase activity; the uracil-N-glycosylase mutant is cytosine-N-glycosylase or thymine-N-glycosylase.

2. The fusion protein of claim 1, wherein the cytosine-N-glycosylase is a human uracil-N-glycosylase mutant hCDG or an escherichia coli-derived uracil-N-glycosylase mutant eCDG or a nematode-derived uracil-N-glycosylase mutant cdg; the thymine-N-glycosylase is a humanized uracil-N-glycosylase mutant hTDG or a uracil-N-glycosylase mutant eTDG derived from escherichia coli or a uracil-N-glycosylase mutant cTDG derived from nematodes.

3. The fusion protein of claim 1, wherein the amino acid sequence of hCDG is shown in SEQ ID No. 1; the amino acid sequence of hTDG is shown as SEQ ID NO.2, SEQ ID NO.7 or SEQ ID NO. 8; the amino acid sequence of eCDG is shown as SEQ ID NO. 3; the amino acid sequence of eTDG is shown as SEQ ID NO. 4; the amino acid sequence of cCDG is shown as SEQ ID NO. 5; the amino acid sequence of cTDG is shown in SEQ ID NO. 6.

4. The fusion protein of claim 1, wherein the fusion protein further comprises a nuclear localization signal; the nuclear localization signal is fused to the N-terminus and/or the C-terminus of the fusion protein.

5. The fusion protein of claim 1 or 4, further comprising a linker peptide for linking the nuclease, uracil-N-glycosylase mutants;

and/or a linker peptide for linking the fusion protein and the nuclear localization signal.

6. The fusion protein of any one of claims 1-5, wherein the fusion protein comprises:

N-CGBE with the amino acid sequence shown in SEQ ID NO. 13;

C-CGBE, the amino acid sequence of which is shown in SEQ ID NO. 14;

CE-CGBE-1, the amino acid sequence of which is shown as SEQ ID NO. 15;

CE-CGBE-2, the amino acid sequence of which is shown as SEQ ID NO. 17;

CE-CGBE-3, the amino acid sequence of which is shown as SEQ ID NO. 18;

pTac-CE-CGBE with the amino acid sequence shown in SEQ ID NO. 19;

CE-sprycbe with an amino acid sequence shown as SEQ ID NO. 20;

CE-TSBE-1, the amino acid sequence of which is shown as SEQ ID NO. 16;

CE-TSBE-2, the amino acid sequence of which is shown in SEQ ID NO. 21;

CE-TSBE-3, the amino acid sequence of which is shown as SEQ ID NO. 22;

the amino acid sequence of CE-TSBE-V206I is shown as SEQ ID NO. 23;

CE-TSBE-R260K, the amino acid sequence of which is shown in SEQ ID NO. 24;

CE (1010) -TSBE-R260K, the amino acid sequence of which is shown in SEQ ID NO. 25;

CE (1029) -TSBE-R260K, the amino acid sequence of which is shown in SEQ ID NO. 26;

CE (1249) -TSBE-R260K, the amino acid sequence of which is shown in SEQ ID NO. 27.

7. A nucleic acid molecule comprising a gene encoding the fusion protein of any one of claims 1-6.

8. A base editing system comprising a mutant uracil-N-glycosylase, wherein the base editing system comprises the fusion protein of any of claims 1-6 and an sgRNA.

9. Use of the fusion protein of any one of claims 1-6, the nucleic acid molecule of claim 7 or the base editing system of claim 8 in the preparation of a gene editing product.

10. A gene editing kit comprising the base editing system of claim 8.