WO2019222537A1

WO2019222537A1 - Catalytically hyperactive variants of human apobec3g protein

Info

Publication number: WO2019222537A1
Application number: PCT/US2019/032720
Authority: WO
Inventors: Hiroshi Matsuo; Atanu MAITI
Original assignee: The United States Of America, As Represented By The Secretary, Department Of Health And Human Services
Priority date: 2018-05-18
Filing date: 2019-05-16
Publication date: 2019-11-21

Abstract

Modified apolipoprotein B mRNA editing enzyme, catalytic polypeptide 3G (APOBEC3G) polypeptides having one or more amino acid substitutions that increase its catalytic activity are described. The APOBEC3G polypeptides optionally include one or more substitutions that increase solubility of the protein. The variant APOBEC3G polypeptides and/or nucleic acid molecules encoding the polypeptides can be used, for example, to inhibit replication of immunodeficiency virus (HIV) and in gene editing systems to induce nucleobase substitutions in a target nucleic acid.

Description

CATALYTICALLY HYPERACTIVE VARIANTS OF HUMAN APOBEC3G PROTEIN

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Application No. 62/673,591, filed May 18, 2018, which is herein incorporated by reference in its entirety.

FIELD

This disclosure concerns modified human APOBEC3 proteins that have increased enzymatic activity and enhanced affinity for single-stranded DNA. The disclosure further concerns use of the APOBEC3 protein variants, such as for inhibiting human immunodeficiency virus (HIV) replication and in gene editing systems.

ACKNOWLEDGMENT OF GOVERNMENT SUPPORT

This invention was made with government support under project number

HHSN26120080001E awarded by the National Institutes of Health. The government has certain rights in the invention.

BACKGROUND

Human APOBEC (“apolipoprotein B mRNA editing enzyme, catalytic polypeptide”) proteins are single- stranded DNA (ssDNA) cytidine deaminases that catalyze Zn-dependent deamination of a deoxy-cytidine, generating deoxy-uridine. The APOBEC family includes APOBEC 1, APOBEC2, APOBEC3, APOBEC4 and ATP (activation-induced cytidine deaminase) (Conticello el al, Mol Biol Evol 22, 367-77, 2005). Among them, APOBEC3 proteins are genetically expanded in humans in response to the evolution of pathogens (LaRue et al. , BMC Mol Biol 9, 104, 2008; Zhang and Webb, Hum Mol Genet 13, 1785-91, 2004). As a result of this expansion, humans contain seven APOBEC3 proteins (APOBEC3A, 3B, 3C, 3D, 3F, 3G and 3H) which are all encoded on chromosome 22 (Jarmuz et al, Genomics 79, 285-96, 2002). APOBEC3G (A3G) restricts human immunodeficiency virus type 1 (HIV-l) (Sheehy et al, Nature 418, 646-50, 2002; Sheehy et al, Nature 418, 646-50, 2002; Mangeat et al, Nature 424, 99-103, 2003; Zhang et al, Nature 424, 94-8, 2003; Liddament et al, Curr Biol 14, 1385-91, 2004; Harris et al, Cell 113, 803-9, 2003), a finding that prompted extensive studies of APOBEC3 protein restriction of retroviruses and retrotransposons. Of the seven human APOBEC3 proteins, A3D, A3F, A3G and A3H can restrict HIV-l, and hypermutation of the virus genomes by their deamination activity is the primary mechanism by which these A3 proteins restrict HIV-l (Malim, Philos Trans R Soc Lond B Biol Sci 364, 675-87, 2009; Chiu and Greene, Annu Rev Immunol 26, 317-53, 2008; Goila- Gaur and Strebel, Retrovirology 5, 51, 2008; Feng et al, Front Microbiol 5, 450, 2014). The APOBEC3 proteins catalyze deamination of deoxy-cytidine introducing C-to-U modifications in newly synthesized (-)DNA strands of the virus genome, which results in G-to-A mutations in (+)DNA as U is used as a template during (+)DNA strand synthesis (Yu et al. , Nat Struct Mol Biol 11, 435-42, 2004). Although there are many reports of deaminase-independent HIV restriction by APOBEC3 proteins, the significance of deaminase-independent mechanisms is unknown (Okada and Iwatani, Front Microbiol 7, 2027, 2016).

HIV-l has developed a defense mechanism against APOBEC3 proteins by using one of its accessory proteins, viral infectivity factor (Vif). Vif physically interacts with HIV-relevant APOBEC3 proteins, and assembles host cellular proteins including an E3 ubiquitin ligase to trigger degradation of the APOBEC3 proteins through the ubiquitin-proteasome pathway (Yu et al, Science 302, 1056-60, 2003). For A3D, A3F and A3G, which contain two Zn²⁺ -binding motifs/domains, the catalytically inactive N-terminal domain (NTD) binds Vif as well as RNA, DNA and other viral proteins (Aydin et al, Structure 22, 668-84, 2014). The C-terminal domain (CTD) of these APOBEC3 proteins is catalytically active, containing Zn²⁺-binding motif HxE-x₂3- 28-C-X2-4-C. The catalytic mechanism of cytosine/cytidine deamination has been studied biochemically and structurally using deaminases from E. coli and yeast. Briefly, the hydroxide ion generated from a water molecule chelating Zn²⁺ attacks the C4 atom of cytosine, then the hydrogen is transferred to the carboxylate group of glutamic acid from the Zn²⁺ -binding motif; this hydrogen is ultimately transferred to the product ammonia (Betts et al, J Mol Biol 235, 635-56, 1994; Xiang et al, Biochemistry 34, 4516-23, 1995; Xiang et al , Biochemistry 36, 4768-74, 1997). Although all APOBEC3 proteins deaminate cytidines in ssDNA, they show differences in preferred hotspot sequences - 5'-CC for A3G and 5'-TC for other A3s (A3A can deaminate 5'-CC albeit to a lesser extent) (Desimmie et al , J Mol Biol 426, 1220-45, 2014). A3G’ s deamination mechanism may be more complicated than that of other A3 proteins because several groups have reported that A3G deaminates 5'-CC hotspots processively from the 3'-end to the 5'-end of ssDNA (Chelico et al, Nat Struct Mol Biol, 2006).

Three dimensional structures of APOBEC proteins have emerged in the last 10 years as several laboratories (Chelico et al, Nat Struct Mol Biol, 2006; Chen et al, Nature 452, 116-9,

2008; Harjes et al, J Mol Biol 389, 819-32, 2009; Shandilya et al, Structure 18, 28-38, 2010; Bohn et al, Structure 21, 1042-50, 2013; Bohn et al, Structure 23, 903-911, 2015; Kouno et al, Nat Struct Mol Biol 22, 485-91, 2015; Kouno et al, Nat Commun 8, 15024, 2017; Holden et al,

Nature, 456: 121-124, 2008; Furukawa et al, Embo J 28, 440-51, 2009; Kitamura et al, Nat Struct Mol Biol 19, 1005-10, 2012; Siu et al, Nat Commun 4, 2593, 2013; Byeon et al, Nat Commun 4, 1890, 2013; Lu et al, J Biol Chem 290, 4010-21, 2015; Shi et al, J Biol Chem 290, 28120-30,

2015; Byeon et al, Biochemistry 55, 2944-59, 2016; Xiao et al, Nat Commun 7, 12193, 2016; Shi et al, Nat Struct Mol Biol 24, 131-139, 2017; Qiao et al, Mol Cell 67, 361-373 e4, 2017) have solved NMR and crystal structures of single domains of human APOBEC proteins. These structures are similar as they share the same secondary structure, including six helices and five b-strands, and one H_AE_X28C_X2-₄C zinc -binding motif. Several alternative ssDNA binding surfaces have been proposed for A3G-CTD based on NMR and crystal structures of apo-form CTDs (Chen et al, Nature 452, 116-9, 2008; Holden et al, Nature, 456:121-124, 2008; Furukawa et al, Embo J 28, 440-51, 2009; Lu et al, J Biol Chem 290, 4010-21, 2015), yet none of these models is convincing because they lack atomic-level information of interactions between ssDNA and protein. Most recently, the crystal structures of A3A in complex with ssDNA containing a 5'-TC deamination motif have been reported. These A3A-ssDNA co-crystal structures revealed the interactions between A3 A and the 5'-TC motif, and structural similarity with the crystal structure of

Staphylococcus aureus tRNA adenosine deaminase (TadA) in complex with RNA (Kouno et al, Nat Commun 8, 15024, 2017; Shi et al, Nat Struct Mol Biol 24, 131-139, 2017).

SUMMARY

Modified APOBEC3G polypeptides having at least one amino acid substitution that increases catalytic activity of the enzyme are disclosed herein. In some instances, the modified polypeptides also include one or more substitutions that increase its solubility. Use of the disclosed APOBEC3G polypeptides and encoding nucleic acid molecules for inhibiting HIV-l replication, and in gene editing systems, is described.

Provided herein is an isolated polypeptide comprising the amino acid sequence of the C- terminal domain (CTD) of human APOBEC3G (SEQ ID NO: 1), wherein the proline residue at position 57 of SEQ ID NO: 1 (corresponding to position 247 of full-length APOBEC3G of SEQ ID NO: 4) is substituted with lysine or arginine. In some embodiments, the polypeptide further includes a glutamine to lysine, arginine or glutamate substitution at position 128 (corresponding to position 318 of full-length APOBEC3G of SEQ ID NO: 4). In some embodiments, the polypeptide further includes one or more amino acid substitutions that increase solubility of the polypeptide. In particular non-limiting examples, the polypeptide has the amino acid sequence of SEQ ID NO: 2 or SEQ ID NO: 6.

Nucleic acid molecules encoding the disclosed polypeptides, and vectors that include the encoding nucleic acid molecules are also provided. Further provided is a method of inhibiting HIV replication in a cell infected with HIV by contacting the cell with a vector that includes a nucleic acid molecule encoding a modified

APOBEC3G polypeptide disclosed herein.

Also provided are fusion proteins that include a modified APOBEC3G polypeptide and a heterologous protein. In some embodiments, the heterologous protein is a CRISPR-associated protein 9 (Cas9) polypeptide, such as a catalytically inactive Cas9 polypeptide. Nucleic acid molecules and vectors encoding the fusion proteins, and isolated cells comprising the vector, are also provided.

Further provided is a method for editing a nucleobase of a target nucleic acid by contacting the target nucleic acid with a fusion protein disclosed herein and a guide sequence, such as a guide RNA (gRNA).

Kits that include any of the disclosed APOBEC3G polypeptides, nucleic acid molecules, vectors, fusion proteins and isolated cells are also provided.

The foregoing and other objects and features of the disclosure will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1D: Generation of a variant of A3G-CTD. (FIG. 1A) Amino acid residues of wild-type A3G-CTD (SEQ ID NO: 1) and the catalytically hyperactive variant CTD2 (SEQ ID NO: 2) are aligned. 2K3A substitutions (L234K, C243A, F310K, C321A and C356A, numbered with reference to SEQ ID NO: 4) are indicated by boxes, and the additional five substitutions (P200A, N236A, P247K, Q318K and Q322A, numbered with reference to SEQ ID NO: 4) are underlined. (FIG. 1B) Real-time NMR deamination assay. The product (5'-AATCCdeoxy-UAAA)

concentration as a function of reaction time is plotted for CTD2 and wild- type A3G-CTD deamination at pH 7.5 with 200 nM protein and 200 mM 5'-AATCCCAAA substrate. For CTD2, the first reaction reached completion within 5 hours, and the second deamination (5'- AATCCdeoxy-UAAA to 5'-AATCdeoxy-Udeoxy-UAAA) started, thus decreasing the

concentration of the initial product. (FIG. 1C) Electrophoretic mobility shift assay (EMSA) for binding of CTD2* (SEQ ID NO: 3) to the 9nt ssDNA (5’-AATCCCAAA-6-FAM). CTD2*-DNA indicates position of the CTD2*-ssDNA complex, and free DNA indicates the position of protein- free ssDNA. Fluorescent-unprobed 9nt polyA (5'-AAAAAAAAA) or fluorescent-unprobed 9nt ssDNA (5'-AATCCCAAA) were added at incremental amounts in lanes 3 (25 nM), 4 (250 nM) and 5 (2500 nM); or 7 (25 nM), 8 (250 nM) and 9 (2500 nM), respectively. (FIG. 1D) Each dot represents microscale thermophoresis (MST) measurement of a mixture containing fluorescent labeled CTD2* (50 nM) and the 9nt ssDNA at various concentrations including 0.12 mM, 0.24 pM, 0.48 pM, 0.97 pM, 1.95 pM, 3.90 pM, 7.81 pM, 15.62 pM, 31.25 pM, 62.5 pM, 125 pM, 250 pM, 500 pM, 1 mM, 2 mM and 4 mM. Three independent MST experiments were performed, and bars of data points represent standard error of n=3 measurements.

FIGS. 2A-2C: Antiviral restriction activity of FLAG-NTD-CTD2. (FIG. 2 A)

Representative western blot showing 293T cells co-transfected with increasing amounts of wild- type FLAG-A3G or FLAG-NTD-CTD2 (21, 42, 84, 170 ng; black triangles), HDV-EGFP and VSV-G in the presence or absence of Vif-HA. Percent A3G expression is shown for each respective lane. Single-cycle infectivity of Vif-i- (FIG. 2B) and Vif- (FIG. 2C) HDV-EGFP virus prepared in the presence of increasing amounts of FLAG-A3G or FLAG-NTD-CTD2 assayed in TZM-bl target cells. Data reflects the average relative light units (RLU) normalized to the no A3G control. Error bars represent the standard deviation for three independent experiments.

FIGS. 3A-3G: Structure of CTD2* in complex with ssDNA. (FIG. 3A) The asymmetric unit contains one protein and one ssDNA molecule. A 2Fo-Fc electron density map contoured at 1 s is shown around the ssDNA. Zn²⁺ ion is indicated by the sphere next to Co. N and C indicate the N- and C-terminal ends of the protein, respectively. (FIG. 3B) An enlarged view shows interactions between the 5'-TCCCA target sequence and the protein. Amino acid sidechains interacting with DNA are shown as sticks. (FIGS. 3C-3F) Enlarged views show interactions between T-₃, C-2, C-i, Co, or A_+i and protein. Dotted lines indicate hydrogen bonds. In (FIG. 3C), the double arrow- headed line points to the neighboring backbone phosphorous atoms of C-i and C-2. In (FIG. 3D), sidechains of F289 and Q318K are not shown. (FIG. 3G) Summary of the interactions between CTD2* and nucleotides in the 5’-TCCCA target sequence.

FIGS. 4A-4C: Comparison of CTD2* and A3A recognition of the nucleotide at the -1 position in target sequences. (FIG. 4A) Interaction of cytidine at -1 position (C-i, blue) with residues of CTD2* in the CTD2*-ssDNA complex. (FIG. 4B) Interaction of thymidine at -1 position (T_i) with residues of A3A in the A3A-ssDNA complex (PDB ID: 5KEG). (FIG. 4C) Superimposition of the CTD2*-ssDNA and A3A-ssDNA complexes showing nucleotides at the -1 position and their interacting residues from both structures.

FIGS. 5A-5C: Comparison of structures of CTD2* with apo-CTD-2K3A. Proteins are in the same orientation in all three figures. (FIG. 5 A) Superimposed cartoons show structural features of CTD2* and CTD-2K3A (PDB ID# 3IR2), respectively. W211, R213, H216, Y315, D316 and D317 sidechains of CTD2* are shown in sticks. Zn²⁺ ion is shown as a sphere. ssDNA is not shown. (FIG. 5B) An enlarged view of (FIG. 5A) that shows the repositioning of the critical residues W211, R213 and H216 of loop 1, and Y315, D316 and D317 of loop 7. Double headed arrows point to positions of Ca atoms of W211 and D317. (FIG. 5C) Surface representation of ssDNA bound CTD2*. Locations of loopl and loop7 residues are labeled except D317 because this residue is buried inside of the molecule and not seen on the surface. The 5'-TCCCA target sequence is shown as sticks.

FIGS. 6A-6B: A variant of CDT2 possesses a different substrate specificity and enhanced deamination activity. CTD2 variant (CTD2-V, set forth as SEQ ID NO: 6) contains three amino acid changes relative to CTD2 (SEQ ID NO: 2). (FIG. 6A) Real-time NMR deamination assays.

1D ¹ H spectra series of 200 mM 5'-AATGCTAAA mixed with 2.0 pM CTD2-V. The H5 signal of U from the product, 5'-AATGUTAAA, appears at 5.55 ppm. (FIG. 6B) The ¹ H NMR signal of the deamination product was tracked with respect to reaction time. The initial reaction rate was 2.9+0.2 reactions/hour.

SEQUENCE LISTING

The nucleic and amino acid sequences listed in the accompanying sequence listing are shown using standard letter abbreviations for nucleotide bases, and three letter code for amino acids, as defined in 37 C.F.R. 1.822. Only one strand of each nucleic acid sequence is shown, but the complementary strand is understood as included by any reference to the displayed strand. The Sequence Listing is submitted as an ASCII text file, created on May 10, 2019, 13.2 KB, which is incorporated by reference herein. In the accompanying sequence listing:

SEQ ID NO: 1 is the amino acid sequence of the C-terminal domain of human wild-type APOBEC3G (referred to herein as“A3G-CTD”).

EILRHSMDPPTFTFNFNNEPWVRGRHETYLCYEVERMHNDTWVLLNQRRGFLCNQAPHK HGFLEGRHAELCFLDVIPFWKLDLDQDYRVTCFTSWSPCFSCAQEMAKFISKNKHVSLCIF TARIYDDQGRCQEGLRTLAEAGAKISIMTYSEFKHCWDTFVDHQGCPFQPWDGLDEHSQD LS GRLRAILQN QEN

SEQ ID NO: 2 is the amino acid sequence of a variant C-terminal domain of human APOBEC3G (referred to herein as“CTD2”) having 10 amino acid substitutions that increase enzymatic activity or solubility. Substitutions are indicated by bold underline.

EILRHSMDPATFTFNFNNEPWVRGRHETYLCYEVERMHNDTWVKLAQRRGFLANQAKH KHGFLEGRHAELCFLDVIPFWKLDLDQDYRVTCFTSWSPCFSCAQEMAKFISKNKHVSLCI KTARIYDDKGRAAEGLRTLAEAGAKISIMTYSEFKHCWDTFVDHOGAPFOPWDGLDEHS QD LS GRLR AILQN QEN

SEQ ID NO: 3 is the amino acid sequence of a catalytically inactive variant of CTD2 (referred to herein as“CTD2*”)· The single mutation relative to CTD2 is shown in bold underline. EILRHSMDPATFTFNFNNEPWVRGRHETYLCYEVERMHNDTWVKLAQRRGFLANQAKHK HGFLEGRHAALCFLD VIPFWKLDLDQD YRVTCFTSWSPCFSCAQEMAKFIS KNKHVSLCIK TARIYDDKGRAAEGLRTLAEAGAKISIMTYSEFKHCWDTFVDHQGAPFQPWDGLDEHSQD LS GRLRAILQN QEN

SEQ ID NO: 4 is the amino acid sequence of full-length human APOBEC3G, deposited under GenBank Accession No. NP_ 068594.1.

SEQ ID NO: 5 is the nucleotide sequence of full-length human APOBEC3G, deposited under GenBank Accession No. NM_02l822. One skilled in the art will appreciate that this sequence can be altered to generate APOBEC3G coding sequences for the variant proteins provided herein.

SEQ ID NO: 6 is the amino acid sequence of a variant of CTD2 (CTD2-V). Amino acid substitutions relative to CDT2 are indicated in bold underline.

EILRHSMDPATFTFNFNNEPWVRGRHETYLCYEVERMHNDTWVKLAQRRGFLANQAKHK HGFLEGRHAELCFLD VIPFWKLDLDQD YRVTCFTSWSPCFSCAQEMAKFISKNKHVSLCIK TARIYFCEGRAAEGLRTLAEAGAKISIMTYSEFKHCWDTFVDHOGAPFOPWDGLDEHSOD LS GRLRAILQN QEN

SEQ ID NOs: 7 and 8 are peptide linkers.

DETAILED DESCRIPTION

Unless otherwise noted, technical terms are used according to conventional usage.

Definitions of common terms in molecular biology can be found in Benjamin Lewin, Genes VH, published by Oxford University Press, 1999; Kendrew et al. (eds.), The Encyclopedia of Molecular Biology, published by Blackwell Science Ltd., 1994; and Robert A. Meyers (ed.), Molecular Biology and Biotechnology: a Comprehensive Desk Reference, published by VCH Publishers, Inc., 1995; and other similar references. As used herein, the singular forms“a,”“an,” and“the,” refer to both the singular as well as plural, unless the context clearly indicates otherwise. As used herein, the term“comprises” means “includes.” Thus,“comprising a nucleic acid molecule” means“including a nucleic acid molecule” without excluding other elements. It is further to be understood that any and all base sizes given for nucleic acids are approximate, and are provided for descriptive purposes, unless otherwise indicated. Although many methods and materials similar or equivalent to those described herein can be used, particular suitable methods and materials are described below. In case of conflict, the present specification, including explanations of terms, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting. All references, including patent applications and patents, and sequences associated with the GenBank® Accession Numbers listed (as of May 18, 2018) are herein incorporated by reference in their entireties.

In order to facilitate review of the various embodiments of the disclosure, the following explanations of specific terms are provided:

I. Terms

Administration: To provide or give a subject an agent, such as a modified APOBEC3G polypeptide or corresponding coding sequence disclosed herein, by any effective route. Exemplary routes of administration include, but are not limited to, injection (such as subcutaneous, intramuscular, intradermal, intraperitoneal, intratumoral, and intravenous), transdermal, intranasal, and inhalation routes.

APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide): A family of Zn²⁺ -dependent single- stranded DNA (ssDNA) cytosine deaminases. This family of proteins includes ABOBEC1, APOBEC2, APOBEC3, APOBEC4 and activation-induced cytidine deaminase (AID). Humans encode seven APOBEC3 proteins: APOBEC3A, APOBEC3B, APOBEC3D, APOBEC3E, APOBEC3F, APOBEC3G and APOBEC3H, four of which are known to inhibit HiV-1 replication (APOBEC3D, APOBEC3F, APOBEC3G and APGBEC3H).

APOBEC3 proteins include a catalytically inactive N-terminal domain (NTD) and a catalytically active C -terminal domain (CTD) Human APOBEC3G nucleic acid and protein sequences are publically available, such as under NCB1 Gene ID 60489. Exemplary APOBEC3G nucleotide and amino acid sequences are deposited under GenBan Accession Nos. NM_..021822.3 (SEQ ID NO:

5) and NP_068594.1 (SEQ ID NO: 4), respectively. The CTD of human APOBEC3G is set forth herein as SEQ ID NO: 1. A modified version of the CTD (CTD2; with 10 amino acid substitutions relative to the wild-type sequence) is set forth herein as SEQ ID NO: 2. A variant of CTD2 (CTD2-V) having 3 amino acid substitutions relative to CTD2, is set forth herein as SEQ ID NO: 6.

Cas9 (CRISPR-associated protein 9): An RNA-guided RNA endonuclease enzyme that can cut DNA. Cas9 has two active cutting sites (HNH and RuvC), one for each strand of the double helix. Cas9 sequences are publicly available. For example, GenBank® Accession Nos. nucleotides 796693..800799 of CP012045.1 and nucleotides 1100046..1104152 of CP014139.1 disclose Cas9 nucleic acids, and GenBank® Accession Nos. NR_269215.1, AMA70685.1 and AKP81606.1 disclose Cas9 proteins. In some examples, the Cas9 is a deactivated form of Cas9 (dCas9), such as one that is nuclease deficient (e.g., those shown in GenBank® Accession Nos. AKA60242.1 and KR011748.1). In some examples, the dCas9 includes one or more of the following point mutations: D10A, H840A and N863A. In some embodiments of the present disclosure, a Cas9 polypeptide is from a species of Streptococcus, such as Streptococcus pyogenes, Streptococcus thermophilus, Streptococcus dysgalactiae, Streptococcus cams, Streptococcus mutans, Streptococcus agalactiae or Staphylococcus aureus. In other embodiments, a Cas9 sequence is from another species of bacteria, such as Neisseria meningitidis, Treponema denticola or Campylobacter jejuni. In certain examples, Cas9 has at least 80% sequence identity, for example at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98% or at least 99% sequence identity to a wild-type Cas9 sequence.

Contacting: Placement in direct physical association; includes both in solid and liquid form.

CRISPR (clustered regularly interspaced short palindromic repeat): DNA loci containing short repetitions of base sequences. Each repetition is followed by short segments of "spacer DNA" from previous exposures to a virus. CRISPRs are found in approximately 40% of sequenced bacterial genomes and 90% of sequenced archaea. CRISPRs are often associated with cas genes that code for proteins related to CRISPRs. The CRISPR/Cas system is a prokaryotic immune system that confers resistance to foreign genetic elements such as plasmids and phages and provides a form of acquired immunity. CRISPR spacers recognize and cut these exogenous genetic elements in a manner analogous to RNAi in eukaryotic organisms. The CRISPR/Cas system can be used for gene editing (adding, disrupting or changing the sequence of specific genes) and gene regulation. By delivering the Cas9 protein and appropriate guide RNAs into a cell, the organism’s genome can be cut at any desired location.

Cytidine deaminase: An enzyme that catalyzes the deamination of cytidine and deoxycytidine to urine and deoxyuridine, respectively. Effective amount: The amount of an agent (such as an APOBEC3G polypeptide or nucleic acid molecule disclosed herein) that is sufficient to effect beneficial or desired results. A therapeutically effective amount may vary depending upon one or more of: the subject and disease condition being treated, the weight and age of the subject, the severity of the disease condition, the manner of administration and the like, which can readily be determined by one of ordinary skill in the art. The beneficial therapeutic effect can include enablement of diagnostic determinations; amelioration of a disease, symptom, disorder, or pathological condition; reducing or preventing the onset of a disease, symptom, disorder or condition; and generally counteracting a disease, symptom, disorder or pathological condition. In one embodiment, an“effective amount” is an amount sufficient to reduce symptoms of a disease, for example by at least 10%, at least 20%, at least 50%, at least 70%, or at least 90% (as compared to no administration of the therapeutic agent). In one example, an effective amount is an amount required to inhibit HIV replication, such as by at least 10%, at least 20%, at least 50%, at least 70%, or at least 90% (as compared to no

administration of the therapeutic agent).

Fusion protein: A protein that includes at least a portion of two different (heterologous) proteins. The two different proteins may be joined directly, or via a linker (such as 1 to 30 amino acids, such as Gly, Ser, or combinations thereof (such as (GGGGS)n (SEQ ID NO: 7) or (G)n), (EAAAK)n (SEQ ID NO: 8), or a cleavable linker). In some examples, a fusion protein is generated chemically, or by expression of a nucleic acid sequence engineered from nucleic acid sequences encoding the fusion protein. To create a fusion protein from a nucleic acid molecule, the nucleic acid sequences must be in the same reading frame and contain no internal stop codons.

Guide sequence: A polynucleotide sequence having sufficient complementarity with a target polynucleotide sequence to hybridize with the target sequence and direct sequence-specific binding of a Cas9 polypeptide (such as a Cas9-varaint APOBEC3G fusion protein disclosed herein) to the target sequence. In some examples, the guide sequence is RNA. In some examples, the guide sequence is DNA. The guide nucleic acid can include modified bases or chemical modifications (e.g., see Latorre et al., Angewandte Chemie 55:3548-50, 2016). In some embodiments, the degree of complementarity between a guide sequence and its corresponding target sequence, when optimally aligned using a suitable alignment algorithm, is about or more than about 50%, 60%, 75%, 80%, 85%, 90%, 95%, 97.5%, 99%, or more. Optimal alignment may be determined with the use of any suitable algorithm for aligning sequences, non-limiting examples of which include the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, algorithms based on the Burrows-Wheeler Transform (e.g. the Burrows Wheeler Aligner), ClustalW, Clustal X, BLAT, Novoalign (Novocraft Technologies, ELAND (Illumina, San Diego, Calif.), SOAP (available at soap.genomics.org.cn), and Maq (available at maq.sourceforge.net). In some embodiments, a guide sequence is about, or at least about 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 75, or more nucleotides in length. In some embodiments, a guide sequence is less than about 75, 50, 45, 40, 35, 30, 25, 20, 15, 12, or fewer nucleotides in length. In some embodiments, a guide sequence is 15-25 nucleotides (such as 18-22 or 18 nucleotides).

Heterologous: Originating from a different genetic sources or species.

Human immunodeficiency virus (HIV): A retrovirus that causes immunosuppression in humans and in can lead to a disease complex known as the acquired immunodeficiency syndrome (AIDS). HIV includes HIV type 1 (HIV-l) and HIV type 2 (HIV-2).

Isolated: An“isolated” biological component (such as a variant APOBEC3G polypeptide, nucleic acid, or cell containing such) has been substantially separated, produced apart from, or purified away from other biological components in the cell or tissue of an organism in which the component occurs, such as other cells, chromosomal and extrachromosomal DNA and RNA, and proteins. Nucleic acids and proteins that have been“isolated” include nucleic acids and proteins purified by standard purification methods. The term also embraces nucleic acids and proteins prepared by recombinant expression in a host cell as well as chemically synthesized nucleic acids and proteins. Isolated proteins or nucleic acids, or cells containing such, in some examples are at least 50% pure, such as at least 75%, at least 80%, at least 90%, at least 95%, at least 98%, or at least 100% pure.

Operably linked: A first nucleic acid sequence is operably linked with a second nucleic acid sequence when the first nucleic acid sequence is placed in a functional relationship with the second nucleic acid sequence. For instance, a promoter is operably linked to a coding sequence if the promoter affects the transcription or expression of the coding sequence. Generally, operably linked DNA sequences are contiguous and, where necessary to join two protein-coding regions, in the same reading frame.

Pharmaceutically acceptable carriers: The pharmaceutically acceptable carriers useful in this invention are conventional. Remington’s Pharmaceutical Sciences, by E. W. Martin, Mack Publishing Co., Easton, PA, l5th Edition (1975), describes compositions and formulations suitable for pharmaceutical delivery of proteins and nucleic acid molecules.

In general, the nature of the carrier will depend on the particular mode of administration being employed. For instance, parenteral formulations usually comprise injectable fluids that include pharmaceutically and physiologically acceptable fluids such as water, physiological saline, balanced salt solutions, aqueous dextrose, glycerol or the like as a vehicle. In addition to biologically-neutral carriers, pharmaceutical compositions to be administered can contain minor amounts of non-toxic auxiliary substances, such as wetting or emulsifying agents, preservatives, and pH buffering agents and the like, for example sodium acetate or sorbitan monolaurate.

Polypeptide, peptide and protein: Refer to polymers of amino acids of any length. The polymer may be linear or branched, it may comprise modified amino acids, and it may be interrupted by non-amino acids. The terms also encompass an amino acid polymer that has been modified; for example, disulfide bond formation, glycosylation, lipidation, acetylation,

phosphorylation, or any other manipulation, such as conjugation with a labeling component. As used herein the term "amino acid" includes natural and/or unnatural or synthetic amino acids, including glycine and both the D or L optical isomers, and amino acid analogs and

peptidomimetics.

Promoter: An array of nucleic acid control sequences which direct transcription of a nucleic acid. A promoter includes necessary nucleic acid sequences near the start site of transcription. A promoter also optionally includes distal enhancer or repressor elements. A “constitutive promoter” is a promoter that is continuously active and is not subject to regulation by external signals or molecules. In contrast, the activity of an“inducible promoter” is regulated by an external signal or molecule (for example, a transcription factor). In one example, the promoter is a U6 promoter or a CMV promoter.

Sequence identity/similarity: The similarity between amino acid (or nucleotide) sequences is expressed in terms of the similarity between the sequences, otherwise referred to as sequence identity. Sequence identity is frequently measured in terms of percentage identity (or similarity or homology); the higher the percentage, the more similar the two sequences are.

Methods of alignment of sequences for comparison are well known in the art. Various programs and alignment algorithms are described in: Smith and Waterman, Adv. Appl. Math. 2:482, 1981; Needleman and Wunsch, J. Mol. Biol. 48:443, 1970; Pearson and Lipman, Proc. Natl. Acad. Sci. U.S.A. 85:2444, 1988; Higgins and Sharp, Gene 13:231, 1988; Higgins and Sharp, CABIOS 5:151, 1989; Corpet et al., Nucleic Acids Research 16:10881, 1988; and Pearson and Lipman, Proc. Natl. Acad. Sci. U.S.A. 85:2444, 1988. Altschul et al., Nature Genet. 6:119, 1994, presents a detailed consideration of sequence alignment methods and homology calculations.

The NCBI Basic Local Alignment Search Tool (BLAST) (Altschul et al., J. Mol. Biol. 215:403, 1990) is available from several sources, including the National Center for Biotechnology Information (NCBI, Bethesda, MD) and on the internet, for use in connection with the sequence analysis programs blastp, blastn, blastx, tblastn and tblastx. A description of how to determine sequence identity using this program is available on the NCBI website on the internet. Variants of protein and nucleic acid sequences known in the art and disclosed herein are typically characterized by possession of at least about 80%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98% or at least 99% sequence identity counted over the full length alignment with the amino acid sequence using the NCBI Blast 2.0, gapped blastp set to default parameters.

For comparisons of amino acid sequences of greater than about 30 amino acids, the Blast 2 sequences function is employed using the default BLOSUM62 matrix set to default parameters,

(gap existence cost of 11, and a per residue gap cost of 1). When aligning short peptides (fewer than around 30 amino acids), the alignment should be performed using the Blast 2 sequences function, employing the PAM30 matrix set to default parameters (open gap 9, extension gap 1 penalties). Proteins with even greater similarity to the reference sequences will show increasing percentage identities when assessed by this method, such as at least 95%, at least 98%, or at least 99% sequence identity. When less than the entire sequence is being compared for sequence identity, homologs and variants will typically possess at least 80% sequence identity over short windows of 10-20 amino acids, and may possess sequence identities of at least 85% or at least 90% or at least 95% depending on their similarity to the reference sequence. Methods for determining sequence identity over such short windows are available at the NCBI website on the internet. One of skill in the art will appreciate that these sequence identity ranges are provided for guidance only; it is entirely possible that strongly significant homologs could be obtained that fall outside of the ranges provided.

Subject: A vertebrate, such as a mammal, for example a human. In some embodiments, the subject is a human. In one embodiment, the subject is a non-human mammalian subject, such as a monkey or other non-human primate, mouse, rat, rabbit, pig, goat, sheep, dog, cat, horse, or cow.

In some examples, the subject is infected with HIV. In other examples, the subject has a genetic disease that can be treated using nucleobase editing. In some examples, the subject is a laboratory animal/organism, such as a zebrafish, Xenopus, C. elegans, Drosophila, mouse, rabbit, or rat.

Tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro are also encompassed.

Transduced, Transformed and Transfected: A vims or vector“transduces” a cell when it transfers nucleic acid molecules into a cell. A cell is“transformed” or“transfected” by a nucleic acid introduced into the cell when the nucleic acid becomes stably replicated by the cell, either by incorporation of the nucleic acid into the cellular genome, or by episomal replication.

These terms encompasses all techniques by which a nucleic acid molecule can be introduced into such a cell, including transfection with viral vectors, transformation with plasmid vectors, and introduction of naked DNA by electroporation, lipofection, particle gun acceleration and other methods in the art. In some examples, the method is a chemical method (e.g., calcium- phosphate transfection), physical method (e.g., electroporation, microinjection, particle

bombardment), fusion (e.g., liposomes), receptor-mediated endocytosis (e.g., DNA-protein complexes, viral envelope/capsid-DNA complexes) and biological infection by viruses such as recombinant viruses (Wolff, J. A., ed, Gene Therapeutics, Birkhauser, Boston, USA, 1994).

Methods for the introduction of nucleic acid molecules into cells are known (e.g., see U.S. Patent No. 6,110,743).

Vector: A nucleic acid molecule into which a foreign nucleic acid molecule can be introduced without disrupting the ability of the vector to replicate and/or integrate in a host cell. Vectors include, but are not limited to, nucleic acid molecules that are single- stranded, double- stranded, or partially double-stranded; nucleic acid molecules that comprise one or more free ends, no free ends (e.g., circular); nucleic acid molecules that comprise DNA, RNA, or both; and other varieties of polynucleotides known in the art.

A vector can include nucleic acid sequences that permit it to replicate in a host cell, such as an origin of replication. A vector can also include one or more selectable marker genes and other genetic elements known in the art. An integrating vector is capable of integrating itself into a host nucleic acid. An expression vector is a vector that contains the necessary regulatory sequences to allow transcription and translation of inserted gene or genes.

One type of vector is a "plasmid," which refers to a circular double stranded DNA loop into which additional DNA segments can be inserted, such as by standard molecular cloning techniques. Another type of vector is a viral vector, wherein virally-derived DNA or RNA sequences are present in the vector for packaging into a virus (e.g., retroviruses, replication defective retroviruses, adenoviruses, replication defective adenoviruses, and adeno-associated viruses). Viral vectors also include polynucleotides carried by a virus for transfection into a host cell. In some embodiments, the vector is a lentivirus (such as 3rd generation integration-deficient lentiviral vectors) or adeno- associated viral (AAV) vectors.

Certain vectors are capable of autonomous replication in a host cell into which they are introduced (e.g., bacterial vectors having a bacterial origin of replication and episomal mammalian vectors). Other vectors (e.g., non-episomal mammalian vectors) are integrated into the genome of a host cell upon introduction into the host cell, and thereby are replicated along with the host genome.

Certain vectors are capable of directing the expression of genes to which they are operatively-linked. Such vectors are referred to herein as "expression vectors." Common expression vectors of utility in recombinant DNA techniques are often in the form of plasmids. Recombinant expression vectors can comprise a nucleic acid provided herein (such as a nucleic acid encoding an APOBEC3G polypeptide) in a form suitable for expression of the nucleic acid in a host cell, which means that the recombinant expression vectors include one or more regulatory elements, which may be selected on the basis of the host cells to be used for expression, that is operatively-linked to the nucleic acid sequence to be expressed. Within a recombinant expression vector, "operably linked" is intended to mean that the nucleotide sequence of interest is linked to the regulatory element(s) in a manner that allows for expression of the nucleotide sequence (e.g., in an in vitro transcription/translation system or in a host cell when the vector is introduced into the host cell). It will be appreciated by those skilled in the art that the design of the expression vector can depend on such factors as the choice of the host cell to be transformed, the level of expression desired, etc. A vector can be introduced into host cells to thereby produce transcripts, proteins, or peptides, including fusion proteins or peptides, encoded by nucleic acids as described herein (e.g., clustered regularly interspersed short palindromic repeats (CRISPR) transcripts, proteins, enzymes, mutant forms thereof, fusion proteins thereof, etc.).

II. Overview of Several Embodiments

The human APOBEC3G protein is a cytidine deaminase that generates cytidine to deoxy- uridine mutations in single- stranded DNA. Due to this enzymatic activity, APOBEC3G is capable of inhibiting replication of HIV-l by generating mutations in the viral genome. Described herein are modified APOBEC3G polypeptides having at least one amino acid substitution that increases catalytic activity of the enzyme. In some cases, the modified polypeptides also include one or more substitutions that increase its solubility. The present disclosure also describes the use of the disclosed APOBEC3G polypeptides and/or the encoding nucleic acid molecules for inhibiting HIV- 1 replication. The disclosed APOBEC3G polypeptides and nucleic acid molecules can also be used in the gene editing systems, such as CRISPR/Cas9, to induce nucleobase substitutions in a target nucleic acid molecule.

Provided herein are isolated polypeptides that are at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98% or at least 99% identical to the amino acid sequence of the C-terminal domain (CTD) of human APOBEC3G (SEQ ID NO: 1), and include an amino acid substitution at one or more residues to increase catalytic activity of the enzyme, or increase its solubility(such as one or more substitutions shown in the table below, such as at least a P247K/P57K mutation). In some embodiments, the isolated polypeptide comprises or consists of the amino acid sequence of SEQ ID NO: 1 and has an amino acid substitution at one or more residues to increase catalytic activity of the enzyme, or increase its solubility. In specific embodiments, the proline residue at position 57 of SEQ ID NO: 1 (corresponding to position 247 of full-length APOBEC3G of SEQ ID NO: 4) is substituted with lysine or arginine. In some examples, the polypeptide further includes a glutamine to lysine, glutamine to arginine, or glutamine to glutamate substitution at position 128 (corresponding to position 318 of full-length APOBEC3G of SEQ ID NO: 4). In specific examples, the proline residue at position 57 and the glutamine residue at position 128 are both substituted with lysine. In other specific examples, the proline residue at position 57 is substituted with lysine and the glutamine residue 128 is substituted with glutamate.

In some examples, the polypeptide includes one or more additional mutations to increase enzymatic activity, such as a proline to alanine substitution at position 10, an asparagine to alanine substitution at position 46, a glutamine to alanine substitution at position 132, or any combination thereof (numbered with reference to SEQ ID NO: 1). In other examples, the one or more additional mutation to increase enzymatic activity include a proline to alanine substitution at position 10, an asparagine to alanine substitution at position 46, an aspartate to phenylalanine substitution at position 126, an aspartate to cysteine substitution at position 127, a glutamine to alanine substitution at position 132, or any combination thereof.

In some examples, the polypeptides include one or more mutations to increase solubility, such as a leucine to lysine substitution at position 44, a cysteine to alanine substitution at position 53, a phenylalanine to lysine substitution at position 120, a cysteine to alanine substitution at position 131, a cysteine to alanine substitution at position 166, or any combination thereof

(numbered with reference to SEQ ID NO: 1).

Except where indicated, the residue numbers recited herein are based on the CTD of SEQ ID NO: 1, SEQ ID NO: 2 or SEQ ID NO: 6. The corresponding positions in full-length

APOBEC3G are provided in the table below.

In particular examples, the ABOBEC3G polypeptide further includes up to 1, up to 2, up to 3, up to 4, up to 5, up to 6, up to 7, up to 8, up to 9, or up to 10 additional amino acid substitutions, deletions or additions so long as the polypeptide retains its catalytic activity and retains a P57K point mutation and the ability to bind ssDNA. In particular examples, the ABOBEC3G polypeptide further includes 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 additional conservative amino acid substitutions.

In one non-limiting example, the amino acid sequence of the ABOBEC3G polypeptide comprises or consists of SEQ ID NO: 2. In another non-limiting example, the amino acid sequence of the ABOBEC3G polypeptide comprises or consists of SEQ ID NO: 6.

Also provided are nucleic acid molecules encoding the modified APOBEC3G polypeptides disclosed herein. For example, using the native APOBEC3G sequence shown in SEQ ID NO: 5, one skilled in the art can generate nucleic acid sequences encoding any modified APOBEC3G polypeptides disclosed herein. In some embodiments, the nucleic acid molecule is operably linked to a promoter, such as a heterologous (i.e., non-native) promoter. Vectors, such as a plasmid or viral vector, that include the nucleic acid molecules disclosed herein are further provided.

Also provided herein are fusion proteins that include a modified APOBEC3G polypeptide disclosed herein and a heterologous protein. In some embodiments, the heterologous protein is a protein tag, such as an affinity tag (for example, chitin binding protein, maltose binding protein, glutathione- S -transferase or poly-His), an epitope tag (for example, V5, c-myc, HA or FLAG) or a fluorescent tag (e.g. , GFP or another fluorescent protein). In other embodiments, the heterologous protein is a CRISPR-associated protein 9 (Cas9) polypeptide. In some examples, the Cas9 polypeptide is catalytically inactive. In some examples, the Cas9 polypeptide is a Streptococcus species Cas9 polypeptide or catalytically inactive variant thereof. In particular examples, the Streptococcus species is Streptococcus pyogenes, Streptococcus thermophilus, Streptococcus dysgalactiae, Streptococcus cams, Streptococcus mutans, Streptococcus agalactiae or

Staphylococcus aureus. In other particular examples, the Cas9 polypeptide is a Cas9 polypeptide from Neisseria meningitidis, Treponema denticola or Campylobacter jejuni.

In some embodiments, the fusion protein includes a linker separating the APOBEC3G polypeptide and the Cas9 polypeptide. In some examples, the linker is about 1 to about 30 amino acids in length, such as about 5 to about 20, or about 10 to about 15 amino acids in length. In specific examples, the linker includes Gly, Ser, or combinations thereof, such as (GGGGS)n (SEQ ID NO: 7) or (G)n). The linker may also include (EAAAK)n (SEQ ID NO: 8), or a cleavable linker. In some examples, the Cas9 polypeptide is C-terminal to the APOBEC3G polypeptide. In other examples, the Cas9 polypeptide is N-terminal to the APOBEC3G polypeptide.

Further provided are nucleic acid molecules encoding a fusion protein disclosed herein. In some embodiments, the nucleic acid encoding the fusion protein is operably linked to a promoter, such as a heterologous promoter. Also provided are vectors, such as a plasmid or viral vector, that include the nucleic acid molecules, and isolated cells that include the disclosed vectors.

Further provided is a method of inhibiting human immunodeficiency vims (HIV) replication in a cell infected with HIV, such as HIV-l. In some embodiments, the method includes contacting the cell with a nucleic acid molecule or vector encoding a modified APOBEC3G polypeptide disclosed herein. The cell can be, for example, a T lymphocyte. In some examples, the method is an in vitro method. In other examples, the method is an in vivo method and contacting the cell with the vector comprises administering the vector to a subject infected with HIV, such as HIV-l. In some examples, the vector is injected into the subject, such as i.v., s.c., or i.m. In specific examples, the method includes selecting a subject with an HIV infection, such as an HIV-l infection. Thus, in some examples, the method can include screening a subject to determine if they are infected with HIV-l, for example by detecting HIV-l nucleic acid molecules (such as HIV-I GAG, HIV-II GAG, HIV-env, or HIV-pol gene), proteins (such as p24), or antibodies in a sample (such as blood or a fraction thereof, saliva, or urine) from the subject.

Methods for editing a nucleobase of a target nucleic acid are also provided. In some embodiments, the method includes contacting the target nucleic acid with a fusion protein disclosed herein and a guide sequence, such as a guide RNA (gRNA). Also provided are kits that include a disclosed polypeptide, fusion protein, nucleic acid, vector, isolated cell, or any combination thereof. In some embodiments, the kit further includes a nucleic acid encoding a guide sequence, such as a gRNA.

Compositions are provided that include a disclosed polypeptide, fusion protein, nucleic acid, vector, or isolated cell, and a pharmaceutically acceptable carrier (such as water or physiological saline). In some examples, the composition is a liquid. In some examples, the composition is lyophilized.

III. Reducing Human Immunodeficiency Virus (HIV) Replication

The APOBEC3 proteins are single-stranded DNA cytidine deaminases that inhibit multiple retroelement substrates. Of the seven human APOBEC3 proteins, four (APOBEC3D, 3F, 3G and 3H) inhibit replication of HIV-l. APOBECA3G (A3G) catalyzes deamination in 5’-CC motifs, leading to G to A mutations in GG dinucleotide motifs. In the context of an HIV-l infection, A3G is able to suppress virus replication by packaging into assembling viral particles, where A3G deaminates cytosine to uracil in newly transcribed viral DNA. Uracil residues of the viral cDNA pair with adenine during plus-strand synthesis and produce G-to-A hypermutation, which ultimately leads to inactivation of the viral genome (Sheehy et al, Nature 418:646-650, 2002; Malim, Philos Trans R Soc Lond B Biol Sci 364:675-687, 2009).

Accordingly, the catalytically hyperactive variants of A3G disclosed herein can be used in methods to reduce or inhibit HIV-l replication and/or in methods of treating an HIV-l infection in a subject. Provided are in vitro methods of reducing or inhibiting HIV-l replication in an infected cell by contacting the cell with a modified APOBEC3G polypeptide, a nucleic acid encoding a modified APOBEC3G polypeptide, or a vector comprising the APOBEC3G-encoding nucleic acid. In some examples, HIV-l replication is reduced by at least 20%, at least 50%, at least 75%, at least 90%, at least 95%, at least 98%, at least 99%, or even 100%, for example as compared to no treatment with the disclosed methods. The method is performed under conditions that allow the APOBEC3G polypeptides to contact viral DNA, such as during reverse transcription of viral RNA. In some embodiments, the APOBEC3G polypeptides are directly contacted with infected cells. For example, the APOBEC3G polypeptide can be fused to a cell-penetrating peptide to permit entry into the infected cell. In other embodiments, the APOBEC3G polypeptides are expressed inside infected cells following transfection of a nucleic acid molecule (such as a vector) encoding the polypeptide.

Also provided are in vivo methods of inhibiting HIV-l replication and/or treating an HIV-l infection in a subject by administering to the subject a modified APOBEC3G polypeptide, a nucleic acid encoding a modified APOBEC3G polypeptide, or a vector comprising the APOBEC3G- encoding nucleic acid. In some embodiments, the subject is treated with (e.g., administered) one or more additional HIV-l therapeutic agents, such as a nucleoside/nucleotide reverse transcriptase inhibitor (NRTI), a nonnucleoside reverse transcriptase inhibitor (NNRTI), a protease inhibitor, an integrase strand transfer inhibitor (INSTI), or any combination thereof.

Exemplary NRTIs that can be used in combination with the disclosed methods include, for example, Emtriva^® (emtrici ahine), Epivir^® (3TC, lamivudine), Retrovir^® (AZT, zidovudine), Videx-EC^® (ddl, didanosine), Viread^® (tenofovir DF), Zerit^® (d4T, stavudine) and Ziagen^® (abaeavir). Exemplary NNRTIs include, but are not limited to, Edurant^® (rilpivirine), Intelence^® (etravirine), Rescriptor^® (delavirdine), Sustiva^® (efavirenz) and Viramune^® (nevirapine). Example of protease inhibitors for the treatment of HIV include amprenavir ( Agenerase ), atazanavir

( Reyataz ), darunavir ( Prezista ), fosamprenavir, ( Telzir , Lexiva ), indinavir ( Crixivan ),

lopinavir/ritonavir ( Kaletra , Aluvia ), nelfinavir ( Viracept ), ritonavir ( Norvir ), saquinavir ( Invirase ) and tipranavir ( Aptivus ). Exemplary INSTIs that can be used in combination with the disclosed methods include, but are not limited to, Isentress® (raltegravir), Tivicay® (dolutegravir) and Vitekta® (elvitegravir).

IV. Nucleobase Editing

The catalytically hyperactive variants of the human APOBEC3G protein disclosed herein (e.g., CTD2 or CTD2-V) can be used as a tool to edit target nucleic acid sequences, such as human genes, in combination with a gene editing system, such as the CRISPR/Cas9 system. Another member of the APOBEC protein family, APOBEC1, has been tested in the CRISPR/Cas9 system and was shown to convert a cytidine to a uridine in a target sequence without the need for inducing double-stranded DNA cleavage (see Komor et al, Nature 533(7603): 420-424, 2016; and U.S. Application Publication No. 2017/0121693). CTD2 is selective to a specific target DNA sequence, soluble, and catalytically hyperactive, which allows this APOBEC3G variant to be used in gene editing systems. Furthermore, the DNA sequence specificity of CTD2 can be modulated by designing specific substitutions, such as the substitutions found in CTD2-V (SEQ ID NO: 6). Thus, the APOBEC3G variants disclosed herein can used to correct point mutations that are relevant to human or veterinary (e.g., dog, cat, or horse) disease.

Provided herein is a method for editing a nucleobase of a target nucleic acid by contacting the target nucleic acid with a fusion protein that includes a variant APOBEC3G disclosed herein, and a Cas9 polypeptide, such as a Cas9 polypeptide. The fusion protein is administered with a guide sequence, such as a gRNA. In some embodiments, the target nucleic acid sequence, such as a target DNA sequence, is associated with a disease or disorder. For example, the target nucleic acid sequence can include a point mutation associated with a disease or disorder. In some embodiments, the activity of the fusion protein results in a correction of the point mutation. In some embodiments, the target nucleic acid sequence comprises a T to C point mutation associated with a disease or disorder, and wherein the deamination of the mutant C base results in a sequence that is not associated with a disease or disorder. In some embodiments, the target nucleic acid sequence encodes a protein and the point mutation is in a codon and results in a change in the amino acid encoded by the mutant codon as compared to the wild-type codon. In some embodiments, the deamination of the mutant C results in a change of the amino acid encoded by the mutant codon. In some embodiments, the deamination of the mutant C results in the codon encoding the wild-type amino acid.

In some embodiments, the disclosed method is an in vivo method and contacting the target nucleic acid comprises administering the fusion protein (or the encoding nucleic acid) to a subject. In some embodiments, the subject has a disease or disorder, such as, but not limited, cystic fibrosis, phenylketonuria, epidermolytic hyperkeratosis (EHK), Charcot-Marie-Toot disease type 4J, neuroblastoma (NB), von Willebrand disease (vWD), myotonia congenital, hereditary renal amyloidosis, dilated cardiomyopathy (DCM), hereditary lymphedema, familial Alzheimer's disease, HIV, Prion disease, chronic infantile neurologic cutaneous particular syndrome (CINCA), desmin- related myopathy (DRM), a neoplastic disease associated with a mutant PI3KCA protein, a mutant CTNNB1 protein, a mutant HR AS protein, or a mutant p53 protein.

In some embodiments, the fusion protein is used to introduce a point mutation into a nucleic acid by deaminating a target nucleobase, such as a C residue. In some examples, the deamination of the target nucleobase results in the correction of a genetic defect, such as by correction of a point mutation that leads to a loss of function in a gene product. In some embodiments, the genetic defect is associated with a disease or disorder, for example, a lysosomal storage disease or a metabolic disorder, such as type I diabetes. In some embodiments, the methods provided herein are used to introduce a deactivating point mutation into a gene or allele that encodes a gene product that is associated with a disease or disorder. For example, in some embodiments, methods are provided herein that employ a fusion protein to introduce a deactivating point mutation into an oncogene, for example, in the treatment of a proliferative disease. A deactivating mutation may, in some embodiments, generate a premature stop codon in a coding sequence, which results in the expression of a truncated gene product, for example a truncated protein lacking the function of the full-length protein. In some embodiments, the guide sequence is RNA. In other embodiments, the guide sequence is DNA. In some example, the guide nucleic acid includes modified bases or chemical modifications (see, for example, Latorre et al., Angewandte Chemie 55:3548-50, 2016). In some embodiments, the degree of complementarity between a guide sequence and its corresponding target sequence is about or more than about 50%, 60%, 75%, 80%, 85%, 90%, 95%, 97.5%, 99%, or more. In some embodiments, the guide sequence is about, or at least about 5, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 75, or more nucleotides in length. In some embodiments, a guide sequence is less than about 75, 50, 45, 40, 35, 30, 25, 20, 15, 12, or fewer nucleotides in length. In some embodiments, a guide sequence is 15-25 nucleotides, such as 18-22 nucleotides, or 18 nucleotides.

The following examples are provided to illustrate certain particular features and/or embodiments. These examples should not be construed to limit the disclosure to the particular features or embodiments described.

EXAMPLES

The Examples below describe the co-crystal structure of the A3G-CTD and ssDNA at 1.86 A resolution. To overcome A3G-CTD weak DNA-binding affinity, a catalytically enhanced variant of A3G-CTD that binds ssDNA stronger than wild- type was generated. This A3G-CTD variant was co-crystalized with a 9-nucleotide ssDNA containing a 5'-TCCCA target sequence with all nine nucleotides well resolved in the structure. The nucleotides within the 5'-TCCCA target sequence show numerous interactions with protein, explaining the nucleotide specificity preferences.

Furthermore, the backbone architecture of the protein changed upon ssDNA binding, enabling the target sequence to fit. These results provide mechanisms by which APOBEC3s recognize their specific substrate sequences.

Example 1: Methods

This example describes the materials and experimental procedures used for the studies described in Example 2.

Protein expression and purification

The CTD2 variant of human APOBEC3G C-terminal domain (residues 191-384) and its inactive variant CTD2* were expressed from pGEX6P-l expression plasmid with glutathione S- transferase (GST) tag (crystallography, EMSA, and real-time NMR deamination assay) for GST purification or from pET-28a plasmid with poly-Histidine tag (Microscale Thermophoresis assay) for Ni-NTA purification in Escherichia coli BL2l(DE3) cells (Invitrogen). Cells were grown in LB media at 37°C until reaching an optical density 0.5-0.6 at 600 nm. Then temperature was reduced to l7°C and protein expression was induced for 18 hours with 0.2 mM isopropyl b-D-l- thiogalactopyranoside (IPTG).

All steps for protein purification were performed at 4°C. E. coli cells were harvested by centrifugation and re-suspended in lysis buffer (either 50 mM sodium phosphate pH 7.3, 150 mM NaCl, 25 mM ZnCh, 2 mM DTT and 0.002% Tween- 20 for GST purification or 50 mM sodium phosphate pH 7.3, 150 mM NaCl, 50 mM ZnCh, lmM DTT, and 0.002% Tween-20 for Ni-NTA purification) and EDTA free protease inhibitor cocktail (Roche, Basel, Switzerland). The suspended cells were disrupted by sonication and then cell debris were separated by centrifugation at 48,384 g for 30 min. Supernatant containing desired protein was applied to either Glutathione- Sepharose® resin (GE Healthcare Life Science) for GST purification or Ni-NTA Agarose resin (QIAGEN) for Ni-NTA purification, equilibrated with lysis buffer and agitated for about 2 hours.

For GST purification, protein bound resin was washed with Pre-Scission Protease cleavage buffer (50 mM sodium phosphate, pH 7.5, 100 mM NaCl, 10 pM Zncl2, 2mM DTT and 0.002% Tween-20) and incubated with Pre-Scission protease (GE Healthcare Life Science) for 18 hours. The supernatant containing the cleaved protein was separated from the resin by centrifugation and loaded on to HiLoad® 16/600 Superdex® 75 gel filtration column (GE Healthcare Life Science) equilibrated with 20 mM Bis-Tris (pH 6.5), 100 mM NaCl, 1 mM DTT, 0.01 mM ZnCh and 0.002% Tween-20.

For Ni-NTA purification, protein bound resin was washed with 50 mM sodium phosphate, pH 7.3, 1 M NaCl, 25 pM ZnCh, 1 or 2 mM DTT and 0.002% Tween-20. Protein was eluted from resin in buffer containing 400 mM imidazole, 50 mM sodium phosphate, pH 7.3, 100 mM NaCl, 1 mM DTT, and 0.002 Tween-20. Eluted protein was loaded on to HiLoad® 16/600 Superdex® 75 gel filtration column equilibrated with 20 mM Bis-Tris pH 6.5, 100 mM NaCl, 1 mM DTT, 0.002% Tween-20, and 20 mM ZnCh. For both GST and Ni-NTA purification, protein purity was analyzed by SDS-PAGE.

Crystal Growth and data collection

Samples used for crystallization contained about 9.5 mg/ml (415 pM) CTD2* and a 50% molar excess of ssDNA in 20 mM Bis-Tris pH 6.5, 100 mM NaCl, 1 mM DTT, 10 pM ZnCh and 0.002% Tween- 20. The 9-nucleotide ssDNA, 5'-AATCCCAAA, was obtained from Integrated DNA Technologies (IDT; Coralville, IA). Initial crystallization condition was identified using JBScreen Nuc-Pro from MiTeGen. Crystals were grown at 4°C, by sitting drop vapor diffusion method over a 65 pl reservoir of 20% W/V PEG 6000, 50 mM di-Sodium L-Malate; pH 5.0 and 30 mM CaCh in a sitting drop 2-well crystallization plate from Molecular Dimension. Drops were set up by mixing 0.3 mΐ of CTD2*-ssDNA complex and 0.3 mΐ of reservoir solution using a robot, Mosquito Crystal from TTP Labtech. Crystals appeared after one week. Crystals grown at 4°C were melted at room temperature, and exactly similar crystal setup at 20°C did not produce any crystal.

Crystals were cryoprotected using reservoir solution containing 20% v/v glycerol and flash frozen in liquid nitrogen. X-ray diffraction data were collected at Southeast Regional Collaborative Access Team (SER-CAT) 22-ID beamline at the Advanced Photon Source, Argonne National Laboratory. The crystals belong to the space group P2i. The collected intensities were indexed, integrated and scaled using HKL2000 (Otwinowski and Minor, Methods Enzymol 276, 307-26, 1997).

Structure determination and analysis

The structure was solved at 1.86A resolution by molecular replacement using the program Phaser (Bunkoczi et al, Acta Crystallogr D Biol Crystallogr 69, 2276-86, 2013) and a previously determined structure of A3G-2K3A (PDB ID code 3IR2, chain B was removed) as search model (Shandilya et al., Structure 18, 28-38, 2010). Model building of the protein and bound DNA and refinements were manually performed using the programs Coot (Emsley and Cowtan, Acta Crystallogr D Biol Crystallogr 60, 2126-32, 2004) and Phenix (Adams et al., Methods 55, 94-106, 2011; Echols et al, J Appl Crystallogr 45, 581-586, 2012), respectively. The first 3 residues (Glu- Ile-Leu) and the last residue (Asn) were not modeled due to lack of electron density. Due to the presence of extra positive density, Ser-368 and Ser-372 were modeled in two alternative conformations. The final model was refined to R_WOrk/Rfree values of 0.18/0.21 and was validated with the PDB validation tool and Molprobity (Chen et al. , Acta Crystallogr D Biol Crystallogr 66, 12-21, 2010). Statistics of Ramachandran analysis yielded 97.87% of the residues in the favored regions and 2.13% were found in the allowed regions. None of the residues were found in disallowed regions. Pairwise root mean square (rms) deviation between CTD-2K3A and ssDNA- bound CTD2* and between ssDNA bound A3A and ssDNA-bound CTD2* were calculated using Doli (Labarga et al. , Nucleic Acids Research 35, W6-W 11, 2007). Figures of structure models were generated by Pymol (Schrodinger, LLC. The PyMOL Molecular Graphics System, Version 1.8., 2015). Real-time NMR deamination assay

Initial rates of deamination reaction were determined using ¹ H nuclear magnetic resonance (NMR) spectra as previously reported (Harjes et al, J Virol 87, 7008-14, 2013; Furukawa et al , Embo J 28, 440-51, 2009). A 9nt ssDNA substrate (IDT), 5'-AATCCCAAA, was used to determine the reaction rate. NMR spectra were acquired at 25°C on Bruker NMR spectrometers operating at ¹ H Larmor frequencies of 600 MHz and 800 MHz. NMR samples contained 5% deuterium oxide with 200 nM protein, 200 mM ssDNA substrate, 100 mM NaCl, 0.002% Tween20, 1 mM DTT, 10 mM ZnCh and also included 50 mM sodium phosphate adjusted to pH 7.5. Concentration of deamination product (5'-AATCCdeoxy-UAAA) was determined from integration of the H5 uracil proton peaks at 5.60 ppm as described previously (Harjes et al, J Virol 87, 7008-14, 2013). A series of ¹ H spectra were measured and the product concentrations as a function of the reaction times were used to determine the initial rate via linear regression.

EMSA

For EMSA, a 9nt ssDNA (IDT) with 6-FAM at the 3' end (5 AATCCC AAA-FAM) , in binding buffer (20 mM Bis-Tris pH 6.5, 100 mM NaCl, 1 mM DTT, 10 pM ZnCh and 0.002% Tween 20 detergent), was used. To ensure the specific DNA binding, competition assays were performed with fluorescent- unprobed ssDNAs including 5'-AAAAAAAAA and 5'-AATCCCAAA, as a non-specific and a specific ssDNA, respectively. Binding reactions were performed in 50 pL by mixing 6-FAM-labeled 10 nM ssDNA (5 '-AATCCC AAA-FAM) with 15 pM of CTD2*. A competitor ssDNA was added with incremental amounts: 0, 25, 250 and 2500 nM. Reaction mixtures were incubated for 1 hour at room temperature. Samples (10 mΐ) were mixed with Novex Hi-Density TBE sample buffer (5X loading dye from Invitrogen) and loaded onto a 4-12% precast TBE gel (Invitrogen) and run with 0.5X TBE buffer for 60 minutes at 100V at 4°C. Gels were imaged by using a Typhoon imager (GE Healthcare Life Sciences) using the blue-excitation (488 nm) fluorescence mode.

Microscale Thermophoresis assay (MST)

The binding affinity of purified CTD2* (SEQ ID NO: 3) with 9nt ssDNA (IDT), 5'- AATCCCAAA, was measured using Monolith NT.115 (Nano Temper Technologies, GmbH, Munich, Germany) (Wienken et al , Nat Commun 1, 100, 2010). RED-tris-NTA fluorescent dye solution was prepared at 100 nM in the MST buffer (20 mM Bis-Tris pH 6.5, 100 mM NaCl, 1 mM DTT, 0.002% Tween20, 20 mM ZnCh). CTD2* was mixed with dye at a final concentration of 100 nM and incubated for 30 min at room temperature followed by centrifugation at 15,000 g for 10 min. The ssDNA was prepared to a stock concentration of 8 mM in the MST buffer. To determine the binding affinity, 10 mΐ of ssDNA solution at 16 different concentrations, ranging from 8 mM to 0.244 mM, were prepared in LOBIND™ centrifuge tubes (Fisher Scientific), then 10 mΐ of fluorescent labelled CTD2* solution (100 nM) was added to each tube. The mixtures were incubated at 4^°C to reach equilibrium. Each incubated solution was loaded into a Nano Temper MST premium coated capillary. The measurement was performed at room temperature using 40% LED power and 20% MST power. The experiment was repeated three times and data analysis was carried out using Nano Temper analysis software (MO affinity).

HIV-1 restriction assay

Plasmid construction: The plasmids were designated with a“p” while the names of viruses and proviruses generated from these plasmids are not. pHCMV-G expresses the G glycoprotein of vesicular stomatitis virus (VSV-G) (Yee et al, Proc Natl Acad Sci U SA 91, 9564- 8, 1994). pHDV-EGFP is an HIV-l derived vector that expresses HIV-l Gag-Pol and enhanced green fluorescent protein (EGFP) but does not express Env, Vif, Vpr, Vpu, or Nef (Unutmaz et al., J Exp Med 189, 1735-46, 1999). pVif-HA is a codon-optimized HIV-l Vif expressing a C-terminal HA epitope tag (Smith et al, J Virol 88, 9893-908, 2014). pFLAG-A3G expresses wild-type A3G with an N-terminal FLAG epitope tag (Russell and Pathak, J Virol 81, 8201-10, 2007). To create pFLAG-NTD-CTD2, unique EcoRI and Xbal cloning sites from pFLAG-A3G were used to subclone a codon-optimized A3G containing solubility mutations 2K3A

(L234K+C243A+F3l0K+C32lA+C356A) (Chen et al, Nature 452, 116-9, 2008) and mutations P199A, P200A, N236A, P247K, Q318K, and Q322A. To create pFLAG-A3G-E259A and pFLAG- NTD-CTD2-E259A, plasmids pFLAG-A3G and pFLAG-NTD-CTD2 were subjected to site- directed mutagenesis to introduce E259A (Quick Change Lightning Site-Directed Mutagenesis Kit, Agilent Technologies). All final plasmids were confirmed by sequencing (Macrogen).

Tissue culture and cell lines: Human embryonic kidney 293T cells (American Type Culture Collection) and TZM-bl cells (obtained through the NIH AIDS Reagent Program [Cat. No. 8129], Division of AIDS, NIAID, NIH; Wei et al., Antimicrob Agents Chemother 46, 1896-905, 2002) were grown in Dulbecco’s modified Eagle’s medium (DMEM) supplemented with 10% fetal calf serum (HyClone) and 1% penicillin-streptomycin stock (penicillin 50 U/ml and streptomycin 50 pg/ml, final concentration; Gibco). TZM-bl cells contain a HIV-l tat-inducible luciferase reporter gene that correlates with HIV-l infectivity. All cells were maintained in humidified 37° C incubators with 5% CO2. Transfection, virus production and single-cycle infection assays: All transfections were performed using LT1 reagent (Mims) according to manufacturer’s instructions. To generate vims for infection, in brief, 293 T cells were transfected with pHDV-EGFP (1 pg), with or without pVif- HA (2.5 pg), pHCMV-G (0.25 pg), and variable concentrations of pFLAG-A3G or pFLAG-NTD- CTD2 (21, 42, 84, 170 or 340 ng). To maintain equivalent DNA amounts in the transfection mix, pcDNA3.l was substituted as needed. Forty-eight hours post infection, virus was harvested, filtered with 0.45-pM filters, and stored at -80 °C. Capsid p24 measurements were analyzed using the HIV-l p24 ELISA Kit (XpressBio) according to manufacturer’s instmctions. Normalized p24 was used to infect 4000 TZM-bl cells in a 96-well plate, and 48 hours post infection, luciferase activity was measured using a 96-well luminometer (LUMIstar Galaxy, BMG LABTECH). Data were plotted as the percent inhibition of luciferase activity versus the no APOBEC3G control. For some experiments, portions of the viral supernatant were spun through a 20% sucrose cushion (15,000 rpm, 2 hours, 4°C, in a Sorvall WX80+ ultracentrifuge) and concentrated lO-fold and used in experiments to determine virion encapsidation of FLAG-A3G and FLAG-NTD-CTD2 by western blotting analysis.

Western blot detection: Cell lysates were made using CelLytic M (Sigma) solution containing Protease Inhibitor Cocktail (Roche), followed by a 10 min, 10,000 x g spin to remove cellular debris. The cell lysates were mixed with NuPAGE LDS sample buffer (Invitrogen) containing b-mercaptoethanol and heated for 5 min at 95°C. Samples were analyzed on 4 - 20% Tris-Glycine Gels (Invitrogen) using standard western blotting techniques. Proteins were detected with primary antibodies as follows: FLAG-A3G or FLAG-NTD-CTD2 (rabbit anti-FI AG polyclonal antibody, 1:5,000 dilution, Sigma); Vif-HA /mouse anti-HA monoclonal antibody, 1:5,000 dilution, Sigma); a-tubulin (mouse anti- a- tubulin antibody, Sigma, 1:10,000 dilution). Antibody against HIV-1 p24 (monoclonal, 1:10,000 dilution) was obtained through the NIH AIDS Reagent Program, Division of AIDS, NIAID, NIH: HIV-1 p24 Gag Monoclonal (#24-3) (Simon et al, J Virol 71, 5259-67, 1997). An IRDye 800CW-labeled goat anti-rabbit secondary antibody (LI- COR) was used at a 1:10,000 dilution to detect rabbit primary antibodies and an IRDye 680-labeled goat anti-mouse secondary antibody (LI-COR) was used at a 1:10,000 dilution to detect mouse primary antibodies. Protein bands were visualized and quantified using an Odyssey Infrared Imaging System (LI-COR).

Data availability

Atomic coordinates and structural factors for the reported crystal structure have been deposited in the Protein Data Bank under the accession number 6BUX. Example 2: Crystal structure of the catalytic domain of APOBEC3G in complex with ssDNA

This example describes the crystal structure of a catalytically enhanced variant of the APOBEC3G C-terminal domain (A3G-CTD) that binds ssDNA stronger than wild-type

APOBEC3G.

Generation of a hyperactive A3G-CTD variant with higher ssDNA affinity

Design of amino-acid substitutions

Although the C-terminal domain of A3G (SEQ ID NO: 1) is catalytically active in vitro (Furukawa et al, Embo J 28, 440-51, 2009; Harjes et al, J Virol 87, 7008-14, 2013), detecting strong ssDNA-binding or purifying a stable A3G-CTD-ssDNA complex has been challenging. To overcome this challenge, variants that are catalytically more active than the wild-type protein were designed and generated by introducing amino acid substitutions. The rational was that catalytically hyperactive variants may have increased affinity for the substrate, while retaining the intact structure and catalytic mechanism. A similar strategy worked well to generate a soluble A3G-CTD variant, namely CTD-2K3 A, enabling determination of the solution NMR structure (Chen et al. , Nature 452, 116-9, 2008). CTD-2K3A contains five amino acid substitutions including L234K, C243A, F310K, C321A and C356A (numbered with reference to SEQ ID NO: 4), and these substitutions do not alter catalytic activity, structure or HIV-l restriction function, but do increase solubility (Chen et al, Nature 452, 116-9, 2008; Shandilya et al, Structure 18, 28-38, 2010). Starting from CTD-2K3A, five additional substitutions were made, including P200A, N236A, P247K, Q318K and Q322A (numbered with reference to SEQ ID NO: 4). To increase the basicity of the region near the active site, lysine residues were introduced in loop 3 and loop 7; P247K and Q318K were chosen as they enhanced catalytic activity. Those lysine substitutions were combined with three other substitutions previously shown to increase catalytic activity (Chen et al. , Nature 452, 116-9, 2008). This variant, hereafter called CTD2, spans residues 191-384 of A3G, and contains P200A, L234K, N236A, C243A, P247K, F310K, Q318K, C321A, Q322A and C356A substitutions (SEQ ID NO: 2; FIG. 1A).

CTD2 is catalytically highly active

The initial reaction speed of deamination of CTD2 was compared with that of wild-type A3G-CTD using a real-time nuclear magnetic resonance (NMR) deamination assay previously used for enzyme kinetics analysis of A3G-CTD (Furukawa et al, Embo J 28, 440-51, 2009). The 9- nucleotide ssDNA 5'-AATCCCAAA was used for the deamination assay, which contains the 5'- TCCCA target sequence. 5'-TCCCA is an optimized target sequence for A3G-CTD (Harjes et al, J Virol 87, 7008-14, 2013), and Yu et al. reported that 5'-CCCA is the preferential deamination sequence found in the minus strand of the HIV-l genome (Yu et al, Nat Struct Mol Biol 11, 435- 42, 2004). Representative NMR spectra illustrate that CTD2 deaminates the 3' C in the 5'-TCCCA target sequence first, then the middle C is deaminated, but the 5' TC was not acted upon by CTD2 as expected based on the preference of wild-type A3G (Harjes et al, J Virol 87, 7008-14, 2013; Furukawa et al, Embo J 28, 440-51, 2009). Indeed, the initial reaction speed was 20 times faster at pH 7.5 with 6.9 ± 0.2 reaction min ¹ for CTD2 (red) and 0.3 ± 0.1 reaction min ¹ for wild-type A3G-CTD (black) (FIG. 1B). This catalytic activity also increased at lower pH 6.5 as was previously observed for A3G-CTD (Harjes et al, J Virol 87, 7008-14, 2013) with 10.8 ± 0.3 reaction min ¹ and 0.5 ± 0.1 reaction min ¹ for CTD2 and A3G-CTD, respectively, suggesting that CTD2 retains the wild-type catalytic mechanism while increasing the reaction speed.

CTD2 specifically binds the target sequence with high affinity

Since CTD2 exhibited greater catalytic activity than A3G-CTD (FIG. 1B), tests were performed to determine whether CTD2 also exhibited increased binding affinity for ssDNA compared to the catalytically inactive E259A variant; this construct is referred to CTD2*. The 9nt ssDNA was labelled with a 6-FAM modification at the 3'-end for electrophoretic mobility shift assay (EMSA) for the substrate 5'-AATCCCAAA. Competitions with a non-specific (5'- AAAAAAAAA) or a specific (5'-AATCCCAAA) negative control ssDNA (fluorescence unprobed) show that CTD2* specifically binds the substrate ssDNA (FIG. 1C). The affinity of CTD2* to the 9nt substrate ssDNA was further investigated by using microscale thermophoresis (MST) (Wienken et al, Nat Commun 1, 100, 2010). The dissociation constant, K_d, for CTD2* was determined to be 55 ± 12 mM (FIG. 1D). An EMSA experiment testing A3G-CTD for binding the same substrate ssDNA showed significantly weaker affinity than CTD2* as only a faint shifted band observed with 160 mM of A3G-CTD, which is consistent with previously reported observations (Holden et al, Nature, 456:121-124, 2008; Furukawa et al, Embo J 28, 440-51,

2009). CTD2* had significantly less affinity to 5'-TCCdeoxy-UA (the first product) and 5'- TCdeoxy-Udeoxy-UA (final product) because K_d values for 9nt ssDNA containing these product sequences were determined to be 150 ± 30 mM and 5.2 ± 0.8 mM, respectively. Furthermore, CTD2* was tested for binding a 9nt RNA 5'-rArArUrCrCrCrArArA, and K_d was determined to be 1.5 ± 0.5 mM. Collectively, EMSA and MST experiments indicated that CTD2* specifically binds the 5'-TCCCA target sequence with high affinity, making this enzyme amenable for structural studies. CTD2 restricts HIV-l replication

To test whether full-length A3G containing a wild-type N-terminal domain (NTD) and CTD2 (NTD-CTD2) retained antiviral activity in vivo, HIV virus was prepared with increasing amounts of wild-type FLAG-A3G or FLAG-NTD-CTD2 in the presence or absence of HIV-l Vif. Both FLAG-A3G and FLAG-NTD-CTD2 were functionally recognized by Vif and degraded as shown by the absence of A3G protein in the Vif-i- lanes (FIG. 2A). As expected, neither FLAG- A3G nor FLAG-NTD-CTD2 blocked the infectivity of virus prepared in the presence of Vif in a single cycle replication assay (FIG. 2B). However, FLAG-A3G or FLAG-NTD-CTD2 potently inhibited viral infectivity when the HIV virus prepared in the absence of Vif (FIG. 2C), and to a similar extent when measuring comparable FLAG- A3 G or FLAG-NTD-CTD2 protein expression levels from the producer cells (FIG. 2A) (z-test, p > 0.3). Introduction of the E259A mutation into FLAG-NTD-CTD2 (and FLAG-A3G), which abolishes a critical catalytic glutamate needed for deaminase activity, restored infectivity in the absence of Vif, indicating that inhibition of HIV-l replication by FLAG-NTD-CTD2 is largely deaminase-dependent. Furthermore, no significant differences in viral infectivity were observed for virions encapsidating similar levels of FLAG-A3G or FLAG-NTD-CTD2 (z-test, p > 0.3). Overall, these data confirm that the full-length NTD-CTD2 blocks HIV-l replication as potently as wild-type A3G.

Co-crystal structure of CTD2* and ssDNA

The catalytically inactive CTD2 (CTD2*) was co-crystallized with the 9nt ssDNA containing a 5'-TCCCA target sequence (5'-AATCCCAAA). The co-crystal structure of CTD2* and ssDNA was determined to 1.86A resolution in the P2i space group by molecular replacement using the previously determined structure of apo-CTD-2K3A (PDB ID: 3IR2)²⁵ (FIG. 3A). The final refinement of the structure resulted in R-work/R-free of 0.18/0.21, respectively (Table 1). There was a single CTD2*-ssDNA complex in the asymmetric unit. The overall protein backbone structure did not change significantly from the backbone structures of apo-CTD-2K3A and A3A bound to ssDNA as indicated by the pairwise root mean square (rms) deviation, which isl.6A for both pairs including ssDNA-bound CTD2* with apo-CTD-2K3A and DNA-bound CTD2* with ssDNA-bound A3A.

All nine nucleotides of ssDNA were well-ordered in the electron density (FIG. 3A). The interface between protein and ssDNA involves all five nucleotides of the 5'-TCCCA target sequence, and approximately 800A² of surface area on ssDNA is buried in the interface with CTD2*. This is a significantly larger area than that found in an A3A-ssDNA complex where approximately 620A² of surface area on ssDNA was buried in the protein-DNA interface (Kouno et aI., Naΐ Commun 8, 15024, 2017). Although more extended at both 5'- and 3'-ends, the phosphate backbone of the ssDNA adopted a curved shape that is similar to the shape of ssDNA observed in the co-crystal structures of A3A-ssDNA (Kouno et al, Nat Commun 8, 15024, 2017; Shi et al, Nat Struct Mol Biol 24, 131-139, 2017) and TadA-RNA (Losey et al, Nat Struct Mol Biol 13, 153-9, 2006).

Table 1. Crystal data collection and refinement statistics

*Values in parentheses are for highest-resolution shell.

Interaction with the 5'-TCCCA target sequence

Remarkably, all five nucleotides in the target sequence interact with protein in the co-crystal structure (FIG. 3B). In the target sequence, nucleotides are numbered with the target cytidine at position 0 such as 5'-T-₃C_-2C-_IC_OA_+I. Protein-DNA interactions for each nucleotide are described in the following sections.

T-3 and C-2 recognition:

The most remarkable interaction involving T-3 is the p-p stacking with W211 (FIG. 3C). T_-3 also interacts with the following nucleotide C-2 by forming a hydrogen bond between the 5'- phosphate group of T-3 and the pyrimidine amino group of C-₂. The Watson-Crick face of T-3 does not interact with protein, whereas it forms a base pair with A+3 of the ssDNA in a neighboring asymmetric unit.

The nucleobase type of C-₂ is recognized by the protein as Watson-Crick face of C-₂ forms water-mediated hydrogen bonds between the pyrimidine carbonyl group and the mainchain amino proton of D316 as well as the guanidino group of R374 (FIG. 3C). In addition, the pyrimidine N3 atom of C-2 forms a hydrogen bond through an ordered water with the carboxyl group of D316 (FIG. 3C). Furthermore, the C-2 pyrimidine ring has a hydrophobic interaction with the indole ring of W211 (FIG. 3C) which creates a spatial restraint, favoring a pyrimidine nucleotide in this position.

Sugar pucker of nucleotides play key roles in shaping the structure of DNA and RNA strands. The deoxy-ribose of C-2 has a C3'-endo conformation, whereas all other eight nucleotides of the CTD2* -bound ssDNA have C2'-endo conformation. This is significant because DNA prefers the C2'-endo in aqueous solution, whereas RNA is predominantly in the C3'-endo conformation. The C3'-endo conformation of C-2 brings two neighboring backbone phosphorus atoms (which belong to C-i and C-2) in a close contact 5.8A, which is a typical distance for double-stranded RNAs in A-form. This spatial arrangement enables the 5 '-phosphate group of C-2 to form a hydrogen bond with the guanidino group of R213. Interestingly, R213 is not conserved in other APOBEC3 proteins, except in AID, which is also member of the APOBEC deaminase family, the charge is conserved with a lysine, which is further described in the Discussion.

C-i recognition:

The Watson-Crick face of C-i has three direct interactions with the protein. The C2 carbonyl group forms a hydrogen bond with the mainchain amino proton of D317, the N3 atom forms a hydrogen bond with the mainchain amino proton of Q318K, and the amino group forms a hydrogen bond with the sidechain carboxyl group of D316 (FIG. 3D). The sidechain of D317 is also coordinated by a hydrogen bond formed between the carboxyl group and the mainchain amino proton of F289. This hydrogen bond stabilizes the helix 3 structure by forming an“N-cap”

(Richardson and Richardson, Science 240, 1648-52, 1988) since F289 is located at the N-terminus of the helix. In addition, Q318K may provide further support in orienting D317 by interacting electrostatically as the e-ammonium group of Q318K is located within 3.7A from the carboxyl group of D317. Furthermore, the 5'- phosphate group of C-i is supported by two water-mediated hydrogen bonds with the NE2 atom of H216 and the mainchain amino proton of R215 (FIG. 3D).

Co recognition:

Electron density was observed that fits a zinc ion (Zn²⁺) chelated by H257, C288, C291 and additional density that fits a water molecule. The target cytosine (Co) is tightly packed under the Zn²⁺ ion by stacking aromatic rings with the Zn²⁺-chelating residue H257 and forming a T-shaped p-p interaction with Y315 (FIG. 3E). In addition to these p-p interactions, many hydrogen bonds support the position of target cytosine, including aromatic ring 02 to the mainchain amino proton of A258, and aromatic ring N3 to the mainchain amino proton of E259A through an ordered water molecule. Furthermore, the deoxy-ribose 03' and 04 atoms form hydrogen bonds with the sidechain amino group of N244 and the hydroxyl group of T218, respectively, which supports the 2'-endo conformation of deoxy-ribose of Co. The 5 '-phosphate group of Co is well-coordinated by interactions with the protein as it forms hydrogen bonds with the hydroxyl group of Y315 and ND1 atom of H216. Two hydrogen bonds provide key recognition of the amino group of Co, including one formed with the mainchain carbonyl group of S286, and another formed with the water molecule coordinated by Zn²⁺. This Zn²⁺-bound water molecule is the key molecule to trigger the deamination by attacking the C4 position of cytosine (Betts et al, J Mol Biol 235, 635-56, 1994; Xiang et al, Biochemistry 34, 4516-23, 1995; Xiang et al, Biochemistry 36, 4768-74, 1997). These Co-interacting residues are conserved in all APOBEC3 proteins (except A3F has a serine instead of T218), and similar interactions have been observed in A3A-ssDNA complexes (Kouno et al, Nat Commun 8, 15024, 2017; Shi et al, Nat Struct Mol Biol 24, 131-139, 2017). Although the CTD2*- ssDNA complex showed that C2'-endo sugar conformation of the target cytidine is required to fit in the catalytic pocket, the mechanism by which A3G discriminates RNA from deamination is not fully understood because the 2' hydrogen of the target cytidine can be replaced with a hydroxyl group without significant steric hindrance in the CTD2*-ssDNA complex structure.

A_+i recognition:

The purine ring of A_+i stacks against the H216 imidazole ring (FIG. 3F). Since histidine forms stronger p-p stacking with a purine ring than with a pyrimidine ring (Churchill and Wetmore, J Phys Chem B 113, 16046-58, 2009), this interaction selects purines in the +1 position rather than pyrimidines, providing an explanation for why 5'-CCCA is the preferential deamination sequence (Yu et al, Nat Struct Mol Biol 11, 435-42, 2004; Harjes et al, J Virol 87, 7008-14, 2013). In addition, the 5 '-phosphate group forms water- mediated hydrogen bonds with the mainchain carbonyl group of H216 and the sidechain amino group of N244. The Watson-Crick face of A_+i does not have interaction with protein as it forms a base pair with A_-5 of a neighboring asymmetric unit.

Overall, CTD2* recognizes 5'-C_-2C-_IC₀ through hydrogen-bonds formed with their Watson- Crick faces, and T_-3 and A_+i by using strong p-p interactions (FIG. 3G). Unusual sugar pucker of C- ₂ contributes in shaping the phosphate backbone to fit the ssDNA-binding site of CTD2*.

Discussion

Since the first structure of the catalytic domain of APOBEC3G was reported (Chen et al, Nature 452, 116-9, 2008), structural studies of the A3G-CTD-ssDNA complex have been hampered by apparently weak binding of A3G-CTD to ssDNA (Chen et al, Nature 452, 116-9, 2008; Furukawa et al, Embo J 28, 440-51, 2009). As described herein, this problem has been overcome by generating an A3G-CTD variant that binds ssDNA with higher affinity than wild- type. P247K appears to be a key substitution that contributes to stabilizing the ssDNA-binding of CTD2* by providing an additional hydrogen bond to a backbone phosphate group located outside of the target sequence. Furthermore, non- Watson-Crick base pairs formed between neighboring asymmetric units in the CTD2*-ssDNA co-crystal, which may stabilize the crystallization in the complex. This highly active variant in the context of full-length A3G containing wild-type NTD restricted HIV-l infection as potently as wild-type A3G in a Vif-dependent manner (FIG. 2).

A3G strongly prefers a cytidine at the -1 position, whereas A3 A prefers a thymidine for that position. Recognition of the Watson-Crick face of DNAs is the mechanism to provide this nucleotide specificity. For the CTD2*-ssDNA complex, the Watson-Crick face of C-i forms three hydrogen bonds including amino group (NH₂) to carboxyl group of D316, carbonyl group to mainchain amino proton of D317 and N3 atom to mainchain amino proton of Q318K. If this cytidine is replaced by a thymidine, N3 and NfF would be replaced by NH and CO, respectively, resulting in loss of two hydrogen bonds with the protein, which explains why A3G prefers cytidine over thymidine at the -1 position. Previously, Rausch et al showed that N3 and NFF of cytosine ring of C-i and C_-2 are key for deamination, whereas 5-methyl deoxy-cytidine at C-i or C_-2 position are tolerated by A3G (Rausch et al, J Biol Chem 284, 7047-58, 2009). This finding is consistent with the CTD2*-ssDNA structure showing that Watson-Crick faces of C-i and C-2 interact with CTD2*, while the C5 positions do not have a contact with the protein (FIG. 3B). In addition to the recognition of the Watson-Crick face, the spatial coordination plays an important role in the recognition of C-i. FIGS. 4A and 4B show a striking difference between the position of C-i in the CTD2*-ssDNA complex and that of T_i in the A3A-ssDNA complex. Interactions of T_-3 and C_-2 with CTD2* are important to position C-i (FIGS. 3B and 3C), whereas the A3A-ssDNA structures did not have interactions with nucleotides at -2 and -3 positions (Kouno et al, Nat Commun 8, 15024, 2017; Shi et al, Nat Struct Mol Biol 24, 131-139, 2017). A3A has a tyrosine (Y132) at the corresponding position to D317, and FIG. 4C shows that C-i crashes into Y132 when the CTD2*- ssDNA structure is overlaid onto the A3A-ssDNA structure. The significance of D316 and D317 in substrate specificity of A3G was originally reported by Holden and co-workers as they showed that D316R, D317R double substitutions enabled A3G to deaminate middle C and 3'-C of a 5'-CCC motif at the same reaction speed (Holden et al, Nature, 456: 121-124, 2008), whereas the wildtype A3G-CTD prefers 3'-C to the middle C by 45-fold (Harjes et al, J Virol 87, 7008-14, 2013).

Another group showed that D317Y substitution changed A3G substrate preference to 5'-TC that is the A3A preferred target sequence (Rathore et al, J Mol Biol 425, 4442-54, 2013). The co-crystal structure of CTD2* and ssDNA showed how D316 and D317 are involved in the recognition of C-i. Overall, spatial coordination as well as the interaction with the Watson-Crick face are critical for specific recognition of C-i.

Structures of A3G-CTD wild-type (Holden et al, Nature, 456:121-124, 2008; Furukawa et al, Embo J 28, 440-51, 2009; Lu et al, J Biol Chem 290, 4010-21, 2015) and a soluble variant, namely CTD-2K3A (Chen et al, Nature 452, 116-9, 2008; Shandilya et al, Structure 18, 28-38, 2010), have been reported previously. However, none of these structures were complexed with ssDNA. The previous crystal structure of CTD-2K3A, solved at 2.25A resolution (PDB ID# 3IR2; Shandilya et al, Structure 18, 28-38, 2010), was chosen as a representative of apo-form CTD for structural comparison with ssDNA-bound CTD2*. Superimposition of structures of ssDNA-bound CTD2* (yellow) and apo-CTD-2K3A (gray) (FIG. 5A), reveals that even with the additional substitutions in CTD2* the backbone structures are essentially unchanged. Regions that are changed include loops 1, 3 and 7, which are intrinsically dynamic in solution as shown in NMR structures of CTD-2K3A (Chen et al, Nature 452, 116-9, 2008; Harjes et al, J Mol Biol 389, 819- 32, 2009). Loops 1 and 7 contain amino acid residues which form numerous interactions with ssDNA; therefore, the structural changes were likely induced upon ssDNA binding.

Loop 1 migrates toward loop 7 upon ssDNA binding, with W211 demonstrating the biggest change with its Ca atom moved by 3.9 A from the position found in the apo-form CTD-2K3A (FIG. 5B). This backbone change enables W211 to have p-p stacking interaction with T-₃. The sidechain of H216 showed a big rotamer change, enabling p-p stacking with A_+i (FIG. 5B). These p-p interactions set both 5'- and 3 '-ends of the 5'-TCCCA target sequence to the rim of the DNA- binding groove formed by loopl of CTD2*. This rim is clearly visible by sidechains of W211,

R213 and H216 in the surface representation of CTD2* (FIG. 5C). This loop 1 rim interacts with the phosphate backbone of ssDNA, whereas loop 7 faces nucleobases as depicted by the sidechains of Y315 and D316 (FIG. 5C). Residues in loop 7, including Y315, D316 and D317, also alter mainchain atom positions and sidechain rotamers upon ssDNA-binding (FIG. 5B). D317 significantly changed its mainchain position as its Ca atom moved 3.0A from the position found in the apo-form CTD-2K3A structure (FIG. 5B). These rearrangements of backbone atoms of the loop7 residues are particularly important for recognition of C-i, as mainchain amino protons form two hydrogen bonds with the Watson-Crick face of C-i (FIG. 3D). This dynamic property of CTD2* is a remarkable difference compared to A3A, which did not show significant changes in mainchain atom positions upon ssDNA-binding. For A3 A, only the sidechains of R28 and H29 (R215 and H216 in CTD2, respectively) in loopl, and Y132 (Y315 in CTD2) in loop7 changes their rotamers (Kouno et al, Nat Commun 8, 15024, 2017; Shi et al, Nat Struct Mol Biol 24, 131- 139, 2017).

Ziegler et al. published a structure of A3G-CTD bound to an adenine nucleotide (Ziegler et al., PLoS One 13, e0l95048, 2018). In the A3G-CTD-adenine complex strcucture, the adenine nucleotide binds in the space that is similar to the T_i position found in the A3A-ssDNA complexes (Kouno et al., Nat Commun 8, 15024, 2017; Shi et al., Nat Struct Mol Biol 24, 131-139, 2017). Since C-i in the CTD2*-ssDNA complex does not occupy the T_i position (FIG. 4C), the protein- DNA interaction found in the A3 G-CTD- adenine complex are different from the enzyme-substrate interaction revealed in the present study. Ziegler et al. suggested that the A3G-CTD-adenine structure shows a non-specific interaction by which A3G-CTD scans ssDNA sequence.

Interestingly, W211 rearranged its position to interact with the bound adenine (Ziegler et al., PLoS One 13, e0l95048, 2018), which may imply that W211 is key in the interaction with non-specific DNA as well as the deamination target sequence.

P247K is the only residue in loop 3 that interacts with ssDNA, and the interaction likely contributes in changing the loop3 position. In addition, other loop3 residues including E254 and R256 form hydrogen bonds with Q293 (helix3) and E323 (helix4) of a neighboring asymmetric unit, respectively, which support the position of loop3 and concurrently aids crystal formation. The substitutions introduced to CTD2 did not change the structure of A3G-CTD as shown in the superimposed structures of CTD2* (this study) and wild-type A3G-CTD (PDB ID: 4ROV) (Lu el al, J Biol Chem 290, 4010-21, 2015). CTD2* and 4ROV structures are well superimposed as indicated by the pairwise root mean square (rms) deviation, which is 0.9A.

A previous study found that A3G-CTD increases its catalytic activity at lower pH, and H216 plays a key role in this pH dependence. Alanine mutation of this residue abolished catalytic activity, whereas arginine mutation kept catalytic activity but lost the pH dependency, suggesting that positive charge and/or formation of hydrogen bonds of this residue are important for substrate binding (Harjes et al, J Virol 87, 7008-14, 2013). The CTD2*-ssDNA co-crystal structure indicates that protonation of the imidazole ring enables formation of two hydrogen bonds with the 5'- phosphate groups of Co and C-i. H216 is conserved only in A3G and A3A among human

APOBEC3 proteins, and similar interactions between the histidine and nucleotides at -1 (T_i) and 0 (Co) positions have been observed in co-crystal structures of A3 A and ssDNA (Kouno et al, Nat Commun 8, 15024, 2017; Shi et al, Nat Struct Mol Biol 24, 131-139, 2017). These conserved interactions involving histidines provide an explanation for similar pH dependency of catalytic speeds of A3A and A3G (Pham et al, J Biol Chem 288, 29294-304, 2013; Harjes et al, J Virol 87, 7008-14, 2013).

Until the results disclosed herein, it was unknown how W211 contributes to catalysis as this residue is spatially far from the catalytic Zn²⁺ ion yet alanine substitution of W211 results in nearly complete loss of catalytic activity. Moreover, large chemical shift changes were observed for the indole amino proton of W211 upon mixing with ssDNAs (Chen et al , Nature 452, 116-9, 2008; Shindo et al , Biology 1, 260-276, 2012). While W211 is not conserved in other members of human APOBEC3 proteins, tryptophan is conserved in ATP. ATP is a member of the human APOBEC family, which plays important roles in antibody diversification and triggers both class switch recombination and somatic hypermutation (Muramatsu et al, J Biol Chem 274, 18470-6, 1999; Revy et al, Cell 102, 565-75, 2000; Bransteitter et al, Proc Natl Acad Sci U S A 100, 4102-7, 2003). Qiao et al. published structures of the human AID protein co-crystalized with dCMP, cacodylic acid or cytidine (PDB ID# 5W0U, 5W0R, 5W1C, respectively), and proposed a “substrate channel” composed by loopl and loop7 of ATP (Qiao et al, Mol Cell 67, 361-373 e4, 2017). AID appears to recognize long target sequences, similar to A3G recognition of a five nucleotide target sequence, and AID has nucleotide preferences in -2, -1, 0 and +1 positions (Larijani et al., Immunogenetics 56, 840-5, 2005; Rogozin and Diaz, J Immunol 172, 3382-4,

2004). Based on the complex described herein, AID likely uses the tryptophan residue

corresponding to W211 for p-p stacking with the nucleotide at the -3 position, and supports specific recognition of the nucleotides at -2 and -1 positions in a manner similar to CTD2* use of W211. In addition, conservation of R213 of CTD2* as a lysine in AID suggests the similar use of the lysine for interaction with 5 '-phosphate group of the nucleotide at the -2 position.

The novel structure of the CTD2*-ssDNA complex reveals the mechanism at atomic-level resolution by which the catalytic domain of APOBEC3G uniquely binds substrate ssDNA.

Fundamental knowledge of this complex can guide the design of molecular-based therapeutics for HIV infection by modulating A3G catalytic function. Komor et al. showed that an APOBEC deaminase tethered to catalytically dead Cas9 can mutate DNA in a programmable manner, offering a new strategy for“gene editing” (Komor et al, Nature 533, 420-4, 2016). The disclosed CTD2 variants can be a better tool for“base editing” because it is more soluble, binds ssDNA stronger, and catalyzes deamination faster than wild-type A3G-CTD.

Example 3: Variant of CTD2 with a different substrate specificity

A variant of CTD2 (CTD2-V) was generated that contains three amino acid changes relative to CTD2 of SEQ ID NO: 2. Specifically, CTD2-V (set forth as SEQ ID NO: 6) contains Phe, Cys and Gly residues at positions 126-128, respectively (corresponding to residues 316-318, respectively, of full-length APOBEC3G).

CTD2-V possesses a different substrate specificity relative to CTD2. CTD2-V was capable of deaminating 5’-GC, whereas CTD2 specifically deaminates 5’-CCC. A real-time NMR deamination assay demonstrated that CTD2-V deaminates 5’-GC to 5’-GU (FIG. 6A). The initial reaction rate of CTD2-V was determined to be 2.9 +0.2 reactions/hour (FIG. 6B).

These results demonstrate that CTD2-V has an enhanced activity for the deamination of specific deoxy-nucleotide sequences including 5’-GC and 5’-AC. G and A have a purine ring in their nucleobase structure, and none of the APOBEC3 proteins (including CTD2) efficiently deaminate deoxy-cytidines following a purine nucleotide. Activation induced deaminase (AID), which is a member of the APOBEC deaminase family, favors 5’-GC and 5’-AC as substrate sequence, but its deamination activity is very low in vitro. CTD2-V achieves the 5’-GC and 5’-AC specificity and better catalytic activity. In view of the many possible embodiments to which the principles of the disclosed invention may be applied, it should be recognized that the illustrated embodiments are only examples of the invention and should not be taken as limiting the scope of the invention. Rather, the scope of the invention is defined by the following claims. We therefore claim as our invention all that comes within the scope and spirit of these claims.

Claims

1. An isolated polypeptide comprising at least 90% sequence identity to the amino acid sequence of SEQ ID NO: 1, wherein the proline residue at position 57 is substituted with lysine or arginine.

2. The isolated polypeptide of claim 1, comprising at least 95% sequence identity to the amino acid sequence of SEQ ID NO: 1, wherein the proline residue at position 57 is substituted with lysine or arginine.

3. The isolated polypeptide of claim 1, comprising the amino acid sequence of SEQ ID NO: 1, wherein the proline residue at position 57 is substituted with lysine or arginine.

4. The isolated polypeptide of any one of claims 1-3, wherein the glutamine residue at position 128 is substituted with lysine, arginine or glutamate.

5. The isolated polypeptide of any one of claims 1-4, wherein the proline residue at position 57 and the glutamine residue at position 128 are both substituted with lysine.

6. The isolated polypeptide of any one of claims 1-5, further comprising a proline to alanine substitution at position 10, an asparagine to alanine substitution at position 46, a glutamine to alanine substitution at position 132, or any combination thereof.

7. The isolated polypeptide of any one of claims 1-4, wherein the proline residue at position 57 is substituted with lysine and the glutamine residue 128 is substituted with glutamate.

8. The isolated polypeptide of any one of claims 1-4 and 7, further comprising a proline to alanine substitution at position 10, an asparagine to alanine substitution at position 46, an aspartate to phenylalanine substitution at position 126, an aspartate to cysteine substitution at position 127, a glutamine to alanine substitution at position 132, or any combination thereof.

9. The isolated polypeptide of any one of claims 1-8, further comprising a leucine to lysine substitution at position 44, a cysteine to alanine substitution at position 53, a phenylalanine to lysine substitution at position 120, a cysteine to alanine substitution at position 131, a cysteine to alanine substitution at position 166, or any combination thereof.

10. The isolated polypeptide of any one of claims 1-9, wherein the amino acid sequence of the polypeptide comprises or consists of SEQ ID NO: 2 or SEQ ID NO: 6.

11. A nucleic acid molecule encoding the polypeptide of any one of claims 1-10.

12. The nucleic acid molecule of claim 11, operably linked to a heterologous promoter.

13. A vector comprising the nucleic acid molecule of claim 11 or claim 12.

14. A method of reducing human immunodeficiency vims (HIV) replication in a cell infected with HIV, comprising contacting the cell with the vector of claim 13.

15. The method of claim 14, wherein the cell is a T lymphocyte.

16. The method of claim 14 or claim 15, wherein the method is an in vitro method.

17. The method of claim 14 or claim 15, wherein the method is an in vivo method and contacting the cell with the vector comprises administering the vector to a subject infected with

HIV.

18. A fusion protein comprising the polypeptide of any one of claims 1-10 and a heterologous protein.

19. The fusion protein of claim 18, wherein the heterologous protein is a CRISPR- associated protein 9 (Cas9) polypeptide.

20. The fusion protein of claim 19, wherein the Cas9 polypeptide is catalytically inactive.

21. The fusion protein of claim 19 or claim 20, wherein the Cas9 polypeptide is a Streptococcus species Cas9 polypeptide or catalytically inactive variant thereof.

22. The fusion protein of claim 21, wherein the Streptococcus species is Streptococcus pyogenes.

23. A nucleic acid molecule encoding the fusion protein of any one of claims 18-22.

24. The nucleic acid molecule of claim 23, operably linked to a heterologous promoter.

25. A vector comprising the nucleic acid molecule of claim 23 or claim 24.

26. An isolated cell comprising the vector of claim 25.

27. A method for editing a nucleobase of a target nucleic acid, comprising contacting the target nucleic acid with the fusion protein of any one of claims 19-22 and a guide RNA (gRNA).

28. A kit comprising:

(a) the polypeptide of any one of claims 1-10;

(b) the fusion protein of any one of claims 18-22;

(c) the nucleic acid molecule of any one of claims 11, 12, 23 and 24;

(d) the vector of claim 13 or claim 25;

(e) the isolated cell of claim 26; or

(f) any combination of two or more of (a), (b), (c), (d) and (e).

29. The kit of claim 28, further comprising a nucleic acid encoding a gRNA.

30. A composition, comprising:

(a) the polypeptide of any one of claims 1-10;

(b) the fusion protein of any one of claims 18-22;

(c) the nucleic acid molecule of any one of claims 11, 12, 23 and 24;

(d) the vector of claim 13 or claim 25; or

(e) the isolated cell of claim 26;

and a pharmaceutically acceptable carrier.