CN116978444A

CN116978444A - Efficiency and outcome prediction system and method for base editing gene scissors using deep learning

Info

Publication number: CN116978444A
Application number: CN202310489065.6A
Authority: CN
Inventors: 金炯凡; 金那惠
Original assignee: Industry Academic Cooperation Foundation of Yonsei University
Current assignee: Industry Academic Cooperation Foundation of Yonsei University
Priority date: 2022-04-29
Filing date: 2023-05-04
Publication date: 2023-10-31

Abstract

The present application relates to an efficiency and result prediction system and the like of base editing gene scissors using deep learning, according to which base editing scissors and single-guide RNA (sgRNA) for efficient base editing can be selected without performing excessive experiments on 63 base editing scissors having various pre-spacer-adjacent motif (PAM) compatibility. Thus, the system can be effectively used in all fields where gene editing is applied, such as performing disease treatment by gene editing, and the like.

Description

Efficiency and outcome prediction system and method for base editing gene scissors using deep learning

Cross Reference to Related Applications

The present application claims priority and benefit from korean patent application No.10-2022-00053742 filed on 29 of 2022, 04 month, which is incorporated herein by reference for all purposes as if fully set forth herein.

Technical Field

The present application relates to an efficiency and result prediction system for base editing gene scissors using deep learning.

Background

Base editing (base coding) allows one base pair to be converted to another without the need for donor deoxyribonucleic acid (donor DNA) or the creation of double-stranded truncations. Base editing gene scissors (base editors) consist of a base editing protein (base editor protein), which is essentially a fusion of a Cas9 nickase and a base changing enzyme such as cytidine or adenosine deaminase (cytidine or adenosine deaminases), and a single-guide RNA (sgRNA). Cytosine base editing gene scissors (cytosine base editors: CBEs) can convert C.G into T.A, and adenine base editing gene scissors (adenine base editors: ABEs) can convert A.T into G.C. In the case of CBE, uracil glycosylase inhibitors (uracil glycosylase inhibitor: UGI) are often added to improve base editing efficiency and purity. In addition to these two main base editing gene scissors, for base editing Gene Scissors (CGBEs) that convert C.G to G.C, CBE-induced base editing gene scissors can convert C.G to G.C by removing UGI and/or adding uracil DNA N-glycosylase (UNG). To improve the efficiency and accuracy of such base editing, base editing gene scissors having improved base conversion regions (i.e., deaminase with or without cofactors such as UGI or UNG) such as YE1-BE4max, ssavob 3B, ABE e (V106W), ABE8.17-m+v106W, CGBE1, miniCGBE1, and apobi-nCas 9-Ung5 were developed, but variants thereof were not widely compared, and thus it remained difficult to select base editing gene scissors that included which base conversion region to use.

Another variable in base editing is Cas9 nickase, which recognizes the corresponding target sequence by including a pre-spacer proximity motif (PAM) located about 15.+ -.2 nucleotides from the target nucleotide. For example, the standard PAM sequence of SpCas9 is NGG, which is not generally available at the desired location and therefore cannot be effectively base edited consisting of minimal bystander editing (bystander editing). In addition, this PAM requirement applies Cas9 restriction in other types of genome editing (e.g., tiling screening), specific target deletions, and efficient homology-induced changes) and base editing. To overcome these limitations, cas9 variants with several PAM-compatible have been developed, for example, as early Cas9 versions that recognize PAM instead of NGG, including xCas9 and SpCas9-NG, and VQR, VRER, VRQR and QQR1 variants, after which more variants have been reported, such as SpCas9-NRRH, spCas9-NRTH, spCas9-NRCH, spG, spRY, and sc++20. However, since these variants are also not widely compared, especially for base editing, it is still difficult to select which Cas9 to use for a given target sequence.

Thus, the inventors of the present invention performed a broad comparison of 7 base editing gene scissor variants (YE 1-BE4max, ssavob 3B, ABE e (V106W), ABE8.17-m+v106W, CGBE1, miniCGBE1 and apob 9-nmcas 9-Ung) with multiple base transition regions (fig. 1A), 10 Cas9 variants (VRQR variants, xCas9, spCas9-NG, spCas9-NRRH, spCas9-NRTH, spCas9-NRCH, spG, spRY and sc++ (fig. 1B) with mutually different or modified PAM-editing gene scissor variants (SpCas 9-YE1-BE4max, spCas9-NRCH-YE1-BE4max, spCas9-NRCH-ssap apec 3B, spCas-e 8e (V106W), spCas9-NRCH, spG, spRY and sc++ (fig. 1B), and 10 base editing gene scissor variants (SpCas 9-YE1-BE4max, spCas 9-NRCH-BE 9-BE 4 max), spCas 9-NRCH-3-sbc 8e (V106W), spCas 9-nre 9-spb) with mutually different or modified PAM-nickase regions (fig. 1B), and 10 base editing gene scissor variants (SpCas 9-YE 1-n 4max, spCas 9-NRCH-BE 4max, spc 9-NRCH-B) with multiple base-modified region (fig. 3B). As a result, deep Cas9 variants (deep Cas9 variants) were developed, which are deep learning-based computational models that predict the efficiency of 9 Cas9 variants based on a large-scale dataset on the activity of the resulting Cas9 and base editing gene scissors variants, and deep BEs (deep bs) were developed that predict the efficiency and editing result frequency of 63 base editing gene scissors, which were generated by combining these 9 Cas9 variants and 7 deamination domains.

Disclosure of Invention

In one aspect, an efficiency and outcome prediction system for base editing gene scissors utilizing deep learning is provided.

On the other hand, an efficiency and result prediction method of base editing gene scissors using deep learning is provided.

Another aspect provides a computer-readable recording medium having recorded thereon a program for executing, by a computer, a method of predicting efficiency and outcome of base editing gene scissors using deep learning.

Specifically, an aspect provides an efficiency and outcome prediction system of base editing gene scissors using deep learning, comprising: a target sequence input unit that receives target sequence data of the base editing gene scissors; and a result prediction unit that applies the target sequence data received by the target sequence input unit to a prediction model of base editing efficiency and result ratio, respectively, to obtain output values of base editing efficiency and result ratio, and multiplies the output values of base editing efficiency and result ratio to generate a base editing prediction score.

In the present specification, "base editing gene scissors (BE)" means a novel gene scissors from CRISPR gene scissors called fourth generation gene editing technology, which function by replacing a single base. Specifically, the base editing gene scissors are composed of a base editing gene scissors composed of a protein, which is a fusion of Cas9 nickase and a base changing enzyme such as cytidine (cytidine) or adenosine deaminase (adenosine deaminases), and a single-guide ribonucleic acid (sgRNA). Representatively, there are included adenine base editing gene scissors (Adenine Base Editor: ABE) in which G.C can be substituted for A.T by binding adenine deaminase (Adenine deaminase) to dCAS9 ("dead" Cas 9) or nCas9 from which the double-stranded DNA cleavage function of CRISPR/Cas9 is removed, cytosine base editing gene scissors (Cytosine Base Editor: CBE) in which T.A can be substituted for C.G by binding cytosine deaminase (cytosine deaminase), and cytosine base editing Gene Scissors (CGBEs) in which C.G can be converted into G.C base editing gene. For example, in the case of CBE, when deaminase replaces cytosine (C) in one DNA strand cut with nCas9 or dCas9 with uracil (U), the base that changes to uracil (U) operates on the principle in which thymine (T) is changed by the DNA repair process. Using these base editing gene scissors, a gene can be deleted or converted to a desired property by correcting or replacing a specific sequence.

Specifically, the base editing gene scissors may BE any one or more selected from the group consisting of YE1-BE4max, ssAPOBEC3B, ABE e (V106W), ABE8.17-m+V106W, CGBE1, miniCGBE1, and APOBEC-nCas9-Ung, but are not particularly limited thereto.

In the present specification, the term "guide RNA (guide RNA)" means an RNA specific to a target DNA sequence, which complementarily binds to all or part of the target sequence, so that adenine deaminase or cytosine deaminase of base editing gene scissors finds adenine (a) and replaces it with guanine (G) and cytosine (C) and replaces it with thymine (T) in the target sequence.

In general, guide RNA refers to forms of single-guide RNA (sgRNA), double RNA (dual RNA) comprising CRISPR RNA (crRNA) and trans-activating crRNA (trans-activating crRNA) as components, or a first site comprising a sequence that is fully or partially complementary to a sequence in the target DNA and a second site comprising a sequence that interacts with an RNA-guided nuclease, but can be included within the scope of the invention without limitation if the RNA guide nuclease of the base editing gene scissors is capable of being active in the target sequence. In addition, the guide RNA may include a scaffold (scaffold) sequence that facilitates the attachment of RNA-guided nucleases.

In the present specification, the term "Cas9 protein" refers to the major protein component of the CRISPR/Cas9 system, which forms a complex with crRNA and tracrRNA to form an activated endonuclease (endonucleoase) or nickase (nickase). The Cas9 protein or gene information may be obtained from known databases such as the gene library (GenBank) of the national center for biotechnology information (National Center for Biotechnology Information: NCBI), but any substance that may have target-specific nuclease activity with guide RNAs may be included within the scope of the invention. In addition, cas9 proteins may be linked to a protein transduction domain (protein transduction domain). The protein transduction domain may be, but is not limited to, a Trans-transcriptional activator (Trans-Activator of Transcription: TAT) protein derived from polyarginine or human immunodeficiency virus (Human immunodeficiency virus: HIV). Furthermore, one skilled in the art can appropriately attach additional domains to the Cas9 protein according to the purpose.

The Cas9 protein may include wild-type Cas9, inactivated Cas9 (dCas 9), all variants of Cas9, such as Cas9 nickase. The inactivated Cas9 may be fokl nuclease (RNA-guided FokI Nuclease: RFN) of RNA guide in which fokl nuclease domain is linked to dCas9 or one in which transcription activator (transcription activator) or repressor domain (repressor domain) is linked to dCas9, and the Cas9 nickase may be D10A Cas9 or H840A Cas9, but is not particularly limited thereto. Specifically, the Cas9 may be any one or more selected from the group consisting of SpCas9, VRQR variants, spCas9-NG, spCas9-NRRH, spCas9-NRTH, spCas9-NRCH, spG, spRY, and sc++.

The source of the Cas protein is also not limited. For example, the Cas9 protein may be derived from streptococcus pyogenes (Streptococcus pyogenes), franciscensis (Francisella novicida), streptococcus thermophilus (Streptococcus thermophilus), legionella pneumophila (Legionella pneumophila), listeria enterica (Listeria innocua), or streptococcus mutans (Streptococcus mutans).

In the present specification, the term "target sequence" refers to a base sequence intended to be targeted by base editing gene scissors. Specifically, as a sequence expected to be targeted by the base editing gene scissors by the guide RNA, a sequence known to exhibit the activity of the base editing gene may be used, or a sequence which a person skilled in the art arbitrarily designs the sequence to be analyzed by using the system of the present invention may be used, but any base editing gene scissors having or expected to have the activity and the sequence to be analyzed may be included within the scope of the present invention without limitation.

In the present specification, the base conversion activity data of the base editing gene scissors may be obtained by introducing the base editing gene scissors into a cell library including oligonucleotides including a nucleotide sequence encoding a guide RNA and a target nucleotide sequence targeted by the guide RNA, but is not particularly limited thereto.

In the present specification, the "target sequence input unit" refers to the configuration of the efficiency and result prediction system of the base editor using deep learning, which is used to receive the target sequence.

In the present specification, the term "activity" or "efficiency" of base editing gene scissors refers to the activity of single base substitution by base editing gene scissors, i.e., the activity of RNA guided nucleases, particularly Cas9, to cleave genes in target sequences and deaminase to convert adenine (a) to guanine (G) or cytosine (C) to thymine (T). In the present specification, the term "activity data" corresponds to data capable of extracting and learning the relationship between a specific input sequence or target sequence and the base editing gene scissors, and the system of the present invention can generate a predictive model of base editing efficiency using the activity data.

Specifically, the activity data of the base editing gene scissors can be obtained by sequencing the base of the target sequence. For example, data according to it can be obtained by performing deep sequencing (deep sequencing) or RNAseq or the like, but the specific method is not particularly limited as long as the activity data of the base-editing gene scissors can be obtained by detecting the bases after editing. The activity data of the base-editing gene scissors may be any known activity data or may be activity data directly obtained by a person skilled in the art by any method suitably employed, and the method for obtaining the data is not particularly limited as long as the data can generate an activity prediction model capable of predicting the activity of the target base-editing gene scissors.

In the present specification, the term "base editing result" of base editing gene scissors means an editing product (product) produced as a result of base editing gene scissors activity on a target sequence. On the other hand, when a plurality of target nucleotides which can be edited exist in a base editing range (editable window), unnecessary bases can be edited, and in this specification, the term "base editing frequency" means the frequency of each result product which is the result of the activity of base editing gene scissors.

The predictive model of base editing efficiency is generated by: a step of receiving base conversion activity data of the base editing gene scissors through the information input unit; and a step of performing deep learning based on a convolutional neural network (convolutional neural network: CNN) on the data received by the information input unit to generate a predictive model of base editing efficiency.

In the present specification, the "information input unit" is a component that receives base conversion activity data or base editing result data of base editing gene scissors, and the information input unit may receive data on base editing gene scissors directly from a user of the prediction system according to a specific embodiment or pre-stored data, but is not particularly limited thereto.

The output value of the base editing efficiency can be calculated by the following expression 1.

[ mathematics 1]

The output value of the base editing result ratio can be calculated by the following expression 2.

[ math figure 2]

In this specification, the term "Deep Learning" is a technology that allows a computer to think and learn intelligence (AI) like a human, and refers to a technology that allows a machine to autonomously learn and solve complex nonlinear problems based on artificial neural network theory. By using the deep learning technique, even if a human is not set with all the judgment criteria, the computer can recognize, deduce and judge by itself, and is likely to be widely used in voice image recognition, photo analysis, and the like. In other words, deep learning may be defined as a combination of machine learning (machine learning) algorithms that attempt to perform high-level abstractions (operations that summarize key content or functions in a large amount of data or complex materials) by combining several nonlinear transformation methods.

In this specification, the term "convolutional neural network (convolutional neural networks: CNN)" refers to a technique that extracts features (features) representing a part of provided information, and implements generalization through information layering.

The step of performing deep learning based on the convolutional neural network to generate a predictive model of base editing efficiency may further include the step of concatenating CRISPR-associated protein 9 (CRISPR associated protein: cas9) activity data, and the Cas9 activity data may be concatenated to a flattened layer (flat layer) of the base editing gene scissor efficiency and outcome prediction system.

In addition, the Cas9 activity data is obtained by performing a method comprising the steps of: a step of introducing Cas9 into a cell library comprising oligonucleotides comprising a nucleotide sequence encoding an sgRNA and a target nucleotide sequence to which the sgRNA is targeted; performing a deep sequencing step using deoxyribonucleic acid (DNA) obtained from a cell library into which the Cas9 is introduced; and a step of analyzing the efficiency of Cas9 from the data obtained from the depth sequencing, and may be generated or output in the form of a predictive score.

In this specification, the term "library" is meant to include a population of two or more homogeneous substances having different properties (pool or population). Thus, the library of oligonucleotides may be a population comprising two or more oligonucleotides having different base sequences, e.g. two oligonucleotides having different guide RNAs and/or target sequences, and the library of cells may be two or more cells having different properties, in particular, for the purposes of the present invention, the oligonucleotides comprised in each cell are different cell populations, e.g. the introduced guide RNAs and/or target sequences or species are different cell populations.

In the present specification, the term "vector" refers to a vector and genetic construct capable of delivering an oligonucleotide into a cell, and the vector in the present application may include an oligonucleotide including each guide RNA coding base sequence and target base sequence. The vector may be a viral vector or a plasmid vector, and a lentiviral vector, a retroviral vector, or the like may be specifically used, but is not limited thereto, and a known vector may be freely used by those skilled in the art as long as the object of the present application can be achieved. In particular, when the vector is present in a cell of an individual, the vector may comprise an insert, i.e. the necessary regulatory elements operably linked to the insert, such that the oligonucleotide may be expressed.

The vectors can be made and purified using standard recombinant DNA techniques. The type of the vector is not particularly limited as long as it can function in target cells such as prokaryotic and eukaryotic cells. In addition, the vector may include a promoter, a start codon and a stop codon, and also, as appropriate, a DNA encoding a signal peptide and/or an enhancer sequence and/or untranslated regions on the 5 'and 3' sides of a desired gene and/or a selectable marker region and/or a replicable unit, etc.

The delivery of the vector to the cells used to prepare the library can be accomplished using a variety of methods known in the art. For example, it can be performed by methods well known in the art, such as calcium phosphate-DNA coprecipitation, DEAE-dextran-mediated transfection, polybrene-mediated transfection, electroporation, microinjection, liposome fusion, and the like. In addition, in the case of using a viral vector, a vector as a target may be delivered into cells using a viral particle as an infection (infection) means. In addition, the vector may be introduced into the cell by gene bombardment or the like. The introduced vector may exist in a cell as a vector itself or may be integrated into a chromosome, but is not particularly limited thereto.

The step of analyzing Cas9 efficiency may be predicting activity of Cas9 from a correlation of Cas9 indel frequencies in a specific target sequence by performing deep learning based on a convolutional neural network.

The predictive model of the base edit result ratio is generated by: receiving base editing result data of the base editing gene scissors through the information input unit; and a step of performing deep learning based on a Convolutional Neural Network (CNN) on the data received by the information input unit to generate a predictive model of base editing efficiency. The same applies to the predictive model of the base editing result ratio. The term "result data" corresponds to data that enables extraction and learning of the relationship between a specific input sequence or target sequence and base editing gene scissors, and the system of the present invention can use the activity data to generate a base editing result ratio model.

In the present specification, the term "result prediction unit" is a constitution in which the base editing efficiency and result of the base editing gene scissors are predicted by applying the target sequence inputted by the target sequence input unit to a prediction model of base editing efficiency and result ratio. In one embodiment, the result prediction unit may predict the base editing efficiency and the result ratio of the base editing gene scissors based on the target sequence information.

The efficiency and result prediction system of base editing gene scissors may further include an output unit that outputs the efficiency and result ratio of base editing gene scissors predicted by the result prediction unit. In addition, the prediction system of the present invention may further include a storage unit in which previously obtained data on the base-editing gene scissors or known data on the base-editing gene scissors are stored, and when the storage unit is included, the information input unit of the prediction system of the present invention receives data of a set size or range from the storage unit for predicting the base-editing efficiency and the editing result ratio of the base-editing gene scissors.

On the other hand, an efficiency and result prediction method of base editing gene scissors using deep learning is provided. Specifically, provided is a method for predicting efficiency and outcome of base editing gene scissors using deep learning, comprising: designing a target sequence of the base editing gene scissors; and a step of applying the designed target sequence to the efficiency and result prediction system of the base editing gene scissors. The same applies to the method for predicting the efficiency and result of base editing gene scissors.

The program may be a system for predicting the efficiency and result of base editing gene scissors or a method for predicting the efficiency and result of base editing gene scissors, which are implemented in a computer programming language.

Computer programming languages in which the programs of the present invention can be implemented include, but are not limited to Python, C, C ++, java, fortran, visual Basic, and the like. The program may be stored in a recording medium such as a USB memory, a compact disc read only memory (compact disc read only memory: CDROM), a hard disk, a magnetic disk, or a similar medium or device, and may be connected to an internal or external network system. For example, the computer system accesses a sequence database such as GenBank (http:// www.ncbi.nlm.nih.gov/nucleic) via HTTP, HTTPS or XML protocols to retrieve the target gene and the regulated nucleic acid sequence of the gene.

The program may be provided on-line or off-line, and may be provided in the form of a computer program stored in a recording medium to implement a prediction system of efficiency and result of base editing gene scissors in combination with a computer.

Drawings

FIGS. 1A-1C show base editing gene scissors and Cas9 variants evaluated in the assays of the invention, wherein red arrows represent introduced mutations. Wherein fig. 1A shows a base editing gene scissor variant comprising a SpCas9-NG nickase region and several deaminase regions, fig. 1B shows a Cas9 variant with different PAM compatibility, and fig. 1C shows a base editing gene scissor variant comprising a Cas9 domain with different PAM compatibility.

FIG. 2 shows a comparison of indels (indels) frequency measured using two slightly different high-volume assessment methods and the correlation of indels frequency for two biological replicates. Wherein a of fig. 2 shows a comparison of the indel frequencies measured before (x-axis) and after (y-axis) removal of sequencing reads comprising erroneous or shuffled sequences, the target sequence numbers for SpCas9, VRQR variants, xCas and SpCas9-NG are n= 11,668, 11,678, 11,661 and 11,659, respectively. Figure 2 b shows the correlation of indel frequencies between two biological replicas, two independent transduction of lentiviral library a was performed on two independently generated SpCas9 expressing cell populations or on cell populations expressing SpCas9-NRCH at different days. The target sequence numbers of SpCas9 and SpCas9-NRCH are (n) = 11,680 and 11,590, respectively.

FIGS. 3A and 3B are scatter plots showing the frequency of insertion deletions induced by Cas9 variants evaluated using libraries A and B, and FIGS. 3C, 3D, 3E, 3F and 3G are scatter plots showing base transitions caused by base editing gene scissors, showing base transition efficiencies in libraries A and B mediated by SpCas9-YE1-BE4max (FIG. 3C) and SpCas9-ABE8E (V106W) (FIG. 3D), or by CBE variants (FIG. 3E), ABE variants (FIG. 3F) and CGBE variants (FIG. 3G), respectively.

FIGS. 4A through 4G show the base conversion efficiency indicated at each position on the target sequences of SpCas9-NG-YE1-BE4max (FIG. 4A), spCas9-NG-SsAPOBEC3B (FIG. 4B), spCas9-NG-ABE8E (V106W) (FIG. 4C), spCas9-NG-ABE8.17-m+V106W (FIG. 4D), spCas9-NG-CGBE1 (FIG. 4E), spCas9-NG-miniCGBE1 (FIG. 4F) and SpCas9-NG-APOBEC-nCas9-Ung (FIG. 4G), respectively, and the target sequences with NG PAM were analyzed.

FIG. 5 shows the preference motif (motif) for the expected base transition induced by SpCas9-NG-YE1-BE4max (FIG. 5 a) and SpCas9-NG-SsAPOBEC3B (FIG. 5B), respectively, and the heat map shows the dependence of the average efficiency of the desired base editing on adjacent nucleotides. Target sequences with NG PAM were used.

FIG. 6 shows a favored motif of SpCas9-NG induced indel frequency at position 6, and a heat map shows the dependence of SpCas9-NG induced indel frequency on nucleotides adjacent to the target nucleotide.

FIG. 7 shows the preference motifs of expected base transitions induced by SpCas9-NG-ABE8e (V106W) (a of FIG. 7) and SpCas9-NG-ABE8.17-m+V106W (b of FIG. 7), respectively, and the heat map shows the dependence of the average efficiency of the desired base editing on adjacent nucleotides. Target sequences with NG PAM were used.

FIG. 8 shows the preference motifs of expected base transitions induced by SpCas9-NG-CGBE1 (FIG. 8 a), spCas9-NG-miniCGBE1 (FIG. 8 b) and SpCas9-NG-APOBEC-nCas9-Ung (FIG. 8 c), respectively, and the heat map shows the dependence of the average efficiency of desired base editing on adjacent nucleotides. Target sequences with NG PAM were used.

FIG. 9 shows the conversion efficiency (a-C) from C.G to G.C or from C.G to T.A and the dependence of the conversion product purity (d to f) from C.G to G.C at position 6 for SpCas9-NG-miniCGBE1 (a and d of FIG. 9), spCas9-NG-miniCGBE1 (b and e of FIG. 9) and SpCas9-NG-APOBEC-nCas9-Ung (C and f of FIG. 9), respectively. Target sequences with NG PAM were used.

Fig. 10A to 10K show the correlation between the average deletion frequency of insertions induced by Cas9 variants in the target sequence comprising each 4-nt PAM sequence and the average base editing conversion efficiency mediated by base editing gene scissor variants comprising the corresponding Cas9 variants, and show the order of the deletion frequency of both insertions.

Figure 11 shows the efficiency of base editing induced by base editing gene scissors comprising Cas9 variants as nickase regions in each position of the target sequence.

FIG. 12 shows the effect of nucleotides adjacent to a target nucleotide on base editing gene scissors comprising Cas9 variants, heat maps show the dependence of expected average base conversion efficiency on adjacent nucleotides in ABE8e (V106W) (FIG. 12 a), ABE8.17-m+V106W (FIG. 12 b), miniCGBE1 (FIG. 12 c) and APOBEC-nCas9-Ung (FIG. 12 d) based on Cas9 variants of SpCas9, spCas9-NG and SpRY et al.

Fig. 13 shows the activity of SpCas9 PAM variants, where a of fig. 13 shows the indel frequency in 23 target sequences that perfectly match NGG PAM for calculation of specificity, the number of target sequences (n) =23, different target sequences being distinguished by different colors. Figure 13 b shows the relative frequency of insertion deletions induced by each Cas9 variant based on single base mismatch type.

Fig. 14 shows the relative frequency of insertion deletions induced by SpCas9 PAM variants in target sequences where there are consecutive base-transition mismatches of two bases at the site of NGG PAM.

Fig. 15 shows the relative frequency of insertion deletions induced by SpCas9 PAM variants in target sequences where there are consecutive base-transition mismatches of three bases at the site of NGG PAM.

Fig. 16 shows the algorithm used in the development of deep cas9 derivatives (a of fig. 16) and deep ng-BE (b of fig. 16).

FIG. 17 shows the editing window (editing window), product purity and preference motif of CBE and ABE variants, and evaluates base editing gene scissors including SpCas9-NG in target sequences with NG PAM.

FIG. 18 shows the edit window, product purity and preference motif of CGBE variants. FIG. 18 a shows the conversion efficiency from cytosine to guanine at each position of the target sequence, FIG. 18 b shows the product purity associated with the base editing variant, FIG. 18 c shows the preference motif for base editing, and the heat map shows the dependence of the desired base average editing efficiency on adjacent nucleotides. Fig. 18 d to f show comparisons of base editing efficiencies induced by several base editing gene scissors, red triangles representing target sequences, wherein the editing efficiency of one base editing gene scissors is at least 30% or more higher than that of another base editing gene scissors.

Figure 19 shows a comparison of Cas9variants with different PAM compatibility and the integration of the variants in the nickase (nickase) domain as scissors for the base editing gene. Figures 19 a and b show the maximum average indel frequency generated by one of the 9 Cas9variants (a of figure 19) and the corresponding Cas9variant (b of figure 19) showing the highest average activity in the target sequence comprising each 4-nt PAM sequence. Fig. 19 c shows the correlation between the frequency of insertion deletions induced by Cas9variants in a target sequence comprising a 4-nt PAM sequence, fig. 19 d shows the percentage of guide sequences where the favored Cas9variant of a given PAM sequence exhibits at least 1.3 times lower activity than one of the remaining Cas9variants with PAM sites, fig. 19 e is a comparison of the average frequency of insertion deletions in a target sequence with a labeled shared 2-nt PAM sequence and base editing efficiency, and fig. 19 f shows the correlation between the average frequency of insertion deletions induced by SpCas9-NRCH in a target with a 4-nt PAM sequence and the average base editing efficiency mediated by base editing variants comprising SpCas 9-NRCH. Figures 19 g and h show the effect of nucleotides adjacent to the target nucleotide on Cas9 variant-based base editing gene scissors, and the heat map shows that the average efficiency of the expected base conversion is dependent on the extent of adjacent nucleotides in the case of YEs 1-BE4max (g of figure 19) and ssavopec 3B (h of figure 19) of Cas9variants including SpCas9, spCas9-NG and SpCas 9-NRCH.

FIG. 20 shows the specificity of SpCas9-YE1-BE4max, spCas9-ABE8e (V106W) and SpCas9 variants with different PAM compatibility. FIG. 20 a is a heat map showing the specificity of SpCas9 variants in target sequences with single base mismatches by comparison to perfectly matched target sequences, and FIGS. 20 b and c show the relative indel frequency of sites comprising consecutive mismatches at positions 1 to 20 of two bases (b of FIG. 20) and three bases (c of FIG. 20). FIGS. 20 d and e show heat maps of the specificity of SpCas9-YE1-BE4max (d of FIG. 20) and SpCas9-ABE8e (V106W (e of FIG. 20), respectively, and FIGS. 20 f and g show the dependence of the relative base editing efficiency of SpCas9-YE1-BE4max (f of FIG. 20) and SpCas9-ABE8e (V106W) (g of FIG. 20) on single base mismatch types, respectively, FIGS. 20 h and i show the effect of two bases (h of FIG. 20) and three bases (i of FIG. 20) on base editing in succession.

Figure 21 shows deep learning-based predictions of Cas9 nuclease variant and base editing variant activity. Fig. 21 a shows a schematic of an algorithm for developing a computational model, and fig. 21 b to d show the correlation between the frequency of insertion deletion induced by Cas9 variants predicted and measured in the target sequences of the separate test dataset (b of fig. 21) and the frequency of base editing induced by base editing gene scissor variants comprising Cas9 variants (c to d of fig. 21). FIG. 21 e shows the performance evaluation results of deep BE performed using a data set that was generated using base editing gene scissor variants and was never used for learning.

FIG. 22 shows correction (correction) of a single-nucleotide polymorphism (SNP) that is pathogenic or possible to be pathogenic using base-editing gene scissors, and shows the overall editing efficiency of SNP correction and the expected editing efficiency of bystandless (bystandless-free) induced by CBE (FIG. 22 a), ABE (FIG. 22 b) and CGBE (FIG. 22 c), respectively.

FIG. 23 shows the generation and evaluation of variant expressing cell lines. Figure 23 a shows a schematic of library experiments and figure 23 b shows western blot analysis of Cas9 protein levels performed in cells expressing Cas9 variants, base editing gene scissor variants with different deaminase and base editing gene scissor variants comprising Cas9 variants.

FIG. 24 shows a comparison of base editing efficiency induced by different CBEs (a of FIG. 24) and ABEs (b of FIG. 24), red triangles representing target sequences, wherein the editing efficiency of one base editing gene scissors is at least 30% or more higher than the editing efficiency of the other base editing gene scissors.

FIG. 25 shows a comparison of base editing efficiencies induced by different CGBE, red triangles representing target sequences, wherein the editing efficiency of one base editing gene scissors is at least 30% or more higher than the editing efficiency of another base editing gene scissors.

Figure 26 shows the average frequency of insertion deletions induced by Cas9 variants in target sequences each comprising a 4-nt PAM sequence.

Figure 27 shows a comparison of Cas9 variants with different PAM compatibility. As the maximum average indel frequency (left heat-map) generated by any of the 10 Cas9 and the corresponding Cas9 variant (right heat-map) showing the highest average activity in each target sequence comprising a 4-nt PAM sequence, the corresponding candidate PAM is marked as white when the maximum average indel frequency is less than 5% (a of fig. 27) and less than 20% (b of fig. 27).

Fig. 28 shows the correlation between indel frequencies induced by Cas9 variants in targets with four exemplary PAMs.

FIG. 29 shows the effect of two base consecutive base conversion mismatches on base editing efficiency. The frequency of indels induced by SpCas9 was analyzed by the same target sequence as used to determine the base editing efficiency d induced by SpCas9-YE1-BE4max (a of fig. 29) and SpCas9-ABE8e (V106W) (c of fig. 29), and shown as a control group.

FIG. 30 shows the effect of three base consecutive base conversion mismatches on base editing efficiency. The frequency of indels induced by SpCas9 was analyzed by the same target sequence as used to determine the base editing efficiency d induced by SpCas9-YE1-BE4max (a of fig. 30) and SpCas9-ABE8e (V106W) (c of fig. 30), and shown as a control group.

FIG. 31 shows the correlation between predicted deep-BE fraction and measured base editing efficiency (a of FIG. 31) and ratio (b of FIG. 31).

Figure 32 shows the structure of deep be, the predictive score of deep Cas9 varians linked to data obtained from base editing gene scissor variants including the corresponding Cas9 variants.

Detailed Description

Hereinafter, a more detailed description will be made by way of examples. However, these examples are for illustrative purposes only, and the scope of the present invention is not limited to these examples.

Example 1 preparation of materials

Example 1-1 plasmid preparation

To generate the backbone (backbone) plasmid, the lentiCas9-Blast plasmid (Addgene, 52962) was first digested with XbaI and BamHI restriction enzymes (New England Biolabs, ipswich, MA) and then treated with 1 μl of rapid calf intestinal alkaline phosphatase (quick calf intestinal alkaline phosphatase: QCIAP, new England Biolabs) for 30 min at 37 ℃. Then, the linearized fragment was gel-purified using a MEGA rapid spin total fragment (quick-spin Total Fragment) DNA purification kit (Intron Biotechnology, urban, south, korea) according to the manufacturer's protocol.

PCR was performed using primers including the desired mutant and Phusion High-fidelity (High-fidelity) DNA polymerase. In order to achieve high protein expression levels, in case of recognition of Cas9variants of different PAMs (Cas 9 PAM variants), the codons of the mutant sites were selected according to the suggestion of GenScript. The codon used in the initial study was selected as the codon corresponding to the deamination region of the base editing gene scissors.

The resulting amplicon was gel purified and cloned into the digested lentiCas9-Blast plasmid using NEBuilder Hifi DNA Assembly Master Mix (New England Biolabs) and reacted at 50 ℃ for 1 hour. In the case of VRQR variants, xCas9, and SpCas9-NG, plasmids described in previous studies were used, which are available from Addgene (Addgene, 138562, 138565, and 138566).

Examples 1-2 library design and preparation

Library C included 11,994 pairs of guide RNA coding sequences and corresponding target sequences and was used to evaluate the activity of base editing gene scissors including Cas9 that recognizes NGG and non-NGG PAMs. The library included 515 endogenous target sequences previously evaluated with 179 or 180 guide RNA target pairs for each NNN PAM, and 36 different PAMs with 5 separate barcodes.

The oligonucleotides of library C were synthesized by a Twist Bioscience (san Francisco, calif.), PCR amplified using a vPhusion high fidelity DNA polymerase (New England Biolabs), and assembled using NEBuilder Hifi DNA Assembly Master Mix (New England Biolabs) to a Lenti-gRNA-Puro vector (Addgene 84752) digested with BsmBI (Engenomics, korea). After PCR purification using MEGA rapid rotation total fragment DNA purification kit (iNtRON Biotechnology), the product was transformed into Endura ElectroCompetent (Electrocomplete) cells (Lucigen, middletton, wi.) to construct the first plasmid library. The plasmid library was then cut with BsmBI restriction enzymes (engineering) and treated with rapid calf intestinal alkaline phosphatase (QCIAP, new England Biolabs), bound to optimized sgRNA scaffolds and transformed into engura electrocompetent cells (Lucigen). Plasmids were extracted using the Plasmid Maxi Kit (Qiagen, hilden, germany).

Examples 1-3 cell culture and lentiviral production

HEK293T cells (American type culture Collection) were maintained in Dulbecco's modified Eagle's Medium (DMEM; gibco, walsh, mass.) supplemented with 10% fetal bovine serum (FBS; gibco). HEK293T cells were seeded one day prior to transfection and treated with chloroquine diphosphate for up to 5 hours on the day of transfection. Opti-MEM reduced serum Medium (Gibco) was mixed with 120. Mu.l of polyethylenimine reagent, 20. Mu.g of lentiviral vector, 15. Mu.g of PAX2 and 5. Mu.g of pMD2.G to a final volume of 1ml, the solution was incubated at room temperature for 15 to 20 minutes and then added to the cell culture medium. The next day, the lentiviral-containing medium was removed and replaced with fresh DMEM (Gibco) supplemented with 10% fbs (Gibco). After 48 hours of transfection, the supernatant containing the variant virus was harvested directly or residual library plasmids from the lentiviral plasmid library were removed by adding universal nucleases (Engenomics) and universal nuclease buffers before harvesting the supernatant. The harvested supernatant was then stored at-80 ℃.

Examples 1-4 Generation of stable cell lines and transduction of lentiviral plasmid libraries

Cells infected in 0.15 infection multiplex (multiplicity of infection: MOI) were selected for further evaluation in cell lines expressing lentiviral variants and maintained continuously with 20 μg ml-1 blasticidin S (Blasticidin S) (InvivoGen, san Diego, calif.). For cells expressing the variants, the cells were inoculated one day prior to transduction of the lentiviral plasmid library and then infected at MOI 0.4 in the presence of 10. Mu.g ml-1 polybrene. After 18 to 19 hours of transduction, the medium was replaced with new medium supplemented with 2 μg ml-1 puromycin (Invitrogen, wolsepm, mass.) and 20 μg ml-1 blasticidin S (InvivoGen). The stable cell lines and cell numbers used for each library are summarized below.

(1) Library A (8X 10 per cell line) ⁷ A cell; will be 2X 10 ⁷ Inoculating each cell into four 15-cm culture dishes

Cas9 variant (day 4 harvest)

SpCas9, VRQR variants, xCas9, spCas9-NG, spCas9-NRRH, spCas9-NRTH, spCas9-NRCH, spG, spRY, and Sc++.

Base editing gene scissor variants based on SpCas9 (day 10 harvest)

YE1-BE4max and ABE8e (V106W).

(2) Library B (2X 10 per cell line) ⁸ A cell; will be 2.5X10 ⁷ The individual cells were inoculated into eight 15-cm dishes

Cas9 variant (day 4 harvest)

Base editing gene scissor variants based on SpCas9 (day 6 harvest)

YE1-BE4max and ABE8e (V106W).

CBE variants based on SpCas9-NG (day 6 harvest)

YE1-BE4max and SsAPOBEC3B.

ABE variants based on SpCas9-NG (harvest day 6)

ABE8e (V106W) and ABE8.17m-V106W.

C to G base editing Gene scissor variants based on SpCas9-NG (day 6 harvest)

CGBE1, miniCGBE1 and APOBEC-Cas9n-Ung.

(3) Library C (8X 10 per cell line) ⁷ A cell; will be 2X 10 ⁷ Inoculating each cell into four 15-cm culture dishes

Cbe variants (day 6 harvest)

SpCas9-NRCH-YE1-BE4max, spRY-YE1-BE4max, and SpCas9-NRCH-SsAPOBEC3B.

ABE variant (day 6 harvest)

SpRY-ABE8e (V106W), spCas9-NRCH-ABE8.17m-V106W and SpRY-ABE8.17m-V106W.

CGBE variants (harvested on day 6)

SpCas9-miniCGBE1 and SpCas9-NRCH-APOBEC-Cas9n-Ung.

Example 2 test methods and results measurements

Example 2-1 Deep sequencing (Deep sequencing)

Genomic DNA was isolated using Wizard genomic DNA purification kit (Genomic DNA Purification Kit) (Promega, feichburg, wisconsin) according to the manufacturer's instructions. The integrated sequences including sgRNA coding sequence, barcode and target sequence were PCR amplified from 48 individual 50 μl reactants with 5 μg of genomic DNA (libraries a and C; each technology replicates a total of 240 μg of genomic DNA) or 96 individual 50 μl reactants with 10 μg of genomic DNA (library B, each technology replicates a total of 480 μg of genomic DNA) using 2x Taq PCR Smart Mix (Solgent). After formation of the gene library, the PCR products were purified using a rapid spin total fragment DNA purification kit (iNtRON Biotechnology) according to the manufacturer's protocol. The amplicon was then sequenced in the NovaSeq 6000 system (Illumina) or the Nextseq 2000 system (Illumina).

Example 2-2 evaluation of variant Activity of nucleases and base editing Gene scissors

After deep sequencing, the data was analyzed using homemade Python script. To improve data accuracy, pairs that include i) errors in the guide RNA, scaffold, or barcode generated during oligonucleotide synthesis, PCR amplification, or sequencing, or ii) shuffling (shuffling) between the barcodes and guide RNA sequences are removed.

For activity analysis of Cas9 variants, data with background indels frequency greater than 8% and total read count less than 100 (library a) or 200 (library B) were filtered out. In the case of the intended base conversion analysis by base editing gene scissors mutation, data with a total read number of less than 100 were excluded. For the total base editing analysis, sequences with total reads less than 100 and background base editing efficiency greater than 8% were excluded.

Examples 2-3 Western blot analysis

The harvested cells were lysed in a buffer containing 20mM HEPES, 150mM NaCl, 1% NP-40, 0.25% sodium deoxycholate and 10% glycerol, 1 of the protease inhibitor cocktail (Cell Signaling Technology): a100 dilution was added to the buffer. The mixture was placed on ice for 20 minutes. As a result, the obtained cell lysate solution was centrifuged at 13,000g for 15 minutes at 4 ℃. The total protein concentration of the supernatant was measured using Bradford protein assay kit (Protein Assay Kit) (Pierce). Proteins (30 μg per well) were separated on 4 to 12% bis-Tris gels and the corresponding gels were separated in 1X NuPAGE MOPS SDS separation buffer (Invitrogen) at 120V for 2 hours. Proteins were then transferred to 0.45 μm invitro polyvinylidene fluoride (Invitrogen) membranes using XCell II blotting module (Invitrogen), and the movement was done in 10% (vol/vol) methanol in 1x NuPAGE movement buffer on ice. After termination of the reaction with 5% BSA for 1 hour, and use of the following 1:1,000 and 1: primary antibodies recognizing SpCas9 (catalog No. 844301, biolegend) and β -actin (catalog No. sc-47778,Santa Cruz Biotechnology) were diluted at a ratio of 2,000, and probed overnight in 5% BSA at 4 ℃. Then, the membrane was washed and used to 1: horseradish peroxidase (HRP) -conjugated goat anti-mouse IgG secondary antibody (horseradish peroxidase-conjugated goat anti-mouse IgG secondary antibodies, cat. Sc-516102,Santa Cruz Biotechnology) diluted at a ratio of 3,000 was incubated for 1 hour at room temperature. Antibodies were visualized by West-Q Pico ECL solution (GenDEPOT) using an ImageQuant LAS-4000 digital imaging system (GE Healthcare).

EXAMPLES 2-4 deep learning model

The data were randomly split into training and test data sets and subjected to 5-fold cross-validation for learning. To predict the frequency of indels generated by Cas9variants, the 26,960 to 27,342 and 1,003 to 3,529 target sequences of libraries a and B were used for training and experimental datasets, respectively. In library B, 12,553 to 16,624 and 8,507 to 86,822 target sequences for efficiency and pattern learning, respectively, were used for SpCas9-NG based base editing gene scissors. In library C, 2,378 to 8,287 target sequences are used for efficient learning of base editing gene scissor variants including Cas9 variants.

The input sequences are converted into values by one-time thermal encoding (one-hot encoding), and zero padding (zero-padding) is applied to maintain the number of input sequences. In the case of deep cas9 derivatives variant, one convolution layer (convolution layer) consisting of 1,000 or 2,000 filters with a length of 10nt is used, in the case of the efficiency model of deep ng-BE, one convolution layer consisting of 1,024 filters with a length of 30nt is used, and in the case of the ratio model of deep ng-BE, one convolution layer consisting of 256 or 1,024 filters with a length of 3nt is used, thereby extracting the input sequence features. As with the deep reinforcement learning algorithm, a pooling layer (pooling layer) for maintaining local information is omitted. To generate a one-dimensional input, a flattened layer (layer) is used, and all models consist of two or three dense layers. For the first layer or the second layer, deep cas9 variables uses 1000 or 1500 nodes, the efficiency model of deep ng-BE uses 1500 or 2000 nodes, and the ratio model of deep ng-BE uses 2500 or 5000 nodes. In the last dense layer, deep cas9variants used 100 nodes, and the efficiency model of deep ng-BE and the ratio model of YE1-BE4max, ssavobe 3B, ABE, and CGBE used 31, 127, 255, and 31 nodes, respectively. The output layer of deep cas9 derivatives generates a prediction score of deep cas9 derivatives, which is generated by multiplying the efficiency of deep ng-BE by the output value of the ratio model.

Since the result of base editing gene scissors is determined by deaminase, a ratio model of deep-BE was used. To develop an efficiency model, data obtained by using base editing gene scissors comprising SpCas9-NG or Cas9variants was used to generate input sequences with lengths of 7, 9 or 10 nt. The input sequence is converted into a binary matrix by one-time thermal encoding and zero padding is used. 256, 512 or 1,024 nodes are used in the convolutional layer and the extracted features are flattened. To take into account the guide sequence preference of Cas9variants, deeppcas 9variants prediction scores are connected to the flattening layer. The output layers of the efficiency and ratio models are multiplied to generate a predicted score for the base editing gene scissors comprising the Cas9 variant.

Random inactivation layers (Dropout layers) were used to avoid overfitting at a rate of 0.3, and rectifying linear units (Rectified Linear Unit: reLU) were used as the activation function for all layers. The output of the deep cas9 variables and deep ng-BE, and the deep BE's efficiency model are linearly transformed. In the case of the output layer of the ratio model of deep-BE and deep BE, the softmax function is applied as the activation function. Average absolute error is used as a loss function, and a learning rate of 10 is used ^-4 Adam optimizer of (a). In addition, tensorFlow is used to develop models.

Examples 2-5 statistical significance

G and h of fig. 19 used the Wilcoxon rank sum test. Statistical significance was analyzed by SPSS statistics (version 25, IBM).

Test example 1 Large Scale assessment of Cas9 Activity and base editing Gene scissor variants

To compare the base editing and nuclease activity of each variant, cell lines expressing these variants at similar levels were first generated. In view of the fact that the use of codons affects protein expression levels, the same codons present in the widely used SpCas9 coding sequence are used for Cas9 variants, excluding the mutation sites where codons were selected based on GenScript suggestion that result in high expression levels of SpCas9 base editing gene scissors. HEK293T cells were transformed with individual lentiviral vectors encoding Cas9 or base editing gene scissor variants by multiplicity of infection (multiplicity of infection: MOI) of 0.15 such that all transformed cells had only one copy of the sequence encoding Cas9 or base editing gene scissor variants and untransformed cells were removed by blasticidin S selection. In Western blotting results, it was shown that the levels of most Cas9 and base editing gene scissor variant proteins were similar, but as an exception, NG-ABE8e (V106W) showed statistically significantly higher protein levels compared to the three YE1-BE4max variants (NG-YE 1-BE4max, spRY-YE1-BE4max and NRCH-YE1-BE4 max) and the two POBEC-nCas9-Ung variants (b of FIG. 23).

To measure the activity of base editing gene scissors and Cas9 in multiple target sequences, a large capacity approach of libraries comprising sgRNA encoding and target sequence pairs was used. Libraries a and B, previously prepared, and C were used as libraries generated by the invention, which included the sgRNA coding and target sequences of 11,802, 23,679 and 11,994 pairs, respectively. Library a included 8,130 pairs for PAM compatibility assessment and 2,940 and 732 pairs for assessing mismatch tolerance with NGG and non-NGG PAM sequences, library B included 8,744, 12,093 and 2,660 pairs of NGG, NGH and non-NG PAM sequences, respectively, to measure the activity of Cas9 and base editing gene scissor variants with various PAM compatibility in multiple target sequences, and library C included 179 or 180 pairs for each NNN PAM sequence to determine the activity of base editing gene scissors including Cas9 versions that recognize NGG and non-NGG PAM.

To increase the accuracy of the data i) the removal of sequences comprising technical errors in the sgrnas, scaffolds or barcodes generated during oligonucleotide synthesis, PCR amplification or sequencing and ii) the removal of sequences that reorganize the barcodes or sgRNA regions during lentivirus production. Comparing the frequency of indels before and after sequence reads that remove sequences that include errors or shuffling, those without errors or shuffling are higher, as expected, a high correlation is observed between these two values (a of FIG. 2). Since indel frequencies show a high correlation between two biological replicates (b of fig. 2) and between technical replicates (fig. 3A to 3G), a more generalized and accurate conclusion is reached by combining the results obtained in the two replicates.

Experimental example 2 edit windows, product purity and favorite motifs for cbe and ABE variants

Deaminase is an essential component of base editing gene scissors, and application of base editing is often limited by insufficient editing activity or off-target effect of DNA and RNA, especially independent effect on Cas 9. CBEs and ABEs having advanced base conversion regions by including YE1-BE4max9 and ssavobs 3B10 as CBEs and ABEs 8e (V106W) and 11 as ABEs and ABE8.17-m+v106W12 have been reported to have high target activity and minimal off-target effects, but the activities of these base editing gene scissors have not been widely compared among a plurality of target sequences, and thus it has been difficult to select the most suitable base editing gene scissors version. Thus, to compare and evaluate the activity, editing window, and specificity of base editing gene scissors comprising these advanced base transition regions, the base transition regions were combined with SpCas9-NG as a SpCas9 variant with broad PAM compatibility.

First, a window for expected base conversion is determined. Both SpCas9-NG-YE1-BE4max and SpCas9-NG-SsAPOBEC3B were the most active in base editing at position 6 (20 bits adjacent to NGG PAM of the guide sequence, numbered 1 bit 20 base pairs from PAM), the editing window of SpCas9-NG-SsAPOBEC3B spans from 2 bits to 13 bits, which is wider than the editing window of SpCas9-NG-YE1-BE4max spanning from 4 bits to 8 bits (a of FIG. 17). Few of these two CBEs were observed to induce other transformations other than the transition from c·g to t·a (fig. 4A and 4B). The purity of the C to T conversion induced by SpCas9-NG-YE1-BE4max and SpCas 9-ssavobs 3B was in the range of 98.1% to 98.9% and 98.7% to 99.5%, respectively, compared to C to a or G conversion (B of fig. 17). Thus, the preference motifs of the two CBEs were analyzed and learned to be different. In other words, spCas9-NG-YE1-BE4max prefers (A/C/T) cN (C is the target nucleotide) motifs, in contrast, spCas9-NG-SsAPOBEC3B showed high activity in TcC and Gc (A/C/T) motifs (C of FIG. 17). Similar motif effects were observed for other positions within the base editing window (fig. 5). This effect was not observed in indel generation by the corresponding SpCas9-NG (fig. 6), which results support that effect is due to the base transition region.

The overall base editing activity of SpCas9-NG-SsAPOBEC3B is higher than that of SpCas9-NG-YE1-BE4max. In other words, the intermediate values of editing activity for SpCas9-NG-SsAPOBEC3B and SpCas9-NG-YE1-BE4max are at positions 26% and 13% respectively at position 6. However, in some target sequences, the base editing activity of SpCas9-NG-YE1-BE4max is higher than that of SpCas9-NG-SsAPOBEC3B. At analysis of position 6, the base editing activity of SpCas9-NG-YE1-BE4max and SpCas9-NG-SsAPOBEC3B was at least 30% higher than that of NG-SsAPOBEC3B and SpCas9-NG-YE1-BE4max in the target sequences 19% and 64%, respectively (d of FIG. 17). Among 19% of the target sequences showing high activity of SpCas9-NG-YE1-BE4max compared to SpCas9-NG-SsAPOBEC3B, the AcG motif is very abundant, and the (A/C/T) cN motif is slightly abundant. In contrast, the GcN motif is very abundant, lacking the AcG motif, in 64% of the target sequences that exhibit higher activity for SpCas9-NG-SsAPOBEC3B than SpCas9-NG-YE1-BE4max. Similar motif effects were found elsewhere within the base editing window (a of fig. 24). Therefore, for efficient base editing, it is possible to select a favorite base editing gene scissors having an appropriate base conversion region according to the motif surrounding the target nucleotide c.

The editing windows for SpCas9-NG-ABE8e (V106W) and SpCas9-NG-ABE 8.17-m+V164W are similar. Both span 4 to 8 bits and the highest activity is shown in 6 bits (e of fig. 17). Two ABEs were rarely observed to induce their base conversion except for a.t to c.g conversion, with the exception that editing from c.g to g.c and from c.g to t.a was observed at low levels, which showed the highest activity at position 6 (fig. 4C and 4D). The purity of the a to G transition and the a to C or G transition induced by SpCas9-NG-ABE8e (V106W) and SpCas9-NG-ABE 8.17-m+v16w are positionally different, with ranges of 98.1% to 98.8% and 97.8% to 98.5%, respectively (f of fig. 17). As a result of analyzing the two ABE favored motifs, ca (C/T) and TaB motifs (g of FIG. 17) were similarly favored. As similarly observed in the above experiments using CBE, similar motif effects were found elsewhere within the base editing window (fig. 7).

The overall base editing activity of SpCas9-NG-ABE8e (V106W) is slightly higher than that of SpCas9-NG-ABE8.17-m+V106W. In other words, the intermediate values of editing activity for SpCas9-NG-ABE8e (V106W) and SpCas9-NG-ABE 8.17-m+V164W were 7% and 5%, respectively, in position 6. However, in some target sequences, the base editing activity of SpCas9-NG-ABE 8.17-m+v168w is higher than SpCas9-NG-ABE8e (V106W). When positions 6 are analyzed in position 6, the base editing activity of SpCas9-NG-ABE8e (V106W) and SpCas9-NG-ABE 8.17-m+V164W is at least higher than that of SpCas9-NG-ABE 8.17-m+v164W and SpCas9-NG-ABE8e (V106W), respectively, in 56% and 12% of the target sequence. Of the 56% of the target sequences that showed higher activity of SpCas9-NG-ABE8e (V106W) than SpCas9-NG-ABE8.17-m+V106W, caA motifs were most abundant, and Ca (C/G/T) B and TaA/C) were slightly abundant. In contrast, aaT and AaV are very abundant and slightly abundant, respectively, in 12% of the target sequences showing higher activity of SpCas9-NG-ABE 8.17-m+v164w than SpCas9-NG-ABE8e (V106W) (h of fig. 17). Similar motif effects were found elsewhere within the base editing window (b of fig. 24). Therefore, for efficient base editing, it is possible to select a favorite base editing gene scissors having an appropriate base conversion region according to the motif surrounding the target nucleotide A.

Experimental example 3 edit window, product purity and preference motif of cgbe variants

The activities of the three CGBE variants were compared based on SpCas9-NG nickase. These variants span from 5 to 7 positions from the editing window of c.g to g.c, and the maximum of activity occurs at 6 positions (a of fig. 18). SpCas9-NG-CGBE1, spCas9-NG-miniCGBE1 and SpCas9-NG-APOBEC-nCas9-Ung showed complete editing activity from C.G to G.C in the order listed, with intermediate activity values of 3.0%, 2.6% and 1.4%. In addition to the editing from C.G to G.C, a conversion of C.G to A.T bases was frequently observed in all three CGBE variants (SpCas 9-NG-CGBE1, spCas9-NG-miniCGBE1 and SpCas9-NG-APOBEC-nCas9-Ung with intermediate values of 2.4%, 3.5% and 4.5% conversion activity from C.G to T.A, respectively, and intermediate values of base conversion efficiency from C.G to A.T of 0.8% and 0.9%). (FIGS. 4E, 4F and 4G); edit purity from c.g to g.c by SpCas9-NG-CGBE1, spCas9-NG-miniCGBE1 and SpCas 9-NG-apodec-nCas 9-Ung were only 51%, 44% and 30% at 6-position, respectively. Similar points were found for the edit purities from c.g to g.c at positions 5 and 7. (highest in SpCas9-NG-CGBE1, and lowest in SpCas 9-NG-apodec-nCas 9-Ung) (b of fig. 18). The motifs favored by CGBE were analyzed and their similarity of preference was known. In other words, among the 6 bits, (A/T) cT is the most preferred, and (A/T) C (A/C/G) is the second most preferred (C of FIG. 18). There is a similar motif effect at positions 5 and 7 of the edit window (fig. 8), similar to that observed in the experiments performed using CBEs and ABEs described above. Importantly, since preference motifs are different in editing from C.G to G.C and from C.G to T.A (supplementary information FIG. 9 a to C), and the editing purity from C.G to G.C is also significantly different depending on the nucleotide adjacent to the target cytosine (FIG. 9 d to f). Specifically, it was shown that the edit purity from C.G to G.C was higher in the (A/G/T) cT motif of SpCas9-NG-miniCGBE1 and SpCas9-NG-APOBEC-nCas9-Ung and in the NcT motif of SpCas9-NG-CGBE 1.

The overall editing activity from c.g to g.c of SpCas9-NG-CGBE1, spCas9-NG-miniCGBE1, and SpCas9-NG-APOBEC-nCas9-Ung varies from high value to low value in the listed order, and the relative editing efficiency varies depending on the target sequence. In some target sequences, the base editing activity of SpCas9-NG-miniCGBE1 and SpCas 9-NG-apodec-nCas 9-Ung is higher than SpCas9-NG-CGBE1 and SpCas9-NG-miniCGBE1, respectively. When analyzed at position 6, the base editing activity from c.g to g.c of SpCas9-NG-miniCGBE1 is at least 30% or more higher than SpCas9-NG-CGBE1, and the activity of SpCas9-NG is at least 30% or more higher than the activity of SpCas9-NG-CGBE1 under SpCas9-NG-miniCGBE1 in 17% and 16% of the target sequence, respectively (d to f of fig. 18). These target sequence-dependent differences in the relative efficiency of CGBE variants are related to the motif differences in preference between variants. Similar motif effects were found elsewhere within the base editing window (fig. 25). Thus, in CBE and ABE, preferred CGBE variants can be selected based on the motif surrounding the nucleotide C of interest for efficient base editing.

Test example 4 determination of Cas9 variants with highest Activity on a given PAM sequence

In previous studies, it was found that SpCas9-NG, spCas9, VRQR variants and XCAS9, which are four major SpCas9 variants with the broadest PAM compatibility among them and the highest nuclease activity, can be induced when appropriate selection is made between them. However, if PAM is defined as a sequence with an average indel frequency of more than 10% or 5% in the corresponding target sequence 4 days after transduction of library a, only 131 (51%) or 156 (61%) of the 256 NNNN PAM sequences capable of combining these four variants are covered. Thus, for the remaining 49% or 39% of the possible PAM sequences, an effective Cas9 nuclease is not available, in particular, spCas9 variants with multiple PAM compatibility need to be developed for PAM sequences that cannot be targeted with the existing four SpCas9 variants.

To overcome some of the limitations of PAM compatibility, through extensive comparative studies, five additional SpCas9 variants with broad or different PAM compatibility were developed, including SpCas9-NRRH, spCas9-NRTH, spCas9-NRCH, spG, and SpRY. In addition, sc++ from Cas9 variants of streptococcus canis have recently been proposed and are known to have broad PAM compatibility, high target activity and low off-target impact. The selection of Cas9 variants for use in a given target sequence can be particularly confusing, especially if PAM compatibility of some of these variants is considered to be partially overlapping.

To determine the most potent Cas9 variants in a given PAM sequence, the activities of the four above SpCas9 variants (SpCas 9-NG, spCas9, VRQR variants and xCas 9), five recently developed SpCas9 variants (SpCas 9-NRRH, spCas9-NRTH, spCas9-NRCH, spG and SpRY 19) and sc++20 were assessed by using the previous library a in determining PAM compatibility of the SpCas9 variants using 7,680 target sequences (30 sgrnas with NNNN PAM sequences).

As a result, when PAM was defined as a sequence in which the average indel frequency of the corresponding target sequence after 4 days of transduction of library a exceeded 10%, 215 (84%) of the 256 4-nt (NNNN) sequences were confirmed to pass at least one of the 10 (=4+5+1) variants tested to be usable for PAM (fig. 19 a, b, and 26). When 5% or 20% was used as a standard instead of 10%, 234 (91%) or 167 (65%) of 256 4-nt (NNNN) sequences were determined as PAM, respectively (fig. 27). In determining which variants on the target with each NNNN PAM sequence showed the highest activity (at least 10% or higher indel frequency), spRY, spCas9-NG, spG and sc++ showed the highest activity on the targets with 57, 41, 28, 25, 22, 20, 13, 7 and 2 PAM sequences, respectively. Among the 84 PAM sequences newly added (=215-131), spCas9-NRCH, spCas9-NRRH, spRY, spCas-NRTH, spG and sc++ were shown to be the most potent nucleases at 35, 24, 13, 10, 1 and 1 PAM sequences. Taken together, these results demonstrate that when newly developed variants are added, particularly SpCas9-NRCH, spCas9-NRRH, spRY, and SpCas9-NRTH, the range of sequences that can be effectively targeted is significantly expanded.

Then, it was investigated whether the relative activity of Cas9 variants with different PAM compatibility was affected by the composition of the guide sequences in the target sequence with a given shared 4-nt PAM. It was thus found that the correlation between the 9 Cas9 variant-induced insertion deletion frequencies was very diverse, with Pearson correlation coefficient median ranging from-0.20 to 0.88 (fig. 19 c and fig. 28). The median of the correlation between the indel frequencies of two biological replicas of the same Cas9 variant is 0.95 (in the case of SpCas 9) and 0.94 (in the case of SpCas 9-NRCH), which is higher than the median of all correlations between the frequencies of two different Cas9 variants, indicating that the relative activity of Cas9 at the site with a given PAM sequence differs depending on the guide sequence composition. The correlation between the activities of SpG, VRQR variants, spCas9-NRCH, xCas9 and SpCas9 is relatively high, whereas the correlation between the activity of SpRY and the activity of any other variant is low, sometimes even negative. When the activity of the favored Cas9 variant on a given PAM sequence was determined to show a guide sequence percentage at least 1.3-fold lower than one of the remaining Cas9 variants of the PAM site, it ranged between 0% and 47% (average 9.7%, median 6.7%) (d of fig. 19), indicating that not only the PAM sequence but also the guide sequence should be considered when selecting the favored Cas9 variant.

Test example 5 relationship between the activities of cas9 nuclease and base editing Gene scissors

As shown above, the windows showing the highest base editing gene scissor activity are very narrow spaced apart and located at a distance from PAM. However, wild-type SpCas9 requires NGG PAM sequences, which can theoretically be found every 16 base pairs. Thus, due to the lack of NGG PAM, efficient base editing, which exhibits minimal bystander effects for a given desired editing, is often prevented. Thus, when base editing gene scissors are used, this problem can be solved using Cas9 variants with different PAM compatibility. The above test examples provide guidelines for selecting an appropriate Cas9 nuclease for a given target sequence, but particularly consider whether these conclusions drawn from nuclease activity assessment can be directly inferred to base editing not yet been assessed when the base editing gene scissors include Cas9 nickases instead of Cas9 nucleases.

Thus, spCas9-NRCH and SpRY were used as examples of variants, comparing the average efficiency of base editing gene scissors and Cas9 nuclease at sites with different PAM sequences. As expected, the relative average efficiency of nucleases and base editing gene scissors comprising CBE, ABE and CGBE at the site with a given PAM sequence shows a high correlation (e and f of fig. 19 and fig. 10), indicating that Cas9 nickase variants can be selected for three types of base editing gene scissors based on the activity of Cas9 nuclease variants at the site of the target PAM sequence. In addition, using Cas9 nickase variants with different PAM compatibility affects overall base editing efficiency for CBE, ABE, and CGBE, but does not affect motifs favored for relative editing windows or base editing (fig. 11 and 12).

Test example 6 SpCas9 variants recognizing multiple PAM, ABE8e (V106W) and YE1-BE4max performed using mismatched target sequences

To test the accuracy of SpCas9, the frequency of indels induced in mismatched target sequences with SpCas9 variants was normalized to match the frequency of indels in the target 4 days after transduction of lentiviral library a. For this analysis, library a included 2,940 sgRNA target pairs with the characteristics of 30sgRNAs x 98 targets (1 target with no mismatch +60 targets, each with 1 base mismatch +19 targets, each with 2 base mismatch +18 targets, each with 3 base mismatch) with NGG PAM (a of fig. 20) compatible with all tested Cas9 variants except sc++. However, spCas9 variants combined with these 30sgrnas induced different indels frequencies in the matched target sequences. When the activities of matching target sequences vary greatly between comparison populations, the activity comparison of mismatched target sequences may deviate. Thus, although the average activity of SpCas9 and SpRY was higher and lower, respectively, than that of the other Cas9 after this selection, 23 sgrnas were selected that showed relatively similar SpCas9 variant-induced indel frequencies in the matched target sequences 4 days post transduction (a of fig. 13). When specificity is defined as "1- (the value obtained by dividing the indel frequency in the mismatched target sequence by the indel frequency in the perfectly matched target)", the general specificity of the variants is similar except for SpCas9 and SpRY (a of fig. 20). In other words, the specificity of SpCas9 and SpRY may be relatively underestimated and overestimated, respectively, due to the high and low indel frequencies measured in the matched target sequences. Mismatch intolerance of all SpCas9 variants tested was highest at 15 bits and gradually decreased as the position was closer to 1 or 20 bits, intolerance being higher in the region closer to PAM (11 to 20 bits) than in the region farther from PAM (1 to 10 bits). This result is in contrast to the results of two major peaks, such as eSpCas9 (1.1), spCas9-HF1, hypcas 9, and evoCas9, which show mismatch intolerance around positions 15 and 16.

In investigating the effect of mismatch type on mismatch tolerance, all tested variants showed the highest tolerance at wobble transition (wobble transition) and found the lowest tolerance for base transition, (b of fig. 13), consistent with previous study results obtained from experiments performed using Cas12a (or Cpf 1) and SpCas 9. In addition, the number of consecutive mismatched bases has been found to have a large impact on the relative activity of the mismatched targets. As the number of mismatches increases, the tolerance to two or three consecutive base mismatches decreases dramatically (fig. 20 b, fig. 14 and fig. 15).

In particular, the SgRNA-dependent base editing gene scissor activity in mismatched target sequences has never been systematically investigated compared to the activity of Cas9 nucleases. Thus, the accuracy of the two base editing gene scissors SpCas9-YE1-BE4max and SpCas9-ABE8e (V106W) was next assessed by using 2,940 sgRNA target pairs (30 sgrnas×98 matched and mismatched targets). The general specificity of the two base editing gene scissors tested was similar to that of SpCas9 nuclease (d and e of fig. 20). In other words, as similarly observed for Cas9 nucleases, the specificity is lower in the region distant from PAM (positions 1 to 10), the region close to PAM (positions 11 to 20) is higher, the highest point is 15, and gradually decreases as the position approaches 1 or 20. As in the case of Cas9 nuclease, both base editing gene scissors showed the highest tolerance in wobble transition and the lowest tolerance in base transition among mismatched targets (f and g of fig. 20). As another similarity to Cas9 nucleases, as seen in two or three base mismatches, as the number of mismatches increases, the tolerance of base editing gene scissors in mismatch targets decreases (fig. 20 h and i, fig. 29, and fig. 30).

Test example 7 DeepCas9 derivatives and DeepNG-BE: deep learning-based model for predicting activity of base editing gene scissors comprising Cas9variants and SpCas9-NG

Because of the large number of Cas9 and base editing gene scissor variants, it is currently difficult to select from for editing a genome in a particular sequence of interest. Given the ability to predict the activity of each variant in the target sequence of interest, it would be very useful to select suitable and efficient variants for a particular application domain, first a computational model was developed that predicts the activity of nine Cas9variants with different PAM compatibility, namely SpCas9, VRQR variants, spCas9-NG, spCas9-NRRH, spCas9-NRTH, spCas9-NRCH, spG, spRY and sc++.

The indel frequency data obtained for Cas9variants in matching target sequences with all types of PAM sequences were first randomly partitioned into training and test data sets. Because of this random partitioning, there is no target sequence shared between the training and test data sets. Then, using the training dataset described above, a deep learning-based computational model was developed for predicting the activity of nine Cas9variants on a specific target sequence (a of fig. 21 and a of fig. 16). Next, this computational model, collectively called deep cas9 varians, was evaluated using a test dataset that was never used for training. At this time, the pearson correlation coefficient was found to range from 0.82 to 0.95 (average, 0.90), and the spearman correlation coefficient was found to range from 0.80 to 0.94 (average, 0.89) (b of fig. 21), indicating that these models exhibited robustness.

Thereafter, as was done previously for previous versions of ABE and CBE, computational models were developed that predict the editing efficiency and outcome of base editing gene scissors comprising 7 SpCas9-NG, namely YE1-BE4max, ssavobe 3B, ABE e (V106W), ABE8.17-m+v106W, CGBE1, miniCGBE1 and apopec-nbas 9-Ung. As was done for Cas9, the base editing efficiency and result data of library B were randomly partitioned into training and test data sets, and using the training data, deep learning-based computational models, collectively referred to as deep-BE-efficiency (deep-BE-efficiency) and deep-BE-ratio (deep-BE-report), were generated. These models predict the efficiency of base editing and the ratio of base editing result sequences, respectively. Further analysis showed that these models exhibited robust performance (fig. 16 b and fig. 31). By combining deep-be_efficiency and deep-be_ratio, we have generated a computational model of the absolute frequency of predicted base editing results, collectively referred to as deep-BE. When deep-BE was evaluated using a test dataset that was never used for training, the pearson correlation coefficient ranged from 0.88 to 0.91 (average, 0.89), and the spearman correlation coefficient ranged from 0.83 to 0.93 (average, 0.88) (c of fig. 21), indicating that these models had strong performance.

Test example 8.Deep be: deep learning-based model for predicting activity of 63-base editing gene scissors

When base editing gene scissors comprising Cas9 variants with different PAM compatibility are used as nickase regions, the expected editing efficiency can be maximized and the impact of bystander editing minimized by locating the desired editing position at or near the position in the base editing window where the maximum editing occurs in many cases. Further, as described above, an appropriate base conversion region can be selected for a given base editing operation according to the target sequence composition and the desired editing. Thus, by combining nine Cas9 variants with various PAM compatibilities with a nickase region with 7 different base transition regions, base editing gene scissors with various PAM compatibilities were generated of 63 (=9×7). However, selecting the most appropriate base editing gene scissors for desired editing in a given target sequence can be particularly difficult when there are too many choices, and thus attempts have been made to develop a computational model that predicts the base editing efficiency and results of these 63 base editing gene scissors in a given target sequence. However, measuring the efficiency of all 63 base editing gene scissors in a large number of target sequences would be very time consuming and costly.

Considering that the base editing efficiency is affected by both the target base conversion activity and Cas9 nickase activity, it is assumed that the base editing efficiency can be predicted by deep learning using factors affecting the base conversion activity and Cas9 activity as input information. Sequence motifs around the target nucleotide affect base editing, and different deaminases typically have different preference motifs. Thus, to reflect the base conversion activity of seven base editing gene scissors with different base conversion regions, editing window ±1 nucleotide was used as input information, which affects mainly the base conversion activity rather than Cas9 nickase activity (fig. 32). Furthermore, the deeppcas 9 derivatives score was used as additional input information, reflecting Cas9 activity. As a training dataset, base editing efficiency data generated by 7 base editing gene scissors (1-BE 4max, ssAPOBEC3B, ABE e (V106W), ABE8.17-m+V106W, CGBE, miniCGBE1, and APOBEC-nCas 9-Ung) comprising 7 SpCas9-NG, were used (i.e., spCas9-NRCH-YE1-BE4max, spCas9-NRCH-SsAPOBEC3B, spCas9-ABE8e (V106W), spRY-ABE8e (V106W), spCas9-NRCH-ABE8.17-m+V106W, spCas9-miniCGBE1, and SpCas9-NRCH-APOBEC-nCas 9-Ung).

As a result, deep_efficiency (deep_efficiency) of a total of 63 base editing gene scissors was developed. To predict the relative ratio of base editing results, the deep-be_ratio was used, considering that the relative ratio of base editing results is determined by base conversion activity and guide sequence, not PAM sequence. By combining the prediction results of deep-be_efficiency and deep-be_ratio, deep BE was developed that predicts the base editing absolute frequency of 63 base editing gene scissors. When seven base editing gene scissors comprising various Cas9 nickase variants (used to generate the training dataset) were tested deep be along with test target sequences that were never used for training, the pearson correlation coefficient ranged from 0.72 to 0.84 (average, 0.78), and the spearman correlation coefficient ranged from 0.63 to 0.86 (average, 0.79) (d of fig. 5), indicating that the performance of these models was very excellent. In addition, when deep be was tested on three base editing gene scissors including various variants of Cas9 nickase that were never used for training along with the test target sequence that were never used for training, the pearson correlation coefficient ranged from 0.69 to 0.86 (average, 0.78), and the spearman correlation coefficient ranged from 0.66 to 0.93 (average, 0.81) (e of fig. 21), indicating that the generalization performance of deep be was good.

The inventors provide these models as network tools in http:// deepcrispr. By selecting the most appropriate base editing gene scissor variant and sgRNA pairs using the network tool, the desired editing in the target sequence of interest can be efficiently obtained.

Test example 9 selection of the most efficient base editing variant and guide sequence pairs to modify pathogenic or pathogenic mutants

Of the 75,104 mutants reported in ClinVar for pathogenicity or potential pathogenicity, 5,475 (7.3%), 15,040 (20%) and 4,492 (6.0%) were modified by editing from c.g to t.a, from a.t to g.c and from c.g to g.c, respectively. Editing from c.g to t.a can be induced using 18 CBE variants (=base transition region 2 x Cas9 nickase variants with different PAM compatibility 9), similarly editing from a.t to g.c and editing from c.g to g.c can be generated using 18 ABE variants and 27 CGBE variants, respectively. However, it is not easy to select the best base editing gene scissor variant and sgRNA pairs for achieving the maximum frequency of editing required. The base edit window width of CBE and ABE is 5bp (although SSsAPOBEC3B has a wider (7 bp) edit window, only a 5bp edit window spanning positions 4 to 8 is considered for fair comparison), when the base edit window width of CGBE is 3bp, in the case of editing from c·g to t·a and from a·t to g·c, there is theoretically 18×5=90 pairs of leader sequence and base editing gene scissors, and editing from c·g to g·c has 27×3=81 pairs.

An effective pair can be reasonably selected, and first, a rational design based on SpCas9 is performed, in which design SpCas9 is selected as the Cas9 nickase region. By using this approach, the guide sequence is designed such that the edit position of the CBE is at 6, 7, 5, 4 or 8 bits (in order of preference), the edit position of the ABE is at 6, 5, 7, 4 or 8 bits, and the edit position of the CGBE is at 6, 5 or 7 bits, in each case determining whether the NGG PAM column is in the correct position. If this process does not identify a location tolerant to NGG PAM, the expected edits are placed at 6 bits and SpCas9 is selected, regardless of PAM sequence. In another form of rational design, known as Cas9 variant-based design, first the intended edits are placed at 6 bits, and then Cas9 variants that recognize PAM at the appropriate locations are selected using the information shown in b of fig. 19. If none of the available Cas9 variants can recognize PAM, until PAM recognized by Cas9 variants is recognized, the edit position of CBE is at position 7, 5, 4 or 8, the edit position of ABE is at position 5, 7, 4 or 8, and the edit position of CGBE is at position 5 or 7. If this process does not identify the appropriate Cas9 variant, the expected edit is placed at position 6 and the variant SpCas9-NRCH that shows the broadest PAM compatibility is selected. When determining the guide sequence and Cas9 domain, the base transition region randomly selects or selects the base transition region with relatively high overall editing efficiency (i.e., CBE, ABE, and CGBE are ssavodec 3B, ABE e (V106W) and CGBE1, respectively). Alternatively, spCas9 variants and base transition regions can be randomly selected such that the editing position is at position 6 (random design). In addition, by using deep BE, it is possible to select a pair of guide sequences and base editing genes that predicts the editing efficiency and result of 90 or 81 pairs of scissors and has the highest efficiency (design based on deep BE).

When the predicted efficiency of base editing and expected editing without bystander editing is compared using these two forms of SpCas 9-based rational design, two forms of Cas9 variant-based rational design, random design, and deep-based design, the deep-based design shows substantially higher expected editing efficiency (a to C of fig. 22) for all three types of editing (i.e., editing from c·g to t·a, a·t to g·c, and c·g to g·c) than other methods, including total expected base editing and expected editing without bystander editing. When the deep be-based design is not considered, the best design method is as follows. The rational design based on Cas9 modification and randomly selected base transition regions used showed the highest efficiency intermediate values, respectively, with complete and bystander-free editing from c·g to t·a. Reasonable design based on Cas9 variants using ABE8e (V106W) showed the highest efficiency intermediate values with complete and bystander-free editing from a.t to g.c. Rational design based on Cas9 variants using CGBE1 base transition regions showed the highest intermediate efficiency values with complete and bystander-free editing from c.g to g.c. When comparing the efficiency intermediate values associated with these Cas9 variant-based designs with the efficiency intermediate values for the corresponding SpCas 9-based designs, the total editing fold from c·g to t·a, a·t to g·c, and c·g to g·c was increased by 3.5 times, 2.2 times, and 3.5 times, respectively, and the bystander-free edits from c·g to t·a, a·t to g·c, and c·g to g·c were 4.8 times, 2.8 times, and 4.4 times, respectively. When the efficiency intermediate value associated with the deep be-based design is compared to the efficiency intermediate value associated with the SpCas 9-based design, the total edit factors from c·g to t·a, a·t to g·c, and c·g to g·c are increased by 4.4 times, 3.0 times, and 5.8 times, respectively, and the bystander-free edits from c·g to t·a, a·t to g·c, and c·g to g·c are 12 times, 4.4 times, and 9.9 times, respectively. Taken together, these results indicate that both base editing gene scissors and deep be comprising Cas9 variants can greatly improve the desired base editing efficiency.

According to the efficiency and result prediction system of base editing gene scissors using deep learning of the aspect, base editing gene scissors and sgrnas for efficient base editing can be selected without performing excessive experiments on 63 base editing gene scissors having various PAM compatibility. Thus, the system can be effectively used in all fields where gene editing is applied, such as performing disease treatment by gene editing, and the like.

Although certain exemplary embodiments and implementations have been described herein, other embodiments and modifications will be apparent from this description. Accordingly, the present inventive concept is not limited to these embodiments, but is to be limited to the broader scope of the appended claims and various obvious modifications and equivalent arrangements will be apparent to those skilled in the art.

Claims

1. An efficiency and outcome prediction system for base editing gene scissors using deep learning, comprising:

a target sequence input unit that receives target sequence data of the base editing gene scissors; and

and a result prediction unit that applies the target sequence data received by the target sequence input unit to a prediction model of base editing efficiency and result ratio, respectively, to obtain output values of base editing efficiency and result ratio, and multiplies the output values of base editing efficiency and result ratio to generate a base editing prediction score.

2. The efficiency and outcome prediction system of base editing gene scissors using deep learning according to claim 1, wherein the prediction model of base editing efficiency is generated by: a step of receiving base conversion activity data of the base editing gene scissors through the information input unit; and a step of performing deep learning based on a convolutional neural network on the data received by the information input unit to generate a predictive model of base editing efficiency.

3. The efficiency and outcome prediction system of base editing gene scissors utilizing deep learning according to claim 2, wherein the step of performing deep learning based on convolutional neural network to generate a predictive model of base editing efficiency further comprises the step of concatenating CRISPR associated protein 9 (Cas 9) activity data.

4. The efficiency and outcome prediction system of base editing gene scissors using deep learning according to claim 3, wherein the Cas9 activity data is obtained by performing a method comprising: a step of introducing Cas9 into a cell library comprising oligonucleotides comprising a nucleotide sequence encoding a single-guide ribonucleic acid (sgRNA) and a target nucleotide sequence targeted by the sgRNA;

Performing a deep sequencing step using deoxyribonucleic acid (DNA) obtained from a cell library into which the Cas9 is introduced; and

and analyzing the efficiency of Cas9 from the data obtained from the depth sequencing.

5. The efficiency and outcome prediction system of base editing gene scissors using deep learning according to claim 4, wherein the step of analyzing the efficiency of Cas9 predicts the activity of Cas9 from the correlation of Cas9 indel frequency in a specific target sequence by performing deep learning based on convolutional neural network.

6. The efficiency and outcome prediction system of base editing gene scissors using deep learning according to claim 1, wherein the prediction model of base editing outcome ratio is generated by: receiving base editing result data of the base editing gene scissors through the information input unit; and

a predictive model for generating a base editing result ratio based on deep learning of a convolutional neural network is performed on the data received by the information input unit.

7. The efficiency and outcome prediction system of base editing gene scissors using deep learning according to claim 1, further comprising: and an output unit that outputs the efficiency and the result ratio of the base editing gene scissors predicted by the result prediction unit.

8. The efficiency and outcome prediction system of base editing gene scissors utilizing deep learning according to claim 3, wherein the Cas9 is any one or more selected from the group consisting of SpCas9, VRQR variants, spCas9-NG, spCas9-NRRH, spCas9-NRTH, spCas9-NRCH, spG, spRY, and sc++.

9. The efficiency and outcome prediction system using deep learning base editing gene scissors according to claim 1, wherein the base editing gene scissors are any one or more selected from the group consisting of YE1-BE4max, ssavobe 3B, ABE e (V106W), ABE8.17-m+v106W, CGBE1, miniCGBE1, and apopec-nbas 9-Ung.

10. The efficiency and result prediction system of base editing gene scissors using deep learning according to claim 1, wherein an output value of the base editing efficiency is calculated by the following equation 1:

[ mathematics 1]

11. The efficiency and result prediction system of base editing gene scissors using deep learning according to claim 1, wherein the output value of the base editing result ratio is calculated by the following mathematical formula 2:

[ math figure 2]

12. A method for predicting efficiency and outcome of base editing gene scissors using deep learning, comprising:

Designing a target sequence of the base editing gene scissors; and

a step of applying the designed target sequence to the efficiency and result prediction system of the base editing gene scissors according to claim 1.

13. A computer-readable recording medium having recorded thereon a program for executing the method according to claim 12 by a computer.