WO2018217730A1

WO2018217730A1 - Systems and methods for optimizing chemical protein synthesis design

Info

Publication number: WO2018217730A1
Application number: PCT/US2018/033865
Authority: WO
Inventors: Michael T. JACOBSEN; Patrick W. ERICKSON; Michael S. Kay
Original assignee: University Of Utah Research Foundation
Priority date: 2017-05-22
Filing date: 2018-05-22
Publication date: 2018-11-29

Abstract

Systems and methods for automatically analyzing and designing efficient strategies for constructing large protein targets. An automated ligator system is provided that systematically scores and ranks feasible synthetic strategies for a particular chemical protein synthesis (CPS) target. The automated ligator system methodically evaluates potential peptide segments for a target using a scoring function that includes solubility, ligation site quality, segment lengths, and number of ligations to provide a ranked list of potential synthetic strategies.

Description

SYSTEMS AND METHODS FOR OPTIMIZING CHEMICAL PROTEIN SYNTHESIS DESIGN

STATEMENT REGARDING SEQUENCE LISTING

The Sequence Listing associated with this application is provided in text format in lieu of a paper copy, and is hereby incorporated by reference into the specification. The name of the text file containing the Sequence Listing is 690181_408WO_SEQUENCE_LISTING.txt. The text file is 34 KB, was created on May 21, 2018, and is being submitted electronically via EFS-Web.

BACKGROUND Technical Field

The present disclosure generally relates to automatically optimizing chemical protein synthesis designs.

Description of the Related Art

Computers and computing systems have affected nearly every aspect of modern living. Computers are generally involved in work, recreation, healthcare, transportation, entertainment, household management, etc. Further, computing system functionality can be enhanced by a computing systems' ability to be interconnected to other computing systems via network connections. Network connections may include, but are not limited to, connections via wired or wireless Ethernet, cellular connections, or even computer to computer connections through serial, parallel, USB, or other connections. The connections allow a computing system to access services at other computing systems and to quickly and efficiently receive application data from other computing systems.

Interconnection of computing systems has facilitated distributed computing systems, such as so-called "cloud" computing systems. In this description, "cloud computing" may be systems or resources for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, services, etc.) that can be provisioned and released with reduced management effort or service provider interaction. A cloud model can be composed of various characteristics (e.g., on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, etc.), service models (e.g., Software as a Service ("SaaS"), Platform as a Service ("PaaS"), Infrastructure as a Service ("IaaS"), and deployment models (e.g., private cloud, community cloud, public cloud, hybrid cloud, etc.).

Cloud and remote based service applications are prevalent. Such applications are hosted on public and private remote systems such as clouds and usually offer a set of web based services for communicating back and forth with clients.

Many computers are intended to be used by direct user interaction with the computer. As such, computers have input hardware and software user interfaces to facilitate user interaction. For example, a modern general purpose computer may include a keyboard, mouse, touchpad, camera, etc. for allowing a user to input data into the computer. In addition, various software user interfaces may be available.

Examples of software user interfaces include graphical user interfaces, text command line based user interface, function key or hot key user interfaces, and the like. Software user interfaces can be implemented on a computing system to assist with or perform many functions that are otherwise improbable or too complex in their absence. For example, designing a chemical protein synthesis strategy is an arduous task, and it is often difficult to identify each of the plethora of possibilities, let alone determine which will result in the most efficient ligation strategy. However, there are no systems available to assist with identifying and characterizing possible chemical protein synthesis designs and/or for determining a most efficient synthesis strategy for a given protein/peptide sequence.

The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced. BRIEF SUMMARY

One embodiment disclosed herein includes a method that may be practiced on a computing system for determining a chemical protein synthesis design for a protein sequence. For example, the method can include receiving a data structure comprising the protein sequence, evaluating one or more viable segments from the list of viable segments, identifying one or more potential ligation strategies for the protein sequence, and determining an ideal ligation strategy based on the evaluated one or more viable segments within the one or more potential ligation strategies.

Another method for determining a chemical protein synthesis design for a protein sequence can include receiving a data structure comprising the protein sequence, identifying a list of one or more possible segments from the protein sequence, generating a list of viable segments from the one or more possible segments, evaluating one or more viable segments from the list of viable segments, identifying one or more potential ligation strategies for the protein sequence, determining an average segment length for the one or more potential ligation strategies, and identifying an ideal ligation strategy based on the evaluated one or more viable segments within the one or more potential ligation strategies and the average segment length of the one or more potential ligation strategies.

In one or more embodiments, a computer system is provided for determining a chemical protein synthesis design for a protein sequence. The computer system may include processors and computer readable hardware storage devices having computer executable instructions executable by the processors to cause the computer system to receive a data structure comprising the protein sequence, identify a list of one or more possible segments from the protein sequence, generate a list of viable segments from the one or more possible segments, evaluate one or more viable segments from the list of viable segments, identify one or more potential ligation strategies for the protein sequence, determine an average segment length for the one or more potential ligation strategies, and identify an ideal ligation strategy based on the evaluated one or more viable segments within the one or more potential ligation strategies and the average segment length of the one or more potential ligation strategies. In some embodiments, identifying the list of one or more possible segments includes identifying one or more of each Cys and Ala junction site in the protein sequence.

In one or more embodiments, viable segments include possible segments having an amino acid length of at least 10 residues and no longer than about 80 residues. Additionally, or alternatively, viable segments may include segments having a C-terminal residue that is not a forbidden residue. The forbidden residue, in an embodiment, is selected from an amino acid residue in the group consisting of: Asp, Glu, Asn, Pro, and Gin.

In at least some embodiments, evaluating the viable segments includes calculating one or more segment metrics. Calculating one or more segment metrics may include calculating a first score based on the presence of a preferred thioester or an acceptable thioester, calculating a second score based on an average solubility, calculating a third score based on a segment length, and/or calculating a penalty for each viable segment comprising an Ala as an N-terminal ligation junction. In some embodiments, a preferred thioester is selected from the group consisting of: Ala, Arg, Cys, His, Phe, Gly, Met, Ser, Trp, and Tyr, and wherein the acceptable thioester is selected from the group consisting of: He, Lys, Leu, Thr, and Val. In some embodiments, the average solubility score comprises a solubility score divided by the number of residues in a given segment, the solubility score comprising a sum of a first value for each His, Lys, and Arg in the given segment (e.g., +1 for each) and a second value for each Asp, Glu, Val, He, and Leu in the given segment (e.g., -1 for each).

In an embodiment, identifying potential ligation strategies for the protein sequence is based on a combination of viable segments that yields the protein sequence.

In some embodiments, the system additionally includes computer executable instructions executable by the processors to cause the computer system to associate a penalty with a potential ligation strategy comprising an average segment length of less than or equal to 40 residues.

In some embodiments, the system additionally includes computer executable instructions executable by the processors to cause the computer system to display the ideal ligation strategy. Additionally, or alternatively, the ideal ligation strategy can be displayed subsequent to determining a total score for each of the potential ligation strategies and rank-ordering the potential ligation strategies based on the total scores, the ideal ligation strategy being, in some embodiments, the highest total score.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the present disclosure, nor is it intended to be used as an aid in determining the scope of the present disclosure.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

In the drawings, identical reference numbers identify similar elements or acts. The sizes and relative positions of elements in the drawings are not necessarily drawn to scale. For example, the shapes of various elements and angles are not necessarily drawn to scale, and some of these elements may be arbitrarily enlarged and positioned to improve drawing legibility. Further, the particular shapes of the elements as drawn, are not necessarily intended to convey any information regarding the actual shape of the particular elements, and may have been solely selected for ease of recognition in the drawings.

Figure 1 illustrates a computing system according to one or more embodiments of the present disclosure.

Figure 2 illustrates a schematic diagram of an implementation for determining a chemical protein synthesis design for a protein sequence.

Figure 3 is a detailed schematic diagram illustrating operation of an automated ligator system to generate predictions of optimal ligation strategies for a protein, according to one non-limiting illustrated implementation.

Figure 4 is a table that provides an example of data that is shown within an "automated ligator analysis" output file, according to one non-limiting illustrated implementation. Figure 5 is a table that depicts a list of potential thioesters and the scores given to each in the automated ligator system, according to one non-limiting illustrated implementation.

Figure 6 is a table that depicts a list of residues considered to be positively-charged or problematic, and the corresponding value given to a segment when the automated ligator system finds one in the sequence, according to one non- limiting illustrated implementation.

Figure 7 is a table that shows equations used to determine a final solubility score, depending on the average solubility value, according to one non- limiting illustrated implementation.

Figure 8 shows box-and- whisker plots for the average solubility scores calculated on viable segments within three different E. coli ribosomal protein classes, as well as the total, according to one non-limiting illustrated implementation.

Figure 9 shows calculated values for each of the variables used in the final solubility segment calculation described in Figure 7, according to one non-limiting illustrated implementation.

Figure 10 is a diagram that shows all proteins within the 30S and 50S E. coli ribosome, as well as important accessory factors, that were compiled in a test data set (structures shown from the following PDB codes: 4V6D, 4V90, 1EFC, 2B3T, 1EK8), according to one non-limiting illustrated implementation.

Figure 11 is a histogram showing the protein length distribution for the ribosomal test set, according to one non-limiting illustrated implementation.

Figure 12 is a histogram displaying the number of acceptable Cys and Ala NCL sites available in the ribosomal test set, according to one non-limiting illustrated implementation.

Figure 13 illustrates a process for determining and ranking all viable ligation strategies using the automated ligator system, wherein the automated ligator system first divides the protein sequence at Cys and Ala ligation sites, then generates all viable 10-80 aa segments that have acceptable thioesters, and then scores the segments by summing the four scoring functions shown (dotted box), according to one non- limiting illustrated implementation.

Figure 14 illustrates a process for determining and ranking all viable ligation strategies using the automated ligator system, wherein after segments are scored, the automated ligator system identifies viable ligation assemblies by looping through the entire list of segments to find those that can be connected to create the entire protein, according to one non-limiting illustrated implementation.

Figure 15 illustrates a top five list of T F-a ligation strategies calculated by the automated ligator system, according to one non-limiting illustrated implementation.

DETAILED DESCRIPTION

In the following description, certain specific details are set forth in order to provide a thorough understanding of various disclosed implementations. However, one skilled in the relevant art will recognize that implementations may be practiced without one or more of these specific details, or with other methods, components, materials, etc. In other instances, well-known structures associated with computer systems, server computers, and/or communications networks have not been shown or described in detail to avoid unnecessarily obscuring descriptions of the implementations.

Unless the context requires otherwise, throughout the specification and claims that follow, the word "comprising" is synonymous with "including," and is inclusive or open-ended (i.e., does not exclude additional, unrecited elements or method acts).

Reference throughout this specification to "one implementation" or "an implementation" means that a particular feature, structure or characteristic described in connection with the implementation is included in at least one implementation. Thus, the appearances of the phrases "in one implementation" or "in an implementation" in various places throughout this specification are not necessarily all referring to the same implementation. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more implementations.

As used in this specification and the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. It should also be noted that the term "or" is generally employed in its sense including "and/or" unless the context clearly dictates otherwise.

The headings and Abstract of the Disclosure provided herein are for convenience only and do not interpret the scope or meaning of the implementations.

Embodiments of the present disclosure provide functionality for automatically determining an optimal peptide synthesis strategy, which can be particularly useful when synthesizing large peptide sequences and/or large D-peptides that cannot be produced using traditional recombination techniques. For a myriad of reasons, it can be difficult to synthesize large peptides using traditional solid-phase synthesis methods. Some alternative synthesis strategies have emerged, including synthesizing smaller portions of the larger peptide followed by ligation of the peptide fragments to form the desired peptide. This, too, can be difficult, given the overwhelming number of possible peptide fragments and ligation strategies that can be derived from a single large peptide sequence and the complexities and technical nuances of peptide synthesis and ligation. For example, the choice of N- and C-termini as well as the overall solubility of the peptide can influence the efficiency of a ligation reaction. These factors, among others, complicate the determination of an optimal synthesis strategy. It is, therefore, seemingly improbable— based on the sequence of the peptide alone— to determine an optimal synthesis strategy. Embodiments of the present disclosure utilize a specific, data-driven scoring system and threshold requirements to identify and rank-order synthesis strategies and to determine an optimal synthesis strategy therefrom.

It should be appreciated that in one or more embodiments, one or more parameters of the disclosed chemical protein synthesis design algorithms can be amended to suit a particular purpose or honed for specific use with a given protein sequence of interest. For example, the parameters may be adjusted for longer protein sequences, but the parameters may also, in some embodiments be adjusted to accommodate shorter sequences or sequences having particularly problematic sequence identities. It should also be appreciated that in some embodiments, one or more parameters of the disclosed chemical protein synthesis design algorithms can be tailored for a given method of protein synthesis, which may be different than the preferred hydrazide method.

Chemical protein synthesis (CPS) allows the precise, atomic-level preparation of proteins and employs two key technologies: (1) solid-phase peptide synthesis (SPPS) to produce peptide segments and (2) a chemoselective ligation strategy to assemble peptide segments into longer synthetic products. The enabling advance in this field was the discovery of Native Chemical Ligation (NCL) in 1994, inspired by the pioneering selective chemical ligation concept. In NCL, a peptide containing a C-terminal thioester chemoselectively reacts with another peptide containing an N-terminal cysteine (or other thiolated amino acid) to form a native amide bond.

CPS possesses two major advantages over recombinant protein expression. First, mirror-image (D-) peptide and proteins can be directly produced. D- peptides and proteins are attractive therapeutics due to their resistance to natural L- proteases. The present inventors have used mirror-image phage display, which requires total chemical synthesis of the mirror-image protein target, to develop D-peptide inhibitors of HIV and Ebola viral entry. This same approach has been used by others to develop mirror-image therapeutics. Another major application of mirror-image peptides/proteins is racemic protein crystallography, due to extended space group accessibility, particularly PI, extending even to quasi-racemic protein crystallography. Several examples from the Kent lab have demonstrated this benefit for protein crystallization. Besides CPS, there is currently no other method for producing D- proteins, as only a few D-residues can currently be incorporated into proteins using the ribosome. Second, CPS offers the ability to site-specifically modify proteins for mechanistic studies. Semisynthetic proteins can be prepared by ligation of recombinantly expressed proteins with synthetic segments. Some recent examples include ubiquitin, alpha-synuclein, histones, and membrane proteins. Additional examples of interesting synthetic targets include studies on fundamental ubiquitin biology, proteins with selective isotopic labeling, site-specific installation of fluors (e.g., FRET pairs), and interesting scaffold approaches.

Using CPS methods, proteins of approximately 100 residues can be routinely prepared in most cases, but production of greater than 300-residue proteins remains very difficult. Challenges in the field include peptide thioester preparation (by Fmoc SPPS), access to reactive ligation junctions, poor SPPS synthesis quality, efficient purification of segments and assembly intermediates, and low yield of purified product. There are two particularly thorny challenges that hinder CPS projects: poor solubility and inefficient/sub-optimal synthetic design.

The first challenge, peptide solubility, is commonly attributed to so- called "difficult" peptides that are poorly soluble even in highly denaturing buffers and/or hard-to-resolve by HPLC for analysis and purification. Several groups have addressed this challenge by designing clever chemical methods for temporarily improving handling properties. The solubility of initial peptide segments can be improved by incorporation of pH-sensitive isoacyl dipeptide building blocks (at Ser/Thr residues) or application of the thioester Arg_n tag strategy. Others have employed custom Glu and Lys building blocks equipped with allylic ester and allylic carbamate linkers containing solubilizing guanidine groups. Recently, others have also devised an Alloc- Phacm Cys variant for introducing poly-Arg sequences to improve peptide solubility. Others have also used picolyl protection of Glu residues to improve peptide solubility and HPLC purification. Photosensitive linkers have also been employed to improve segment solubility at Gin residues. A very promising approach is termed the RBM (removable backbone modification) strategy for temporary solubilization, which was originally limited to Gly, but has since been expanded to other residues.

With the Aucagne group, another approach dubbed the "Helping Hand" has been recently introduced for temporarily solubilizing difficult peptides. In this strategy, a heterobifunctional linker, Fmoc-Ddae-OH, can be used to specifically attach solubilizing sequences onto Lys side chains. Using this approach, the solubilizing sequence is easy to install and then selectively cleave using dilute aqueous hydrazine to restore the native Lys side chain. Its use was demonstrated in one-pot applications following NCL and free-radical based desulfurization.

A second major challenge to producing large proteins is the selection of the most efficient (and high-yielding) synthesis strategy. Synthesis of large targets is laborious and may require tremendous material and human resources to identify an acceptable strategy. To illustrate and address this challenge, the automated ligator system of the present disclosure is provided, which systematically scores all plausible ligation strategies to generate a ranked list of the predicted most efficient assemblies. As discussed further below, the utility of the automated ligator system is demonstrated in the context of three CPS projects: TNFa (157 aa), GroES (97 aa), and DapA (312 aa), followed by analysis of a ribosomal protein set that previews the challenges associated with this ambitious synthetic target.

Initially, with reference to Figure 1, examples of various computing environments in which the implementations of the present disclosure may be practiced are provided. Then, with reference to Figures 2-15, various features of the present disclosure are discussed.

Example Computing Systems

It should be appreciated that one or more method acts or steps disclosed herein, in addition to the chemical protein synthesis design systems and algorithms described, can be implemented on one or more computing systems.

Computing systems are increasingly taking a wide variety of forms. Computing systems may, for example, be handheld devices, appliances, laptop computers, desktop computers, mainframes, distributed computing systems, datacenters, or even devices that have not conventionally been considered a computing system, such as wearables (e.g., glasses, watches). In this description and in the claims, the term "computing system" is defined broadly as including any device or system— or combination thereof— that includes at least one physical and tangible processor and a physical and tangible memory capable of having thereon computer-executable instructions that may be executed by a processor. The memory may take any form and may depend on the nature and form of the computing system. A computing system may be distributed over a network environment and may include multiple constituent computing systems.

As illustrated in Figure 1, in its most basic configuration, a computing system 100 typically includes at least one hardware processing unit 102 and memory 104. The memory 104 may be physical system memory, which may be volatile, nonvolatile, or some combination of the two. The term "memory" may also be used herein to refer to non-volatile mass storage such as physical storage media. If the computing system is distributed, the processing, memory, and/or storage capability may be distributed as well.

The computing system 100 also has thereon multiple structures often referred to as an "executable component." For instance, the memory 104 of the computing system 100 is illustrated as including executable component 106. The term "executable component" is the name for a structure that is well understood to one of ordinary skill in the art in the field of computing as being a structure that can be software, hardware, or a combination thereof. For instance, when implemented in software, one of ordinary skill in the art would understand that the structure of an executable component may include software objects, routines, methods, and so forth, that may be executed on the computing system, whether such an executable component exists in the heap of a computing system, or whether the executable component exists on computer-readable storage media.

In such a case, one of ordinary skill in the art will recognize that the structure of the executable component exists on a computer-readable medium such that, when interpreted by one or more processors of a computing system (e.g., by a processor thread), the computing system is caused to perform a function. Such structure may be computer-readable directly by the processors— as is the case if the executable component were binary. Alternatively, the structure may be structured to be interpretable and/or compiled— whether in a single stage or in multiple stages— so as to generate such binary that is directly interpretable by the processors. Such an understanding of exemplary structures of an executable component is well within the understanding of one of ordinary skill in the art of computing when using the term "executable component."

The term "executable component" is also well understood by one of ordinary skill as including structures that are implemented exclusively or near- exclusively in hardware, such as within a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), or any other specialized circuit. Accordingly, the term "executable component" is a term for a structure that is well understood by those of ordinary skill in the art of computing, whether implemented in software, hardware, or a combination. In this description, the terms "component," "service," "engine," "module," "control," "generator," or the like may also be used. As used in this description and in this case, these terms— whether expressed with or without a modifying clause— are also intended to be synonymous with the term "executable component," and thus also have a structure that is well understood by those of ordinary skill in the art of computing.

In the description that follows, embodiments are described with reference to acts that are performed by one or more computing systems. If such acts are implemented in software, one or more processors (of the associated computing system that performs the act) direct the operation of the computing system in response to having executed computer-executable instructions that constitute an executable component. For example, such computer-executable instructions may be embodied on one or more computer-readable media that form a computer program product. An example of such an operation involves the manipulation of data.

The computer-executable instructions (and the manipulated data) may be stored in the memory 104 of the computing system 100. Computing system 100 may also contain communication channels 108 that allow the computing system 100 to communicate with other computing systems over, for example, network 110.

While not all computing systems require a user interface, in some embodiments the computing system 100 includes a user interface 112 for use in interfacing with a user. The user interface 112 may include output mechanisms 112A as well as input mechanisms 112B. The principles described herein are not limited to the precise output mechanisms 112A or input mechanisms 112B as such will depend on the nature of the device. However, output mechanisms 112A might include, for instance, speakers, displays, tactile output, holograms and so forth. Examples of input mechanisms 112B might include, for instance, microphones, touchscreens, holograms, cameras, keyboards, mouse or other pointer input, sensors of any type, and so forth.

Embodiments described herein may comprise or utilize a special purpose or general-purpose computing system including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments described herein also include physical and other computer- readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computing system. Computer- readable media that store computer-executable instructions are physical storage media. Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example— not limitation— embodiments of the present subject matter can comprise at least two distinctly different kinds of computer-readable media: storage media and transmission media.

Computer-readable storage media include RAM, ROM, EEPROM, solid state drives ("SSDs"), flash memory, phase-change memory ("PCM"), CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other physical and tangible storage medium which can be used to store desired program code in the form of computer-executable instructions or data structures and which can be accessed and executed by a general purpose or special purpose computing system to implement the disclosed functionality of the present disclosure.

A "network" is defined as one or more data links that enable the transport of electronic data between computing systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computing system, the computing system properly views the connection as a transmission medium. Networks may be "private" or they may be "public," or networks may share qualities of both private and public networks. A private network may be any network that has restricted access such that only the computer systems and/or modules and/or other electronic devices that are provided and/or permitted access to the private network may transport electronic data through the one or more data links that comprise the private network. A public network may, on the other hand, not restrict access and allow any computer systems and/or modules and/or other electronic devices capable of connecting to the network to use the one or more data links comprising the network to transport electronic data.

For example, a private network found within an organization, such as a private business, restricts transport of electronic data between only those computer systems and/or modules and/or other electronic devices within the organization. Conversely, the Internet is an example of a public network where access to the network is, generally, not restricted. Computer systems and/or modules and/or other electronic devices may often be connected simultaneously or serially to multiple networks, some of which may be private, some of which may be public, and some of which may be varying degrees of public and private. For example, a laptop computer may be permitted access to a closed network, such as a network for a private business that enables transport of electronic data between the computing systems of permitted business employees, and the same laptop computer may also access an open network, such as the Internet, at the same time or at a different time as it accesses the exemplary closed network.

Transmission media can include a network and/or data links which can be used to carry desired program code in the form of computer-executable instructions or data structures and which can be accessed and executed by a general purpose or special purpose computing system. Combinations of the above should also be included within the scope of computer-readable media.

Further, upon reaching various computing system components, program code in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to storage media (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a "NIC") and then eventually transferred to computing system RAM and/or to less volatile storage media at a computing system. Thus, it should be understood that storage media can be included in computing system components that also— or even primarily— utilize transmission media.

Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computing system, special purpose computing system, or special purpose processing device to perform a certain function or group of functions. Alternatively, or additionally, the computer-executable instructions may configure the computing system to perform a certain function or group of functions. The computer executable instructions may be, for example, binaries or even instructions that undergo some translation (such as compilation) before direct execution by the processors, such as intermediate format instructions like assembly language, or even source code.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

Those skilled in the art will appreciate that the subject matter disclosed herein may be practiced in network computing environments with many types of computing system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, tablets, mobile telephones, PDAs, pagers, routers, switches, datacenters, wearables (e.g., glasses) and the like. The subject matter disclosed herein may also be practiced in distributed system environments where local and remote computing systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Those skilled in the art will also appreciate that the implementations discussed herein may be practiced in a cloud computing environment. Cloud computing environments may be distributed, although this is not required. When distributed, cloud computing environments may be distributed internationally within an organization and/or have components possessed across multiple organizations. In this description and the following claims, "cloud computing" is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services). The definition of "cloud computing" is not limited to any of the other numerous advantages that can be obtained from such a model when properly deployed. Automated Ligator System

Referring now to the diagram 200 of Figure 2, using individual text files listing one or more protein sequences 202 (e.g., one protein sequence) in single and/or triple lettered amino acid sequence (e.g., FASTA text files) as input, an automated ligator system 204 of the present disclosure automatically generates a list 206 (e.g., rank-ordered) of predicted potential ligation strategies based on scoring algorithms that evaluate junction sites, segment solubility and length, and number of ligations. An overall schematic for the automated ligator system 204 is also shown in the diagram 300 of Figure 3. An exemplary input 302 (Figure 3) for the automated ligator system 204 is individual FASTA text files 306 within a single folder, for example. The automated ligator system 204 (e.g., a script running thereon) finds all the Cys and Ala junction sites within a protein, and a list 308 of viable segments are created. An exemplary viable segment has C-terminal residues that would not result in "forbidden" thioesters (e.g., Asp, Glu, Asn, Pro, or Gin). Asp and Glu can undergo thioester migration to the side chain, and Pro thioesters have extremely slow native chemical ligation kinetics. Asn, Gin, and Asp cannot and/or are difficult to prepare via the hydrazide method, which is an exemplary, and in some embodiments a preferred method. It may be advantageous for viable segments to be at least 10 residues in length, but no longer than 80 residues. It should be appreciated that the program may be edited to change these thioester restrictions for use with other methods. In at least some embodiments, these thioester restrictions may be modified by the user via a modification file 304 (e.g., "Custom Thioester Input.xlsx" Excel file), such as an optional modification file that can be edited and placed into the FASTA text files folder.

Figure 3 describes how the automated ligator system 204 can generate predictions of optimal ligation strategies for a protein based on a FASTA text file. The text to the right of the downward directed arrows describes components by which the automated ligator system 204 predicts optimal ligation strategies. The curved arrows correspond to functions within the automated ligator system's 204 default "restriction mode," which helps to overcome computational costs for large proteins. Restriction mode can be turned off by the user in order to enter the automated ligator system's "safe mode."

Once the viable segment list 308 is prepared, the automated ligator system 204 can evaluate each of the segments through one or more of four (or more) different scoring functions to generate a scored segment list 310. A first function scores the segments based on the presence of a preferred thioester (+2) or an acceptable thioester (0). Optionally, these thioester characterizations may be modified through the aforementioned modification file 304. Another function scores the segment based on its average solubility, and the score ranges between 0 and -3, for example. This solubility function can also contain an optional "helping hand" lysine linker reward component, if activated by the user. A third function assigns each segment with a score based on length, ranging from +2 to -2. The fourth function penalizes any segments containing Ala as an N-terminal ligation junction with a score of -2 (vs. no score for Cys). It should be appreciated that scores provided herein are exemplary in nature and different scoring metrics are considered within the scope of this disclosure, including those that are scalar multiples of the foregoing or differentially weighted. After scoring all of the viable segments, the automated ligator system 204 finds one or more, or in some implementations all, potential ligation strategies 318 for the protein of interest. While possible assemblies are being compiled, the automated ligator system 204 can apply a final scoring component: a penalty for strategies that have more than (length of the protein / 40) total segments. In other words, strategies that have an average segment length less than 40 residues are penalized for each additional ligation over this threshold. This function gives a score of -2 for each "excessive" ligation in these assemblies. The output 312 of the automated ligator system 204 may include all segments with respective solubility scores 314, an automated ligator analysis output file 316 (e.g., Excel file), and the list of total ligation strategies 318.

Compiling one or more (or all) the possible ligation strategies 318 represents a significant computationally expensive component of the automated ligator system 204. To reduce the number of strategies that the automated ligator system 204 computes, in at least some implementations, the automated ligator system may only consider strategies that have no more than (length of the protein / 35) total ligations, for example. In at least some implementations, the number 35 is chosen based on the fact that the final scoring function considers an average segment length of 40 amino acids to be "ideal." As a result, in such implementations, strategies having greater than (length of the protein / 35) total ligations would be scored very poorly, thus allowing these strategies to be disregarded. It should be appreciated that other boundaries could be selected, including, for example, greater than (length of protein / 39), greater than (length of protein / 38), greater than (length of protein / 37), greater than (length of protein / 36), greater than (length of protein / 33), greater than (length of protein / 30), greater than (length of protein / 25), less than (length of protein / 15), less than (length of protein / 20), less than (length of protein / 25), less than (length of protein / 30), less than (length of protein / 35), or a range selected from the foregoing.

As described further below, the automated ligator system 204 may be used to analyze past chemical protein synthesis projects that were completed (e.g., T F, GroES, and DapA), as well as the proteins making up the 30S, 50S, and Accessory /Translation Factors of the E. coli ribosome. For some of the largest proteins in an E. coli ribosomal test set (e.g., SI, EF-G, and IF2), it was discovered that the automated ligator system 204 may be too computationally expensive for widespread use. Thus, the "restriction mode" was developed to help reduce the computational costs associated with larger proteins. As Figure 3 illustrates (see curved arrows), the first function of restriction mode can be triggered if the number of segments is larger than 200 and will sequentially trim the smallest and largest segments in the viable segment list until 200 segments remain. When this mode is triggered, the automated ligator system 204 alerts users to the new cutoffs for the minimum and maximum segment lengths.

A second component of restriction mode may be activated if the number of segments is greater than 150 and the length of the protein is greater than 400 residues. This component will add, for example, 15 to both the ligation and segment cutoffs used when finding potential strategies. For example, the automated ligator system 204 may only consider strategies that have equal to (length of the protein / 50) ligations and will penalize strategies that have greater than (length of the protein / 55) segments. The foregoing requirements and restrictions were chosen based on trial-and- error when running SI, EF-G, and IF2. It should be appreciated that other thresholding requirements can be implemented and can depend on many factors, including protein length.

The thresholding numbers in the foregoing example allowed the automated ligator system 204 to produce viable ligation strategies for all but one protein in the data set. The one protein that did not generate a viable ligation strategy due to the automated ligator system's 204 restriction mode was IF2. This was caused by a required (unique) ligation segment being cut during the segment trimming process. To account for situations where restriction mode eliminates all viable strategies, users may turn off restriction mode. In this case, the automated ligator system 204 will analyze the protein using the normal ligation and segment cutoffs (e.g., 35 and 40) without trimming any sequences from the viable segment list 308. To limit the computational cost and avoid memory errors, the automated ligator system 204 can enter "safe mode" in this or similar situations. In "safe mode," the automated ligator system 204 can stop finding strategies when a certain threshold is reached. For example, if the operating system is Mac OS X, then the automated ligator system 204 may stop after a certain amount of memory is consumed (e.g., 100 MB of volatile or non-volatile storage), whereas the automated ligator system 204 may stop after a set period of time (e.g., 10 minutes) of searching for other operating systems. It should be appreciated that any threshold can be used such as, for example, a total number of CPU cycles performed, a ratio of memory used per processor, time running the automated ligator system 204, or similar.

An output file 316 of the automated ligator system 204 can be a text file, a CSV file, or an "Automated Ligator System Analysis Excel file," which shows the top number (e.g., 1000) of ligation strategies for each protein. The "best" strategies are those that have the highest score, which is compiled by adding the overall score for each individual segment while considering any penalties imposed for strategies containing too many segments. The automated ligator system 204 can also show some or all of the potential strategies for a protein within the "All Strategies" file 318 (e.g., text file), which can be manually or automatically saved/located in a "Total Ligation Strategies" subfolder of the same or different folder (e.g., storage location) that included the sequence file 306 that was analyzed. Finally, all or some of the segments used to find potential ligation assemblies for some or every protein in the directory are stored in the "All Segments with Solubility Scores" file 314 (e.g., text file), which may be stored in a subfolder under the same name at the same or a different location as the "Total Ligation Strategies" subfolder and/or the sequence file 306 analyzed.

In at least some implementations, the executable instructions that implement the automated ligator system 204 may be provided as an executable that can run on one or more operating systems, or may be provided as a script or in any other form. Further, in some implementations the output of the automated ligator system 204 may include the summary information for a run. For example, the summary information may include user inputs, alerts for when restriction mode more stringently cuts segments for finding ligation strategies, and/or the number of viable segments and strategies found for each protein. In at least some implementations, instead of having viable segments for all proteins within a single text file, the automated ligator system 204 may output all viable segments within a "viable segment lists" file (e.g., Excel file). In such implementations, all viable segments for each protein may be shown in separate sheets of an Excel spreadsheet, and all segments have an ID number, the average solubility score, and the final solubility score. Further, in at least some implementations, to allow users to quickly run the automated ligator multiple times on the same set of data, output files may be stored within a timestamped folder, for example.

In at least some implementations, the automated ligator system 204 may first provide an explanation of the default thioester characterizations built into the automated ligator system (e.g., based mainly on Fmoc hydrazide SPPS). The user may then have the option to use the default thioester characterizations or to customize them using the modification file 304 discussed above. If the user would like to customize the thioesters, the user may edit the modification file 304 to classify all 20 possible thioesters. The automated ligator system 204 may then next ask the user to enter the maximum length of peptide segments that can be considered in making strategy predictions. Since the minimum length of segments is set to 10 residues in at least some implementations, the number the user enters needs to be larger than this minimum value.

In some embodiments, the automated ligator system 204 may provide an explanation of the restriction mode and will ask if the user would like to turn this mode off. In some embodiments, the automated ligator system 204 then provides an explanation of the helping hand Lys linker reward function to the user and asks if the user would like to keep the reward function. The automated ligator system 204 then starts, and shows the user the status of the analysis on a display as each file (e.g., FASTA file) is analyzed. If the user is working with large proteins, any restriction mode component that is triggered by one of the files may be shown to the user.

In some embodiments, when the automated ligator system 204 is finished, the amount of time (e.g., in seconds) taken to run the automated ligator system may be displayed. As noted above, one or more of three new components may be present, e.g., the "All Segments with Solubility Scores" folder, the "Total Ligation Strategies Text Files" folder, and the output Excel document. It should be appreciated that the file and/or folder names can be changed to any desirable naming scheme and/or prompted to a user for selection/input.

As described above, the automated ligator system 204 can predict optimal ligation strategies by using one or more (e.g., 5) different scoring functions. As shown in the table 400 of Figure 4, the automated ligator analysis file 316 automatically show the user the total strategy score for the top (e.g., 1000) hits, along with the total score from each individual function. In an actual output file 316, the peptide segments for each row's proposed synthesis strategy may be shown to the right of the final column in the table 400.

The following describes how each of the functions can score potential ligation strategies, as well as provide rationale for assigning scores in each function. Each scoring function was designed to try to give scores that were on the same scale, meaning that certain functions would not dominate the overall score for ligation strategies. In some embodiments, the scores are weighted to reduce dominating functions or, alternatively, weighted to encourage one or more functions to dominate or be more heavily weighted within the overall score.

Thioester Score Description of Scoring Function

Except for the segments corresponding to the C-terminus of the full- length protein, the thioester scoring function assigns segments a score of 2 or 0 (or in some embodiments a weighting thereof), depending on the residue that would need to be converted to a thioester for native chemical ligation. While forbidden thioesters are not technically included in this scoring function, as these segments will be filtered out before being scored, it is still worth mentioning again that, in at least some implementations, the automated ligator system 204 does not allow segments with poor thioesters to be available for creating ligation assemblies. The table 500 in Figure 5 shows all 20 possible thioesters and their assigned scores. It should be appreciated that one or more of the possible thioesters can be further grouped and scored accordingly. For example, a third group can be formed from the preferred thioesters and scored as a +1 (e.g., based on a comparatively lower thioester kinetic rate) or +3 (e.g., based on a relatively higher kinetic rate).

Rationale

Preferred vs. accepted thioester list - Preferred thioesters can be selected primarily for their enhanced thioester kinetic rates compared to the kinetic rates of accepted thioesters. In at least some implementations, Lys may not be selected as a preferred thioester due to reported lactamization.

Forbidden thioester list - Asp and Glu have been shown to undergo thioester migration to the side chain. Pro thioesters have extremely slow ligation kinetics. Asp, Asn, and Gin thioesters cannot be prepared via the hydrazide method, which is an exemplary— and in some embodiments a preferred— method of choice.

Solubility Score

Description of Scoring Function

The automated ligator system 204 can calculate a solubility value for each segment by simply scoring the segment for the presence of positive residues (+1 for each occurrence) or problematic residues (-1 for each occurrence). Table 600 of Figure 6 lists the residues that are categorized as positively-charged or problematic, and the corresponding value given to a segment when the automated ligator system 204 finds one in the sequence. The automated ligator system 204 can use this solubility value to determine an average "per residue" solubility value for the segment, based on, for example, Equation 1 (below): Solub

Solub = (1)

n Equation 1 (above) shows a calculation of average solubility, where Solub is equal to the average solubility score, Solub represents the solubility score calculated for a segment, and n equals the number of residues in a segment.

The average solubility score is used to assign a segment's final solubility score, which can be anywhere between 0 and -3 in the example provided. The table 700 of Figure 7 shows the equations used, depending on the value of the average solubility score, to assign the final solubility score (see rationale section for details on the values used for the constants). In particular, the table 700 shows the equations used to determine the final solubility score, depending on the average solubility value calculated by Equation 1. Solub corresponds to the segment's average solubility score, SolubTesSet is equal to the mean average solubility score for all viable segments in our ribosomal test data set, OneStdDev, TwoStdDev, and 3StdDev refer to one, two and three standard deviations, respectively, away from the mean average solubility score for viable segments in the test data set, and FinalSolub is the final solubility score given to a segment.

If the user decides to turn the helping hand Lys linker reward function on, then the presence of a Lys in a segment will result in the final solubility score being divided by 2. For example, if a segment receives a final solubility score of -2 from table 700 of Figure 7 and has a Lys in its sequence, then the final solubility score becomes -1 (i.e., -2 divided by 2) for the segment.

Rationale

Calculations of initial solubility values (table 600 of Figure 6): The scoring system used to calculate an initial solubility value of a segment was based off of experiences with working on peptides of varying solubility. Generally, it was observed that segments rich in positively-charged residues have increased solubility. On the other hand, peptides that have a lot of Asp, Glu, Val, He, or Leu (DEVIL) residues have been known to cause segment insolubility.

Calculation for average solubility score (Equation 1): In order to obtain unique solubility scores for each segment, the automated ligator system 204 needs to calculate an average solubility score for all segments. This score can be considered in some embodiments as the mean solubility value per residue within a segment. Without this calculation, the solubility values of individual segments tend to be heavily dependent on length, and the average solubility score helps to normalize this value.

Calculations for final solubility score (table 700 of Figure 7): Instead of having the average solubility score being the final score given to a segment, it may be desirable to get an idea of what the "mean" average solubility value is for a random peptide segment in the viable segments list for segments in the E. coli ribosomal data set. For example, the scoring function may result in most segments having negative average solubility scores, meaning that the negative values would not always correspond to poorly soluble segments, thus requiring readjustment of the zero point for the solubility score.

To get a sense of the mean average solubility score for segments, the average solubility scores were examined for all segments within the proteins of the E. coli ribosomal 30S subunit, 50S subunit, and Accessory /Translation Factors. All of these segments satisfied the requirements of the initial viable segment list (between 10 and 80 residues long without forbidden thioesters). Figure 8 shows box-and-whisker plots 800 obtained for segments within each of the three protein classes, as well as the total. In particular, Figure 8 shows box-and-whisker plots 800 for the average solubility scores calculated on viable segments within three different E. coli ribosomal protein classes, as well as the total. The following statistical data about the average solubility scores was calculated for each of the protein classes: 30S Subunit: Mean = -0.1041, Standard deviation = 0.1565; 50S Subunit: Mean = -0.1430, Standard deviation = 0.1585; Accessory & Translation Factors: Mean = -0.2203, Standard deviation = 0.1246; Total: Mean = -0.1581, Standard deviation = 0.1547.

As indicated in Figure 8, the mean of the average solubility scores for all three of the ribosomal protein classes was slightly negative, meaning that setting the mean average solubility score at 0 for the final solubility scoring function is likely inappropriate. To set the "true" mean, the mean average solubility score for all segments in all three protein classes combined was used (Total or "All Segments" in Figure 8). The three standard deviation from the mean values in table 700 were calculated by using the standard deviation for all the segments. For example, the OneStdDev value is equal to the standard deviation subtracted from the mean average solubility value. The table 900 of Figure 9 shows the calculated values used in the automated ligator system 204 to define these four variables.

Helping hand reward: The development of the traceless helping hand lysine linker advantageously helped to increase the solubility of problematic peptide segments. Since the helping hand can dramatically increase solubility of previously insoluble segments, in at least some implementations, it was decided to have a function that rewarded segments containing a Lys residue. If at least one Lys is observed within a segment, the helping hand reward function simply divides the segment's final solubility value (table 700) by 2, representing a 50% increase in solubility when the helping hand is incorporated into the segment. If the user does not wish to use the helping hand strategy, they may turn this function off via the prompts given to the user after the automated ligator system 204 is initialized. It should be appreciated that a more conservative helping hand weighting can be used, such as, for example, multiplying the segment's final solubility value by a constant greater than 1/2 (e.g., 3/5, 2/3, 3/4, 4/5, etc.), or it can, alternatively, be weighted more heavily by, for example, multiplying the segment's final solubility value by a constant less than 1/2 (e.g., 2/5, 1/3, 1/4, 1/5, etc.).

Segment Length Score

Description of Scoring Function

In at least some implementations, the automated ligator system 204 employs a simple function for scoring segments on the basis of length. The program has 40 amino acids set as the "ideal" segment length, which results in the maximal score of +2 given to segments 40 amino acids in length. For segments that are smaller or larger than 40 residues, a penalty of 0.1 is given for each amino acid length difference. For example, a segment that is 57 residues long will have a length score of 0.3 (i.e., 2.0 - |0.1 x(57 - 40)|), whereas a segment 23 residues long will also get a score of 0.3 (i.e., 2.0 - |0.1 x(23 - 40)|).

Rationale

In experiences with chemical protein synthesis of large proteins, an ideal length for each of the segments is approximately 40 amino acids. While segments between 20 and 60 amino acids are typically routine for solid-phase peptide synthesis, segments below 20 amino acids are usually not desired, as having many small segments requires more ligations to obtain the final protein. Additional ligations result in reduced yields due to the need for more RP-HPLC purifications. Segments above 60 residues can be synthesized, but these are much more likely to suffer from side reactions and aggregation due to the larger number of reactions needed to prepare such segments. 40 amino acids thus seemed like a feasible length to refer to as "ideal" for this scoring function.

Alanine Junction Penalty Description of Scoring Function

In some embodiments, as discussed above, except for segments that represent the N-terminus of the desired protein, any segments that have an Ala at the N- terminus receive a penalty of -2, and segments containing a Cys at the N-terminus do not receive a penalty. Rationale

While both Cys and Ala sites can be used as native chemical ligation junctions, Ala junction sites require (in the method used herein) an additional desulfurization step to convert the temporary Cys back to Ala after ligation has occurred. Additionally, native Cys in the protein must be protected when desulfurization steps are required. These additional steps, and potential yield losses from them, justify penalizing segments containing Ala junction sites compared to segments having Cys. The automated ligator system 204 may also be modified to use other ligation junctions (e.g., thiolated residues).

Excessive Number of Ligations Penalty

Description of Scoring Function

When a ligation strategy contains more than the ideal number of segments, this function penalizes the strategy's score by giving a -2 for each segment that is over the threshold segment number. For example, if the threshold segment number for a protein was calculated to be 10, then a ligation strategy containing 12 segments would be penalized with -4. Equation 2 below shows the calculation of the threshold number of ligations for assigning penalties. n

Thresh =—

40 (2) wherein Thresh corresponds to the threshold segment number, and n equals the number of amino acids in the protein of interest. As discussed above, the automated ligator system 204 uses an equation similar to Equation 2 for discarding ligation strategies that contain too many ligations. Instead of 40 being the denominator for this function, 35 is the number placed in the denominator, and the threshold number refers to ligation number, not segment number.

Rationale

As discussed in the segment length scoring function section above, an ideal length for peptide segments used in large chemical protein syntheses is 40 residues. As a result, Equation 2 uses 40 residues to calculate the penalty threshold segment number for ligation strategies. In other words, assemblies that have an average segment length that is less than 40 residues would be penalized by this function (-2 for every additional ligation). It is important to note that, if a protein contains more than 150 segments in the viable segment list and is longer than 400 residues, Equation 2 changes so that the denominator is 55 instead of 40, though it could be selected as any reasonable number larger than 40. In this case, the number is being scaled up to match the increase in the threshold ligation number used to calculate the number of ligations that can be used in a potential strategy (usually 35, but increases to 50 when the restriction mode gets activated; see Figure 3). The smaller number may be used to calculate the ligation number cutoff for strategies that will even be considered in the automated ligator system 204, whereas the larger number may be used to assess which strategies have an acceptable, but non-ideal, number of segments in order to penalize such strategies.

As noted above, if running a particular protein with the automated ligator system 204 causes an error to occur, then the output or information stored in memory (e.g., volatile memory) that is being generated by the automated ligator system may be too large for the resources available on the computing system. In this case, the problematic protein sequence may be analyzed by the automated ligator system 204 in safe mode (restriction mode turned off), which may, in some embodiments, be able to get an idea of the ligation strategies available for the protein.

Selection of Ribosomal Protein Set

As an ideal test set for the automated ligator system 204, the E. coli ribosomal proteins (30S and 50S subunits plus key accessory factors) were selected for analysis. The diagram 1000 of Figure 10 shows the ribosomal protein test set, which includes all proteins within the 30S and 50S E. coli ribosome, as well as important accessory factors, that are compiled in the test data set (structures shown from the following PDB codes: 4V6D, 4V90, 1EFC, 2B3T, 1EK8).

Synthesis of a mirror-image ribosome has been a longtime dream for mirror-image synthetic biology and would enable production of large mirror-image proteins via in vitro translation. A mirror-image ribosome is also a key stepping stone towards building a fully mirror-image cell ("Ζ coli"). The E. coli ribosome is ideal for this project because it has been extensively characterized (including detailed protocols for its efficient in vitro assembly), and it is active without rRNA modifications(which would be difficult to produce in mirror-image). These 65 proteins represent an ideal set with lengths from 38 to 890 residues (21 30S subunits, 33 50S subunits, and 11 key translation accessory factors). As shown in the histogram 1100 of Figure 11, 57 of the proteins are within reach of current CPS techniques (less than 300 aa), although proteins longer than 200 aa would likely require multiple synthesis attempts with current manual synthetic designs. The remaining eight proteins would be very challenging to prepare with current CPS methods, as the largest protein synthesized to date is the 352-aa Dpo4 DNA polymerase. These lengths, combined with the large number of total subunits, illustrate the need to enhance the efficiency of current CPS strategies to achieve this ambitious goal.

The histogram 1200 of Figure 12 compares the number of Cys and Ala ligation sites available in the ribosomal data set. This analysis demonstrates the importance of including non-Cys ligation sites via the ligation-desulfurization approach into ligation strategy prediction tools, as the ribosomal protein set is highly Cys- deficient. Here Ala is included as an alternate ligation junction since it is the most common amino acid in the test set and the most commonly used alternate ligation site.

The data for the histograms 1100 and 1200 were generated using a protein amino acid composition analysis program. In order to analyze interesting amino acid composition trends between subgroups of the E. coli ribosome (e.g., 30S, 50S, and the accessory/translation factors), a script was developed to quickly analyze an entire set of individual FASTA files. This script is called "Protein amino acid composition analysis," or Paacman. Paacman is most similar to the "ProtParam" tool that is available on the ExPASy bioinformatics resource portal. Paacman is able to analyze multiple protein FASTA files at once to show the amino acid composition of an entire set of proteins. Paacman also lists all possible di-amino acid sequences, as well as highlights di-amino acid sequences important in chemical protein synthesis, such as potential ligation junctions, aspartimide-prone sequences, and pseudoproline sites.

In operation, Paacman first counts the number of each amino acid within the FASTA files located in the user's folder of interest. The script then generates an Excel output file listing the number of individual amino acids, as well as total amino acid numbers, for the entire list of proteins within the folder. In addition, Paacman generates a heat map for amino acid composition for the set of proteins. This heat map is found within the output Excel file. See e.g., Jacobsen MT, Erickson PW, Kay MS. Aligator: A computational tool for optimizing total chemical synthesis of large proteins. Bioorg Med Chem. 2017; 25: 4946-4952, the contents of which including supplemental material are incorporated herein by reference as if set forth in its entirety.

Paacman also performs a di-amino acid composition analysis on the user's FASTA files. The script currently performs two types of di-amino acid searches: a search for di-amino acid sequences important to complete protein synthesis, and a total di-amino acid search (e.g., all 400 possible di-amino acid sequences). Potential cysteine ligation sites (XC), alanine ligation sites (XA), aspartimide-prone sites (DX), and serine/threonine pseudoprolines (S/T)X are specifically highlighted in the "CPS Di- AA Composition" search. However, the script simply lists the number of these sites present in each protein of interest; Paacman may not currently remove di-amino acid sequences that are not compatible with total chemical synthesis strategies. The script lists the results for each search within the same Excel output file as the amino acid composition, but each result is in a separate sheet.

When Paacman has run, the user will see an output Excel document within a folder. This Excel file shows the single amino acid composition analysis in the first sheet, the important di-amino acid sequences for complete protein synthesis in the second sheet, and the total di-amino acid sequence analysis in the third sheet.

Design of the automated ligator system

As discussed above, the automated ligator system 204 first divides the protein sequence based on the presence of Cys or Ala ligation sites to generate a list of potential peptide segments (see the diagram 1300 of Figure 13). One version of the automated ligator system 204 may be designed to work with thioesters prepared using the hydrazide method, so segments containing "incompatible" C-terminal residues are not included in the segment list (Asp, Glu, Asn, Gin, and Pro). Specifically, Asp/Glu may be excluded because of their potential for thioester migration to the side chain, although recent work has suggested that this is a pH-dependent reaction more prevalent in Asp thioesters. Asp, Asn, and Gin may be excluded because they cannot be directly prepared via peptide hydrazide method, although recent work has addressed this challenge using a two-step cleavage protocol. Pro may be eliminated due to its slow ligation kinetics and propensity for diketopiperazine formation. The modular design of the automated ligator system 204 allows for changes to these restrictions as new tools are developed (or based on the ligation tools available to each group) and already includes a thioester scoring customization option, as discussed above. In at least some implementations, only segments between 10 and the maximum length chosen (e.g., 80 residues) are allowed, since shorter or longer segments will likely result in inefficient strategies or unacceptable segment purity, respectively. This upper limit can be customized to reflect a user's preference.

As shown in Figure 13, in at least some implementations, the automated ligator system 204 first divides the protein sequence at Cys and Ala ligation sites. The automated ligator system 204 then generates all viable 10-80 aa segments that have acceptable thioesters. These segments are then scored by summing the four scoring functions shown (dotted box), as described above. Referring to Figure 14, after segments are scored, the automated ligator system 204 identifies viable ligation assemblies by looping through the entire list of segments to find those that can be connected to create the entire protein. Each box corresponds to segments of different lengths within the viable segment list, and the branches after the hydrazides indicate the available number of segments for the next ligation. Generally, a branch terminates if: 1) the number of ligations exceeds n/35 (not shown); 2) no valid segments can be ligated to the C-terminus (shown as an X); or 3) a viable ligation assembly is completed. All complete ligation strategies are scored by summing the scores of their segments plus a penalty for "excessive" ligations, as described above. Finally, the automated ligator system 204 sorts the final viable ligation strategy list by overall score.

As noted above, in at least some implementations, each segment is evaluated by summing the four components of the scoring function (Figure 13): 1) Thioester (ligation) kinetics: Segments containing a preferred C-terminal thioester receive +2 points (e.g., Ala, Arg, Cys, Gly, His, Met, Phe, Ser, Tip, or Tyr), while other acceptable thioesters (e.g., He, Leu, Lys, Thr, or Val) result in 0 points. 2) Solubility: The handling properties of peptides under NCL and acidic RP-HPLC conditions have been found to correlate positively with the density of positively charged residues (His, Lys, Arg) and negatively with the density of negatively charged or branched hydrophobic residues (Asp, Glu, Val, He, and Leu). Therefore, a score of +1 is assigned for each HKR residue and -1 for each DEVIL residue. A segment's cumulative solubility score is then divided by the number of residues to produce a "per residue" solubility score. This number is then compared to the distribution of solubility scores observed for all viable segments within our entire ribosomal protein test set. Segments with average or better predicted solubility receive a score of zero, while below average segments receive a penalty score corresponding to their number of standard deviations below the mean (capped at -3.0). 3) Length: An ideal segment length may be about 40 residues, as these are typically straightforward to prepare by SPPS and to purify by RP- HPLC. The length scoring function assigns a score of +2 for an ideal segment (40 residues) and subtracts 0.1 per residue deviation from this ideal value (score = 2 - |(40 - x) x 0.11) for segments of length x (e.g., a 30-residue segment receives 1 point). 4) Ligation junction: Each Ala junction site receives a score of -2 to reflect the added complexity (and potential yield loss) of post-NCL desulfurization.

All viable segments are then assembled into potential ligation strategies, and the automated ligator system 204 outputs optimal strategies in a rank-ordered list (see Figure 14). In addition to summing the individual scores from each segment, ligation strategies receive an additional score based on the number of ligations. For traditional NCL (excluding various one-pot and solid-phase approaches, e.g.,), each ligation step adds an HPLC purification step that ultimately reduces yield. To discourage strategies that have more than an ideal number of ligations, this scoring function penalizes strategies having an average segment length less than 40 residues (-2 for each "excessive" ligation beyond this limit). The list of potential ligation strategies may be arranged by the overall automated ligator system 204 scores. Compiling all of the segments into rank-order ligation strategies represents by far the most computationally demanding step of the automated ligator system 204. As a target protein's length and number of segments increases, the number of potential strategies explodes, quickly exceeding typical memory and storage limits (e.g., the 529-aa RF3 protein has greater than 6 million potential ligation strategies). As described above, the "ideal" segment length is set to 40 residues, so strategies are penalized for having more than n/40 segments (wherein n equals protein length). As a result, strategies having more than n/35 ligations would score very poorly in our parameters, and thus can be excluded from the analysis, greatly reducing the number of possible strategies (e.g., a 350-aa protein is limited to less than or equal to 10 ligations). The modular automated ligator system may allow users to easily adjust these cutoff variables to their preferences. For very long proteins (e.g., greater than 400 residues or with greater than 200 potential segments), the program may enter the "restriction mode" that more aggressively trims unlikely ligation strategies, as discussed above. Additionally, if no valid strategies are identified, the user can elect to enter "safe mode," which provides a sampling of valid ligation strategies by limiting the total number of strategies evaluated (only required for the 890-residue IF2 in the test set).

As a demonstration of how new synthetic tools can be incorporated into the automated ligator system, optional function ("HH mode") may be included that predicts the solubilizing impact of a helping hand (HH) in the strategy. As described above, the traceless helping hand linker, installed at a Lys residue, can be used to increase the solubility of peptide segments. If the user wishes to incorporate helping hands in a synthesis strategy, then the automated ligator system 204 will reward segments containing at least one Lys residue by dividing its solubility score by 2, which reduces the penalty associated with negative (poor) solubility scores. Interestingly, approximately 93% of all valid segments in the ribosome test set are HH-compatible (i.e., contain at least one Lys), demonstrating the general applicability of this tool. The automated ligator system 204 analyses described below do not use HH mode unless otherwise specified.

Case Studies for the Automated Ligator System

To evaluate the predictive power of the automated ligator system 204 for ranking synthesis strategies, the automated ligator system was applied to three CPS projects: TNFa (157 aa), GroES (97 aa), and DapA (312 aa). The diagram 1500 of Figure 15 shows the top five TNF-a ligation strategies calculated by the automated ligator system 204. Displayed results may be provided in a similar format to the automated ligator system's 204 Excel output discussed above, in which the total and component scores are shown before the segments in the predicted strategy. In the case of TNF-a, the 2nd-ranked assembly reflects the actual strategy used to prepare TNFa.

In the first case study, the synthesis of human Tumor Necrosis Factor alpha (TNF-a) was analyzed, which is an inflammatory cytokine overexpressed in chronic inflammatory diseases. This 157-residue target (residues 77 - 233, comprising the soluble domain) forms a 52 kDa trimer. The synthesis was initially confronted using a three-segment approach centered on the Prol76-Cysl77 junction. However, measurable product was unable to be obtained due to slow ligation kinetics and diketopiperazine formation at the Pro ligation site. An alternative three-segment strategy was selected that circumvented this ligation site, resulting in successful L- and D-syntheses of the target. Indeed, analysis of TNFa by the automated ligator system 204 (see Figure 15; see Table 1 : Analysis of TNFa by Automated Ligator System, which is provided below at the end of this specification with the row and column indicators repeated on each page) revealed that the published strategy (segment strategy number 2 in Figure 15 with a total score of 2.9; see highlighted strategy in Table 1 : Analysis of TNFa by Automated Ligator System) scored 2^nd out of 226 possible strategies with a score of +2.9 (scores ranged from +3.6 to -13.9). Figure 15 diagrams the peptide segments and scores associated with the top five strategies. Lower scoring strategies were penalized by multiple factors: thioester NCL reactivity, segment solubility, segment lengths, and ligation-associated penalties. Compared to published synthesis, the top-ranked strategy overcame modestly poor solubility and an extra Ala ligation site with good length and thioester scores. The 3^rd-ranked strategy (+1.7) is very similar to the top-ranked one, differing only in the junction between the 1^st and 2^nd segments (replacing an optimal Arg thioester with a sluggish Leu, resulting in a 2-point penalty). The 4^th strategy is very similar to the 1^st, but was penalized by a longer 62- residue C-terminal segment. The 5^th strategy had strong length and thioester scores that were offset by penalties due to an additional ligation (-2) and an additional Ala ligation junction (-2). Analyzing TNF-a with the automated ligator system 204 in HH mode (see Table 2: Analysis of T Fa by Automated Ligator System with Helping Hand, which is provided below at the end of this specification with the row and column indicators repeated on each page) predicts a 0.7-point HH benefit. Overall, the automated ligator system 204 analysis of TNF-a reinforced synthetic experience.

In the second case study, the synthesis of E. coli GroES was highlighted, which was made using a ligation-desulfurization approach. During this synthesis, significant challenges were encountered with the hydrophobic C-terminal region, which were overcame using a poly-Lys helping hand. Ultimately, both L- and D- GroES were synthesized using a two-segment approach centered on the Leu41-Ala42 junction. Analysis of GroES by the automated ligator system 204 produced ten possible strategies with scores ranging from -1.0 to -6.0. A full-length, one-segment strategy was not considered due to elimination of peptides greater than 80 aa. The published strategy scored -1.7, fourth on the list. Examination of higher scoring strategies suggested that increasing the length of the C-terminal segment would have produced a better strategy due to improved solubility. This possibility had not been considered, showing that the automated ligator system 204 can advantageously provide unexpected potential insights into synthetic strategies. Importantly, the synthesis required the assistance of a helping hand to overcome solubility problems in the C-terminal segment (which has a solubility score of -2.0). Analyzing GroES with the automated ligator system 204 in HH mode predicts a significant HH benefit.

In the final case, the synthesis of the E. coli GroEL/ES chaperone- dependent protein DapA (312 residues) was analyzed. In this work, several different synthetic strategies were pursued before initially settling on an eight-segment approach. The automated ligator system 204 analysis of DapA produced an astonishing 357,721 possible strategies, which were culled to the top 1000 hits (scores ranging from +17.2 to +7.8). Satisfactorily, the top-ranked strategy (+17.2) produced the same eight-segment strategy settled upon through experience. However, after further trial-and-error, a successful seven-segment strategy was identified that combined the last two segments. This approach ranked in the top 0.02%. This dramatic score difference is almost entirely due to a nearly 5 point segment length penalty (75 vs. 37 + 38 aa), reflecting the surprisingly high quality and yield of this long segment, which is difficult to predict. Interestingly, re-analysis in HH-mode predicts that the helping hand will have much less impact in the context of this protein. Thus, the automated ligator system 204 was effective at suggesting optimal synthesis strategies, but further tuning of the various design factors will be important in future versions of the program, based on user input.

After validating that the automated ligator system 204 provides an informative rank ordering of synthetic strategies for GroES, T Fa, and DapA, the entire ribosomal protein set was analyzed, with and without HH mode. Potential ligation strategies were found for every protein except IF2, and the automated ligator system 204 required less than 12 min to analyze the full protein set (on a 2016 MacBook Pro®). A ligation strategy was found for the very long IF2 protein only using safe mode (less than 8 mins).

Analysis of the highest-scoring ligation strategies across this test set reveals how the tension between the different scoring functions affects overall ranking. For example, the top S12 ligation strategies take different approaches to achieve their high scores. The winning strategy benefits from a small number of ligations and ideal segment lengths. The two similar strategies tied for 2^nd-place have one fewer ligation, but this benefit is offset by an additional Ala junction or a less optimal thioester. The 4^Λ-ρ^6 strategy makes optimal use of natural Cys junctions and minimizes ligations, but contains a suboptimal long segment (72 aa).

Perhaps the most interesting aspect of this the automated ligator system analysis is the impact of the ΉΗ reward function on the overall strategy scores. For example, S i 's top score is dramatically reduced (2.2, no-HH) due to its poor solubility (-10.7). Its score improves dramatically in HH-mode (by approximately 5 points), indicating that the poorly soluble segments contain HH installation sites (Lys), though the order of the top-scoring strategies is largely unchanged (all benefit similarly from the HH reward). A striking example of the HH' s potential impact is RF2, which is predicted to be among the most difficult proteins in the test set (-2.0, no-HH). Application of HH-mode greatly improves the top ranking score to 0.8, but also deposes the original top-ranked strategy (now #2). Compared to the top non-HH strategy, the best HH-assisted strategy uses better thioesters and has a better length distribution, though these benefits are partly offset by its use of one additional Ala junction. HH- mode reveals this ligation strategy by diminishing strong solubility penalties that otherwise discouraged the use of more optimal segments. As expected, proteins with good overall solubility scores are predicted to benefit much less from helping hands (e.g., LI). Experimental data from the ribosomal test set will provide valuable training to improve the automated ligator system 204, particularly for balancing the relative weights of each of the scoring functions.

In this perspective, two of the central challenges in CPS projects were highlighted, peptide solubility and picking an efficient synthetic strategy. Chemists have been continually developing new tools to handle difficult peptides— in particular, the Helping Hand tool for temporarily introducing solubilizing groups at Lys residues. In certain cases, there may not be an accessible Lys in a difficult segment, so it would be wise to employ the suite of solubilizing tools (e.g., Gly, Glu, or Cys-based strategies described in the introduction). In order to address strategic challenges in CPS projects, the automated ligator system of the present disclosure has been provided, which is a new computational tool to analyze and then predict the most efficient synthetic strategies for CPS projects.

The automated ligator system 204 effectively ranked the synthesis strategies for three targets: T Fa, GroES, and Dap A. The automated ligator system was then applied to the entire ribosomal protein set. The automated ligator system may incorporate additional scoring functions to reflect other challenges in CPS. For example, aspartimide formation reduces synthetic quality and yields, so the automated ligator system may reward placement of aspartimide-prone dipeptide sequences near the N-termini of segments (to minimize base-catalyzed formation). As another example, the automated ligator system may take into account proline and pseudoproline sites that introduce kinks into the peptide chain, which can help prevent on-resin aggregation during SPPS, significantly improving quality and yield. This type of kink analysis may account for the distribution of such sites within segments in order to predict synthetic difficulty. For mirror-image projects, this function may help predict which D- pseudoproline dipeptides would have the most impact on improving overall quality.

In at least some implementations, the automated ligator system may consider the order of segment assembly. While one-pot and solid-phase ligation strategies can be used, convergent synthesis is generally the most efficient approach. Consideration of the optimal ligation order may depend on the specific terminal protecting groups used, as well as the arrangement of residues requiring desulfurization. To predict optimal convergent synthesis strategies, the automated ligator system may also consider the solubility of selected assembly intermediates to avoid intractable peptides that are difficult to ligate or purify.

In at least some implementations, the automated ligator system may be centered on NCL using the Fmoc-compatible hydrazide method for generating peptide thioesters, but may be generally applicable to other chemo selective ligation reactions (e.g., KAHA and Ser/Thr ligation).

The foregoing detailed description has set forth various implementations of the devices and/or processes via the use of block diagrams, schematics, and examples. Insofar as such block diagrams, schematics, and examples contain one or more functions and/or operations, it will be understood by those skilled in the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof. In one implementation, the present subject matter may be implemented via Application Specific Integrated Circuits (ASICs). However, those skilled in the art will recognize that the implementations disclosed herein, in whole or in part, can be equivalently implemented in standard integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computer systems), as one or more programs running on one or more controllers (e.g., microcontrollers) as one or more programs running on one or more processors (e.g., microprocessors), as firmware, or as virtually any combination thereof, and that designing the circuitry and/or writing the code for the software and or firmware would be well within the skill of one of ordinary skill in the art in light of this disclosure.

Those of skill in the art will recognize that many of the methods or algorithms set out herein may employ additional acts, may omit some acts, and/or may execute acts in a different order than specified.

In addition, those skilled in the art will appreciate that the mechanisms taught herein are capable of being distributed as a program product in a variety of forms, and that an illustrative implementation applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include, but are not limited to, the following: recordable type media such as floppy disks, hard disk drives, CD ROMs, digital tape, and computer memory.

The various implementations described above can be combined to provide further implementations. To the extent that they are not inconsistent with the specific teachings and definitions herein, all of the U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification, including U.S. Provisional Patent Application Serial No. 62/509,645, filed May 22, 2017, are incorporated herein by reference, in their entirety. Aspects of the implementations can be modified, if necessary, to employ systems, circuits and concepts of the various patents, applications and publications to provide yet further implementations.

These and other changes can be made to the implementations in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific implementations disclosed in the specification and the claims, but should be construed to include all possible implementations along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.

: Analysis of TNFa by Automated Ligator System

B c D

Total Strategy Thioester Total Solubility Total Segment Length Score Score Score Total Score

3.6308755 -2.2691245 5.9

2.93923281 -1.36076719 2.3

4 1.671421082 -2.228578918 5.9

1.166445944 -2.133554056 3.3

6 0.641778781 -3.058221219 5.7

_ 7_ 0.474803254 -1.225196746 -0.3 _8_ 0.456230697 -3.243769303 5.7 _9_ 0.149875414 -3.550124586 5.7 _L0 -0.009322902 -1.709322902 3.7 11 -0.049863909 -2.149863909 2.1 ϋ -0.185756275 -2.885756275 2.7

13 -0.235411993 -2.335411993 2.1 ii -0.541767277 -2.641767277 2.1 15 -0.544206498 -2.444206498 3.9

_L6 -0.77109762 -1.47109762 0.7

17 -0.793008474 -2.093008474 3.3 ii -0.797800399 -1.697800399 2.9

19 -0.850888487 -1.550888487 0.7

20 -0.877398965 -1.977398965 -0.9 ϋ -1.235849188 -1.535849188 0.3

22 -1.295764784 -2.395764784 3.1

23 -1.317675638 -3.017675638 5.7 24 -1.480746457 -3.380746457 3.9 25 -1.503223721 -3.203223721 5.7 26 -1.707411668 -1.407411668 1.7 27 -1.782605486 -1.482605486 1.7 28 -1.809579005 -3.509579005 5.7 29 -2.145210694 -2.845210694 2.7 30 -2.148352269 -1.848352269 3.7 ii -2.172389147 -2.472389147 0.3

32 -2.232078832 -2.332078832 4.1

33 -2.3691245 -2.2691245 3.9 34 -2.3691245 -2.2691245 3.9 35 -2.381406335 -2.281406335 3.9 36 -2.473752458 -1.573752458 1.1 37 -2.481750063 -1.781750063 1.3

-2.503660917 -2.403660917 3.9

39 -2.615402789 -3.115402789 0.5 A B C D

-2.816964472 2 -1.516964472 0.7

-2.998419621 6 -2.498419621 3.5

-3.050018538 4 -1.950018538 0.9

-3.052150272 4 -2.552150272 3.5

-3.073049026 4 -1.373049026 0.3

-3.073956979 2 -1.773956979 4.7

-3.183967705 6 -2.683967705 3.5

-3.262229955 4 -1.562229955 0.3

-3.285204338 6 -2.985204338 3.7

-3.30704548 4 -2.20704548 -3.1

-3.418064111 4 -1.718064111 2.3

-3.418290021 6 -2.718290021 1.3

-3.440200875 6 -3.340200875 3.9

-3.490322988 6 -2.990322988 3.5

-3.741661229 2 -1.041661229 -2.7

-3.742569182 0 -1.442569182 1.7

-3.786897118 6 -2.486897118 2.7

-3.825954677 4 -2.325954677 0.5

-3.972445202 6 -2.672445202 2.7

-3.976847028 4 -2.076847028 0.1

-3.977754982 2 -2.477754982 4.5

-4.02553329 6 -2.52553329 0.5

-4.1844049 6 -1.8844049 1.7

-4.278800486 6 -2.978800486 2.7

-4.328578918 4 -2.228578918 3.9

-4.331888574 6 -2.831888574 0.5

-4.340860754 4 -2.240860754 3.9

-4.354604069 4 -2.654604069 2.3

-4.383954915 4 -2.283954915 3.9

-4.410493991 8 -2.510493991 0.1

-4.574857208 4 -3.074857208 0.5

-4.614432174 4 -2.314432174 -0.3

-4.716849275 8 -2.816849275 0.1

-4.833554056 6 -2.133554056 1.3

-4.957250289 4 -2.457250289 1.5

-4.972882397 6 -1.872882397 0.9

-5.009472957 2 -1.909472957 0.9

-5.019470997 4 -2.119470997 1.1

-5.120944859 6 -2.820944859 1.7 A B C D

Total Strategy Thioester Total Solubility Total Segment Length

1

Score Score Score Total Score

80 -5.244658757 4 -2.944658757 3.7

81 -5.252525063 6 -2.552525063 1.3

82 -5.263605573 4 -2.763605573 1.5

83 -5.322997072 6 -2.822997072 3.5

84 -5.440631513 8 -3.540631513 0.1

85 -5.498799352 2 -1.198799352 -2.3

86 -5.505049957 4 -2.805049957 1.3

87 -5.54413816 6 -3.04413816 3.5

88 -5.576523878 8 -4.276523878 0.7

89 -5.626179596 8 -3.726179596 0.1

90 -5.629352356 6 -3.129352356 3.5

91 -5.656394866 8 -2.756394866 1.1

92 -5.738104024 6 -3.238104024 1.5

93 -5.909422356 6 -2.809422356 0.9

94 -5.944167753 4 -1.644167753 -2.3

95 -5.945075706 2 -2.045075706 2.1

96 -5.962750149 8 -3.062750149 1.1

97 -6.021604738 4 -1.721604738 1.7

98 -6.023466515 4 -2.123466515 0.1

99 -6.185756275 6 -2.885756275 0.7

100 -6.185756275 6 -2.885756275 0.7

101 -6.247693829 6 -2.347693829 0.1

102 -6.248601782 4 -2.748601782 4.5

103 -6.255601192 4 -2.555601192 -1.7

104 -6.268166569 6 -3.368166569 -2.9

105 -6.429746714 4 -2.329746714 -2.1

106 -6.430654667 2 -2.730654667 2.3

107 -6.468834917 6 -2.568834917 0.1

108 -6.46974287 4 -2.96974287 4.5

109 -6.554049112 6 -2.654049112 0.1

110 -6.554957066 4 -3.054957066 4.5

111 -6.592708914 6 -2.692708914 2.1

112 -6.686532387 8 -3.786532387 1.1

113 -6.69021694 2 -1.39021694 -1.3

114 -6.730123439 6 -2.430123439 1.7

115 -6.77109762 4 -1.47109762 -1.3

116 -6.77109762 4 -1.47109762 -1.3

117 -6.793008474 4 -2.093008474 1.3

118 -6.810082235 4 -1.710082235 0.9

119 -6.814199686 2 -1.514199686 0.7 A B C D

Total Strategy Thioester Total Solubility Total Segment Length

1

Score Score Score Total Score

120 -6.848384471 4 -2.148384471 1.3

121 -6.850888487 4 -1.550888487 -1.3

122 -6.850888487 4 -1.550888487 -1.3

123 -6.899064198 6 -2.999064198 2.1

124 -6.92540274 4 -2.42540274 1.5

125 -7.044078689 4 -2.544078689 -2.5

126 -7.209451794 4 -1.509451794 -1.7

127 -7.211979482 4 -2.511979482 1.3

128 -7.295764784 6 -2.395764784 1.1

129 -7.295764784 6 -2.395764784 1.1

130 -7.478694438 2 -1.378694438 -2.1

131 -7.482811889 0 -1.182811889 -2.3

132 -7.535978297 6 -4.235978297 0.7

133 -7.616998008 2 -1.516998008 -0.1

134 -7.622846435 6 -3.722846435 2.1

135 -7.654820196 6 -1.954820196 -1.7

136 -7.655728149 4 -2.355728149 2.7

137 -7.666663397 6 -3.366663397 1.7

138 -7.697558442 4 -3.197558442 1.5

139 -7.707411668 2 -1.407411668 -0.3

140 -7.713880237 4 -2.413880237 0.7

141 -7.717997689 2 -2.217997689 0.5

142 -7.762787665 2 -1.462787665 -0.3

143 -7.782605486 2 -1.482605486 -0.3

144 -7.837981483 2 -1.537981483 -0.3

145 -8.145210694 4 -2.845210694 0.7

146 -8.145991753 4 -2.445991753 -1.7

147 -8.148352269 4 -1.848352269 1.7

148 -8.148352269 4 -1.848352269 1.7

149 -8.20058669 4 -2.90058669 0.7

150 -8.232078832 4 -2.332078832 2.1

151 -8.285610211 0 -1.185610211 -3.1

152 -8.287454828 4 -2.387454828 2.1

153 -8.442917875 6 -3.942917875 1.5

154 -8.481750063 6 -1.781750063 -0.7

155 -8.481750063 6 -1.781750063 -0.7

156 -8.520796011 2 -2.220796011 -0.3

157 -8.591360154 6 -2.891360154 -1.7

158 -8.592268107 4 -3.292268107 2.7

159 -8.615402789 6 -3.115402789 -1.5 A B C D

Total Strategy Thioester Total Solubility Total Segment Length

1

Score Score Score Total Score

160 -8.615402789 6 -3.115402789 -1.5

161 -8.816964472 2 -1.516964472 -1.3

162 -8.816964472 2 -1.516964472 -1.3

163 -8.892723465 4 -1.992723465 -0.9

164 -9.050018538 4 -1.950018538 -1.1

165 -9.050018538 4 -1.950018538 -1.1

166 -9.052150272 4 -2.552150272 1.5

167 -9.052150272 4 -2.552150272 1.5

168 -9.073956979 2 -1.773956979 2.7

169 -9.129332975 2 -1.829332975 2.7

170 -9.21672228 6 -3.71672228 -1.5

171 -9.367614632 6 -3.467614632 -1.9

172 -9.368522585 4 -3.868522585 2.5

173 -9.378302426 4 -2.678302426 -0.7

174 -9.416300894 8 -3.916300894 -1.5

175 -9.418064111 4 -1.718064111 0.3

176 -9.418290021 6 -2.718290021 -0.7

177 -9.418290021 6 -2.718290021 -0.7

178 -9.473440107 4 -1.773440107 0.3

179 -9.681200962 4 -1.981200962 -1.7

180 -9.685318413 2 -1.785318413 -1.9

181 -9.742569182 0 -1.442569182 -0.3

182 -9.797945178 0 -1.497945178 -0.3

183 -9.977754982 2 -2.477754982 2.5

184 -9.988844489 4 -2.488844489 0.5

185 -10.00519978 6 -3.705199778 -2.3

186 -10.02553329 6 -2.52553329 -1.5

187 -10.02553329 6 -2.52553329 -1.5

188 -10.03313098 2 -2.533130978 2.5

189 -10.16677992 4 -2.666779923 -1.5

190 -10.17089737 2 -2.470897374 -1.7

191 -10.20998558 4 -2.709985577 0.5

192 -10.29519977 4 -2.795199773 0.5

193 -10.33188857 6 -2.831888574 -1.5

194 -10.33188857 6 -2.831888574 -1.5

195 -10.34801789 6 -3.848017893 -0.5

196 -10.35460407 4 -2.654604069 0.3

197 -10.3840966 6 -2.484096597 -1.9

198 -10.40998007 4 -2.709980066 0.3

199 -10.48811674 2 -1.788116736 -2.7 A B C D

Total Strategy Thioester Total Solubility Total Segment Length

1

Score Score Score Total Score

200 -10.57485721 4 -3.074857208 -1.5

201 -10.6302332 4 -3.130233205 -1.5

202 -10.69045188 6 -2.790451881 -1.9

203 -10.79164281 4 -2.491642811 -0.3

204 -10.95725029 4 -2.457250289 -0.5

205 -10.9736957 2 -2.473695696 -2.5

206 -11.00947296 2 -1.909472957 -1.1

207 -11.01262629 4 -2.512626286 -0.5

208 -11.0127839 4 -2.712783899 -0.3

209 -11.019471 4 -2.119470997 -0.9

210 -11.019471 4 -2.119470997 -0.9

211 -11.06484895 2 -1.964848953 -1.1

212 -11.09799809 4 -2.797998095 -0.3

213 -11.26360557 4 -2.763605573 -0.5

214 -11.31898157 4 -2.818981569 -0.5

215 -11.39597086 4 -2.095970856 -1.3

216 -11.41423412 6 -3.514234119 -1.9

217 -11.50504996 4 -2.805049957 -0.7

218 -11.50504996 4 -2.805049957 -0.7

219 -11.94507571 2 -2.045075706 0.1

220 -12.0004517 2 -2.100451703 0.1

221 -12.19876918 4 -2.098769178 -2.1

222 -12.33251081 4 -3.032510814 -1.3

223 -12.43065467 2 -2.730654667 0.3

224 -12.48603066 2 -2.786030664 0.3

225 -13.10876529 4 -3.608765292 -1.5

226 -13.13530914 4 -3.035309136 -2.1

227 -13.91156361 4 -3.611563614 -2.3

E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus)

Penalty Ligations

VRS S SRTP SDKP VAH VVANPQ AEGQLQWLNR

-4 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-2 0 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 0

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2

RANALL (SEQ ID NO: 7)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 0

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-2 0 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5) E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus)

Penalty Ligations

VRS S SRTP SDKP VAH VVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 0

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 0

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 0

R (SEQ ID NO: 5)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5) E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus)

Penalty Ligations

VRS S SRTP SDKP VAH VVANPQ AEGQLQWLNR

-2 0 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-2 0 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 0

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2

RANALL (SEQ ID NO: 7)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6) E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus)

Penalty Ligations

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRS S SRTP SDKP VAH VVANPQ AEGQLQ WLNR

-6 -2

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRS S SRTP SDKP VAH VVANPQ AEGQLQ WLNR

-4 -2

RANALL (SEQ ID NO: 7)

VRS S SRTP SDKP VAH VVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

-4 0 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

R (SEQ ID NO: 5)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

RANALL (SEQ ID NO: 7) E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus)

Penalty Ligations

VRS S SRTP SDKP VAH VVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-4 -2 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -2

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9) E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus)

Penalty Ligations

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRS S SRTP SDKP VAH VVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-4 0 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-6 -4

RANALL (SEQ ID NO: 7)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-4 0 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5) E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus)

Penalty Ligations

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

VRS S SRTP SDKP VAH VVANPQ AEGQLQ WLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRS S SRTP SDKP VAH VVANPQ AEGQLQ WLNR

-8 -4

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRS S SRTP SDKP VAH VVANPQ AEGQLQWLNR

-6 -4 RANALLANGVELRDNQLWPSEGLYLIYSQVL

FKGQG (SEQ ID NO: 6)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRSSSRTPSDKPVAHVVANPQ AEGQLQWLNR

-8 -4

R (SEQ ID NO: 5)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8) E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus) Penalty Ligations

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO 8)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

VRS S SRTP SDKP VAH VVANPQ AEGQLQWLNR

-8 -4

RANALL (SEQ ID NO: 7)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 9)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 8)

-8 -4 VRSSSRTPSDKPV (SFQ ID NO: 9) E F G

Total Ala Total Penalty

Junction Site for # of Segments (from N- to C-terminus) Penalty Ligations

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 9)

H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

ANALLANGVELRDNQLWPSEGLYLI AEAKPWYEPIYLGGVFQLEKGDRLSA YSQVLFKGQGCPSTHVLLTHTISRIAV EINRPDYLDFAESGQVYFGIIAL (SEQ SYQTKVNLLSAIKSPCQRETPEG (SEQ ID NO: 32)

ID NO: 17)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEINRPDYLDFAESGQVYFGIIAL AIKSPCQRETPEGAEAKPWYEPIYLGG (SEQ ID NO: 37)

VFQLEKGDRLS (SEQ ID NO: 18)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEG (SEQ ID NO: 38) (SEQ ID NO: 13)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEGAEAKPWYEPIYLGG YSQVLFKGQGCPSTHVLLTHTISRIAV VFQLEKGDRLS (SEQ ID NO: 34) SYQTKVNLLS (SEQ ID NO: 15)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRIAVSYQTKVNLLS FKGQG (SEQ ID NO: 12) (SEQ ID NO: 13)

ANALLANGVELRDNQLWPSEGLYLI CPSTHVLLTHTISRIAVSYQTKVNLLS YSQVLFKGQG (SEQ ID NO: 10) (SEQ ID NO: 13)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRIAVSYQTKVNLLS FKGQG (SEQ ID NO: 12) AIKSPCQRETPEG (SEQ ID NO: 11)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLGG FKGQGCPSTHVLLTHTISRIAVSYQTK VFQLEKGDRLSAEINRPDYLDFAESG VNLLS (SEQ ID NO: 19) QWFGIIAL (SEQ ID NO: 33)

ANGVELRDNQLWPSEGLYLIYSQVL AEAKPWYEPIYLGGVFQLEKGDRLSA FKGQGCPSTHVLLTHTISRIAVSYQTK EINRPDYLDFAESGQVYFGIIAL (SEQ VNLLSAIKSPCQRETPEG (SEQ ID NO: ID NO: 32)

20)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRIAVSYQTKVNLLS FKGQG (SEQ ID NO: 12) AIKSPCQRETPEGAEAKPWYEPIYLGG

VFQLEKGDRLS (SEQ ID NO: 18)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTK VNLLSAIKSPCQRETPEG YSQVLFKGQGCPSTHVLLTHTISRI (SEQ ID NO: 39)

(SEQ ID NO: 21)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLGG

(SEQ ID NO: 13) VFQLEKGDRL S AEINRPD YLDF (SEQ

ID NO: 40)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLGG FKGQGCPSTHVLLTHTISRIAVSYQTK VFQLEKGDRLS (SEQ ID NO: 34) VNLLS (SEQ ID NO: 19)

AHWANPQAEGQLQWLNRR (SEQ ID ANALLANGVELRDNQLWPSEGLYLI NO: 22) YSQVLFKGQG (SEQ ID NO: 10)

ANPQAEGQLQWLNRR (SEQ ID NO: ANALLANGVELRDNQLWPSEGLYLI 23) YSQVLFKGQG (SEQ ID NO: 10)

ANALLANGVELRDNQLWPSEGLYLI CPSTHVLLTHTISRI (SEQ ID NO: 24) YSQVLFKGQG (SEQ ID NO: 10)

ANPQAEGQLQWLNRRANALLANGVE CPSTHVLLTHTISRIAVSYQTKVNLLS LRDNQLWPSEGLYLIYSQVLFKGQG (SEQ ID NO: 13)

(SEQ ID NO: 14) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEG (SEQ ID NO: 38) YSQVLFKGQGCPSTHVLLTHTISRIAV SYQTKVNLLS (SEQ ID NO: 15)

ANALLANGVELRDNQLWPSEGLYLI CPSTHVLLTHTISRIAVSYQTKVNLLS YSQVLFKGQG (SEQ ID NO: 10) AIKSPCQRETPEGAEAKPWYEPIYLGG

VFQLEKGDRL S AEINRPD YLDF (SEQ ID NO: 26)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTKVNLLSAIKSPCQRETPEGAE YSQVLFKGQGCPSTHVLLTHTISRI AKPWYEPIYLGGVFQLEKGDRLSAEI

(SEQ ID NO: 21) NRPDYLDFAESGQVYFGIIAL (SEQ ID

NO: 41)

(SEQ ID NO: 14)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTKVNLLSAIKSPCQRETPEGAE YSQVLFKGQGCPSTHVLLTHTISRI AKPWYEPIYLGGVFQLEKGDRLS

(SEQ ID NO: 21) (SEQ ID NO: 42)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLSAIKSPCQRETPEG

(SEQ ID NO: 39)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLSAIKSPCQRETPEG FKGQGCPSTHVLLTHTISRI (SEQ ID (SEQ ID NO: 39)

NO: 25)

ANPQAEGQLQWLNRRANALLANGVE CPSTHVLLTHTISRIAVSYQTKVNLLS LRDNQLWPSEGLYLIYSQVLFKGQG AIKSPCQRETPEG (SEQ ID NO: 11) (SEQ ID NO: 14)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTKVNLLS ANGVELRDNQLWPSEGLYLIYSQVL (SEQ ID NO: 13)

FKGQG (SEQ ID NO: 16)

CPSTHVLLTHTISRIAVSYQTKVNLLS AESGQVYFGIIAL (SEQ ID NO: 43) AIKSPCQRETPEGAEAKPWYEPIYLGG VFQLEKGDRL S AEINRPD YLDF (SEQ

ID NO: 26)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEG (SEQ ID NO: 38) FKGQGCPSTHVLLTHTISRIAVSYQTK VNLLS (SEQ ID NO: 19)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEGAEAKPWYEPIYLGG YSQVLFKGQGCPSTHVLLTHTISRIAV VFQLEKGDRL S AEINRPD YLDF (SEQ SYQTKVNLLS (SEQ ID NO: 15) ID NO: 40)

ANPQAEGQLQWLNRRANALLANGVE CPSTHVLLTHTISRIAVSYQTKVNLLS LRDNQLWPSEGLYLIYSQVLFKGQG AIKSPCQRETPEG (SEQ ID NO: 11) (SEQ ID NO: 14) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLSAIKSPCQRETPEGAE

AKPWYEPIYLGGVFQLEKGDRLSAEI NRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 41)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLSAIKSPCQRETPEGAE FKGQGCPSTHVLLTHTISRI (SEQ ID AKPWYEPIYLGGVFQLEKGDRLSAEI NO: 25) NRPDYLDFAESGQVYFGIIAL (SEQ ID

NO: 41)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTK VNLLS ANGVELRDNQLWPSEGLYLIYSQVL (SEQ ID NO: 13)

FKGQG (SEQ ID NO: 16)

ANPQAEGQLQWLNRRANALLANGVE CPSTHVLLTHTISRIAVSYQTK VNLLS LRDNQLWPSEGLYLIYSQVLFKGQG AIKSPCQRETPEGAEAKPWYEPIYLGG

(SEQ ID NO: 14) VFQLEKGDRLS (SEQ ID NO: 18)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTK VNLLS ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEG (SEQ ID NO: 11) FKGQG (SEQ ID NO: 16)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLSAIKSPCQRETPEGAE

AKPWYEPIYLGGVFQLEKGDRLS

(SEQ ID NO: 42)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLSAIKSPCQRETPEGAE FKGQGCPSTHVLLTHTISRI (SEQ ID AKPWYEPIYLGGVFQLEKGDRLS NO: 25) (SEQ ID NO: 42)

ANALLANGVELRDNQLWPSEGLYLI AE AKPWYEPIYLGGVFQLEKGDRLS YSQVLFKGQGCPSTHVLLTHTISRIAV (SEQ ID NO: 35)

SYQTKVNLLSAIKSPCQRETPEG (SEQ

ID NO: 17)

ANPQAEGQLQWLNRRANALLANGVE CPSTHVLLTHTISRIAVSYQTK VNLLS LRDNQLWPSEGLYLIYSQVLFKGQG (SEQ ID NO: 13)

(SEQ ID NO: 14)

ANPQAEGQLQWLNRRANALL (SEQ ANGVELRDNQLWPSEGLYLIYSQVL ID NO: 27) FKGQG (SEQ ID NO: 12)

ANALLANGVELRDNQLWPSEGLYLI AEAKPWYEPIYLGGVFQLEKGDRLSA YSQVLFKGQGCPSTHVLLTHTISRIAV EINRPDYLDF (SEQ ID NO: 36) SYQTKVNLLSAIKSPCQRETPEG (SEQ

ID NO: 17)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 24) FKGQG (SEQ ID NO: 12)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLGG FKGQGCPSTHVLLTHTISRIAVSYQTK VFQLEKGDRL S AEINRPD YLDF (SEQ VNLLS (SEQ ID NO: 19) ID NO: 40)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQG (SEQ ID NO: 12)

CPSTHVLLTHTISRIAVSYQTK VNLLS AIKSPCQRETPEG (SEQ ID NO: 38) (SEQ ID NO: 13) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

VFQLEKGDRL S AEINRPD YLDF (SEQ ID NO: 26)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTKVNLLS ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLGG FKGQG (SEQ ID NO: 16) VFQLEKGDRL S (SEQ ID NO: 18)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEG (SEQ ID NO: 38) (SEQ ID NO: 13)

ANGVELRDNQLWPSEGLYLIYSQVL AEAKPWYEPIYLGGVFQLEKGDRLS FKGQGCPSTHVLLTHTISRIAVSYQTK (SEQ ID NO: 35)

VNLLSAIKSPCQRETPEG (SEQ ID NO:

20)

FKGQG (SEQ ID NO: 16)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 24) FKGQG (SEQ ID NO: 12)

ANALLANGVELRDNQLWPSEGLYLI AVS YQTKVNLL S (SEQ ID NO: 44) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 21)

(SEQ ID NO: 14)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 24) FKGQG (SEQ ID NO: 12)

ANGVELRDNQLWPSEGLYLIYSQVL AEAKPWYEPIYLGGVFQLEKGDRLSA FKGQGCPSTHVLLTHTISRIAVSYQTK EINRPDYLDF (SEQ ID NO: 36) VNLLSAIKSPCQRETPEG (SEQ ID NO:

20)

(SEQ ID NO: 21)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLGG

(SEQ ID NO: 13) VFQLEKGDRL S (SEQ ID NO: 34)

ANPQAEGQLQWLNRRANALLANGVE AIKSPCQRETPEGAEAKPWYEPIYLGG LRDNQLWPSEGLYLIYSQVLFKGQG VFQLEKGDRLSAEINRPDYLDFAESG CPSTHVLLTHTISRIAVSYQTKVNLLS QWFGIIAL (SEQ ID NO: 33)

(SEQ ID NO: 29)

(SEQ ID NO: 21) NRPDYLDF (SEQ ID NO: 45) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

ANALLANGVELRDNQLWPSEGLYLI AVS YQTKVNLL S (SEQ ID NO: 44)

YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 21)

VFQLEKGDRLS (SEQ ID NO: 18)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEAKPWYEPIYLGGVFQLEKGDRLS AIKSPCQRETPEG (SEQ ID NO: 11) (SEQ ID NO: 35)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTKVNLLSAIKSPCQRETPEG YSQVLFKGQGCPSTHVLLTHTISRI (SEQ ID NO: 39)

(SEQ ID NO: 21)

FKGQG (SEQ ID NO: 16)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLS (SEQ ID NO: 44)

ANGVELRDNQLWPSEGLYLIYSQVL AVS YQTKVNLL S (SEQ ID NO: 44) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 25)

ANPQAEGQLQWLNRRANALLANGVE CPSTHVLLTHTISRI (SEQ ID NO: 24) LRDNQLWPSEGLYLIYSQVLFKGQG

(SEQ ID NO: 14)

ANPQAEGQLQWLNRRANALLANGVE AIKSPCQRETPEGAEAKPWYEPIYLGG LRDNQLWPSEGLYLIYSQVLFKGQG VFQLEKGDRLS (SEQ ID NO: 34) CPSTHVLLTHTISRIAVSYQTKVNLLS

(SEQ ID NO: 29)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLSAIKSPCQRETPEG

(SEQ ID NO: 39)

NO: 25)

ANPQAEGQLQWLNRRANALLANGVE CPSTHVLLTHTISRIAVSYQTKVNLLS LRDNQLWPSEGLYLIYSQVLFKGQG AIKSPCQRETPEGAEAKPWYEPIYLGG

(SEQ ID NO: 14) VFQLEKGDRL S AEINRPD YLDF (SEQ

ID NO: 26) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEINRPDYLDF (SEQ ID NO: 46) AIKSPCQRETPEGAEAKPWYEPIYLGG VFQLEKGDRL S (SEQ ID NO: 18)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLSAIKSPCQRETPEGAE

AKPWYEPIYLGGVFQLEKGDRLSAEI NRPDYLDF (SEQ ID NO: 45)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLSAIKSPCQRETPEGAE FKGQGCPSTHVLLTHTISRI (SEQ ID AKPWYEPIYLGGVFQLEKGDRLSAEI NO: 25) NRPDYLDF (SEQ ID NO: 45)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLS (SEQ ID NO: 44)

NO: 25)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLSAIKSPCQRETPEG

(SEQ ID NO: 39)

NO: 25)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEGAEAKPWYEPIYLGG YSQVLFKGQGCPSTHVLLTHTISRIAV VFQLEKGDRL S (SEQ ID NO: 34) SYQTK VNLLS (SEQ ID NO: 15)

(SEQ ID NO: 14)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTKVNLLS (SEQ ID NO: 44) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 21)

AHWANPQAEGQLQWLNRR (SEQ ID ANALLANGVELRDNQLWPSEGLYLI NO: 22) YSQVLFKGQGCPSTHVLLTHTISRIAV

SYQTK VNLLS (SEQ ID NO: 15)

ANPQAEGQLQWLNRR (SEQ ID NO: ANALLANGVELRDNQLWPSEGLYLI 23) YSQVLFKGQGCPSTHVLLTHTISRIAV

SYQTK VNLLS (SEQ ID NO: 15)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRI (SEQ ID NO: 24) ANGVELRDNQLWPSEGLYLIYSQVL FKGQG (SEQ ID NO: 16)

ANPQAEGQLQWLNRRANALLANGVE AVSYQTKVNLLSAIKSPCQRETPEG LRDNQLWPSEGLYLIYSQVLFKGQG (SEQ ID NO: 39)

CPSTHVLLTHTISRI (SEQ ID NO: 30)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQG (SEQ ID NO: 12) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AHWANPQAEGQLQWLNRR (SEQ ID ANALL ANGVELRDNQL WP SEGL YLI NO: 22) YSQVLFKGQGCPSTHVLLTHTISRIAV

SYQTKVNLLSAIKSPCQRETPEG (SEQ ID NO: 17)

ANPQAEGQLQWLNRR (SEQ ID NO: ANALL ANGVELRDNQL WP SEGL YLI 23) YSQVLFKGQGCPSTHVLLTHTISRIAV

SYQTKVNLLSAIKSPCQRETPEG (SEQ ID NO: 17)

(SEQ ID NO: 14)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTK VNLLS ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLGG FKGQG (SEQ ID NO: 16) VFQLEKGDRL S AEINRPD YLDF (SEQ

ID NO: 26)

ANPQAEGQLQWLNRRANALLANGVE AIKSPCQRETPEG (SEQ ID NO: 38) LRDNQLWPSEGLYLIYSQVLFKGQG CPSTHVLLTHTISRIAVSYQTK VNLLS

(SEQ ID NO: 29)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 24) FKGQG (SEQ ID NO: 12)

SYQTK VNLLS (SEQ ID NO: 15)

ANPQAEGQLQWLNRRANALLANGVE AVSYQTKVNLLSAIKSPCQRETPEGAE LRDNQLWPSEGLYLIYSQVLFKGQG AKPWYEPIYLGGVFQLEKGDRLSAEI CPSTHVLLTHTISRI (SEQ ID NO: 30) NRPDYLDFAESGQVYFGIIAL (SEQ ID

NO: 41)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRIAVSYQTK VNLLS FKGQG (SEQ ID NO: 12) AIKSPCQRETPEGAEAKPWYEPIYLGG

VFQLEKGDRL S (SEQ ID NO: 18)

AHWANPQAEGQLQWLNRRANALL AVS YQTKVNLL S AIKSPCQRETPEG ANGVELRDNQLWPSEGLYLIYSQVL (SEQ ID NO: 39)

FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 31)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLGG FKGQGCPSTHVLLTHTISRIAVSYQTK VFQLEKGDRL S (SEQ ID NO: 34) VNLLS (SEQ ID NO: 19)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLS (SEQ ID NO: 44) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

NO: 25)

(SEQ ID NO: 21)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 24) FKGQG (SEQ ID NO: 12)

ANPQAEGQLQWLNRRANALL (SEQ ANGVELRDNQLWPSEGLYLIYSQVL ID NO: 27) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 19)

ANPQAEGQLQWLNRRANALLANGVE AVSYQTKVNLLSAIKSPCQRETPEGAE LRDNQLWPSEGLYLIYSQVLFKGQG AKPWYEPIYLGGVFQLEKGDRLS CPSTHVLLTHTISRI (SEQ ID NO: 30) (SEQ ID NO: 42)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 19)

VNLLSAIKSPCQRETPEG (SEQ ID NO: 20)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLSAIKSPCQRETPEG (SEQ ID NO: 20)

ANPQAEGQLQWLNRRANALLANGVE AIKSPCQRETPEGAEAKPWYEPIYLGG LRDNQLWPSEGLYLIYSQVLFKGQG VFQLEKGDRL S AEINRPD YLDF (SEQ CPSTHVLLTHTISRIAVSYQTKVNLLS ID NO: 40)

(SEQ ID NO: 29)

AHWANPQAEGQLQWLNRR (SEQ ID ANALLANGVELRDNQLWPSEGLYLI NO: 22) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 21)

ANPQAEGQLQWLNRR (SEQ ID NO: ANALLANGVELRDNQLWPSEGLYLI 23) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 21)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQG (SEQ ID NO: 12)

VNLLS (SEQ ID NO: 19)

AHWANPQAEGQLQWLNRRANALL AVSYQTKVNLLSAIKSPCQRETPEGAE ANGVELRDNQLWPSEGLYLIYSQVL AKPWYEPIYLGGVFQLEKGDRLSAEI FKGQGCPSTHVLLTHTISRI (SEQ ID NRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 31) NO: 41) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 19)

(SEQ ID NO: 21) (SEQ ID NO: 42)

SYQTK VNLLS (SEQ ID NO: 15)

AHWANPQAEGQLQWLNRRANALL AVSYQTKVNLLSAIKSPCQRETPEGAE ANGVELRDNQLWPSEGLYLIYSQVL AKPWYEPIYLGGVFQLEKGDRLS FKGQGCPSTHVLLTHTISRI (SEQ ID (SEQ ID NO: 42)

NO: 31)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLS (SEQ ID NO: 44)

NO: 25)

(SEQ ID NO: 21)

(SEQ ID NO: 14)

(SEQ ID NO: 21)

ANPQAEGQLQWLNRRANALL (SEQ ANGVELRDNQLWPSEGLYLIYSQVL ID NO: 27) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 25) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 25)

(SEQ ID NO: 14) VFQLEKGDRLS (SEQ ID NO: 18)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVSYQTKVNLLSAIKSPCQRETPEGAE

AKPWYEPIYLGGVFQLEKGDRLS

(SEQ ID NO: 42)

(SEQ ID NO: 14)

SYQTKVNLLSAIKSPCQRETPEG (SEQ

ID NO: 17)

VNLLS (SEQ ID NO: 19)

SYQTK VNLLS (SEQ ID NO: 15)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 19)

ANPQAEGQLQWLNRRANALLANGVE AVS YQTKVNLL S (SEQ ID NO: 44) LRDNQLWPSEGLYLIYSQVLFKGQG CPSTHVLLTHTISRI (SEQ ID NO: 30)

NO: 25)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 25)

ANPQAEGQLQWLNRRANALLANGVE AVS YQTKVNLL S AIKSPCQRETPEG LRDNQLWPSEGLYLIYSQVLFKGQG (SEQ ID NO: 39)

CPSTHVLLTHTISRI (SEQ ID NO: 30) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTKVNLLS ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLGG FKGQG (SEQ ID NO: 16) VFQLEKGDRLS (SEQ ID NO: 18)

SYQTK VNLLSAIKSPCQRETPEG (SEQ ID NO: 17)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 25)

ANPQAEGQLQWLNRRANALLANGVE AVSYQTKVNLLSAIKSPCQRETPEGAE LRDNQLWPSEGLYLIYSQVLFKGQG AKPWYEPIYLGGVFQLEKGDRLSAEI CPSTHVLLTHTISRI (SEQ ID NO: 30) NRPDYLDF (SEQ ID NO: 45)

CPSTHVLLTHTISRI (SEQ ID NO: 30)

SYQTK VNLLSAIKSPCQRETPEG (SEQ ID NO: 17)

VNLLSAIKSPCQRETPEG (SEQ ID NO:

20)

VNLLS (SEQ ID NO: 19)

ANPQAEGQLQWLNRRANALLANGVE AIKSPCQRETPEG (SEQ ID NO: 38) LRDNQLWPSEGLYLIYSQVLFKGQG CPSTHVLLTHTISRIAVSYQTKVNLLS

(SEQ ID NO: 29)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 19) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AHWANPQAEGQLQWLNRRANALL AVS YQTKVNLL S (SEQ ID NO: 44) ANGVELRDNQLWPSEGLYLIYSQVL FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 31)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQG (SEQ ID NO: 12)

(SEQ ID NO: 29)

FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 31)

VNLLSAIKSPCQRETPEG (SEQ ID NO: 20)

AHWANPQAEGQLQWLNRRANALL AVSYQTKVNLLSAIKSPCQRETPEGAE ANGVELRDNQLWPSEGLYLIYSQVL AKPWYEPIYLGGVFQLEKGDRLSAEI FKGQGCPSTHVLLTHTISRI (SEQ ID NRPDYLDF (SEQ ID NO: 45)

NO: 31)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLSAIKSPCQRETPEG (SEQ ID NO: 20)

NO: 31)

AHWANPQAEGQLQWLNRR (SEQ ID ANALL ANGVELRDNQL WP SEGL YLI NO: 22) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 21)

ANPQAEGQLQWLNRR (SEQ ID NO: ANALL ANGVELRDNQL WP SEGL YLI 23) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 21)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQG (SEQ ID NO: 12)

FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 31) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

VNLLSAIKSPCQRETPEG (SEQ ID NO: 20)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLSAIKSPCQRETPEG (SEQ ID NO: 20)

(SEQ ID NO: 29)

(SEQ ID NO: 21)

NO: 25)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 25)

NO: 31)

NO: 25)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 28) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 25)

NO: 31) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

NO: 31)

J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRL S AEINRPD YLDF AES GQVYFGIIAL (SEQ ID NO: 33)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPD YLDF AESGQVYFGIIAL (SEQ GVFQLEKGDRL S (SEQ ID NO: 34) ID NO: 37)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL (SEQ (SEQ ID NO: 35) ID NO: 37)

AEAKPWYEPIYLGGVFQLEKGDRLSA AESGQVYFGIIAL (SEQ ID NO: 43) EINRPDYLDF (SEQ ID NO: 36)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AESGQVYFGIIAL (SEQ ID NO: 43)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 32)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPDYLDFAESGQVYFGIIAL (SEQ GVFQLEKGDRLS (SEQ ID NO: 34) ID NO: 37)

AIKSPCQRETPEGAEAKPWYEPIYLG AESGQVYFGIIAL (SEQ ID NO: 43) GVFQLEKGDRL S AEINRPD YLDF

(SEQ ID NO: 40)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDFAESGQVYFGIIAL (SEQ (SEQ ID NO: 35) ID NO: 37)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AESGQVYFGIIAL (SEQ ID NO: 43)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEAKPWYEPIYLGGVFQLEKGDRLSA AIKSPCQRETPEG (SEQ ID NO: 11) EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 39) EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 32)

AESGQVYFGIIAL (SEQ ID NO: 43)

AVSYQTKVNLLSAIKSPCQRETPEGA E AKP WYEPIYLGG VFQLEKGDRL S AE J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

INRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 41)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AVSYQTKVNLLSAIKSPCQRETPEGA AEINRPD YLDF AESGQVYFGIIAL (SEQ EAKPWYEPIYLGGVFQLEKGDRLS ID NO: 37)

(SEQ ID NO: 42)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AESGQVYFGIIAL (SEQ ID NO: 43)

(SEQ ID NO: 40)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 32)

AEAKPWYEPIYLGGVFQLEKGDRLSA AESGQVYFGIIAL (SEQ ID NO: 43) EINRPDYLDF (SEQ ID NO: 36) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEAKPWYEPIYLGGVFQLEKGDRLSA AIKSPCQRETPEG (SEQ ID NO: 11) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AESGQVYFGIIAL (SEQ ID NO: 43)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 39) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AESGQVYFGIIAL (SEQ ID NO: 43)

ID NO: 32)

AESGQVYFGIIAL (SEQ ID NO: 43)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAE AKPWYEPIYLGG

(SEQ ID NO: 13) VFQLEKGDRLSAEINRPDYLDFAESGQ

VYFGIIAL (SEQ ID NO: 33)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAE AKPWYEPIYLGG

(SEQ ID NO: 13) VFQLEKGDRLSAEINRPDYLDFAESGQ

VYFGIIAL (SEQ ID NO: 33)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 32)

AVSYQTKVNLLSAIKSPCQRETPEGA E AKPWYEPIYLGG VFQLEKGDRL S AE INRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 41)

(SEQ ID NO: 40)

AVSYQTKVNLLSAIKSPCQRETPEGA AEINRPDYLDFAESGQVYFGIIAL (SEQ EAKPWYEPIYLGGVFQLEKGDRLS ID NO: 37)

(SEQ ID NO: 42)

AVSYQTKVNLLS (SEQ ID NO: 44) AIKSPCQRETPEGAE AKPWYEPrYLGG

VFQLEKGDRLSAEINRPDYLDFAESGQ VYFGIIAL (SEQ ID NO: 33)

AESGQVYFGIIAL (SEQ ID NO: 43)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDFAESGQVYFGIIAL (SEQ (SEQ ID NO: 35) ID NO: 37) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AEINRPDYLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

AESGQVYFGIIAL (SEQ ID NO: 43)

AEINRPDYLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 43) E AKPWYEPIYLGG VFQLEKGDRL S AE INRPDYLDF (SEQ ID NO: 45)

AIKSPCQRETPEGAEAKPWYEPIYLG AESGQVYFGIIAL (SEQ ID NO: 43) GVFQLEKGDRL S AEINRPDYLDF

(SEQ ID NO: 40)

AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRL S AEINRPDYLDF AES GQVYFGIIAL (SEQ ID NO: 33)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 39) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEINRPDYLDFAESGQVYFGIIAL (SEQ AIKSPCQRETPEGAEAKPWYEPIYLG ID NO: 37)

GVFQLEKGDRLS (SEQ ID NO: 18)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL (SEQ (SEQ ID NO: 35) ID NO: 37)

AESGQVYFGIIAL (SEQ ID NO: 43)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPDYLDF AESGQVYFGIIAL (SEQ GVFQLEKGDRLS (SEQ ID NO: 34) ID NO: 37)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPDYLDFAESGQVYFGIIAL (SEQ J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

G VFQLEKGDRL S (SEQ ID NO: 34) ID NO: 37)

AEINRPDYLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

AVSYQTKVNLLSAIKSPCQRETPEGA EAKPWYEPIYLGG VFQLEKGDRL S AE INRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 41)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 32)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLGG

(SEQ ID NO: 13) VFQLEKGDRLSAEINRPDYLDFAESGQ

VYFGIIAL (SEQ ID NO: 33)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 39) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLGG

(SEQ ID NO: 13) VFQLEKGDRLSAEINRPDYLDFAESGQ

VYFGIIAL (SEQ ID NO: 33)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AVSYQTKVNLLSAIKSPCQRETPEGA AEINRPDYLDFAESGQVYFGIIAL (SEQ EAKPWYEPIYLGG VFQLEKGDRLS ID NO: 37)

(SEQ ID NO: 42)

AESGQVYFGIIAL (SEQ ID NO: 43)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AVSYQTKVNLLS (SEQ ID NO: 44) AIKSPCQRETPEGAEAKPWYEPIYLGG

VFQLEKGDRLSAEINRPDYLDFAESGQ VYFGIIAL (SEQ ID NO: 33)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPDYLDFAESGQVYFGIIAL (SEQ G VFQLEKGDRL S (SEQ ID NO: 34) ID NO: 37)

ID NO: 41)

AEINRPDYLDF (SEQ ID NO: 46) AESGQ VYFGIIAL (SEQ ID NO: 43)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AEINRPDYLDF (SEQ ID NO: 46) AESGQ VYFGIIAL (SEQ ID NO: 43)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 32)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 32)

AIKSPCQRETPEGAEAKPWYEPIYLG AESGQ VYFGIIAL (SEQ ID NO: 43) GVFQLEKGDRL S AEINRPDYLDF

(SEQ ID NO: 40)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQ VYFGIIAL (SEQ ID NO: 43) EAKPWYEPIYLGG VFQLEKGDRLSAE INRPDYLDF (SEQ ID NO: 45)

(SEQ ID NO: 42)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 32)

GVFQLEKGDRL S (SEQ ID NO: 18) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AESGQVYFGIIAL (SEQ ID NO: 43)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 39) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 39) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

GVFQLEKGDRLS (SEQ ID NO: 18)

AEINRPDYLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 32)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 32)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 37)

(SEQ ID NO: 40)

CPSTHVLLTHTISRIAVSYQTKVNLLS AESGQVYFGIIAL (SEQ ID NO: 43) AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRL S AEINRPDYLDF

(SEQ ID NO: 26)

AVSYQTKVNLLSAIKSPCQRETPEGA E AKP WYEPIYLGG VFQLEKGDRL S AE INRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 41)

AVSYQTKVNLLSAIKSPCQRETPEGA EAKP WYEPIYLGG VFQLEKGDRLSAE INRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 41)

AVSYQTKVNLLS (SEQ ID NO: 44) AIKSPCQRETPEGAE AKP WYEPIYLGG

VFQLEKGDRLSAEINRPDYLDFAESGQ VYFGIIAL (SEQ ID NO: 33) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVS YQTKVNLL S AIKSPCQRETPEGAE

AKPWYEPIYLGGVFQLEKGDRLSAEIN RPDYLDF AESGQVYFGIIAL (SEQ ID NO: 41)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVS YQTKVNLL S AIKSPCQRETPEGAE

AKPWYEPIYLGGVFQLEKGDRLSAEIN RPDYLDF AESGQVYFGIIAL (SEQ ID NO: 41)

AVSYQTKVNLLSAIKSPCQRETPEGA AEINRPDYLDF AESGQVYFGIIAL (SEQ EAKPWYEPIYLGGVFQLEKGDRLS ID NO: 37)

(SEQ ID NO: 42)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 39) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 39) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 32)

AEINRPDYLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 32)

(SEQ ID NO: 40)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 32)

AVSYQTKVNLLS (SEQ ID NO: 44) AIKSPCQRETPEGAE AKPWYEPIYLGG

VFQLEKGDRLSAEINRPDYLDFAESGQ VYFGIIAL (SEQ ID NO: 33)

ID NO: 41) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AVSYQTKVNLLSAIKSPCQRETPEGA E AKPWYEPIYLGG VFQLEKGDRL S AE INRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 41)

(SEQ ID NO: 42)

AEINRPDYLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

(SEQ ID NO: 42)

AESGQVYFGIIAL (SEQ ID NO: 43)

AEINRPDYLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

(SEQ ID NO: 40)

(SEQ ID NO: 26)

CPSTHVLLTHTISRIAVSYQTKVNLLS AESGQVYFGIIAL (SEQ ID NO: 43) AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRL S AEINRPDYLDF J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

(SEQ ID NO: 26)

AESGQVYFGIIAL (SEQ ID NO: 43)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVS YQTKVNLL S AIKSPCQRETPEGAE

AKPWYEPIYLGGVFQLEKGDRLSAEIN RPDYLDF AESGQVYFGIIAL (SEQ ID NO: 41)

AVSYQTKVNLLS (SEQ ID NO: 44) AIKSPCQRETPEGAE AKP WYEPIYLGG

VFQLEKGDRLSAEINRPDYLDFAESGQ VYFGIIAL (SEQ ID NO: 33)

AVSYQTKVNLLS (SEQ ID NO: 44) AIKSPCQRETPEGAE AKP WYEPIYLGG

VFQLEKGDRLSAEINRPDYLDFAESGQ VYFGIIAL (SEQ ID NO: 33)

CPSTHVLLTHTISRI (SEQ ID NO: 24) AVS YQTKVNLL S AIKSPCQRETPEGAE

AKPWYEPIYLGGVFQLEKGDRLSAEIN RPDYLDF AESGQVYFGIIAL (SEQ ID NO: 41)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 32)

AEINRPDYLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 43) E AKP WYEPIYLGG VFQLEKGDRL S AE INRPDYLDF (SEQ ID NO: 45)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 43) EAKP WYEPIYLGG VFQLEKGDRLSAE INRPDYLDF (SEQ ID NO: 45)

AVSYQTKVNLLS (SEQ ID NO: 44) AIKSPCQRETPEGAE AKP WYEPIYLGG

VFQLEKGDRLSAEINRPDYLDFAESGQ VYFGIIAL (SEQ ID NO: 33)

AVSYQTKVNLLS (SEQ ID NO: 44) AIKSPCQRETPEGAE AKP WYEPIYLGG

VFQLEKGDRLSAEINRPDYLDFAESGQ VYFGIIAL (SEQ ID NO: 33) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AIKSPCQRETPEG (SEQ ID NO: 38) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 32)

(SEQ ID NO: 40)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 43) EAKPWYEPIYLGGVFQLEKGDRLSAE INRPDYLDF (SEQ ID NO: 45)

AEINRPD YLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

(SEQ ID NO: 40)

AEINRPD YLDF (SEQ ID NO: 46) AESGQVYFGIIAL (SEQ ID NO: 43)

Table 2: Analysis of TNFa by Automated Ligator System with Helping Hand

A B C D

-1.767073676 4 -1.867073676 4.1

-1.789813174 6 -2.289813174 0.5

-2.058482236 2 -0.758482236 0.7

-2.075009269 4 -0.975009269 0.9

-2.186978489 2 -0.886978489 4.7

-2.386524513 4 -0.686524513 0.3

-2.450244071 6 -1.950244071 3.5

-2.477109396 4 -1.977109396 3.5

-2.481114978 4 -0.781114978 0.3

-2.493636429 6 -2.193636429 3.7

-2.502212217 6 -2.402212217 3.9

-2.543018112 6 -2.043018112 3.5

-2.559032055 4 -0.859032055 2.3

-2.727273274 6 -2.227273274 3.5

-2.79125679 6 -2.09125679 1.3

-2.935634519 4 -1.835634519 -3.1

-3.021284591 0 -0.721284591 1.7

-3.214289459 4 -1.114289459 3.9

-3.220430377 4 -1.120430377 3.9

-3.220830614 2 -0.520830614 -2.7

-3.24220245 6 -0.94220245 1.7

-3.244482819 6 -1.944482819 2.7

-3.269665456 4 -1.169665456 3.9

-3.337256861 6 -2.037256861 2.7

-3.364011598 4 -1.864011598 0.5

-3.439911751 2 -1.939911751 4.5

-3.463800905 6 -1.963800905 0.5

-3.521512022 6 -2.221512022 2.7

-3.639457774 4 -1.739457774 0.1

-3.648056067 6 -2.148056067 0.5

-3.759413814 4 -2.059413814 2.3

-3.766777028 6 -1.066777028 1.3

-3.769540384 4 -2.269540384 0.5

-3.856281256 8 -1.956281256 0.1

-3.959735498 4 -1.059735498 1.1

-3.976262531 6 -1.276262531 1.3

-4.036441199 6 -0.936441199 0.9

-4.040536417 8 -2.140536417 0.1

-4.054736478 2 -0.954736478 0.9 A B C D

Total Strategy Thioester Total Solubility Total Segment Length

1

Score Score Score Total Score

78 -4.158250347 4 -1.858250347 -0.3

79 -4.429659404 4 -1.929659404 1.5

80 -4.442584209 6 -2.142584209 1.7

81 -4.473363638 4 -2.173363638 3.7

82 -4.612532796 6 -2.112532796 3.5

83 -4.613914566 4 -2.113914566 1.5

84 -4.72310334 6 -2.22310334 3.5

85 -4.796787958 6 -2.296787958 3.5

86 -4.834636758 4 -2.134636758 1.3

87 -4.834680001 8 -3.534680001 0.7

88 -4.851163791 6 -2.351163791 1.5

89 -4.899399676 2 -0.599399676 -2.3

90 -4.922537853 2 -1.022537853 2.1

91 -4.979231693 8 -2.079231693 1.1

92 -5.066733818 8 -3.166733818 0.1

93 -5.122083877 4 -0.822083877 -2.3

94 -5.15950786 8 -3.25950786 0.1

95 -5.160802369 4 -0.860802369 1.7

96 -5.163486854 8 -2.263486854 1.1

97 -5.236822958 6 -2.136822958 0.9

98 -5.443912397 6 -2.143912397 0.7

99 -5.443912397 6 -2.143912397 0.7

100 -5.51506172 6 -1.21506172 1.7

101 -5.575335151 4 -2.075335151 4.5

102 -5.662767517 4 -1.762767517 0.1

103 -5.685905695 4 -2.185905695 4.5

104 -5.709912375 4 -2.009912375 -1.7

105 -5.746504237 4 -1.046504237 1.3

106 -5.759590312 4 -2.259590312 4.5

107 -5.774881174 6 -1.874881174 0.1

108 -5.797439113 2 -2.097439113 2.3

109 -5.801880234 4 -1.101880234 1.3

110 -5.885451718 6 -1.985451718 0.1

111 -5.947388717 6 -2.047388717 2.1

112 -5.955041117 4 -0.855041117 0.9

113 -5.955989741 4 -1.255989741 1.3

114 -5.959136336 6 -2.059136336 0.1

115 -5.980501346 6 -3.080501346 -2.9

116 -5.99510847 2 -0.69510847 -1.3

117 -5.996985137 4 -1.896985137 -2.1 A B C D

Total Strategy Thioester Total Solubility Total Segment Length

1

Score Score Score Total Score

118 -6.03554881 4 -0.73554881 -1.3

119 -6.03554881 4 -0.73554881 -1.3

120 -6.057099843 2 -0.757099843 0.7

121 -6.075444244 4 -0.775444244 -1.3

122 -6.075444244 4 -0.775444244 -1.3

123 -6.131643878 6 -2.231643878 2.1

124 -6.189684255 8 -3.289684255 1.1

125 -6.41373563 4 -1.91373563 1.5

126 -6.454725897 4 -0.754725897 -1.7

127 -6.477864074 4 -1.177864074 2.7

128 -6.504151124 4 -2.004151124 -2.5

129 -6.677410098 6 -0.977410098 -1.7

130 -6.715443478 6 -2.415443478 1.7

131 -6.789347219 2 -0.689347219 -2.1

132 -6.798916652 6 -1.898916652 1.1

133 -6.798916652 6 -1.898916652 1.1

134 -6.81440721 6 -3.51440721 0.7

135 -6.830891001 4 -2.330891001 1.5

136 -6.858499004 2 -0.758499004 -0.1

137 -6.891405944 0 -0.591405944 -2.3

138 -7.003705834 2 -0.703705834 -0.3

139 -7.041302743 2 -0.741302743 -0.3

140 -7.059081831 2 -0.759081831 -0.3

141 -7.09667874 2 -0.79667874 -0.3

142 -7.157841279 6 -3.257841279 2.1

143 -7.207974379 4 -1.907974379 0.7

144 -7.224176135 4 -0.924176135 1.7

145 -7.224176135 4 -0.924176135 1.7

146 -7.310033104 2 -1.810033104 0.5

147 -7.423639607 4 -2.123639607 0.7

148 -7.479015603 4 -2.179015603 0.7

149 -7.590875031 6 -0.890875031 -0.7

150 -7.590875031 6 -0.890875031 -0.7

151 -7.655107656 4 -1.955107656 -1.7

152 -7.678245833 4 -2.378245833 2.7

153 -7.692805105 0 -0.592805105 -3.1

154 -7.767073676 4 -1.867073676 2.1

155 -7.789813174 6 -2.289813174 -1.5

156 -7.789813174 6 -2.289813174 -1.5

157 -7.822449672 4 -1.922449672 2.1 A B C D

Total Strategy Thioester Total Solubility Total Segment Length

1

Score Score Score Total Score

158 -7.867876999 6 -3.367876999 1.5

159 -7.877791857 6 -2.177791857 -1.7

160 -7.896361733 4 -0.996361733 -0.9

161 -8.058482236 2 -0.758482236 -1.3

162 -8.058482236 2 -0.758482236 -1.3

163 -8.075009269 4 -0.975009269 -1.1

164 -8.075009269 4 -0.975009269 -1.1

165 -8.111432265 2 -1.811432265 -0.3

166 -8.186978489 2 -0.886978489 2.7

167 -8.242354486 2 -0.942354486 2.7

168 -8.477109396 4 -1.977109396 1.5

169 -8.477109396 4 -1.977109396 1.5

170 -8.559032055 4 -0.859032055 0.3

171 -8.614408052 4 -0.914408052 0.3

172 -8.690600481 4 -0.990600481 -1.7

173 -8.754779202 6 -3.254779202 -1.5

174 -8.771262992 4 -2.071262992 -0.7

175 -8.79125679 6 -2.09125679 -0.7

176 -8.79125679 6 -2.09125679 -0.7

177 -8.792659207 2 -0.892659207 -1.9

178 -8.830679354 4 -3.330679354 2.5

179 -8.854568508 8 -3.354568508 -1.5

180 -9.021284591 0 -0.721284591 -0.3

181 -9.030225377 6 -3.130225377 -1.9

182 -9.076660587 0 -0.776660587 -0.3

183 -9.439911751 2 -1.939911751 2.5

184 -9.445456504 4 -1.945456504 0.5

185 -9.463800905 6 -1.963800905 -1.5

186 -9.463800905 6 -1.963800905 -1.5

187 -9.495287747 2 -1.995287747 2.5

188 -9.54901795 6 -3.24901795 -2.3

189 -9.556027048 4 -2.056027048 0.5

190 -9.565501741 4 -2.065501741 -1.5

191 -9.594058368 2 -0.894058368 -2.7

192 -9.629711666 4 -2.129711666 0.5

193 -9.648056067 6 -2.148056067 -1.5

194 -9.648056067 6 -2.148056067 -1.5

195 -9.667560467 2 -1.967560467 -1.7

196 -9.759413814 4 -2.059413814 0.3

197 -9.769540384 4 -2.269540384 -1.5 A B C D

-9.814789811 4 -2.114789811 0.3

-9.820427008 6 -3.320427008 -0.5

-9.82491638 4 -2.32491638 -1.5

-9.843082558 6 -1.943082558 -1.9

-9.959735498 4 -1.059735498 -0.9

-10.02733772 6 -2.12733772 -1.9

-10.05473648 2 -0.954736478 -1.1

-10.11011248 2 -1.010112475 -1.1

-10.24685567 4 -1.946855665 -0.3

-10.34798543 4 -1.047985428 -1.3

-10.35742621 4 -2.057426209 -0.3

-10.4296594 4 -1.929659404 -0.5

-10.43111083 4 -2.131110827 -0.3

-10.46895963 2 -1.968959628 -2.5

-10.4850354 4 -1.985035401 -0.5

-10.61391457 4 -2.113914566 -0.5

-10.66929056 4 -2.169290563 -0.5

-10.83463676 4 -2.134636758 -0.7

-10.92253785 2 -1.022537853 0.1

-10.97791385 2 -1.07791385 0.1

-11.05353512 6 -3.153535121 -1.9

-11.14938459 4 -1.049384589 -2.1

-11.54836719 4 -2.248367187 -1.3

-11.79743911 2 -2.097439113 0.3

-11.85281511 2 -2.15281511 0.3

-12.34976635 4 -2.249766348 -2.1

-12.70080071 4 -3.200800708 -1.5

-13.50219987 4 -3.202199869 -2.3

E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-6 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-6 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-2 0 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 0

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 0

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-2 0 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48) E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-6 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-4 0

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 0

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

(SEQ ID NO: 47)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 0

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

(SEQ ID NO: 47)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-6 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-2 0 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 0

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-2 0 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

(SEQ ID NO: 47)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

(SEQ ID NO: 47)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-6 -4 ANALL ANGVELRDNQL WP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-4 -2

ANALL (SEQ ID NO: 49)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

(SEQ ID NO: 47)

-4 0 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4 ANALL ANGVELRDNQL WP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQL WP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL ANGVELRDNQL WP SEGL YLIYSQ VLFK E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

GQG (SEQ ID NO: 48)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-8 -4

ANALL (SEQ ID NO: 49)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-8 -4

ANALL (SEQ ID NO: 49)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -2

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-4 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-4 -2 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51) E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-8 -4

ANALL (SEQ ID NO: 49)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-6 -4 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

-4 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4

ANALL (SEQ ID NO: 49)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-4 0 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50) E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

ANALL (SEQ ID NO: 49)

-4 0 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-6 -4 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQWLNRR

-8 -4

ANALL (SEQ ID NO: 49)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

(SEQ ID NO: 47)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-6 -4 ANALL ANGVELRDNQLWP SEGL YLIYSQ VLFK

GQG (SEQ ID NO: 48)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50) E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -2 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

VRS S SRTPSDKP VAH WANPQ AEGQLQ WLNRR

-8 -4

ANALL (SEQ ID NO: 49)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-6 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-6 -2 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51) E F G

Total

Total Ala

Penalty for

Junction Site Segments (from N- to C-terminus)

# of

Penalty

Ligations

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

-8 -4 VRSSSRTPSDKPVAHW (SEQ ID NO: 50)

-8 -4 VRSSSRTPSDKPV (SEQ ID NO: 51)

H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

FKGQG (SEQ ID NO: 58) AIKSPCQRETPEG (SEQ ID NO: 53)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEAKPWYEPIYLGGVFQLEKGDRLSA AIKSPCQRETPEG (SEQ ID NO: 53) EINRPDYLDF (SEQ ID NO: 79)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEGAEAKPWYEPIYLG YSQVLFKGQGCPSTHVLLTHTISRIAV GVFQLEKGDRL S AEINRPD YLDF AES SYQTKVNLLS (SEQ ID NO: 59) GQVYFGIIAL (SEQ ID NO: 76)

ANALLANGVELRDNQLWPSEGLYLI AEAKPWYEPIYLGGVFQLEKGDRLSA YSQVLFKGQGCPSTHVLLTHTISRIAV EINRPDYLDFAESGQVYFGIIAL (SEQ SYQTKVNLLSAIKSPCQRETPEG (SEQ ID NO: 75)

ID NO: 60)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEG (SEQ ID NO: 80) (SEQ ID NO: 55)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRIAVSYQTKVNLLS FKGQG (SEQ ID NO: 54) (SEQ ID NO: 55)

ANALLANGVELRDNQLWPSEGLYLI CPSTHVLLTHTISRIAVSYQTKVNLLS YSQVLFKGQG (SEQ ID NO: 52) (SEQ ID NO: 55)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEINRPD YLDF AESGQVYFGIIAL AIKSPCQRETPEGAEAKPWYEPIYLG (SEQ ID NO: 81)

GVFQLEKGDRLS (SEQ ID NO: 61)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRIAVSYQTKVNLLS FKGQG (SEQ ID NO: 54) AIKSPCQRETPEG (SEQ ID NO: 53)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEGAEAKPWYEPIYLG YSQVLFKGQGCPSTHVLLTHTISRIAV GVFQLEKGDRLS (SEQ ID NO: 77) SYQTKVNLLS (SEQ ID NO: 59)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLG FKGQGCPSTHVLLTHTISRIAVSYQTK GVFQLEKGDRLSAEINRPD YLDF AES VNLLS (SEQ ID NO: 62) GQVYFGIIAL (SEQ ID NO: 76)

ANGVELRDNQLWPSEGLYLIYSQVL AEAKPWYEPIYLGGVFQLEKGDRLSA FKGQGCPSTHVLLTHTISRIAVSYQTK EINRPDYLDFAESGQVYFGIIAL (SEQ VNLLSAIKSPCQRETPEG (SEQ ID NO: ID NO: 75)

63)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTK VNLLSAIKSPCQRETPEG YSQVLFKGQGCPSTHVLLTHTISRI (SEQ ID NO: 82)

(SEQ ID NO: 64)

AHWANPQAEGQLQWLNRR (SEQ ID ANALLANGVELRDNQLWPSEGLYLI NO: 65) YSQVLFKGQG (SEQ ID NO: 52)

ANPQAEGQLQWLNRR (SEQ ID NO: ANALLANGVELRDNQLWPSEGLYLI 66) YSQVLFKGQG (SEQ ID NO: 52)

ANALLANGVELRDNQLWPSEGLYLI CPSTHVLLTHTISRI (SEQ ID NO: 68) YSQVLFKGQG (SEQ ID NO: 52)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRIAVSYQTKVNLLS FKGQG (SEQ ID NO: 54) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRLS (SEQ ID NO: 61)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEG (SEQ ID NO: 80) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

YSQVLFKGQGCPSTHVLLTHTISRIAV SYQTKVNLLS (SEQ ID NO: 59)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLG

(SEQ ID NO: 55) GVFQLEKGDRL S AEINRPD YLDF

(SEQ ID NO: 83)

ANPQAEGQLQWLNRRANALLANGV CPSTHVLLTHTISRIAVSYQTKVNLLS ELRDNQLWPSEGLYLIYSQVLFKGQ (SEQ ID NO: 55)

G (SEQ ID NO: 56)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLG FKGQGCPSTHVLLTHTISRIAVSYQTK GVFQLEKGDRLS (SEQ ID NO: 77) VNLLS (SEQ ID NO: 62)

ANALLANGVELRDNQLWPSEGLYLI CPSTHVLLTHTISRIAVSYQTKVNLLS YSQVLFKGQG (SEQ ID NO: 52) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRL S AEINRPD YLDF

(SEQ ID NO: 69)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTKVNLLSAIKSPCQRETPEGA YSQVLFKGQGCPSTHVLLTHTISRI EAKPWYEPIYLGGVFQLEKGDRLSAE

(SEQ ID NO: 64) INRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 84)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLSAIKSPCQRETPEG FKGQGCPSTHVLLTHTISRI (SEQ ID (SEQ ID NO: 82)

NO: 67)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEG

(SEQ ID NO: 82)

G (SEQ ID NO: 56)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTKVNLLSAIKSPCQRETPEGA YSQVLFKGQGCPSTHVLLTHTISRI EAKPWYEPIYLGGVFQLEKGDRLS

(SEQ ID NO: 64) (SEQ ID NO: 85)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTKVNLLS ANGVELRDNQLWPSEGLYLIYSQVL (SEQ ID NO: 55)

FKGQG (SEQ ID NO: 58)

ANPQAEGQLQWLNRRANALLANGV CPSTHVLLTHTISRIAVSYQTKVNLLS ELRDNQLWPSEGLYLIYSQVLFKGQ AIKSPCQRETPEG (SEQ ID NO: 53) G (SEQ ID NO: 56)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEG (SEQ ID NO: 80) FKGQGCPSTHVLLTHTISRIAVSYQTK VNLLS (SEQ ID NO: 62)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEGAEAKPWYEPIYLG H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

YSQVLFKGQGCPSTHVLLTHTISRIAV GVFQLEKGDRL S AEINRPD YLDF SYQTKVNLLS (SEQ ID NO: 59) (SEQ ID NO: 83)

CPSTHVLLTHTISRIAVSYQTKVNLLS AESGQVYFGIIAL (SEQ ID NO: 86) AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRL S AEINRPD YLDF

(SEQ ID NO: 69)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLSAIKSPCQRETPEGA FKGQGCPSTHVLLTHTISRI (SEQ ID EAKPWYEPIYLGGVFQLEKGDRLSAE NO: 67) INRPD YLDF AESGQVYFGIIAL (SEQ

ID NO: 84)

ANPQAEGQLQWLNRRANALL (SEQ ANGVELRDNQLWPSEGLYLIYSQVL ID NO: 70) FKGQG (SEQ ID NO: 54)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 68) FKGQG (SEQ ID NO: 54)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEGA

EAKPWYEPIYLGGVFQLEKGDRLSAE INRPD YLDF AESGQWFGIIAL (SEQ ID NO: 84)

G (SEQ ID NO: 56)

FKGQG (SEQ ID NO: 58)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQG (SEQ ID NO: 54)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTKVNLLS ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEG (SEQ ID NO: 53) FKGQG (SEQ ID NO: 58)

ANPQAEGQLQWLNRRANALLANGV CPSTHVLLTHTISRIAVSYQTKVNLLS ELRDNQLWPSEGLYLIYSQVLFKGQ AIKSPCQRETPEGAEAKPWYEPIYLG

G (SEQ ID NO: 56) GVFQLEKGDRLS (SEQ ID NO: 61)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLSAIKSPCQRETPEGA FKGQGCPSTHVLLTHTISRI (SEQ ID EAKPWYEPIYLGGVFQLEKGDRLS NO: 67) (SEQ ID NO: 85)

ANALLANGVELRDNQLWPSEGLYLI AEAKPWYEPIYLGGVFQLEKGDRLS YSQVLFKGQGCPSTHVLLTHTISRIAV (SEQ ID NO: 78)

SYQTKVNLLSAIKSPCQRETPEG (SEQ

ID NO: 60)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEGA

EAKPWYEPIYLGGVFQLEKGDRLS

(SEQ ID NO: 85)

ANALLANGVELRDNQLWPSEGLYLI AEAKPWYEPIYLGGVFQLEKGDRLSA YSQVLFKGQGCPSTHVLLTHTISRIAV EINRPDYLDF (SEQ ID NO: 79) SYQTKVNLLSAIKSPCQRETPEG (SEQ

ID NO: 60) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLG FKGQGCPSTHVLLTHTISRIAVSYQTK GVFQLEKGDRL S AEINRPD YLDF VNLLS (SEQ ID NO: 62) (SEQ ID NO: 83)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRIAVSYQTK VNLLS FKGQG (SEQ ID NO: 54) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRL S AEINRPD YLDF

(SEQ ID NO: 69)

CPSTHVLLTHTISRIAVSYQTK VNLLS AIKSPCQRETPEG (SEQ ID NO: 80) (SEQ ID NO: 55)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTKVNLLS (SEQ ID NO: 87) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 64)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTK VNLLS ANGVELRDNQLWPSEGLYLIYSQVL (SEQ ID NO: 55)

FKGQG (SEQ ID NO: 58)

CPSTHVLLTHTISRIAVSYQTK VNLLS AIKSPCQRETPEG (SEQ ID NO: 80) (SEQ ID NO: 55)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 68) FKGQG (SEQ ID NO: 54)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTK VNLLS ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLG FKGQG (SEQ ID NO: 58) GVFQLEKGDRLS (SEQ ID NO: 61)

ANGVELRDNQLWPSEGLYLIYSQVL AEAKPWYEPIYLGGVFQLEKGDRLS FKGQGCPSTHVLLTHTISRIAVSYQTK (SEQ ID NO: 78)

VNLLSAIKSPCQRETPEG (SEQ ID NO:

63)

ANPQAEGQLQWLNRRANALLANGV CPSTHVLLTHTISRIAVSYQTK VNLLS ELRDNQLWPSEGLYLIYSQVLFKGQ (SEQ ID NO: 55)

G (SEQ ID NO: 56)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 68) FKGQG (SEQ ID NO: 54)

(SEQ ID NO: 64)

ANGVELRDNQLWPSEGLYLIYSQVL AEAKPWYEPIYLGGVFQLEKGDRLSA FKGQGCPSTHVLLTHTISRIAVSYQTK EINRPDYLDF (SEQ ID NO: 79) VNLLSAIKSPCQRETPEG (SEQ ID NO:

63)

(SEQ ID NO: 64)

ANALLANGVELRDNQLWPSEGLYLI AVSYQTK VNLLSAIKSPCQRETPEG H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

YSQVLFKGQGCPSTHVLLTHTISRI (SEQ ID NO: 82)

(SEQ ID NO: 64)

(SEQ ID NO: 64) INRPDYLDF (SEQ ID NO: 88)

GVFQLEKGDRLS (SEQ ID NO: 61)

ANPQAEGQLQWLNRRANALLANGV AIKSPCQRETPEGAEAKPWYEPIYLG ELRDNQLWPSEGLYLIYSQVLFKGQ GVFQLEKGDRL S AEINRPD YLDF AES GCPSTHVLLTHTISRIAVSYQTKVNLL GQVYFGIIAL (SEQ ID NO: 76) S (SEQ ID NO: 72)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLS (SEQ ID NO: 87) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 67)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEG (SEQ ID NO: 80) YSQVLFKGQGCPSTHVLLTHTISRIAV SYQTKVNLLS (SEQ ID NO: 59)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLG

(SEQ ID NO: 55) GVFQLEKGDRLS (SEQ ID NO: 77)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLS (SEQ ID NO: 87)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEAKPWYEPIYLGGVFQLEKGDRLS AIKSPCQRETPEG (SEQ ID NO: 53) (SEQ ID NO: 78)

ANPQAEGQLQWLNRRANALLANGV CPSTHVLLTHTISRI (SEQ ID NO: 68) ELRDNQLWPSEGLYLIYSQVLFKGQ

G (SEQ ID NO: 56)

FKGQG (SEQ ID NO: 58)

(SEQ ID NO: 64)

NO: 67)

ANPQAEGQLQWLNRRANALLANGV AIKSPCQRETPEGAEAKPWYEPIYLG ELRDNQLWPSEGLYLIYSQVLFKGQ GVFQLEKGDRLS (SEQ ID NO: 77) GCPSTHVLLTHTISRIAVSYQTKVNLL

S (SEQ ID NO: 72) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

NO: 67)

ANPQAEGQLQWLNRRANALLANGV CPSTHVLLTHTISRIAVSYQTK VNLLS ELRDNQLWPSEGLYLIYSQVLFKGQ AIKSPCQRETPEGAEAKPWYEPIYLG

G (SEQ ID NO: 56) GVFQLEKGDRL S AEINRPD YLDF

(SEQ ID NO: 69)

NO: 67)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEG

(SEQ ID NO: 82)

ANGVELRDNQLWPSEGLYLIYSQVL AVSYQTKVNLLSAIKSPCQRETPEGA FKGQGCPSTHVLLTHTISRI (SEQ ID EAKPWYEPIYLGGVFQLEKGDRLSAE NO: 67) INRPDYLDF (SEQ ID NO: 88)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQG (SEQ ID NO: 54)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLS (SEQ ID NO: 87)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRI (SEQ ID NO: 68) ANGVELRDNQLWPSEGLYLIYSQVL FKGQG (SEQ ID NO: 58)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 68) FKGQG (SEQ ID NO: 54)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEG

(SEQ ID NO: 82)

CPSTHVLLTHTISRIAVSYQTK VNLLS AEINRPD YLDF (SEQ ID NO: 89) AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRLS (SEQ ID NO: 61)

G (SEQ ID NO: 56)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEGA

EAKPWYEPIYLGGVFQLEKGDRLSAE INRPDYLDF (SEQ ID NO: 88)

AHWANPQAEGQLQWLNRR (SEQ ID ANALLANGVELRDNQLWPSEGLYLI NO: 65) YSQVLFKGQGCPSTHVLLTHTISRIAV

SYQTK VNLLS (SEQ ID NO: 59)

ANPQAEGQLQWLNRR (SEQ ID NO: ANALLANGVELRDNQLWPSEGLYLI 66) YSQVLFKGQGCPSTHVLLTHTISRIAV

SYQTK VNLLS (SEQ ID NO: 59)

ANPQAEGQLQWLNRRANALLANGV AVSYQTKVNLLSAIKSPCQRETPEG ELRDNQLWPSEGLYLIYSQVLFKGQ (SEQ ID NO: 82)

GCPSTHVLLTHTISRI (SEQ ID NO: 73) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

SYQTK VNLL S AIKSPCQRETPEG (SEQ ID NO: 60)

ANALLANGVELRDNQLWPSEGLYLI AIKSPCQRETPEGAEAKPWYEPIYLG YSQVLFKGQGCPSTHVLLTHTISRIAV GVFQLEKGDRLS (SEQ ID NO: 77) SYQTK VNLLS (SEQ ID NO: 59)

G (SEQ ID NO: 56)

ANPQAEGQLQWLNRRANALLANGV AIKSPCQRETPEG (SEQ ID NO: 80) ELRDNQLWPSEGLYLIYSQVLFKGQ GCPSTHVLLTHTISRIAVSYQTKVNLL

S (SEQ ID NO: 72)

NO: 67)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTK VNLLS ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLG FKGQG (SEQ ID NO: 58) GVFQLEKGDRL S AEINRPD YLDF

(SEQ ID NO: 69)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLS (SEQ ID NO: 87)

(SEQ ID NO: 64)

SYQTK VNLLS (SEQ ID NO: 59)

GVFQLEKGDRLS (SEQ ID NO: 61)

ANGVELRDNQLWPSEGLYLIYSQVL CPSTHVLLTHTISRI (SEQ ID NO: 68) FKGQG (SEQ ID NO: 54)

AHWANPQAEGQLQWLNRRANALL AVSYQTKVNLLSAIKSPCQRETPEG ANGVELRDNQLWPSEGLYLIYSQVL (SEQ ID NO: 82)

FKGQGCPSTHVLLTHTISRI (SEQ ID H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

NO: 74)

ANPQAEGQLQWLNRRANALLANGV AVSYQTKVNLLSAIKSPCQRETPEGA ELRDNQLWPSEGLYLIYSQVLFKGQ EAKPWYEPIYLGGVFQLEKGDRLSAE GCPSTHVLLTHTISRI (SEQ ID NO: 73) INRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 84)

ANPQAEGQLQWLNRRANALL (SEQ ANGVELRDNQLWPSEGLYLIYSQVL ID NO: 70) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 62)

VNLL S AIKSPCQRETPEG (SEQ ID NO: 63)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 62)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLL S AIKSPCQRETPEG (SEQ ID NO: 63)

AHWANPQAEGQLQWLNRR (SEQ ID ANALLANGVELRDNQLWPSEGLYLI NO: 65) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 64)

ANPQAEGQLQWLNRR (SEQ ID NO: ANALLANGVELRDNQLWPSEGLYLI 66) YSQVLFKGQGCPSTHVLLTHTISRI

(SEQ ID NO: 64)

ANPQAEGQLQWLNRRANALLANGV AVSYQTKVNLLSAIKSPCQRETPEGA ELRDNQLWPSEGLYLIYSQVLFKGQ EAKPWYEPIYLGGVFQLEKGDRLS GCPSTHVLLTHTISRI (SEQ ID NO: 73) (SEQ ID NO: 85)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQG (SEQ ID NO: 54)

SYQTK VNLLS (SEQ ID NO: 59)

ANPQAEGQLQWLNRRANALLANGV AIKSPCQRETPEGAEAKPWYEPIYLG ELRDNQLWPSEGLYLIYSQVLFKGQ GVFQLEKGDRL S AEINRPD YLDF GCPSTHVLLTHTISRIAVSYQTKVNLL (SEQ ID NO: 83)

S (SEQ ID NO: 72)

ANGVELRDNQLWPSEGLYLIYSQVL A VS YQTK VNLL S (SEQ ID NO: 87) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 67)

AHWANPQAEGQLQWLNRRANALL AVSYQTKVNLLSAIKSPCQRETPEGA ANGVELRDNQLWPSEGLYLIYSQVL EAKPWYEPIYLGGVFQLEKGDRLSAE FKGQGCPSTHVLLTHTISRI (SEQ ID INRPDYLDFAESGQVYFGIIAL (SEQ NO: 74) ID NO: 84)

VNLLS (SEQ ID NO: 62)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 62)

(SEQ ID NO: 64) (SEQ ID NO: 85)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLS (SEQ ID NO: 87)

G (SEQ ID NO: 56)

(SEQ ID NO: 64)

AHWANPQAEGQLQWLNRRANALL AVSYQTKVNLLSAIKSPCQRETPEGA ANGVELRDNQLWPSEGLYLIYSQVL EAKPWYEPIYLGGVFQLEKGDRLS FKGQGCPSTHVLLTHTISRI (SEQ ID (SEQ ID NO: 85)

NO: 74)

ANPQAEGQLQWLNRRANALL (SEQ ANGVELRDNQLWPSEGLYLIYSQVL ID NO: 70) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 67)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 67)

(SEQ ID NO: 64)

(SEQ ID NO: 64) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

VNLLS (SEQ ID NO: 62)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 62)

G (SEQ ID NO: 56) GVFQLEKGDRLS (SEQ ID NO: 61)

G (SEQ ID NO: 56)

SYQTK VNLLS (SEQ ID NO: 59)

ANPQAEGQLQWLNRRANALLANGV AVSYQTKVNLLS (SEQ ID NO: 87) ELRDNQLWPSEGLYLIYSQVLFKGQ GCPSTHVLLTHTISRI (SEQ ID NO: 73)

SYQTKVNLLSAIKSPCQRETPEG (SEQ

ID NO: 60)

NO: 67)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEGA

EAKPWYEPIYLGGVFQLEKGDRLS

(SEQ ID NO: 85)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 67)

GCPSTHVLLTHTISRI (SEQ ID NO: 73)

SYQTKVNLLSAIKSPCQRETPEG (SEQ H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

ID NO: 60)

S YQTKVNLL S AIKSPCQRETPEG (SEQ ID NO: 60)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 67)

AHWANPQAEGQLQWLNRRANALL CPSTHVLLTHTISRIAVSYQTKVNLLS ANGVELRDNQLWPSEGLYLIYSQVL AIKSPCQRETPEGAEAKPWYEPIYLG FKGQG (SEQ ID NO: 58) GVFQLEKGDRLS (SEQ ID NO: 61)

AHWANPQAEGQLQWLNRRANALL AVSYQTKVNLLS (SEQ ID NO: 87) ANGVELRDNQLWPSEGLYLIYSQVL FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 74)

ANPQAEGQLQWLNRRANALLANGV AVSYQTK VNLLSAIKSPCQRETPEG ELRDNQLWPSEGLYLIYSQVLFKGQ (SEQ ID NO: 82)

GCPSTHVLLTHTISRI (SEQ ID NO: 73)

S YQTKVNLL S AIKSPCQRETPEG (SEQ ID NO: 60)

ANPQAEGQLQWLNRRANALLANGV AVSYQTK VNLLSAIKSPCQRETPEGA ELRDNQLWPSEGLYLIYSQVLFKGQ EAKPWYEPIYLGGVFQLEKGDRLSAE GCPSTHVLLTHTISRI (SEQ ID NO: 73) INRPDYLDF (SEQ ID NO: 88)

VNLLS (SEQ ID NO: 62)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLLS (SEQ ID NO: 62)

VNLLSAIKSPCQRETPEG (SEQ ID NO:

63)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQG (SEQ ID NO: 54) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

ANPQAEGQLQWLNRRANALLANGV AIKSPCQRETPEG (SEQ ID NO: 80) ELRDNQLWPSEGLYLIYSQVLFKGQ GCPSTHVLLTHTISRIAVSYQTKV LL

S (SEQ ID NO: 72)

(SEQ ID NO: 64)

S (SEQ ID NO: 72)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQG (SEQ ID NO: 54)

FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 74)

VNLL S AIKSPCQRETPEG (SEQ ID NO: 63)

FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 74)

AHWANPQAEGQLQWLNRRANALL AVSYQTKVNLLSAIKSPCQRETPEGA ANGVELRDNQLWPSEGLYLIYSQVL EAKPWYEPIYLGGVFQLEKGDRLSAE FKGQGCPSTHVLLTHTISRI (SEQ ID INRPDYLDF (SEQ ID NO: 88) NO: 74)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLL S AIKSPCQRETPEG (SEQ ID NO: 63)

VNLL S AIKSPCQRETPEG (SEQ ID NO: 63) H I

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRIAVSYQTK

VNLL S AIKSPCQRETPEG (SEQ ID NO: 63)

(SEQ ID NO: 64)

NO: 67)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 67)

S (SEQ ID NO: 72)

NO: 74)

NO: 67)

AHWANPQAEGQLQWLNRRANALL ANGVELRDNQLWPSEGLYLIYSQVL

(SEQ ID NO: 71) FKGQGCPSTHVLLTHTISRI (SEQ ID

NO: 67)

NO: 74)

J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AEAKPWYEPIYLGGVFQLEKGDRLSA J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRL S AEINRPD YLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPD YLDF AESGQVYFGIIAL GVFQLEKGDRL S (SEQ ID NO: 77) (SEQ ID NO: 81)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AEAKPWYEPIYLGGVFQLEKGDRLSA AESGQVYFGIIAL (SEQ ID NO: 86) EINRPDYLDF (SEQ ID NO: 79)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 75)

AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AESGQVYFGIIAL (SEQ ID NO: 86)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AIKSPCQRETPEGAEAKPWYEPIYLG AESGQVYFGIIAL (SEQ ID NO: 86) GVFQLEKGDRL S AEINRPD YLDF

(SEQ ID NO: 83)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 81)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEAKPWYEPIYLGGVFQLEKGDRLSA AIKSPCQRETPEG (SEQ ID NO: 53) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 82) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 81)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AESGQVYFGIIAL (SEQ ID NO: 86)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 81)

AESGQVYFGIIAL (SEQ ID NO: 86)

AVSYQTKVNLLSAIKSPCQRETPEGA E AKP WYEPIYLGG VFQLEKGDRL S AE INRPD YLDF AESGQVYFGIIAL (SEQ

ID NO: 84)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPDYLDFAESGQVYFGIIAL GVFQLEKGDRL S (SEQ ID NO: 77) (SEQ ID NO: 81)

AEINRPDYLDFAESGQVYFGIIAL

(SEQ ID NO: 81) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AVSYQTKVNLLSAIKSPCQRETPEGA AEINRPD YLDF AESGQVYFGIIAL EAKPWYEPIYLGGVFQLEKGDRLS (SEQ ID NO: 81)

(SEQ ID NO: 85)

(SEQ ID NO: 83)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AESGQVYFGIIAL (SEQ ID NO: 86)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEAKPWYEPIYLGGVFQLEKGDRLSA AIKSPCQRETPEG (SEQ ID NO: 53) EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 82) EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDFAESGQVYFGIIAL (SEQ ID NO: 75)

ID NO: 75)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AESGQVYFGIIAL (SEQ ID NO: 86)

AESGQVYFGIIAL (SEQ ID NO: 86) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLG

(SEQ ID NO: 55) GVFQLEKGDRLSAEINRPD YLDF AES

GQVYFGIIAL (SEQ ID NO: 76)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLG

(SEQ ID NO: 55) GVFQLEKGDRLSAEINRPD YLDF AES

GQVYFGIIAL (SEQ ID NO: 76)

AESGQVYFGIIAL (SEQ ID NO: 86)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AVSYQTKVNLLS (SEQ ID NO: 87) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRLSAEINRPD YLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AVSYQTKVNLLSAIKSPCQRETPEGA EAKPWYEPIYLGGVFQLEKGDRLSAE INRPD YLDF AESGQVYFGIIAL (SEQ

ID NO: 84)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

(SEQ ID NO: 83)

(SEQ ID NO: 85)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AESGQVYFGIIAL (SEQ ID NO: 86)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 86) EAKPWYEPIYLGGVFQLEKGDRLSAE INRPD YLDF (SEQ ID NO: 88)

AIKSPCQRETPEGAEAKPWYEPIYLG J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

GVFQLEKGDRL S AEINRPD YLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AEINRPD YLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 82) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

(SEQ ID NO: 83)

GVFQLEKGDRL S (SEQ ID NO: 61)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AESGQVYFGIIAL (SEQ ID NO: 86)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLG

(SEQ ID NO: 55) GVFQLEKGDRLSAEINRPD YLDF AES

GQVYFGIIAL (SEQ ID NO: 76)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPD YLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AESGQVYFGIIAL (SEQ ID NO: 86)

CPSTHVLLTHTISRIAVSYQTKVNLLS AIKSPCQRETPEGAEAKPWYEPIYLG

(SEQ ID NO: 55) GVFQLEKGDRLSAEINRPD YLDF AES

GQVYFGIIAL (SEQ ID NO: 76)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

(SEQ ID NO: 78) (SEQ ID NO: 81)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 82) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AVSYQTKVNLLS (SEQ ID NO: 87) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRLSAEINRPD YLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AESGQVYFGIIAL (SEQ ID NO: 86)

ID NO: 84)

AESGQVYFGIIAL (SEQ ID NO: 86)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AEINRPD YLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AVSYQTKVNLLSAIKSPCQRETPEGA AEINRPD YLDF AESGQVYFGIIAL EAKP WYEPIYLGG VFQLEKGDRLS (SEQ ID NO: 81)

(SEQ ID NO: 85)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AESGQVYFGIIAL (SEQ ID NO: 86)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AIKSPCQRETPEGAEAKPWYEPIYLG AESGQVYFGIIAL (SEQ ID NO: 86) GVFQLEKGDRL S AEINRPD YLDF J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

(SEQ ID NO: 83)

ID NO: 84)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPDYLDF AESGQVYFGIIAL G VFQLEKGDRL S (SEQ ID NO: 77) (SEQ ID NO: 81)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 86) E AKPWYEPIYLGG VFQLEKGDRL S AE INRPDYLDF (SEQ ID NO: 88)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRL S AEINRPDYLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AVSYQTKVNLLSAIKSPCQRETPEGA AEINRPDYLDF AESGQVYFGIIAL EAKPWYEPIYLGGVFQLEKGDRLS (SEQ ID NO: 81)

(SEQ ID NO: 85)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 82) EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 82) EINRPDYLDFAESGQVYFGIIAL (SEQ

ID NO: 75)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

CPSTHVLLTHTISRIAVSYQTKVNLLS AEINRPDYLDF AESGQVYFGIIAL AIKSPCQRETPEGAEAKPWYEPIYLG (SEQ ID NO: 81)

GVFQLEKGDRL S (SEQ ID NO: 61)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

EINRPD YLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPD YLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AESGQVYFGIIAL (SEQ ID NO: 86)

(SEQ ID NO: 83)

(SEQ ID NO: 69)

AEINRPD YLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

(SEQ ID NO: 83)

AVSYQTKVNLLS (SEQ ID NO: 87) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRL S AEINRPD YLDF AES GQVYFGIIAL (SEQ ID NO: 76)

ID NO: 84)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEGA

EAKPWYEPIYLGGVFQLEKGDRLSAE INRPD YLDF AESGQVYFGIIAL (SEQ ID NO: 84)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEGA

EAKPWYEPIYLGGVFQLEKGDRLSAE INRPD YLDF AESGQVYFGIIAL (SEQ ID NO: 84)

AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 81)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 82) EINRPD YLDF AESGQVYFGIIAL (SEQ J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

ID NO: 75)

AVSYQTKVNLLSAIKSPCQRETPEG AEAKPWYEPIYLGGVFQLEKGDRLSA

(SEQ ID NO: 82) EINRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 75)

(SEQ ID NO: 85)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AVSYQTKVNLLS (SEQ ID NO: 87) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRL S AEINRPDYLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 86) EAKPWYEPIYLGGVFQLEKGDRLSAE INRPDYLDF (SEQ ID NO: 88)

AIKSPCQRETPEGAEAKPWYEPIYLG AESGQVYFGIIAL (SEQ ID NO: 86) GVFQLEKGDRL S AEINRPDYLDF

(SEQ ID NO: 83)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AVSYQTKVNLLSAIKSPCQRETPEGA EAKPWYEPIYLGGVFQLEKGDRLSAE INRPDYLDF AESGQVYFGIIAL (SEQ

ID NO: 84)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

ID NO: 84)

(SEQ ID NO: 85)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

(SEQ ID NO: 85)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AIKSPCQRETPEGAEAKPWYEPIYLG AEINRPDYLDF AESGQVYFGIIAL GVFQLEKGDRLS (SEQ ID NO: 77) (SEQ ID NO: 81)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 86) E AKP WYEPIYLGG VFQLEKGDRL S AE INRPDYLDF (SEQ ID NO: 88)

AESGQVYFGIIAL (SEQ ID NO: 86)

(SEQ ID NO: 83)

CPSTHVLLTHTISRIAVSYQTKVNLLS AESGQVYFGIIAL (SEQ ID NO: 86) AIKSPCQRETPEGAEAKPWYEPIYLG GVFQLEKGDRL S AEINRPDYLDF

(SEQ ID NO: 69)

(SEQ ID NO: 83)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

(SEQ ID NO: 69)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AVSYQTKVNLLS (SEQ ID NO: 87) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRL S AEINRPDYLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AVSYQTKVNLLS (SEQ ID NO: 87) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRL S AEINRPDYLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AEAKPWYEPIYLGGVFQLEKGDRLSA AESGQVYFGIIAL (SEQ ID NO: 86) EINRPDYLDF (SEQ ID NO: 79) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEGA

EAKPWYEPIYLGGVFQLEKGDRLSAE INRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 84)

CPSTHVLLTHTISRI (SEQ ID NO: 68) AVSYQTKVNLLSAIKSPCQRETPEGA

EAKPWYEPIYLGGVFQLEKGDRLSAE INRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 84)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AESGQVYFGIIAL (SEQ ID NO: 86)

AEAKPWYEPIYLGGVFQLEKGDRLS AEINRPDYLDF AESGQVYFGIIAL

(SEQ ID NO: 78) (SEQ ID NO: 81)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 86) EAKPWYEPIYLGG VFQLEKGDRL S AE INRPDYLDF (SEQ ID NO: 88)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 86) EAKPWYEPIYLGG VFQLEKGDRLSAE INRPDYLDF (SEQ ID NO: 88)

AVSYQTKVNLLS (SEQ ID NO: 87) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRL S AEINRPDYLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AVSYQTKVNLLS (SEQ ID NO: 87) AIKSPCQRETPEGAEAKPWYEPIYLG

GVFQLEKGDRL S AEINRPDYLDF AES GQVYFGIIAL (SEQ ID NO: 76)

AEINRPDYLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

AIKSPCQRETPEG (SEQ ID NO: 80) AEAKPWYEPIYLGGVFQLEKGDRLSA

EINRPDYLDF AESGQVYFGIIAL (SEQ ID NO: 75)

(SEQ ID NO: 83)

AVSYQTKVNLLSAIKSPCQRETPEGA AESGQVYFGIIAL (SEQ ID NO: 86) EAKPWYEPIYLGG VFQLEKGDRLSAE INRPDYLDF (SEQ ID NO: 88) J K

Segments (from N- to C-terminus) Segments (from N- to C-terminus)

(SEQ ID NO: 83)

AEINRPD YLDF (SEQ ID NO: 89) AESGQVYFGIIAL (SEQ ID NO: 86)

Claims

1. A computer system for determining a chemical protein synthesis design for a protein sequence, the computer system comprising:

one or more processors; and

one or more computer readable hardware storage devices, wherein the one or more computer readable hardware storage devices comprise computer executable instructions executable by at least one of the one or more processors to cause the computer system to:

receive a data structure comprising the protein sequence;

identify a list of one or more possible peptide segments from the protein sequence;

generate a list of viable peptide segments from the one or more possible peptide segments;

evaluate one or more viable peptide segments from the list of viable peptide segments;

identify one or more potential ligation strategies for synthesizing the protein sequence;

determine an average peptide segment length for the one or more potential ligation strategies; and

identify an ideal ligation strategy based on the evaluated one or more viable peptide segments within the one or more potential ligation strategies and the average peptide segment length of the one or more potential ligation strategies.

2. The computer system of claim 1, wherein identifying the list of one or more possible peptide segments comprises identifying one or more of each Cys and Ala junction site in the protein sequence.

3. The computer system of claim 1 or claim 2, wherein viable peptide segments comprise the one or more possible peptide segments having an amino acid length of at least about 10 residues and no longer than about 80 residues.

4. The computer system of any of claims 1-3, wherein generating a list of viable peptide segments comprises identifying one or more viable peptide segments having a C-terminal residue that is not a forbidden residue.

5. The computer system of claim 4, wherein the forbidden residue is an amino acid residue selected from the group consisting of: Asp, Glu, Asn, Pro, and Gin.

6. The computer system of any of claims 1-5, wherein evaluating the one or more viable peptide segments from the list of viable peptide segments comprises calculating one or more peptide segment metrics.

7. The computer system of claim 6, wherein calculating the one or more peptide segment metrics comprises one or more of:

calculating a first score based on the presence of a preferred thioester or an acceptable thioester;

calculating a second score based on an average solubility score;

calculating a third score based on a peptide segment length; and calculating a penalty for each viable peptide segment comprising an Ala as an N-terminal ligation junction.

8. The computer system of claim 7, wherein the preferred thioester is selected from the group consisting of: Ala, Arg, Cys, Gly, His, Met, Phe, Ser, Tip, and Tyr, and wherein the acceptable thioester is selected from the group consisting of: He, Leu, Lys, Thr, and Val.

9. The computer system of claim 7, wherein one or more of the preferred thioester and the acceptable thioester are selected automatically by the computer system based on one or more defined lists or are manually selected by a user.

10. The computer system of claim 7, wherein a value of the second score ranges between 0 and -3.

11. The computer system of claim 7 or claim 10, wherein the average solubility score comprises a solubility score divided by the number of residues in a given peptide segment, the solubility score comprising a sum of a first value for each His, Lys, and Arg in the given segment and a second value for each Asp, Glu, Val, He, and Leu in the given peptide segment, wherein the average solubility is compared to a distribution of known solubilities for a set of peptides.

12. The computer system of claim 11, wherein a value of the second score comprises 0 when the average solubility is greater than or equal to a calculated average solubility within the distribution of known solubilities for the set of peptides, and wherein the value of the second score comprises a real number between 0 and -3 when the average solubility is less than the calculated average solubility within the distribution of known solubilities for the set of peptides.

13. The computer system of claim 12, wherein the real number is dependent upon and proportional to a degree by which the average solubility falls below the calculated average solubility.

14. The computer system of any of the preceding claims, wherein identifying the one or more potential ligation strategies for synthesizing the protein sequence is based on combinations of the one or more viable peptide segments that yield the protein sequence.

15. The computer system of any of the preceding claims, further comprising computer executable instructions executable by the at least one of the one or more processors to cause the computer system to associate a penalty score with a potential ligation strategy comprising an average peptide segment length of less than 40 residues.

16. The computer system of any of the preceding claims, further comprising computer executable instructions executable by the at least one of the one or more processors to cause the computer system to display the ideal ligation strategy.

17. The computer system of claim 1, further comprising computer executable instructions executable by the at least one of the one or more processors to cause the computer system to determine a total score for each of the one or more potential ligation strategies.

18. The computer system of claim 17, further comprising computer executable instructions executable by the at least one of the one or more processors to cause the computer system to rank-order the one or more potential ligation strategies based on the total score.

19. The computer system of claim 18, wherein the ideal ligation strategy comprises a highest total score of the rank-ordered one or more potential ligation strategies.

20. A method, implemented at a computer system that includes one or more processors, for determining a chemical protein synthesis design for a protein sequence, the method comprising:

receiving a data structure comprising the protein sequence; identifying a list of one or more possible peptide segments from the protein sequence; generating a list of viable peptide segments from the one or more possible peptide segments;

evaluating one or more viable peptide segments from the list of viable peptide segments;

identifying one or more potential ligation strategies for the protein sequence;

determining an average peptide segment length for the one or more potential ligation strategies; and

identifying an ideal ligation strategy based on the evaluated one or more viable peptide segments within the one or more potential ligation strategies and the average peptide segment length of the one or more potential ligation strategies.

21. The method of claim 20, further comprising:

determining a rank-ordered list of the one or more potential ligation strategies based on a total score for each of the one or more potential ligation strategies; and

displaying the ideal ligation strategy, wherein the ideal ligation strategy comprises a ligation strategy having a highest total score of the rank-ordered list.

22. A computer system for determining a chemical protein synthesis design for a protein sequence, the computer system comprising:

one or more processors; and

receive a data structure comprising the protein sequence;

evaluate one or more viable peptide segments from a list of viable peptide segments; identify one or more potential ligation strategies for synthesizing the protein sequence; and

determine an ideal ligation strategy based on the evaluated one or more viable peptide segments within the one or more potential ligation strategies.