WO2003058205A2

WO2003058205A2 - Methods of identifying putative effector proteins

Info

Publication number: WO2003058205A2
Application number: PCT/US2003/000911
Authority: WO
Inventors: Alan Collmer; James Alfano; David Schneider; Lisa Schechter
Original assignee: Cornell Research Foundation, Inc.; The Board Of Regents Of The University Of Nebraska; United States Department Of Agriculture
Priority date: 2002-01-11
Filing date: 2003-01-13
Publication date: 2003-07-17
Also published as: AU2003226441A1; WO2003058205A3; AU2003226441A8

Abstract

A system and method for identifying putative effector proteins that includes providing a predicted amino acid sequence and determining whether the predicted amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences, wherein satisfaction of the one or more rules indicates that the predicted amino acid sequence is likely an effector protein. A computer readable medium is also disclosed herein as having stored thereon instructions which, when executed by a processor, cause the processor to perform such a determination.

Description

METHODS OF IDENTIFYING PUTATIVE EFFECTOR PROTEINS

This application claims the benefit of U.S. Provisional Patent Application Serial No. 60/348,061 filed January 11, 2002, which is hereby incorporated by reference in its entirety.

STATEMENT OF GOVERNMENT SPONSORSHIP

This invention was made at least in part with funding received from the National Science Foundation under grants DBI-0077622 and CB-9982646, and the U.S. Department of Agriculture under Cooperative Agreement 58-1907-1-140 and CRIS No. 1907-2100-009-00D. The U.S. Government may have certain rights in this invention.

FIELD OF THE INVENTION

This invention relates generally to methods of mining putative effector proteins by identifying their open reading frames (ORFs) from genomic databases and determining whether the predicted protein is likely to be secreted.

BACKGROUND OF THE INVENTION

Type III protein secretion systems are central to the virulence of many bacteria, including animal pathogens in the genera Salmonella, Yersinia, Shigella, and Escherichia and plant pathogens in the genera Pseudomonas, Erwinia, Xanthomonas, Ralstonia, aod Pantoea (Galan and Collmer, Science 284(5418):1322-1328 (1999)). Loss of the secretion system usually abolishes pathogenicity in mutants, whereas mutation of a single effector protein gene commonly has little or no effect because of apparent redundancies among the effectors (Cornelis and Van Gijsegem, Annu. Rev. Microbiol. 54:735-774 (2000)). This observation highlights both the collective importance of effectors in pathogenesis and the difficulty in identifying them through loss of function. Given this problem, effector protein genes have been alternatively sought through the identification of proteins secreted to the medium and genes coordinately regulated with those encoding the secretion machinery (Worley et al., Methods Enzymol. 326:97-104 (2000)). However, some effectors are poorly secreted in culture and/or expressed independently of the secretion system regulon (Knoop et al, J. Bacteriol. 173(22):7142-7150 (1991); van Dijk et al., J. Bacteriol. 181(16):4790-4797 (1999)). Thus, despite the availability of genomic sequence data for several pathogens that utilize type III secretion systems, only a limited and fragmentary inventory of the effectors underlying their pathogenicity. Methods for predicting bacterial protein sorting signals have provided powerful tools for functional genomics by enabling genome-wide assignments of proteins with predicted signal peptides, lipoprotein cleavage sites, and transmembrane domains. However, it has not yet been possible to make genome- wide predictions of which proteins travel the type I, II, III, or IV secretion pathways to the bacterial milieu or host targets. The identification of secretion signals for the type III pathway is further complicated by uncertainty about whether targeting information resides in the mRNA or the protein, by observation that some effectors possess multiple secretion signals (e.g., amino-terminal and chaperone-dependent), and by overlap between signals involved in pathway entry and in subsequent subcellular targeting within host cells (e.g., the myristolation signals found in the amino-terminus of some P. syringae effector proteins) (Lloyd et al., Trends Microbiol. 9(8):367-371 (2001)).

In the controversy regarding the mRNA or protein location of targeting information, there are two primary observations supporting the mRNA signal hypothesis: (i) the first 15 or 17 amino acids of YopE or YopH, respectively, are sufficient to direct a CyaA (Bordetella pertussis adenyl cyclase) reporter through the pathway but no consensus sequence has been recognized in the amino-terminal 15 codons of these and other Yops, and (ii) frameshifts which completely alter the amino acid sequence in this region do not prevent secretion of YopE_1-15-Npt (neomycin phosphotransferase) reporter fusions (Anderson and Schneewind, Science 278(5340): 1140-1143 (1997); Sory et al.. Proc. Natl. Acad. Sci. USA 92(26):! 1998- 12002 (1995)). Recent evidence for a protein signal features two observations involving native YopE: (i) frameshifts affecting codons 2-11 disrupt the amino- terminal secretion signal and render secretion dependent on the YopE chaperone, and (ii) although many alternative amino acids in this region can support secretion, amphipathicity is essential, and secretion can even be supported by a synthetic peptide of alternating serine and isoleucine residues in positions 2-11 (Lloyd et al., Mol. Microbiol 39(2):520-531 (2001)). The mechanism by which the secretion machinery recognizes proteins with amphipathic amino-termini is unknown, and amphipathicity per se is too general a property to support efficient genome- wide searches for novel effector genes.

However, consensus sequences predicting secretability have been identified in seven proteins that are coordinately produced with and secreted by the Salmonella pathogenicity island 2 (SPI2) type III secretion system (Miao and Miller, Proc. Natl. Acad. Sci. USA 97(13 :7539-7544 (2000)). The consensus occurs in the first 143 residues and includes an amino-terminal amphipathic region. The observation that the effector proteins of both plant and animal pathogens can be secreted by heterologous type III secretion systems argues for a universal signal (Cornells and Van Gijsegem, Annu. Rev. Microbiol. 54:735-774 (2000)). However, the conservation in the Sα/røø«e//α-translocated effectors (STE) secreted by SPI2 suggests that system-specific secretion information may be superimposed on the universal signal.

In contrast to Salmonella, P. syringae has a single type III secretion system. The Hrp system is known to secrete harpins (HrpZ and HrpW), the HrpA pilus subunit, and effector proteins commonly designated as Avrs (avirulence) or Hops (Hrp-dependent outer proteins). Deletion mutations have demonstrated that the amino-terminal 10-15 residues are required for the secretion of AvrPto and AvrB, respectively (Anderson et al., Proc. Natl. Acad. Sci. USA 96(22 : 12839-12843 (1999)). Fusions with an AvrRpt2 reporter have further demonstrated that additional signals for translocation into plant cells reside within the amino-terminal 58 residues of the Xanthomonas campestris pv. vesicatoria AvrBs2 proteins (Mudgett et al., Proc. Natl. Acad. Sci. USA 97(24): 13324-13329 (2000)). In general, the effector proteins of plant and animal pathogens appear to carry targeting information in the N-terminal portion, but the nature of that information is unclear.

It would be desirable, therefore, to identify a set of rules that can discriminate putative effector proteins from other proteins encoded by the genome. The present invention is directed to overcoming this and other deficiencies in the art.

SUMMARY OF THE INVENTION

One aspect of the present invention relates to a method of identifying putative effector proteins that includes: providing a predicted amino acid sequence; and determining whether the predicted amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences, wherein satisfaction of the one or more rules indicates that the predicted amino acid sequence is likely an effector protein.

Another aspect of the present invention relates to a system including: a determination system that determines whether a provided amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences.

A further aspect of the present invention relates to a computer readable medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the step of: determining whether a provided amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences.

Pseudomonas syringae pv. tomato DC3000 is an important model organism in molecular plant pathology whose pathogenicity is dependent upon effector proteins injected into host plant cells by the Hrp type III protein secretion system. A draft sequence of the DC3000 genome and use of that sequence for genome- wide identification of virulence-related genes in the Hrp regulon has been described (Fouts et al., Proc. Natl. Acad. Sci. USA 99(4^:2275-2280 (2001)). One aspect of the present invention is the genome- wide investigation of the Hrp system by identifying characteristics that predict which DC3000 proteins travel the Hrp pathway and therefore are candidate effectors.

Application of the rules for candidate effector proteins in Pseudomonads and other plant and animal pathogens will allow for more effective and expeditious mining of effector proteins during genome- wide investigations of such organisms. It is expected, therefore, that the present invention will provide for the identification and subsequent testing and use of a significant number of effector proteins that are expressed and secreted by plant or animal pathogens possessing type III secretions systems. Such proteins are predicted to have exquisite biochemical activity in plants and other eukaryotes, and are believed to have a variety of beneficial uses. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a diagram illustrating a system containing a memory and query module for executing a series of instruction for performing a method of identifying a putative effector protein in accordance with the present invention.

Figure 2 is a diagram illustrating the method of mining genomes for candidate effector proteins.

Figures 3A-C illustrate the sequence alignments of the first 40 aa of P. syringae Hrp-secreted proteins and other relevant proteins and patterns that are predictive of export signals in secreted proteins. In Figure 3 A, the first 40 aa of a nonredundant set of Hrp-secreted proteins and Avr proteins from P. syringae pathovars are coded by functional class. Below the aligned amino acids are bars indicating which positions are pertinent to the various export-associated patterns, expressed as six predictive rules (see Figure 5). The sequences of proteins from P. s. tomato DC3000 are indicated with a Pto subscript. In Figure 3B, the first 40 aa from representative P. s. tomato DC3000 proteins associated with the Hrp regulon but not secreted by the Hrp system are aligned: CorS (coronatine biosynthesis regulator), IaaL (N-(indole-3-acetyl)-L-lysine synthetase); HrcC, (outer membrane Hrp translocator); and HrpL (alternative sigma factor). In Figure 3C, the first 40 aa of various identified ORFs are aligned. Application of the export-signal rules to two ORFs in the AvrPphF locus and six ADP-ribosyl transferases produced by P. s. tomato DC3000 (HopPtoSl, HopPtoS2, and ORF31), P. aeruginosa (ExoS and ExoT), and chicken (NRT2_Chk)- Proteins violating the export signal rules are listed below Figures 4B and 4C (with the violations). The GenBank accession numbers are M15194 (AvrA), M21965 (AvrB), M22219 (AvrC), Z21715 (AvrRpt2), NC_002759 (AvrRpml), L20425 (AvrPto), AJ277495 (AvrPpiGl), AF232005 (HopPsyV), AF232004 (AvrE, HopPtoB (EEL ORF1), HopPtoAl (CEL ORF5), HrpA, HrpK, HrpW, HrpZ), AAF67151 (AvrPphF ORF1), AAF67152 (AvrPphF ORF2), AAC34756 (HrcC), NP_252530 (ExoS), NP_248734 (ExoT), and P55807 (NRT2_chk), all of which are hereby incorporated by reference in their entirety.

Figure 4 is a chart illustrating the DNA Rules lookup table, which contains the queries for the identified genomic DNA that may encode a putative candidate effector protein or regions of genomic DNA associated with the identified genomic DNA.

Figure 5 is a chart illustrating the Protein Rules lookup table, which contains the queries for determining whether a predicted amino acid sequence (encoded by the identified genomic DNA) satisfies one or more rules and, therefore, is likely to be a secreted effector protein.

Figure 6 is a diagram illustrating the steps in performing a secretion assay to assess whether a candidate effector protein is secreted by a type III secretion system.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates to a method and system for determining whether a predicted amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences. Satisfaction of the one or more rules indicates that an organism's genomic sequence encodes a putative effector protein (i.e., the amino acid sequence is likely an effector protein).

Referring to Figure 1, a putative effector protein detection system 10 of the present invention includes a processor 11, a memory storage device 12, a display 13, a printer 14 (optional), a user interface 15 (which includes suitable input devices), and an input/output (I/O) unit 17 which are coupled together by a bus system 16 or other link, respectively, although the system 10 may comprise other components, other numbers of the components, and other combinations of the components. The processor 11 may execute one or more programs of stored instructions for the method for determining whether a predicted amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences, as described herein. In this particular embodiment, programmed instructions for making such a determination are stored in memory 12 and are executed by processor 11, although some or all of those programmed instructions could be stored and retrieved from and also executed at other locations. A variety of different types of memory storage devices, such as a random access memory (RAM) or a read only memory (ROM) in the system or a floppy disk, hard disk, CD ROM, or other computer readable medium which is read from and/or written to by a magnetic, optical, or other reading and/or writing system that is coupled to the processor 11, can be used for memory 12.

The display 13 and printer 14 are used to show information to the operator. A variety of different of devices can be used for the display 13, such as a CRT or flat panel display, and a variety of printing device can be used as well.

The user interface 15 permits an operator to enter data into the system 10. A variety of different types devices can be used for user input interface 15, such as a keyboard, a computer mouse, or an interactive display screen.

The I/O unit 17 in system 10 is used to couple the system to the Internet 18 for accessing information or services provided on remote systems 19, such as remote servers for execution of certain tasks or for accessing databases stored thereon. A variety of different interface devices can be used with a variety of different communication protocols.

In embodiments of the present invention, the system 10 has stored in its memory 12 instructions for implementing by the processor 11, such as the analysis of genomic DNA for compliance with DNA Rules lookup table 70 and the analysis of the predicted amino acid sequence with Protein Rules lookup table 80 described hereinafter. Thus, the system 10 of the present invention has stored thereon instructions which, when executed by a processor cause the processor to perform the steps of: identifying genomic DNA that includes a start codon and contains at least 150 consecutive codons with no stop codon; determining whether the genomic DNA is associated with or satisfies any conditions identified in the DNA Rules lookup table 70; predicting the amino acid sequence encoded by the genomic DNA; and/or assessing whether the encoded a protein is secreted by a type III secretion system, based on the satisfaction of one or more rules from the Protein Rules lookup table 80. The last step, of course, is requisite in performing the method of the present invention.

Referring to Figure 2, in use the system will first identify a portion of genomic DNA from a database of genomic DNA sequences at step 20. The portion of genomic DNA thus identified will include a start codon and at least 150 consecutive codons without a stop codon. This is a putative open reading frame (ORF). The identification of the genomic DNA satisfying these criteria can be carried out using, e.g., known DNA analysis software. Organisms whose genomes can be searched and predicted proteins analyzed with the set of rules identified herein include, without limitation, species from the genera Pseudomonas, Erwinia, Xanthomonas, Yersinia, Ralstonia, Salmonella, Shigella, pathogenic E. coli, Helicobacter, and other plant or animal pathogens. In most species, suitable start codons include ATG and GTG, although other suitable start codons that have been identified can also be utilized. In most species, the stop codon is typically TGA, TAG, or TAA

Optionally, at step 25 an analysis of the genomic DNA can be performed. The analysis of the genomic DNA is performed by the processor 11 in accordance with the DNA Rules lookup table 70 (Figure 4) or by an operator examining the identified genomic DNA and its associated regions within the genome. One component of this analysis determines whether the identified genomic DNA is associated with other genomic elements that are characteristic of effector protein coding regions. One such genomic element is the presence of a hrp-dependent promoter. Another such genomic element is the presence of the identified genomic DNA near a transposon (indicating horizontal acquisition of associated genes). These analyses can be performed in accordance with the Hidden Markov Model approach identified in Fouts et al., Proc. Natl. Acad. Sci. USA 99(4^:2275-2280 (2001), which along with its supplemental information is hereby incorporated by reference in its entirety). Another component of the analysis is a counting of the nucleic acid bases in the identified genomic DNA to determine its GC content: ORFs of effector proteins are characterized by low GC content. As used herein, a low GC content refers to a GC content that is about 2 percent or more, preferably 5 percent or more, lower than genome average for a particular organism. Once the genomic DNA has been identified, the amino acid sequence encoded thereby is predicted at step 30. In this particular embodiment, conventional translation programs that are programmed to recognize codons and identify the encoded amino acid residue can be utilized, although other systems and methods for predicting amino acid sequences can also be utilized. Suitable software that can be used to predict amino acid sequence include, without limitation, DNAStar and the Transeq software of the EMBOSS package (Sanger Center). This software can be loaded directly onto the system 10 or loaded onto a separate server or system for access by system 10, for execution of the request to translate the identified genomic DNA into a predicted amino acid sequence.

Having predicted the amino acid sequence of encoded product of the ORF identified in step 30, an analysis is performed at step 40 to determine whether the N-terminal region of the encoded protein satisfies one or more rules from a set of rules that define properties of secreted effector proteins. The length of the N-terminal region to be analyzed is preferably at least about 40 amino acids. It is preferably to analyze up to about 50 amino acids from the N-terminal region, although larger sequence lengths, such as 60, 70 , 80, 90, or even 100 or more amino acids can be analyzed. Based on the foregoing, it should be appreciated that the steps 20, 25, and 30 can be performed on separate systems and the predicted amino acid sequence merely provided to system 10 for execution of step 40.

The analysis can performed by the system 10, although other systems containing processors and memory can be utilized. In this embodiment, processor 11 calls from memory 12 a Protein Rules lookup table 80 stored therein (Figure 5) to determine whether the predicted amino acid sequence complies with two or more of Rules (a)-(f). Thus, the analysis of the predicted amino acid sequence is merely a series of queries: i.e., whether or not the predicted amino acid sequence satisfies each of two or more Rules (a)-(f). The results of the query can be stored in memory 12 and visually displayed on display 13 or printed by printer 14. Each of the predictive rules in the lookup table 80 is based on the texture or local composition of the protein rather than its structure or conformation per se. From an examination of the alignment shown in Figure 4, which illustrates the alignment of aa 1-40 from various effector proteins, Rules (a)-(f) emerge. With respect to Rule (e), "rich in polar amino acids" refers to a percentage of polar amino acids (e.g., serine, threonine, cysteine, asparagine, glutamine, and glycine) that is greater than a specified threshold level. In the analysis of Pseudomonas syringae DC3000, "rich in polar amino acids" was defined as at least 7 of serine, threonine, or glutamine (i.e., in combination) present in the first 40 or 50 amino acids analyzed.

Based on the query of the predicted amino acid sequence at step 40, it is possible to determine whether the protein encoded by the identified genomic DNA is likely to be secreted. Most effector proteins from Pseudomonas satisfy all of these rules, although several exceptions exist, as noted in Figures 3B and 3C. Therefore, proteins that satisfy one or more of the above rules, more preferably two or more of the above rules, three or more of the above rules, four or more of the above rules, five or more of the above rules, or all six of the above rules, can be putative effector proteins.

There is also a subset or class of effectors in animal pathogens whose N-terminal regions have the same texture and therefore satisfy the rules. The Yersinia YopE is one of these. Therefore, the analysis of protein products of candidate ORFs can be used to find additional members of this class of effector in bacteria other than Pseudomonas.

Depending on the number of rules satisfied by the N-terminal region of the predicted amino acid sequence for the protein encoded by the ORF, it is possible to assess the likelihood of whether the ORF encodes an effector protein. The more rules that are satisfied, the greater the likelihood that the encoded protein is indeed an effector protein. Rules (a), (c), and (e) are preferably satisfied, more preferably rules (a)-(e) are satisfied, and most preferably rules (a)-(f) are satisfied. This enables one of ordinary skill in the art to determine whether or not a secretion assay is to be performed at step 50 (Figure 2).

Performance of the secretion assay will confirm whether or not the predicted protein is secreted by a type III secretion system. Because of the time and resources involved in performing a secretion assay, the method and systems of the present invention allow for an analysis of the likelihood that such secretion will or will not occur. Thus, the present invention will afford the practitioner a greater wealth of information upon which to base the decision to proceed or not to proceed with such a secretion assay. This should significant streamline the process for identifying new effector proteins.

The basic procedures for performing the secretion assay are illustrated in Figure 6. Basically, the desired ORF is introduced into a DNA construct 100 which includes appropriate promoter and transcription termination (polyadenylation) signals suitable for use in the host cell to be transformed. The host cell includes a functional type III secretion system, allowing proteins secretable by the type III secretion system to be secreted into growth media. After preparing the DNA construct, it is introduced into a plasmid or cosmid 102 for introduction into the host cell 104 using standard procedures, and transformants 104' are selected and then cultured on growth media at step 130, which allows for expression of the protein encoded by the selected ORF. If the ORF does, in fact, encode an effector protein, then the encoded protein will be secreted into the growth media, which can be tested at step 140 for presence of the protein. Further testing can then be performed to determine the properties of the effector protein.

Preparing construct 100 and inserting the construct into the plasmid 102 at step 110 can be carried out using conventional recombinant techniques, such as restriction enzyme cleavage and ligation with DNA ligase. U.S. Patent No. 4,237,224 to Cohen and Boyer, which is hereby incorporated by reference in its entirety, describes the production of expression systems in the form of recombinant plasmids using restriction enzyme cleavage and ligation with DNA ligase. The DNA sequences are cloned into the vector using standard cloning procedures in the art, as described by Maniatis et al., Molecular Cloning: A Laboratory Manual. Cold Springs Laboratory, Cold Springs Harbor, New York (1982), which is hereby incorporated by reference in its entirety.

Suitable vectors include, but are not limited to, the following viral vectors such as lambda vector system gtl 1, gt WES.tB, Charon 4, and plasmid vectors such as pBR322, pBR325, pACYC177, pACYC184, pUC8, pUC9, pUC18, pUC19, pLG339, pR290, pKC37, pKClOl, SV40, pBluescript II SK +/- or KS +/- (see

"Stratagene Cloning Systems" Catalog (1993) from Stratagene, La Jolla, Calif, which is hereby incorporated by reference in its entirety), pQE, pIH821, pGEX, pET series (see Studier et. al., "Use of T7 RNA Polymerase to Direct Expression of Cloned Genes," Gene Expression Technology, vol. 185 (1990), which is hereby incorporated by reference in its entirety), and any derivatives thereof. Suitable vectors are continually being developed and identified.

These recombinant plasmids are then introduced into host cells at step 120 by means of known transformation and transfection techniques, and the host cells replicated in cultures. Recombinant molecules can be introduced into host cells via transformation, transduction, conjugation, mobilization, or electroporation. See

Maniatis et al., Molecular Cloning: A Laboratory Manual, Cold Springs Laboratory, Cold Springs Harbor, New York (1982), which is hereby incorporated by reference in its entirety. A variety of host- vector systems may be utilized to express the type III secretion system and the protein encoded by the desired ORF. Primarily, the vector system must be compatible with the host cell used. A preferred host-vector system is a bacterium transformed with bacteriophage DNA, plasmid DNA, or cosmid DNA. However, other host-vector systems can also be used, including microorganisms such as yeast containing yeast vectors; mammalian cell systems infected with virus (e.g., vaccinia virus, adenovirus, etc.); insect cell systems infected with virus (e.g., baculovirus); and plant cells infected by bacteria or transformed via particle bombardment (i.e., biolistics). The expression elements of these vectors vary in their strength and specificities. Depending upon the host-vector system utilized, any one of a number of suitable transcription elements can be used.

Different genetic signals and processing events control many levels of gene expression (e.g., DNA transcription and messenger RNA (mRNA) translation). Transcription of DNA is dependent upon the presence of a promoter which is a DNA sequence that directs the binding of RNA polymerase and thereby promotes mRNA synthesis. The DNA sequences of eukaryotic promoters differ from those of prokaryotic promoters. Furthermore, eukaryotic promoters and accompanying genetic signals may not be recognized in or may not function in a prokaryotic system, and, further, prokaryotic promoters are not recognized and do not function in eukaryotic cells.

Specific initiation signals are also required for efficient gene transcription and translation in prokaryotic cells. These transcription and translation initiation signals may vary in "strength" as measured by the quantity of gene specific messenger RNA and protein synthesized, respectively. The DNA expression vector, which contains a promoter, may also contain any combination of various "strong" transcription and/or translation initiation signals. Efficient translation of mRNA in prokaryotes requires a ribosome binding site called the Shrne-Dalgarno ("SD") sequence on the mRNA. This sequence is a short nucleotide sequence of mRNA that is located before the start codon, usually ATG, which encodes the amino-terminal methionine of the protein. The SD sequences are complementary to the 3 '-end of the 16S rRNA (ribosomal RNA) and probably promote binding of mRNA to ribosomes by duplexing with the rRNA to allow correct positioning of the ribosome. Thus, any SD-ATG combination that can be utilized by host cell ribosomes may be employed. Such combinations include, but are not limited to, SD-ATG combinations synthesized by recombinant techniques, the SD-ATG combination from the cro gene or the N gene of coliphage lambda, or from the Escherichia coli tryptophan E, D, C, B or A genes. For a review on maximizing gene expression, see Roberts and Lauer, Methods in Enzymology, 68:473 (1979), which is hereby incorporated by reference in its entirety.

Promoters vary in their "strength" (i.e. their ability to promote transcription). For the purposes of expressing a cloned DNA construct of the present invention, it is desirable to use strong promoters in order to obtain a high level of transcription and, hence, expression of the DNA construct. Depending upon the host cell system utilized, any one of a number of suitable promoters may be used. For instance, when cloning in Escherichia coli, its bacteriophages, or plasmids, promoters such as the T7 phage promoter, lac promoter, trp promoter, recA promoter, ribosomal RNA promoter, the PR and PL promoters of coliphage lambda and others, including but not limited, to lacUV5, ompF, bla, lpp, and the like, may be used to direct high levels of transcription of adjacent DNA segments. Additionally, a hybrid trp-lacUV5 (tac) promoter or other Escherichia coli promoters produced by recombinant DNA or other synthetic DNA techniques may be used to provide for transcription of the inserted construct. Bacterial host cell strains and expression vectors may be chosen which inhibit the action of the promoter unless specifically induced. In certain operons, the addition of specific inducers is necessary for efficient transcription of the inserted DNA. For example, the lac operon is induced by the addition of lactose or IPTG (isopropylthio-beta-D-galactoside). A variety of other operons, such as trp, pro, etc., are under different controls.

Host cells can be transformed using the expression systems of the present invention, whereby the host cell is transformed with one or more of the DNA constructs of the present invention, as described above. Preferably, the host cells are present in a cell culture. Although any bacterial cell is suitable for use as a host cell, it is desirable in many instances to use as the host cell an organism that does not express a native type III secretion system but instead a recombinantly expressed type III secretion system. Two examples of such host cells are recombinant E. coli cells available from the American Type Culture Center (Manassas, Virginia) as deposit PTA-3287 (deposited April 13, 2001), which is E. coli DH5α containing plasmid pCPP2156 (the cloned hrp gene cluster of Erwinia chrysanthemi); and PTA-3288 (deposited April 13, 2001), which is E. coli DH5 containing plasmid pCPP430 (the cloned hrp gene cluster of Erwinia amylovora). Alternatively, the host cell can be one which expresses a native type III secretion system but the ORF is from a source organism which is a different strain or species (e.g., host cell is a pathovar of Erwinia amylovora or Erwinia chrysanthemi and the source organism is a different species or strain of Erwinia or from another biological genus altogether, such as Pseudomonas, Xanthomonas, etc.). Biological markers can be used to identify the host cells 104' carrying recombinant DNA molecules. In bacteria, these are commonly drug-resistance genes. Drug resistance is used to select bacteria that have taken up cloned DNA from the much larger population of bacteria that have not. Various dominant selectable markers are now known in the art, including: aminoglycoside phosphotransferase (APH), using the drug G418 for selection which inhibits protein synthesis where the APH inactivates G418; dihydrofolate reductase (DHFR):Mtx-resistant variant, using the drug methotrexate (Mtx) for selection which inhibits DHFR where the variant DHFR is resistant to Mtx; hygromycin-B-phosphotransferase (HPH), using the drug hygromycin-B which inhibits protein synthesis where the HPH inactivates hygromycin B; thymidine kinase (TK), using the drug aminopterin which inhibits de novo purine and thymidylate synthesis where the TK synthesizes thymidylate; xanthine-guanine phosphoribosyltransferase (XGPRT), using the drug mycophenolic acid which inhibits de novo GMP synthesis where XGPRT synthesizes GMP from xanthine; and adenosine deaminase (ADA), using the drug 9-b-D-xylofuranosyl adenine (Xyl-A) which damages DNA and where the ADA inactivates Xyl-A. Other selectable markers are continually being identified.

Once transformed host cells 104' have been selected, they can then be grown in suitable growth media at step 130 under conditions that promote expression of the protein encoded by the ORF of the recombinant construct 100. Effective conditions include optimal growth temperatures and nutrient media which will enable maximal growth of the host cells and maximal expression of the protein encoded by the ORF. Exemplary culture media include, without limitation, LM media and minimal media, both of which are known in the art. One of ordinary skill in the art can readily determine the optimal growth temperatures for particular strains of host cells and suitable nutrient media capable of optimizing host cell growth.

Once the transformed host cells 104' have been grown in culture, the protein 106 encoded by the ORF, if secreted, will be present in the growth medium. The protein 106 can be detected or isolated from the growth medium at step 140 using conventional protein isolation procedures.

Purified protein, if desired, may be obtained by several methods. The protein or polypeptide is preferably produced in purified form (preferably at least about 80%, more preferably 90%, pure) by conventional techniques. To isolate the protein 106, the recombinant host cells are propagated, the growth medium is centrifuged to separate cellular components from supernatant containing the secreted protein 106, and the supernatant is removed. The supernatant is then subjected to sequential ammonium sulfate precipitation. The fraction containing the protein 106 encoded by the ORF is subjected to gel filtration in an appropriately sized dextran or polyacrylamide column to separate the proteins. If necessary, the protein fraction may be further purified by HPLC. Banding by the protein 106 encoded by the ORF should be present in only the transformed cells but not non-recombinant cells grown under identical conditions as a control.

From the foregoing, it should be appreciated that the secretion assay can be performed in a variety of manners, all of which are described in greater detail in U.S. Patent Application Serial No. 09/350,852 to Bauer et al., filed July 9, 1999, which is hereby incorporated by reference in its entirety.

EXAMPLES

The following examples are provided to illustrate embodiments of the present invention but they are by no means intended to limit its scope.

Materials and Methods

The following materials and methods were employed in the analysis of the predictive rules identified herein. Strains and Media: Escherichia coli strain DH5 α was used for cloning experiments, and P. s. tomato DC3000 or derivatives and P. s. phaseolicola 3121 were used for secretion or translocation assays, respectively. Routine culture conditions for bacteria are similar to those described (van Dijk et al., J. Bacteriol. 18K16V.4790-4797 (1999), which is hereby incorporated by reference in its entirety). Antibiotics were used at the following concentrations: 100 μg/ml ampicillin, 20 μg/ml chloramphenicol, 10 μg ml gentamicin, 100 μg/ml rifampicin, 10 μg/ml kanamycin, and 20 μg/ml tetracycline.

Secretion Assays: All of the secretion assays used P. s. tomato DC3000 strains carrying a pML123 derivative containing a PCR-cloned ORF (encoding a candidate Hrp-secreted protein) fused to nucleotide sequences that encoded either the hemagglutinin or FLAG epitopes along with their native ribosome binding sites (Labes et al., Gene 89:37-46 (1990), which is hereby incorporated by reference in its entirety). Constructs carrying different epitope-tagged ORFs were electroporatedinto DC3000 and a DC3000 ArcC mutant and grown in Hrp-inducing conditions (Yuan & He, J. Bacteriol. 178:6399-6402 (1996), which is hereby incorporated by reference in its entirety). Additionally, all of the DC3000 strains also carried pCPP2318, a construct that contains blaM lacking signal peptide sequences (Charkowski et al., Bacteriol. 179:3866-3874 (1997), which is hereby incorporated by reference in its entirety). DC3000 cultures were separated into cell-bound and supernatant fractions as described (van Dijk et al., J. Bacteriol. 181(16):4790-4797 (1999), which is hereby incorporated by reference in its entirety). Proteins were separated with SDS/PAGE by standard procedures (Sambrook et al., Molecular Cloning: A Laboratory Manual (Cold Spring Harbor Lab. Press) (1989), which is hereby incorporated by reference in its entirety), transferred to polyvinylidene difluoride membranes, and immunoblotted by using anti-FLAG (Sigma), anti-hemagglutinin (Roche Molecular Biochemicals), or anti-P^lactamase (5 Prime →3 Prime) as primary antibodies. Primary antibodies were recognized by goat anti-rabbit IgG-alkalinephosphatase conjugate (Sigma), which were visualized by chemiluminescenceby using a Western-Light chemiluminescence detection system (Tropix, Bedford, MA) and X-Omat x-ray film.

Bioinformatic Techniques: Routine DNA analysis of the draft nucleotide sequence of P. s. tomato DC3000 available from the Internet site of The Institute for Genomic Research used BLAST searches, the Artemis genome viewer and annotation tool available from Internet site of The Sangar Institute, and LASERGENE software (DNAStar, Madison, WI). The two core motifs, described for simplicity in standard Prosite syntax (Hofmann et al., Nucleic Acids Res. 27:215-219 (1999), which is hereby incorporated by reference in its entirety), of the algorithm used to identify ORFs that shared general features with Hrp-secreted proteins was written as follows: <M-[CGHKNPQRSTY]-[ILV]- {DEFWY}- {DEMILVFWY}-{DE}- {DE}- {DE}- {DE}-{DE}-{DE}; and <M-[CGHKNPQRSTY]-[CGHKNPQRSTY]-[ILV]- {DEMILVFWY}-{DE}-{DE}-{DE}-{DE}-{DE}-{DE}. Briefly, the < at the left of the Met indicates that the following pattern must appear at the N terminus of the peptide (i.e., an ORF needed to start with Met). Characters in square brackets are alternatives for a single position (i.e., [ILV] denotes a single lie, Leu, or Val residue). Characters in curly brackets are excluded (i.e., {DE} denotes any single residue except Asp or Glu). Dashes are used to separate residues. Other requirements were as follows: First, for an ORF to be selected it needed a minimum overall length of

150 residues. Second, to select ORFs that encoded polar amino termini, ORFs were required to have a minimum combined number of Ser, Thr, and Gin residues of 7 within the first 50 residues. Finally, the candidate sequences were screened to eliminate those containing the following Prosite patterns: [MrVLFYW]-[MTvTFYW]- [MΓVLFYW]; [MLΓV]-[MLIV]; [FYW]-[FWΥ]; [AG]-[AG];[KR]-[KR]; and [NQJ- [NOJ. This step simply eliminates sequences containing runs of residues from containing these particular residue classes. This genomewide search yielded genes that were manually screened for validity as ORFs by using the Artemis genome viewer

Example 1 - Identification of Predictive Patterns in the N-Terminal Regions of Proteins Secreted by the P. syringae Hrp System

To enable genomewide identification of Hrp effector genes (regardless of the presence of 5' Hrp promoter sequences), the N-terminal regions of an enlarged set of Hrp-secreted proteins was examined for conserved patterns and properties. A training set of 28 nonredundant proteins thought to be secreted by the P. syringae Hrp system (as indicated by previous avirulence or secretion tests) was constructed, and whenever possible a protein family with a homolog from P. s. tomato DC3000 was represented (Figure 3 A). Attempts to align or find motifs in the first 50 aa of these proteins using known programs failed. However, when these amino acids were examined on the basis of their biophysical properties and solvent-exposed substitutability, several patterns emerged, which are expressed as predictive rules in Figure 3 A. In general, the rules define a specific pattern of solvent-exposed, equivalent amino acids that occur in the first five positions, an absence of acidic residues in the first 12 positions, and an overall amphipathicity and richness in polar amino acids in the N-terminal 50 or so residues. Notably, each of four representative proteins that are expressed by the Hrp regulon but are not secreted by the Hrp system (CorS, IaaL, HrcC, and HrpL) failed multiple rules (Figure 3B). Thus, these export signal rules appeared sufficiently specific to support a genomewide search for additional candidate effector genes, which could then be submitted to Hrp secretion tests.

Example 2 - Global Analysis of the P. s. tomato DC3000 Genome for ORFs Predicted to Be Secreted by the Hrp System

An algorithm based on the export signal rules permitted a computer- based search for candidate Hrp-secreted proteins. The DC3000 genome was searched in all six reading frames for ORFs (at least 150 aa in length) with N termini (starting with Met) that satisfied the export rules. The large number of contigs and ambiguous nucleotides in the DC3000 draft sequence precluded an exhaustive search. The search process was based entirely on the direct translation of the contiguous sequences of unambiguous nucleotide codes, and no attempt was made to restrict the search to ORFs as defined by various gene-finding packages. This genomewide search yielded 400 hits that were manually screened for redundancy with known effectors and for validity as ORFs by using the Artemis genome viewer, based on the presence of BLASTP hits (Altschul et al., Nucleic Acids Res. 25:3389-3402 (1997), which is hereby incorporated by reference in its entirety), Glimmer 2.0 ORF calls (Delcher et al., Nucleic Acids Res. 27:4636-4641 (1999), which is hereby incorporated by reference in its entirety), ribosome binding sites, and transcription termination sequences (Ermolaeva et al., J. Mol. Biol. 301:27-33 (2000), which is hereby incorporated by reference in its entirety). The resulting 129 apparently valid, additional ORFs were then analyzed for features characteristic of known effectors in pathogenicity islands (or islets), such as atypical G+C% content and presence in the same region of known virulence factors, Hrp promoters, or mobile genetic elements.

A pool of 129 ORFs was reduced to 32 effector candidates that shared several characteristics of Hrp-secreted proteins. The nucleotide sequence of all 32 ORFs is published in the supporting materials to Petnicki-Ocwieja et al., Proc. Natl. Acad. Sci.

USA 99: 1652-1651 (2002), as published on the Internet site for PNAS, both of which are hereby incorporated by reference in their entirety). An abbreviated list of the six most interesting ORFs based on BLASTP hits to other virulence genes is provided along with other relevant features in Table 1 below.

Table 1 Selected ORFs encoding candidate effector proteins that were identified by the genomewide search based on export-signal patterns

H Mobile promoters DNA GenBank

New within wi hin Homolog accession

Designation designation bp G+C lO kb ⁴ 10 kb¹ (BLASTP lvalue) no.

ORF29 HopPtoL 2700 61.0 n N SPI-2 regulated SrfC (le- AAF74575

21)

ORF30^S HopPtoS2 795 46.5 y N Clostridium exoenzyme NP_346979

C3 ADP- ribosyltransferase (le-5);

20.5% identical to

HopPtoSl¹'

ORF31¹ NA 897 49.8 Y Chicken ADP- P55807 ribosyltransferase (5e-3); also 71.7% identical to HopPtoSl¹'

ORF32⁵ NA 507 54.2 Y Chicken ADP- P55807 ribosyltransferase (5e-3); also 51.3% identical to HopPtoSl¹¹

ORF33 NA 2823 55.2 Y SepC insecticidal toxin NP_065279 (le-128)

ORF34 NA 534 63.5 N Lytic enzyme (3e-36) BAA83137

NA, not available; n, no; y, yes. t If protein was determined to be Hrp-secreted by either secretion or translocation assays the protein was given a Hop name.

Indicates that the ORF is within 10 kb of a HrpL-responsive Hip promoter identified in Fouts et al.

(!)•

^§ Indicates that a transposon, plasmid, or a phage-related sequence is within 10 kb.

¹0RF was determined to possess an ART domain (pfaml 129), further confirming its similarity to ADP-ribosyltransferases.

Sequence comparisons were carried out with EMBOSS software Interestingly, this search found a putative effector, SrfC, that is predicted to travel the type III pathway encoded by SPI2 of S. enterica (Worley et al., Mol. Microbiol. 36:749-761 (2000), which is hereby incorporated by reference in its entirety). A further indicator of the efficacy of the search was the finding of three additional ADP- ribosyltransferases, ORF 30, 31, and 32, all with significant amino acid sequence identity to HopPtoSl (Table 1).

Example 3 - Confirmation That Two ORFs Identified in the Export Signal- Based Search Encode Hrp-Secreted Proteins

To determine whether the genomewide search had identified any novel Hrp-secreted proteins, secretion assays were performed on two ORFs, 29 and 30, which seemed to be particularly promising candidates. As noted above, the products encoded by ORFs 29 and 30 share similarity with a putative type III effector from S. enterica, SrfC, and ADP-ribosyltransferases, respectively. Both ORFs were PCR-cloned into a broad-host-range vector fused to the FLAG epitope, and each construct was introduced into DC3000 wild-type and Hrp mutant strains. The epitope- tagged ORF29 and ORF30 proteins were secreted by DC3000 in a Hrp-dependent manner without leakage of a cytoplasmic marker protein, and consequently they were designated as HopPtoL and HopPtoS2, respectively.

Discussion of Example 1-3

The export signal rules were developed from an enlarged set of Hrp- secreted P. syringae proteins to identify common characteristics in the first 50 aa of these proteins. These characteristics have permitted genomewide identification of novel proteins predicted to travel the Hrp pathway in P. s. tomato DC3000. Two of these ORFs were then tested and both were found to be Hrp-secreted. Several DC3000 proteins with homology to Avr proteins in other P. syringae strains were demonstrated to secreted in a Hrp-dependent manner, and these were consequently designated as Hops. The iterative process of sequence pattern-based prediction and experimental testing pursued has yielded 22 confirmed Hrp-secreted proteins and an orderly process for eventual completion of the inventory of effector proteins. The export signal rules also permitted us to predict which of the two ORFs in the AvrPphF locus is the effector. This locus was previously described in P. 5. phaseolicola (Tsiamis et al, EMBO 19:3204-3214 (2000), which is hereby incorporated by reference in its entirety) and is also present in DC3000. It was also found that AvrPphF locus ORF1 violates all of the rules, but AvrPphF locus ORF2 none (Figure 3C). Furthermore, ORF1 shares many of the general characteristics of type III chaperones (Cornelis & van Gijsegem, Annu. Rev. Microbiol. 54:734-774 (2000), which is hereby incorporated by reference in its entirety). No detection was made for secretion of the ORFl product in secretion assays, and only ORF2 was shown to be translocated into plant cells on the basis of its delivery of an AvrRpt2 reporter. Another demonstration of the selectivity of the export signal rules is that only the chicken ADP-ribosyltransferase NRT2_CH_K shows major violations of the rules even though this protein is more similar to HopPtoSl and S2 than either of the type Ill-secreted ADP-ribosyltransferases from P. aeruginosa, ExoS and ExoT (Figure 3C).

The observation that there is no overall difference in the N-terminal residue patterns of effectors that are chaperone-associated or chaperone-independent suggests that entry into the pathway is the same for nascent and preformed effectors. Also, no significant difference was observed in the N-terminal residue patterns between effectors and accessory secretion factors such as the HrpA pilus subunit and the harpin-like proteins. Perhaps the simplest method for sorting proteins to be secreted from those to be injected is by timing, with those proteins entering the pathway before the Hrp pilus has connected with host cells being preferentially released to the milieu. There is presently little knowledge of how effector proteins

(particularly those lacking chaperones) are targeted for entry into type III secretion pathways and what component(s) of the secretion machinery serve as gatekeepers. Recently, Lloyd et al. (Mol. Microbiol. 43:51-59 (2002), which is hereby incorporated by reference in its entirety) have elegantly demonstrated the importance of amphipathicity in the first 8 aa in type III secretion signals. The analysis presented here suggests that positional effects in the first few amino acids of the export signal are also important. The patterns observed herein suggest that solvent-exposed amino acids in the N terminus function as a "key" that is engaged by a receptor "lock" in the Hrp machinery. The key-way in the lock is likely to have a net negative charge (as suggested by the lack of acidic amino acids in the first 12 residues of Hops) and appears to recognize a specific pattern in the first five residues. This pattern occurs in almost all Hops. The subsequent 6-50 or more residues of Hops have the general property of amphipathicity (which seems to be the universal characteristic of type III effector proteins) without any positional specificity. Unlike the Salmonella- translocated effectors secreted by SPI2 (Miao & Miller, Proc. Natl. Acad. Sci. USA 97:7539-7544 (2000), which is hereby incorporated by referenced in its entirety)), many of the P. syringae Hops do not appear to be homologs. Thus, the pattern in the first five residues likely represents convergent evolution to fit a Hrp system receptor. To determine whether the algorithm used to search for export- associated patterns would be useful in identifying type Ill-secreted proteins in other pathogens that use type III secretion systems, the genomes of P. aeruginosa PAO1 (available from the known Internet site) and R. solanacearum GMI1000 (available from the known Internet site were search. Whereas the algorithm yielded 129 ORFs in DC3000, it identified 54 and 73 ORFs in P. aeruginosa and P. solanacearum, respectively. Several, but not all known, type III effector genes were identified in these organisms. For example, the type III effectors ExoS and ExoT both were identified in P. aeruginosa as well as several proteins secreted by the flagellar type III system. In R. solanacearum, the algorithm identified the P. syringae Avr homologs AvrPphD, AvrA, and AvrPpiC2, the Ralstonia PopA harpin-like protein, a hypothetical protein that is similar to PopC, and interestingly, HrpV, a protein encoded within the hrp/hrc cluster, whose function is unknown.

The genomes of two nonpathogens, E. coli K12 MJ1655 (available from the known Internet site) and Bacillus subtilis 168 (available from the known

Internet site) and identified 54 and 40 ORFs, respectively. These latter bacteria do not have type III secretion systems other than the flagellar system, and it seems unlikely that all of these ORFs represent secreted proteins. Thus, genomewide searches with the current algorithm yield a collection of ORFs that is only enriched in type III- secreted proteins. However, as demonstrated, winnowing of this collection using other characteristics associated with effector genes, such as signatures of horizontal acquisition, can efficiently yield a subset (independent of 5' Hrp promoter sequences) that can be systematically tested for secretion. This process yields DC3000 effector genes unlikely to be found by other means, as demonstrated herein.

HopPtoSl and HopPtoS2 share sequence similarity with ADP- ribosyltransferases, proteins that have long been implicated in bacterial pathogenesis in animals through the modification of host signal transduction pathways (Finlay & Falkow, Microbiol. Mol. Biol. Rev. 61:136-169 (1997), which is hereby incorporated by reference in its entirety), but until now have not been implicated in the bacterial pathogenesis of plants. The DC3000 genomic studies described in an earlier paper clearly show that several of the effectors in DC3000 are redundant (Fouts et al., Proc. Natl. Acad. Sci. USA 99:2275-2280 (2002), which is hereby incorporated by reference in its entirety). By using the pattern-based export prediction described herein, three ADP-ribosyltransferase genes (in addition to hopPtoSl) were identified in the genome of DC3000 that haveN-termini putative export signals. One of these, ORF32, may not be a functional gene because the ORF is truncated. The other two, HopPtoS2 and ORF31, are full-length genes based on sequence alignments.

HopPtoS2 is secreted by the Hrp system and ORF31 shares high amino acid sequence identity with the Hrp-secreted HopPtoSl. Interestingly, HopPtoSl contains putative myristoylation and palmitoylation sites at its N terminus (as does AvrPphF ORF2; Figure 3C), whereas the other two do not, indicating that HopPtoSl may be localized to the plasma membrane. Thus, there appear to be at least three Hrp-secreted ADP- ribosyltransferases and these may localize to different regions of the plant cell. The existence of these proteins in P. syringae is particularly noteworthy given that ADP- ribosyltransferase genes have not been identified in the bacterial plant pathogen genomes that have been published thus far (Simpson et al. Nature 406:151-159 (2000); Wood et al., Science 294:2317-2323 (2001); Goodner et al., Science

294:2323-2328 (2001); Salanoubat et al., Nature 415:497-502 (2002), each of which is hereby incorporated by reference in its entirety). Significantly, the genomewide search for export signals yielded ahomolog of the S. enterica candidate effector SrfC, further adding to the growing list of effectors shared between plant and animal pathogens. It is also noteworthy that one of the 32 ORFs found by the genomewide search (ORF48) is a homolog of a bacterial catalase (BLASTP le-126), and another (ORF49) is a glucokinase homolog (BLASTP 3e-42). These putative effectors could have a role in oxidative stress and regulation of sugar metabolism, respectively. Although the invention has been described in detail for the purpose of illustration, it is understood that such detail is solely for that purpose, and variations can be made therein by those skilled in the art without departing from the spirit and scope of the invention which is defined by the following claims.

Claims

What is Claimed:

1. A method of identifying putative effector proteins comprising: providing a predicted amino acid sequence; and determining whether the predicted amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences, wherein satisfaction of the one or more rules indicates that the predicted amino acid sequence is likely an effector protein.

2. The method according to claim 1, wherein the set of rules for known effector protein amino acid sequences comprises two or more rules selected from the group consisting of Rules (a)-(f):

Rule (a): I, L, V, A, or P are found in positions 3 or 4, but not in both; Rule (b): Position 5 is not occupied by a M, I, L, F, Y, or W;

Rule (c): No D or E residue appears in positions 1-12; Rule (d): Not more than one C appears between positions 5 and 50; Rule (e): The first 50 residues are amphipathic and rich in polar residues, containing 7 or more of the group S, T, and Q; Rule (f): No more than three consecutive residues from the group of

M, I, L, V, F, Y, and W occur in the first 50 residues.

3. The method according to claim 2, wherein the set of rules for known effector protein amino acid sequences comprises (a), (c), and (e).

4. The method according to claim 2, wherein the set of rules for known effector protein amino acid sequences comprises (a) - (e).

5. The method according to claim 2, wherein the set of rules for known effector protein amino acid sequences comprises (a) - (f).

6. The method according to claim 1 further comprising: identifying genomic DNA, containing a start codon and at least

150 consecutive codons without a stop codon, from the genome of an organism; and predicting the amino acid sequence encoded by the identified genomic DNA.

7. The method according to claim 1 further comprising: determining whether the identified genomic DNA is downstream of a HrpL-dependent promoter region, whether the identified genomic DNA is associated with a transposon, whether the identified genomic DNA has a low GC content, or combinations thereof.

8. The method according to claim 1 further comprising: assessing whether the protein or polypeptide encoded by the identified genomic DNA is secreted by a host cell comprising a type III secretion system.

9. The method according to claim 8, wherein said assessing comprises: transforming a host cell comprising a type III secretion system with a DNA construct comprising promoter and transcription termination sequences operably coupled to identified genomic DNA; growing the transformed host cell in a suitable media under conditions effective to express the protein encoded by the identified genomic DNA; and evaluating whether the host cell secretes the protein into the media.

10. The method according to claim 8, wherein the host cell is an Escherichia coli cell recombinant for the type III secretion system.

11. The method according to claim 8, wherein the host cell possesses a native type III secretion system.

12. The method according to claim 1, wherein the organism is a species of Pseudomonas, Erwinia, ox Xanthomonas.

13. The method according to claim 1 , wherein the organism is a species of Yersinia, Ralstonia, Salmonella, Shigella, pathogenic E. coli, or Helicobacter.

14. A system comprising: a determination system that determines whether a provided amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences.

15. The system according to claim 14, wherein the set of rules comprise a combination of two or more rules selected from the group consisting of Rules (a)-(f):

Rule (a): I, L, V, A, or P are found in positions 3 or 4, but not in both;

Rule (b) Position 5 is not occupied by a M, I, L, F, Y, or W; Rule (c) No D or E residue appears in positions 1-12; Rule (d) Not more than one C appears between positions 5 and 50; Rule (e) The first 50 residues are amphipathic and rich in polar residues, containing 7 or more of the group S, T, and Q; Rule (f): No more than three consecutive residues from the group of M, I, L, V, F, Y, and W occur in the first 50 residues.

16. The system according to claim 15, wherein the set of rules for known effector protein amino acid sequences comprises (a), (c), and (e).

17. The system according to claim 15, wherein the set of rules for known effector protein amino acid sequences comprises (a) - (e).

18. The system according to claim 15, wherein the set of rules for known effector protein amino acid sequences comprises (a) - (f).

19. A computer readable medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the step of: determining whether a provided amino acid sequence satisfies one or more rules from a set of rules that applies to known effector protein amino acid sequences.

20. The computer readable medium according to claim 19, wherein the set of rules comprise a combination of two or more of Rules (a)-(f):

M, I, L, V, F, Y, and W occur in the first 50 residues.

21. The computer readable medium according to claim 20, wherein the set of rules comprise rules (a), (c), and (e).

22. The computer readable medium according to claim 20, wherein the set of rules comprise rules (a)-(e).

23. The computer readable medium according to claim 20, wherein the set of rules comprise rules (a)-(f).