WO2005042698A2

WO2005042698A2 - T cell epitopes useful in a hepatitis c virus vaccine and as diagnostic tools and methods for identifying same

Info

Publication number: WO2005042698A2
Application number: PCT/US2004/033942
Authority: WO
Inventors: Ole Lund; Claus Lundegaard; Morten Nielsen; Peder Worning; Robert J. Deans; Søren Buus; Søren BRUNAK
Original assignee: Pecos Labs, Inc.
Priority date: 2003-10-23
Filing date: 2004-10-15
Publication date: 2005-05-12
Also published as: WO2005042698A3

Abstract

In a first aspect the present invention consists of T cell epitopes in the Hepatitis C virus genome. More specifically there is provided 429 linear peptide epitopes from the Hepatitis C Virus genome. In a second aspect the invention also consists of variants of these sequences. In a third aspect this invention consists of a method to predict these sequences from genome data and to validate the predictions experimentally. Ina fourth aspect the present invention provides compositions including these epitopes for use in a vaccine and to induce a T cell response in a subject, or as a diagnostic tool. Figure 2 shows a plot of the measured versus the predicted binding affinity score for four different prediction methods (1) Rammensee matrix method, (b) Hidden Markov model trained on sequences in the Rammensee dataset, (c) Neural network trained with sparse sequence encoding, and (d) The Comb-II neural network method.

Description

APPLICATION FOR A PATENT

Inventors: Ole Lund, Prasstøgade 5b II tv., DK-2100 Kbh. 0.,

Denmark.

Claus Lundegaard Espegaardsvej 26A DK-2880 Bagsvasrd,

Denmark.

Morten Nielsen, Bogøvej 6, 3. th., DK-2000 Fredriksberg, Denmark.

Peder Worning, Nørrebrogade 30, 1. th, DK-2200 København N,

Denmark.

Robert J. Deans, 415 Furman Drive, Claremont CA 91711

USA. Søren Buus, Astilbehaven 160, 2830 Virum,

Denmark.

Søren Brunak, Gammel Vartov vej 22, DK-2900 Hellerup,

Denmark.

TITLE OF APPLICATION:

T cell epitopes useful in a Hepatitis C virus vaccine and as diagnostic tools and methods for identifying same

SUMMARY OF THE INVENTION

In a first aspect the present invention consists of T cell epitopes in the Hepatitis C virus genome. More specifically there is provided 429 linear peptide epitopes from the Hepatitis C Virus genome. In a second aspect the invention also consists of variants of these sequences. In a third aspect this invention consists of a method to predict these sequences from genome data and to validate the predictions experimentally. In a fourth aspect the present invention provides compositions including these epitopes for use in a vaccine and to induce a T cell response in a subject, or as a diagnostic tool. Figure 2 shows a plot of the measured versus the predicted binding affinity score for four different prediction methods (a) Rammensee matrix method, (b) Hidden Markov model trained on sequences in the Rammensee dataset, (c) Neural network trained with sparse sequence encoding, and (d) The Comb-ll neural network method.

FIELD OF THE INVENTION The present invention relates to the identification of T-cell epitopes within Hepatitis C Virus, for use as diagnostic as well as prophylactic and therapeutic vaccine use. The present invention also relates to subunit vaccines.

BACKGROUND OF THE INVENTION

Hepatitis C virus

Hepatitis C is a disease of the liver caused by the hepatitis C virus (HCV). Most chronically infected might not be aware of their infection because they are initially assymptomatic. Infected persons serve as a source of transmission to others. Chronic liver disease is the tenth leading cause of death among adults in the United States, and accounts for approximately 25,000 deaths annually, or approximately 1% of all deaths. Population-based studies indicate that 40% of chronic liver disease is HCV-related, resulting in an estimated 8,000-10,000 deaths each year. Current estimates of medical and work-loss costs of HCV-related acute and chronic liver disease are greater than $600 million annually. After acute infection, 15%-25% of persons appear to resolve their infection (http://www.cdc.gov/mmwr/preview/mmwrhtml/00055154.htm). No vaccine is currently available to prevent hepatitis C and treatment for chronic hepatitis C is too costly for most persons in developing countries to afford (http://www.who.int inf-fs/en/fact164.html). The high mutation frequency of the viral genome allows escape from immune clearance, and refractory to drug treatment. The lack of in vitro culture methodology, and small animal models of infection have prevented development of an effective vaccine.

Prediction of epitopes

We describe a novel neural network method to predict T-cell epitopes. A novel representation of the input has been developed consisting of a combination of sparse encoding, Blosum encoding and hidden Markov models. We measure the performance of the method by use of both the Pearson correlation coefficient and sensitivity/specificity plots, and make a detailed comparison of its performance to that of a series of other methods. We demonstrate that the present method has a performance that is substantially higher than that of the other methods, especially for the high affinity binders. The method is derived from data consisting of peptides with associated measured binding affinity to the HLA class I molecule A204. By use of mutual information calculations we show that peptides that bind to the HLA-A2 molecule display a signal of higher order sequence correlations. Neural networks are able to integrate such correlations when predicting a binding affinity for a peptide sequence. It is this feature combined with the ability of the neural network to train on data with continuous binding affinities and the use of novel sequence encoding strategies that gives the presented method an edge to the other methods. Finally we use the method to predict T-cell epitopes for the genome of

Hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design.

Introduction The hallmark of the immune system is its ability to recognise and distinguish between self (friend) and non-self (enemy). The T cells do this by recognizing peptides that are bound to Major Histocompatibility Complex (MHC) complexes. A number of methods for predicting the binding of peptides to MHC molecules have been developed (reviewed by Schirle M, Weinschenk T, Stevanovic S. Combining computer algorithms with experimental approaches permits the rapid and accurate identification of T cell epitopes from defined antigens. J. Immunol Methods. 2001 Nov 1; 257(1-2): 1-16) since the first motif methods was presented (Rothbard JB, Taylor WR. A sequence pattern common to T cell epitopes. EMBO J. 1988 Jan; 7(1):93-100; Sette A, Buus S, Appella E, Smith JA, Chesnut R, Miles C, Colon SM, Grey HM. Prediction of major histocompatibility complex binding regions of protein antigens by sequence pattern analysis. Proc Natl Acad Sci U S A. 1989 May; 86(9): 3296- 300.). The discovery of allele specific motifs (Falk K, Rόtzschke O, Stevanovic S, Jung G, Rammensee HG. Allele-specific motifs revealed by sequencing of self-peptides eluted from MHC molecules. Nature. 1991 May 23; 351(6324): 290-6) lead to the development of more accurate algorithms (Pamer EG, Davis CE, So M.. Expression and deletion analysis of the Trypanosoma brucei rhodesiense cysteine protease in Escherichia coli. Infect Immun. 1991 Mar; 59(3): 1074-8; Rδtzschke O, Falk K, Stevanovic S, Jung G, Walden P, Rammensee HG. Exact prediction of a natural T cell epitope. Eur J Immunol. 1991 Nov; 21(11): 2891-4). In the simple types of prediction tools it is assumed that the amino acids at each position along the peptide sequence contribute with a given binding energy, which can be added up to yield the overall binding energy of the peptide. (Meister GE, Roberts CG, Berzofsky JA, De Groot AS. Two novel T cell epitope prediction algorithms based on MHC- binding motifs; comparison of predicted and published epitopes from Mycobacterium tuberculosis and HIV protein sequences. Vaccine. 1995 Apr;13(6):581-91). Similar types of approaches are used by the EpiMatrix method (Schafer JR, Jesdale BM, George JA, Kouttab NM, De Groot AS. Prediction of well-conserved HIV-1 ligands using a matrix-based algorithm, EpiMatrix. Vaccine. 1998 Nov; 16(19): 1880-4), the BIMAS method (Parker KC, Bednarek MA, Coligan JE. Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J Immunol. 1994 Jan 1; 152(1): 163-75)) and the SYFPETHI method

(Rammensee HG, Friede T, Stevanoviic S. MHC ligands and peptide motifs: first listing. Immunogenetics. 1995;41 (4): 178-228.). These predictions, however, fail to recognize correlated effects. That is the effect on the binding affinity of a given amino acid at one position can be influenced by which amino acids are present at other positions in the peptide. Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule. Artificial neural networks (ANN) are ideally suited to take such correlations into account and neural network methods for predicting whether or not a peptide binds a MHC molecule have earlier been developed (Brusic V, Rudy G, Harrison, LC, Prediction of MHC binding peptides using artificial neural networks, in: Complex systems: mechanism of adaptation. Eds. Stonier, RJ and Yu, XS, 10S, 1994, 253-260).

Brusic et al. use a sparse (orthogonal) encoding of a 20 amino acid alphabet and reduced alphabets with 9 and 6 letters (Brusic V, Rudy G, Harrison, LC, Prediction of MHC binding peptides using artificial neural networks, in: Complex systems: mechanism of adaptation. Eds. Stonier, RJ and Yu, XS, 10S, 1994, 253-260). The conventional sparse encoding of the amino acids ignores their chemical similarities. Here use a combination of several sequence encoding strategies in order to take these similarities in to account, explicitly. The different encoding schemes are defined in terms of Blosum matrices and hidden Markov models in addition to the conventional sparse encoding.

More detailed predictions of peptide binding have been made by dividing binding affinities into classes of affinity ranges, and by inverting the networks it was found that the different classes are associated with different binding sequence motifs (Adams HP, Koziol JA. Prediction of binding to MHC class I molecules. J Immunol Methods. 1995 Sep 25; 185(2): 181 -90.). Neural networks have also been trained to predict HMC binding using different affinity thresholds (Gulukota K, Sidney J, Sette A, DeLisi C. Two complementary methods for predicting peptides binding major histocompatibility complex molecules. J Mol Biol. 1997 Apr 18;267(5): 1258-67). Mamitsuka trained the transition and emission probabilities of a fully connected hidden markov model using a steepest descent algorithm so as to minimize the differences between the predicted and target probabilities for each peptide (Mamitsuka H. Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov models. Proteins. 1998 Dec 1;33(4):460-74). Using this method they obtain better results than using neural networks or hidden Markov models. We have earlier developed matrix methods (Lauemøller SL, Holm A, Hilden J, Brunak S, Hoist Nissen M, Stryhn A, østergaard Pedersen L, Buus S. Quantitative predictions of peptide binding to MHC class I molecules using specificity matrices and anchor-stratified calibrations. Tissue Antigens. 2001 57:405-14) and ANNs which are special in that they give quantitative (continuous) values for binding affinities between peptides and the human MHC molecule HLA-A2 (Buus et al., Submitted). Our improved method combines the neural network approach with a hidden Markov model encoding to achieve more accurate prediction of the peptide/MHC binding affinity. SUMMARY OF THE INVENTION

In a first aspect the present invention consists of T cell epitopes in Hepatitis C Virus (HCV). More specifically there is provided 197 linear peptide epitopes from the HCV genome. In a second aspect the invention also consists of variants of these sequences. In a third aspect this invention consists of a method to predict these sequences from genome data and to validate the predictions experimentally. In a fourth aspect the present invention provides compositions including these epitopes for use in a vaccine and to induce a T cell response in a subject, or as a diagnostic tool.

BRIEF DESCRIPTION OF THE TABLES AND FIRGURES Sparse Blosum 50

# Set # Hid # Bin Performance # Hid # Bin Performance

1 5 3 0.920 4 1 0.920

2 6 2 0.879 6 1 0.908

3 3 2 0.880 10 2 0.900

4 10 5 0.882 3 2 0.889

5 8 2 0.862 2 3 0.937

Table 1. Test performance for five fold cross validation test/training using sparse and Blosum 50 encoding, respectively. The 2nd and 5th columns (# Hid) give the number of neurons in the hidden layer, and the 3rd and 6th columns (# Bin) give the number of bins in the balanced training for the networks with the optimal test performance for sparse and Blosum 50 sequence encoding, respectively.

Method

Rammensee 0.761 +/- 0.016 0.296+/-0.073 0.066 +/- 0.116

HMM 0.804 +/- 0.014 0.332 +/- 0.061 0.142 +/- 0.096

NN_Sparse 0.877 +/- 0.011 0.438 +/- 0.065 0.345 +/- 0.090

NN_BI50 0.899 +/- 0.010 0.498 +/- 0.064 0.382 +/- 0.099

Comb-I 0.906 +/- 0.009 0.508 +/- 0.063 0.392 +/- 0.092

Comb-II 0.912 +/- 0.009 0.508 +/- 0.054 0.420 +/- 0.080 Table 2. The Pearson correlation coefficient between the predicted score and the measured binding affinity for the 528 peptides in the Buus dataset. The six methods in the table are Rammensee: Score matrix method by HG. Rammensee, HMM: Hidden Markov model trained on sequence data in the Rammensee dataset, NN_Sparse: Neural network with sparse sequence encoding, NNJ3L50: Neural network with Blosum 50 sequence encoding, Comb-I: Combination of neural network with sparse and Blosum 50 sequence encoding, and Comb-ll: Combination of neural network with sparse and Blosum 50 encoding, respectively integrated with hidden Markov model input. The numbers given in the table are calculated using the Bootstrap method (Press WH, Flannery, BP, Teukolsky, SA, Vetterling, WT. Numerical recipes. Cambridge University Press, 1989) with 500 dataset realizations. The correlation values are estimated as average values over the 500 dataset realization and the error-bars the associated standard deviations.

HCV 1 9.103 3002 54 177

Table 3. Epitope predictions for the HCV genome calculated using the combined neural network method TEPIPRED. The 2nd column gives the number of proteins, the 3rd column the number of basepairs, the 4th column the total number of possible 9mer peptides, the 5th column the number of predicted high binders, the 6th column the number of intermediate binders and the 7^th column the % of intermediate binder with respect to the total number of peptides found in the genome of HCV.

Figure 1. Mutual information matrices calculated for two different datasets. The left panel shows the mutual information matrix calculated for a data set consisting of 313 peptides defined as the Rammensee dataset combined with peptides from the Buss dataset with a binding affinity stronger than 500 nM. The right panel shows the mutual information matrix calculated for a set of 313 random^' peptides extracted from the Mycobaterium tuberculosis genome. In the upper row the mutual information plot is calculated using the conventional 20-letter amino acid alphabet. In the lower row the calculation is done using a 6-letter amino acid alphabet defined in the text. Figure 2. Scatter plot of the measured versus the predicted binding affinity score for the 528 peptides in the Buus dataset. The figure shows the performance for four different prediction methods. The insert to each figure shows an enlargement of the part of the plot that corresponds to a binding affinity stronger than 500 nM. (a) Rammensee matrix method, (b) Hidden Markov model trained on sequences in the Rammensee dataset, (c) Neural network trained with sparse sequence encoding, and (d) The Comb-I I neural network method.

Figure 3. (a) Sensitivity-Specificity plot for a measured binding affinity of 500 nM for a series of linear combinations of the two neural network methods corresponding to Blosum 50 and sparse sequence encoding, respectively. The curves were calculated by use of the Bootstrap method (Press WH, Flannery, BP, Teukolsky, SA, Vetterling, WT. Numerical recipes. Cambridge University Press, 1989) using 500 data set realizations, (a) 428 peptides in the test/train dataset, (b) 100 peptides in the evaluation set. The optimal performance is determined as the curve, which is shifted mostly up against the upper right corner of the graph. In the upper graph we determine the optimal performance to be the orange curve, corresponding to an combination of the two neural network methods with 70% weight on the Blosum 50 encoded prediction and 30% weight on the sparse encoded prediction. This set of weights also results in close to optimal performance in lower graph. Insert to the graphs show an enlargement of the high-specificity part of the curves excluding error-bars.

Figure 4. Sensitivity-specificity curves calculated from the 528-peptide data set. Six methods are shown in the graphs: Rammensee: Matrix method by Rammensee (Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S. SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics. 1999 Nov; 50(3-4):213-9), HMM: Hidden Markov Model trained on data from the Rammensee database, SEQ: Neural network with sparse sequence encoding, B150: neural network with Blosum 50 sequence encoding, Comb-I: combination of neural network with Blosum 50 and sparse encoding, respectively, with 70% weight on the Blosum prediction, and Comb- II: combination of neural network with hidden Markov model input with Blosum 50 and sparse encoding, respectively. The upper graph (a) shows the curves for a measured affinity threshold of 50 nM. The lower graph (b) shows the curves corresponding to a measured affinity threshold of 500 nM. The sensitivity-specificity curves were calculated as described in Fig. 3 using 528 dataset realizations. In the insert to the graphs is shown an enlargement of the high-specificity part of the curves excluding the error-bars.

BEST METHOD OF CARRYING OUT THE INVENTION

The following examples describe the best methods for carrying out the invention.

EXAMPLE 1: PREDICTION OF EPITOPES

Materials and Methods

Two sets of data were used to derive the prediction method. One set was used to train and test the neural networks, and consists of 528 nine-mer amino acids peptides for which the binding affinity to the HLA class I molecule A204 have been measured by the method described by Buus et al. (Buus S, Stryhn A, Winther K, Kirkby N, Pedersen LO. Receptor-ligand interactions measured by an improved spun column chromatography technique. A high efficiency and high throughput size separation method. Biochim Biophys Acta. 1995 Apr 13; 1243 (3): 453-60, NOTE also newer method.). The data set consists of 76 peptides with a binding affinity stronger than 50 nM, 144 with a binding affinity stronger than 500 nM, 159 with a binding affinity between 500 and 50000 nM, and 225 peptides with a binding affinity weaker than 50000 nM. This datas et is hereafter referred to as the "Buus data set". The second data set was used to train the hidden Markov model. This dataset was constructed from sequences downloaded from the Syfpeithi database (http://svfpeithi.bmi-heidelberg.com; Rammensee H, et al. (1995), Rammensee H, et al. (1999)). AH sequences from the database were downloaded and clustered into the nine super-types (A1, A2, A3, A24, B7, B27, B44, B58, and B62) and 3 outlier types (A29, B8 and B46) described by Sette and Sidney, 1999 (Sette A, Sidney J. The nine major HLA class I super- types account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics. 1999 Nov; 50(3-4): 201-12). The sequences in the A2 super-type cluster were aligned manually and trimmed into 211 unique nine amino acid long peptides. This dataset is hereafter referred to as the "Rammensee dataset".

Mutual information

One important difference between linear prediction methods like un-gapped hidden Markov models and un-linear prediction methods like neural networks with hidden layers is their capability to integrate higher order sequence correlation into the prediction score. A measure of the degree of higher order sequence correlations in a set of aligned amino acid sequences can be obtained by, calculating the mutual information matrix. For the case of peptide nine-mers, this is a 9x9 matrix where each matrix element is calculated using the formula Mi_j = Σ Pi_j(ab) log ( Pi_j(ab)/(Pi(a) P_j(b))

Here the summation is over the 20 letters in the conventional amino acid alphabet and i, j refers to positions in the peptide. Pij(ab) is the probability of mutually finding the amino acid a at position i and amino acid b at position j. Pj(a) is the probabilities of finding the amino acid a at position i irrespectively of the content at the other positions, and likewise for P_j(b). A positive value in the mutual information matrix indicates that prior knowledge of the amino acid content at position i will provide information about the amino acid content at position j. The statistical reliability of a mutual information calculation relies crucially on the size of the corresponding dataset. In the mutual information calculation one seeks to estimate 400 amino acid pair frequencies at each position in the matrix. Such estimates are naturally associated with large uncertainties when dealing with small datasets. In Fig. 1(a) and (b) we show the mutual information matrix calculated for 2 different sets of nine-mer alignments. The one dataset was constructed so as to obtain the largest possible positive set, by combining peptides from the Rammensee dataset with the peptides from the Buus dataset that have a measured binding affinity stronger than 500 nM. This set contains 313 unique sequences. The second dataset was constructed as a negative set by extracting 313 unique random peptides from the Mycobaterium tuberculosis genome. The mutual information content is calculated using the conventional 20 amino acid alphabet. The figure demonstrates a signal of mutual information between the 7 non-anchor residues (1, 3, 4, 5, 6, 7 and 8) in the dataset defined by peptides that bind to the HLA molecule. It is worth remarking that the mutual information content between any of the two anchor positions (2 and 9) and all other amino acids is substantially lower than the mutual information content between any two non- anchor positions. The significance of the mutual information content calculations can be improved by applying a suitable reduced sequence alphabet in the calculations. In Fig. 1(c) and (d) we show the mutual information matrices for the two datasets described above calculated using a reduced 6-letter alphabet derived from side-chain surface (The alphabet is defined as A="GAS\ B="CTDV", C="P", D="NLIQMEH", E="KFRY" and F="W"). The matrices in Fig. 1 (c) and (d) display a similar behavior to the plots in Fig. 1 (a) and (b), however with the difference that the signal of mutual information in the dataset derived from low and non-binding peptides has been substantially decreased compared to that of the dataset containing binding peptides.

Both hidden Markov models and neural networks were applied to construct the optimal prediction scheme for HLA A2 peptide binding.

A hidden Markov model was generated for the A2 HLA type based on the sequences in the Rammensee dataset. The model was constructed using the hmmbuild command from the Hmmer package (S.R. Eddy. HMMER: Profile hidden Markov models for biological sequence analysis (http://hmmer.wustl.edu/). 2001; S.R. Eddy. Profile hidden Markov models. Bioinformatics 14:755-763, 1998) using the following command "hmmbuild -F -fast -pam /usr/cbs/bio/src/blast-2.1.2/data/BLOSUM62 - gapmax 0.7 hmmerfile fastafile". Here fastafile is an input file containing the sequences in the Rammensee dataset in FASTA format

(http://www.ncbi.nlm.nih.gov/BLAST/fasta.html). and hmmerfile is the output file generated by the hmmer program.

An epitope similarity score S for the nine amino acid long peptide is calculated as

where Pj is the probability for finding a given amino acid on position i in the hidden Markov model and Qr is the probability for finding the amino acid in the Swiss Prot database (Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000 Jan 1;28(1):45-8.). These probabilities can be calculated from the output from the hmmbuild program as described in the manual.

Neural networks Two types of encoding of the peptides were used in the neural network training. The first is the conventional sparse encoding where each amino acid is encoded as a 20 digit binary number (a single 1 and 19 zeros). The second is the Blosum 50 encoding in which the amino acids are encoded. as the Blosum 50 score for replacing the amino acid with each of the 20 amino acids (Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15; 89(22): 10915-9.). Other Blosum encoding schemes were tried and we find that all encodings with Blosum matrices corresponding to a clustering threshold in the range 30-70% give comparable performance. In the following we will use the Blosum 50 matrix when we refer to Blosum sequence encoding. The sparse versus the Blosum sequence encoding constitutes two different approaches to represent sequence information to the neural network. In the sparse encoding the neural network is given very precise information about the sequence that corresponds to a given training example. One can say that the network learns a lot about a little. The neural network learns that a specific series of amino acids correspond to a certain binding affinity value. In the Blosum encoding, on the other hand, the network is given more general and less precise information about a sequence. The Blosum matrix contains prior knowledge about which amino acids are similar and dissimilar to each other. The Blosum encoding for Leucine (L) has for instance positive encoding values at input neurons corresponding to Isoleucine (I), Methionine (M), Phenylalanine (F) and Valine (V) and negative encoding values at input neurons corresponding to for instance Asparagine (N) and Aspartic acid (D). This encoding help the network to generalize, i.e. when a positive example with a Leucine at a given position is presented to the network, the parameters in the neural network corresponding to the above similar and dissimilar amino acids are also adjusted in a way so that the network appears to have seen positive examples with I, M, F and V and negative examples with N and D at that specific amino acid position. This ability to generalize the input data is of course highly beneficial for neural network training when the number of training data is limited. The use of Blosum sequence encoding might, on the other hand, even in situations where data is not a limiting factor be an important aid to guide the neural network training, simply because the Blosum matrix encodes a subtle evolutional and chemical relationship between the 20 amino acids.

We develop the method with optimal predictive performance in a two step procedure. In the first round the method is optimized on a subset of 428 of the 528 peptides in the Buus dataset, and its performance is evaluated on an independent evaluation set of the remaining 100 peptides. In this manner we minimize the risk of over-fitting. In the second round the method is retrained on the full set of data using the parameter settings obtained in the first round.

The test and training of the neural networks is performed in a 5-fold manner by splitting the 428 peptides into five sets of training and test data. The splitting is performed such that all test and training sets have approximately the same distribution of high, low and non-binding peptides. The training data is used to perform feed-forward and back-propagation and the test data to define the stopping criteria for the network training as described by Baldi and Brunak (Baldi P, Brunak S. Bioinformatics. The machine learning approach. Second Edition. The MIT Press. 2001).

The performance of the neural networks is measured using the Pearson correlation coefficient on the test set (Press WH, Flannery, BP, Teukolsky, SA, Vetterling, WT. Numerical recipes. Cambridge University Press, 1989).

The neural network architecture used is a conventional feed forward network (Baldi P, Brunak S. Bioinformatics. The machine learning approach. Second Edition. The MIT Press. 2001) with an input layer with 180 neurons, one hidden layer with 2-10 neurons and a single neuron output layer. The 180 neurons in the input layer encode the 9 amino acids in the peptide sequence with each amino acid represented by 20 neurons. The back-propagation procedure was used to update the weights in the network.

We transform the measured binding affinities in order to place the output values used in the training and testing of the neural networks on a scale between 0 and 1. The transformation is defined as 1-Iog(a)/log(50000), where a is the measured binding affinity and log() is the base-ten logarithmic function. In this transformation high binding peptides, with a measured affinity stronger than 50 nM, are assigned an output value above 0.638, intermediate binding peptides, with an affinity stronger than 500 nM, an output value above 0.426, and peptides, with an affinity weaker than 500 nM, an output value below 0.426. Peptides that have an affinity weaker than 50000 nM are assigned an output value of 0.0.

Since the distribution of binding affinities for the peptides in the training and test sets is highly non-uniform, with a great over-representation of low and non-binder peptides, it is important that the network training is done in a balanced manner. This is done by partitioning the training data into a number of N subsets (bins) such that the i'th bin contains peptides with a transformed binding affinity between (i-1)/N and i/N. In a balanced training, data from each bin is presented to the neural network with an equal frequency. For each of the five training and test sets, a series of network trainings were performed each with a different number of hidden neurons (2, 3, 4, 6, 8 and 10) and a different number of bins (1 , 2, 3, 4 and 5) in balancing of the training. For each series, a single network with the highest test performance was finally selected.

The number of hidden neurons, the number of bins in the balancing of the training and the corresponding test performance for each of the five training and test sets is given in Table 1 for the sparse and the Blosum sequence encoding, respectively.

The predictions for the 100 peptides in the evaluation set are obtained as an average of the five predictions for the sparse and the Blosum encoding and the Pearson correlation coefficient is found to be 0.866 and 0.926, respectively.

Results

Combination of more than one neural network prediction. In Fig. 2(c) we show the test performance on the 528 peptides in the Buus dataset for neural networks trained using sparse sequence encoding. From the plot is the clear that the method has a tendency to under-predict the peptide binding affinity. The least square straight line fit has a slope less than unity and an intercept with the y-axis not equal to zero. A similar behavior is observed for the neural network trained using Blosum sequence encoding. A simple way to correct for this under-predictability of the neural network is to perform a linear transformation on the output neuron. This transformation will naturally not change the Pearson correlation coefficient. It will, however, enable a direct interpretation of the neural network output in terms of actual binding affinities as well as allow a direct comparison of the output from many neural networks. The slope and intercept for the two network types are estimated from the 428 peptides in the training/test set and are 0.7315 and 0.094 for the sparse and 0.877 and 0.04655 for the Blosum encoding, respectively. We combine the output from the two networks trained using sparse and Blosum sequence encoding, respectively, in a simple manner, as a weighted sum of the two. To select the weight that corresponds to the optimal performance, we plot the sensitivity/specificity curve for a series of weighted sum combinations of the two network outputs. The sensitivity is defined as the ratio TP/AP. Here TP (true positives) is the number of data points for which both the predicted score is above a given prediction threshold value and the measured binding affinity is above a given affinity threshold value. AP (actual positives) is the total number of data points that have a measured binding affinity above the affinity threshold value. The specificity is defined as the ratio TP/PP. Here PP (predicted positives) is the total number of predictions with score above the prediction threshold value. In a sensitivity/specificity plot, the curve for the perfect method is the one where the area under the curve is unity. The curves and the associated error-bars are estimated using the Bootstrap method (Press WH, Flannery, BP, Teukolsky, SA, Vetterling, WT., Numerical Recipies, Cambridge University Press, 1989). N datasets were constructed by randomly drawing N data points with replacement from the original dataset of M peptides. For each of the N datasets a sensitivit - specificity curve was calculated and the data and error-bars shown on the curves in Fig. 3 are the mean and standard deviation estimated from the N sensitivity-specificity curve realizations.

In order to find the optimal combination of the sparse and Blosum encoded networks, respectively, we construct the sensitivity/specificity curves for a series of weighted sum combinations. In Fig. 3 (a) the sensitivity/specificity curves are shown for a measured binding affinity threshold value equal to 0.426, corresponding to a binding affinity of 500nM. The optimal combination is found to have a weight on the Blosum encoded network close to 0.7 and a weight on the sparse encoded network close to 0.3, respectively. This set of weights for the combination of the two neural network predictions is also, in Fig. 3 (b), seen to improve to the prediction accuracy for the 100 peptides in the evaluation set. This is however to a less clear degree, due to the small number of binding peptides in the evaluation set. The evaluation set contains 31 peptides with binding affinity stronger than 500 nM.

The neural network training and testing is next repeated using the full data set in a five-fold training. The combined method, hereafter referred to as Comb-I, is defined using the parameters for the linear transformation and the weights on the Blosum and the sparse encoded neural networks, respectively, estimated above.

Integration of data from the Rammensee database in the neural network training.

In Fig. 2(b) we show the performance of the hidden Markov model evaluated on the 528 peptides in the Buus dataset. The plot displays a reasonable correlation between the hidden Markov model score and the measured binding affinity. This correlation demonstrates that the sequences in the

Rammensee dataset contain valuable information and that the neural network training could benefit from an integration of the Rammensee sequence data into the training dataset. It is however not obvious how such an integration should be done. The Rammensee data are binary in nature. They describe that a given peptide does bind to the HLA molecule but not the strength of the binding. The data in the Buus dataset on the other hand are continuous in that each peptide is associated with a binding affinity. It turns out that a fruitful procedure for integrating the Rammensee data into the neural network training is to use the 9 score values generated by the hidden Markov model as additional input to the neural network. The hidden Markov model is trained on the data from the Rammensee database. Two neural networks each with 189 input neurons (9x20+9) are next trained in a five-fold manner as described above using the 9 hidden Markov model scores combined with the sparse or Blosum sequence encoding in the input layer, respectively. In the final combined method the prediction value is calculated as the simple average with equal weight of the two neural network predictions including a linear transformation of the two output neurons to correct for under- predictability. This method is hereafter referred to as the Comb-ll method (or T-PRED) and is the one used in the HCV genome predictions described below.

Neural network methods compared to hidden Markov model methods and the matrix method by Rammensee.

In Table 2, we give the test performance measured in terms of the Pearson correlation coefficient for the 528 peptides in the Buus dataset for six different prediction methods; One method is the matrix method by Rammensee (http://syfpeithi.bmi-heidelberg.com/), the second the hidden Markov model trained on the Rammensee dataset, and the other four are neural networks methods trained using sparse and Blosum sequence encoding, the linear combination of the two, and the linear combination including input from the hidden Markov model, respectively. The numbers given in the table are calculated using the Bootstrap method with 500 dataset realizations. The correlation values are estimated as average values over the 500 dataset realization and the error-bars as the associated standard deviations. From the results shown, it is clear that the neural network methods all have a higher predictive performance compared to both the method by Rammensee and the Hidden Markov model. Both of the latter methods are derived from binary affinity data, in that all peptides that bind to the HLA-A2 molecule are assigned equal weight in the training. Furthermore they are both linear methods that do not include higher order correlations between the amino acids in the peptide sequence when calculating an output score. Neural networks, on the other hand, can both be trained on data with continuous binding affinities and do, if the network contains a hidden layer, include higher order sequence correlations in the output score. To estimate the importance of the ability of the neural network to train on continuous data and the importance of integration of higher order sequence correlations in the prediction score, we transformed the Buus dataset into binary data by assigning peptides with a measured binding affinity stronger than 500 nM an output value of 0.9, and all other peptides a value of 0.1. In a five-fold test/training of a neural network using sparse sequence encoding the test performance on the 528 peptides in the Buus dataset was found to be 0.838 +/- 0.013 and 0.856 +/- 0.013 for networks trained without and with a hidden layer, respectively. These numbers should be compared to the 0.877 +/- 0.011 obtained for a neural network with a hidden layer trained and tested in a similar manner using continuous affinity data. The result hence confirms the importance of both training the prediction method on data with continuous binding affinities and ability of the neural network method to integrate higher order sequence correlation in the prediction score. The difference in predictive performance between the neural network and the linear methods is most significant for datasets defined by peptides with a binding affinity stronger than 50 nM, thus indicating that the signal of higher order sequence correlation is most strongly present in peptides that bind strongly to the HLA A2 molecule. The same conclusion can be drawn from the plots displayed in Fig 2. Here the test performance for the 528 peptides is shown as a scatter plot of the measured binding affinity versus the prediction score for four of the six methods above. Again it is clear that all the neural network methods and in particular the two combined methods have a higher predictive performance than both the Rammensee and the hidden Markov model methods.

In Fig. 4 we show the sensitivity-specificity curves calculated for the data in the 528 peptide-set using the four different neural network methods as well as the method by Rammensee and the hidden Markov model method. All curves are estimated using the Bootstrap method described above. The upper graph shows the sensitivity-specificity curves for the six methods calculated for a measured affinity threshold corresponding to 500 nM, and the lower graph the sensitivity-specificity curves for a measured affinity threshold corresponding to 50 nM. In both graphs it is clear that the combined neural methods have a performance superior to that of the other four methods. In both graphs all four neural network methods and in particular the two combined methods have a performance that is substantially higher than that of the Rammensee method. The plot further demonstrates that the integration of the data from the Rammensee database in the training of the neural networks, in terms of the hidden Markov model input data, has increased the specificity of the combined neural network method substantially. For an affinity threshold of 500 nM the plot shows that at a specificity of 0.975 the combined neural network method Comb-ll has a sensitivity of 0.54, where the combined neural network method Comb-I, that does not include HMM data, has a sensitivity of only 0.22. In Fig. 4(a) the largest sensitivity gap between the combined neural method Comb-ll and the method of Rammensee is found at a specificity value of 0.7 corresponding to a difference of 0.38 in sensitivity or a difference in the number of true positive predictions of 29 of a total of the 76 high binding peptides in the data set. In Fig. 4(b) the largest sensitivity gap between the two methods is found at a specificity value of 0.88 corresponding to a difference of 0.37 in sensitivity or difference in the number of true positive predictions of 54 of a total of the 144 intermediate binding peptides in the data set.

Conclusions and discussion

We describe a novel method for predicting the binding affinity of peptides to the HLA-A2 molecule. The method is a combination of a series of neural networks that as input take a peptide sequence as well as the scores of the sequence to a hidden Markov model (HMM) trained to recognize HLA-A2 binding peptides. The method combines two types of neural network predictions. In half the networks the amino acid sequence is encoded using a classical orthogonal sparse encoding and in the other half of the networks the amino acids are encoded as their BlosumδO scores to the 20 different amino acids. We show that a combined approach where the final prediction is calculated as a linear combination of the two network predictions leads to an improved performance over simpler neural network approaches. We also show that the use of the BlosumδO matrix to encode the peptide sequence leads to an increased performance over the classical orthogonal sparse encoding. The Blosum sequence encoding is beneficial for the neural network training especially in situations where data is limited. The Blosum encoding helps the neural network to generalize, so that the parameters in the network corresponding to similar and dissimilar amino acids are adjusted simultaneously for each sequence example.

A detailed comparison of the derived neural network method to that of linear methods as the matrix method by Rammensee and first order hidden Markov model has been carried out. The predictive performance was measured in term of both the Pearson correlation coefficient and in terms of sensitivity/specificity plots. For both measures it was demonstrated that the neural network method has a predictive performance superior to that of the linear methods.

Analysis of the mutual information in peptides that bind HLA-A2 revealed correlations between the amino acids located between the anchor positions. Neural networks with hidden units can take such correlations into account, but simpler methods such as neural networks without hidden units, matrix methods, and first order hidden Markov models cannot. It is this ability to integrate higher order sequence correlations in to the prediction score combined with the fact that neural networks can be trained on data with continuous binding affinities that gives the present method an edge to the other methods in the comparison.

By calculating the mutual information we show that there exist correlations between different positions in peptides that can bind HLA-A2. Previous studies have shown that the sequence information contained in a motif correlates with the predictive power that can be obtained (Gorodkin J, Lund O, Andersen CA, Brunak S. Using sequence motifs for enhanced neural network prediction of protein distance constraints Proc Int Conf Intel! Syst Mol Biol. 1999;:95-105.). Here we show that the extra predictive power obtained using neural networks can be attributed to the mutual information between positions in a motif. Other published strategies capable of dealing with higher order sequence correlations rely on neural networks (Brusic V, Rudy G, Harrison, LC, Prediction of MHC binding peptides using artificial neural networks, in: Complex systems: mechanism of adaptation. Eds. Stonier, RJ and Yu, XS, 10S, 1994, 253-260) and decision trees (Savoie CJ, Kamikawaji N, Sasazuki T, Kuhara S. Use of BONSAI decision trees for the identification of potential MHC class I peptide epitope motifs. Pac Symp Biocomput. 1999:182-9). Another approach is taken by Mamitsuka (Mamitsuka H. Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov models. Proteins. 1998 Dec 1;33(4):460-74) who trains a fully connected hidden markov model. In this approach sequence correlations can be handled by that different motifs take different paths through this hidden markov model. Three-dimensional models have also been used to predict MHC-peptide binding (Altuvia Y, Schueler O, Margalit H. Ranking potential binding peptides to MHC molecules by a computational threading approach. J Mol Biol. 1995 Jun 2;249(2):244-50,Logean A, et al. (2001)). This approach may give information that is complementary to what can be obtained from the sequence, alone and one possible way to improve the predictive accuracy could be to combine predictions based on sequence with predictions based on structure.

Another way to improve performance is to get more data. Brusic uses neural networks to select peptides for experimental testing (Brusic V, Bucci K, Schonbach C, Petrovsky N, Zeleznikow J, Kazura JW. Efficient discovery of immune response targets by cyclical refinement of QSAR models of peptide binding. J. Mol Graph Model. 2001; 19(5): 405-11, 467), which in turn are used to improve the networks. We have used "query by committee", to select the most informative peptides for experimental testing (Buus et al., Submitted).

Whole genome predictions

Next we perform peptide binding affinity predictions on a whole-genome scale on the Hepatitis C Virus genome. The genome for entry "NC_001433" were downloaded from Genebank. (Nucleic Acids Res 2002 Jan 1;30(1): 17-20 GenBank. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL.)

The HLA-A2 epitopes were selected as those having a combined nural network score of above 0.42. For the other supertypes epitopes were selected in they had a hidden markov model score of above 7.

According to the literature (Yewdell JW, Bennink JR. Immunodominance in major histocompatibility complex class l-restricted T lymphocyte responses. Annu Rev Immunol. 1999;17:51-88.) the affinity of a given peptide to the MHC complex is the far most important factor in the chain of events leading to a CTL-response. However, other factors are likely to be important as well. Since the MHC-I molecule only binds peptides of a certain length span (Yewdell JW, Bennink JR. Immunodominance in major histocompatibility complex class I- restricted T lymphocyte responses. Annu Rev Immunol. 1999;17:51-88.) the processing of the full length poly-peptide is a crucial step. The major participant in this processing is the proteasome, which cleaves poly-peptides with a preference at certain positions. A predicted score for a given position being cleaved can be obtained using the software NetChop (Kesimir C, Nussbaum AK, Schild H, Detours V, Brunak S. Prediction of proteasome cleavage motifs by neural networks. Protein Eng. 2002 Apr;15(4):287-96.). Taking this score into account when selecting the end-set of possible peptides will make it more likely that a given peptide in the set will be able to induce a CTL-response.

For virus infected cells it has been shown that the likelihood for a certain peptide being bound and presented by the MCH-I complex is somewhat proportional with the expression-level, (probably) because there has to be a certain amount of free polypeptide in the cytoplasm. If this is the case another factor is important as well when considering putative epitopes from intracellular bacterial infections. Here the amount of free polypeptide in the cytoplasm is also dependent of the proteing being excreted from the bacteria. To be excreted the native polypeptide must contain a signal peptide at the N- terminus, and a predicted score for this feature can be obtained using the software SignalP (Nielsen H, Engelbrecht J, Brunak S, von Heijne G.

Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 1997 Jan;10(1):1-6).

Based on these considerations, we arrive at the follow scoring scheme for selecting peptides that are likely to induce a CTL-response. Only peptides that have a high predicted binding affinity (less than 50 nM) are passed through this second reranking scoring scheme. The scoring scheme is constructed from the predicted probability that the mother protein (the protein which the peptide is a part of) is being exported (i.e. the probability of having an N- terminal signal peptide) using the prediction server SignalP Nielsen H, Engelbrecht J, Brunak S, von Heijne G. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 1997 Jan; 10(1): 1-6), combined with the score that the proteasome will cleave the mother protein at the C-terminal end of the peptide, using the prediction server NetChop (Kesimir C, Nussbaum AK, Schild H, Detours V, Brunak S. Prediction of proteasome cleavage motifs by neural networks. Protein Eng. 2002 Apr;15(4):287-96.). The binding affinity score was calculated using the combined neural network method. A scoring scheme is defined in the following equation

S = A + 0.2*C -B,

where A is the predicted binding affinity, B is the predicted average probability that the peptide is cleaver at an interior site, and C is the predicted signal peptide probability. EXAMPLE 2

Selection of variant sequences

Besides the peptides identified by the methods described in example 1 , a number of variants of these peptides may be useful as for example a vaccine a diagnostic tool. These variant peptides may differ in that the amino acid found in one or more positions of the original peptide are replaced by different amino acids. These different amino acids may be selected in a number of ways. A hydrophobic amino acid may for example be replaced by another hydrophobic amino acid. Groups of interchangeable amino acids may for example be selected as polar (N and Q), charged (D, E, K, R and H), acidic (D, E), basic (K, R, H), ambivalent (P, T, S, C, A, G, Y and W) or hydrophobic (F, L, I, M and V). Another way to construct groups of similar amino acid is to group those that have an substitution score above a given threshold such as 0 (zero) according to a amino acid substitution matrix such as Blosum 62 (Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15; 89(22): 10915-9). Yet another way to select variant peptides is to replace one or more amino aids with other amino acids with the aim of creating a variant peptide that is predicted to bind almost as good or better than the original peptide. The prediction may for example involve the use of a neural network method, a hidden markov model or a matrix method. Amino acids may also be replaced by other amino acids than the 20 amino acids normally found in proteins. The variant peptides may also contain additional amino acids or fewer amino acids than the original peptide. The variant peptides may for example be one amino acid longer or shorter in either the N or the C terminal end.

EXAMPLE 3

Measure binding of epitopes to MHC molecules

The binding of peptides to the HLA molecules can be verified experimentally for example by the method described by Sylvester-Hvid et al. (Sylvester-Hvid C, Kristensen N, Blicher T, Ferre H, Lauemoller SL, Wolf XA, Lamberth K, Nissen MH, Pedersen LO, Buus S. Establishment of a quantitative ELISA capable of determining peptide - MHC class I interaction. Tissue Antigens. 2002 Apr;59(4):251-8). This can be used to verify that peptides that are predicted to bind a given HLA molecule actually binds that molecule. The results from such experiments can be used to select a subset of peptides will undergo further testing to verify their utility in a vaccine.

EXAMPLE 4

Epitope Stimulation of Cytolytic T Cell Function in vitro

The ability of individual peptides to be recognized by naϊve, normal human T cells can be measured by antigen stimulation using peptide-pulsed antigen presenting cells. Peptides purified following chemical synthesis will be added to cultures of human antigen-presenting cells (APC), in order to test stimulation of normal human T cells. This is a measurement of the abundance of epitope specific T cells in normal human blood products, and as well a determination that the binding affinity of the peptides for class I HLA is sufficient to trigger T cell activation.

Human monocytes, from peripheral blood of 3 HLA-A2 human donors, will be purified using adherence to glass wool and grown in culture for 2 days prior to peptide loading. Culture conditions will , and will include commercially available lots of GM-CSF and IL-4 at reported concentrations. It has been demonstrated that exogenous peptides can be added in excess to such antigen presenting cells and displace peptides bound on cell surface class I HLA receptors.

These peptide pulsed cells will then be washed, and incubated with freshly isolated human T cell populations. Samples will be cultured in 96 well microtiter plates for a period of 7 days. Each culture will contain approximately 5 x 10e6 normal human T cells. T cell activation will be measured in two ways - the induction of T cell activation markers and by monitoring of IL-2 production.

Flow cytometry using FITC-conjugated anti-human CD69 will be performed on duplicate samples of all cultures. CD69 is a proven T cell activation marker used in similar studies, and well be used in double-staining experiments with anti-CD3 antibody to determine T cell numbers.

Controls will consist of T cells incubated in identical conditions but lacking peptide-pulsed dendritic cells, and T cells exposed to a positive control peptide-APC population. A positive control peptide from prior investigation of HIV T cell HLA-A2 peptides will be used. Results will be considered positive if parallel numbers of CD69+ T cells are generated in test peptide exposures compared to positive control, and if baseline numbers remain statistically reduced.

In parallel, duplicate wells will be harvested and assayed by ELISA for the induction of IL-2, an interleukin expressed by activated T cells. Controls will be as described above, and individual peptides will be considered positive if parallel levels of IL-2 are generated in test peptide exposures compared to positive control, and if baseline numbers remain statistically reduced

EXAMPLE 5

Abundance of epitope-specific T cells in normal and patient blood Diagnostic kits are now available which allow flow cytometric detection of antigen specific T cells. These assays use the same principles employed in receptor binding assays described above - adding synthesized peptides to recombinant HLA molecules, but then extend this assay by measuring the ability of T cells to bind the HLA/peptide complex. This is accomplished by generating flourescently conjugated tetramenc complexes of peptide/HLA using avidin-biotin conjugation, and then measuring the number of fluorescently labeled T cells in a blood product.

This is a tool for monitoring quantitative responses to vaccination or in measuring antigen specific T cells in patient blood samples. It is also a surrogate assay to measure potency of a T cell response during healthy donor vaccine trials. Each peptide will be employed according to protocol, and HLA-A2 donor bloods (3/sample) will be tested for baseline levels of T cell recognition. It is expected that positive numbers in the range of 1:10⁴ to 1:10⁵ be obtained from the naϊve T cell pool.

Baseline numbers will be generated using non-HLA-A2 donors as well as non- peptide loaded complexes. Again, HIV T cell epitopes will be tested as controls as well. Samples which have tested positive in the in vitro T cell stimulation assay will be chosen for testing by this method.

Assays will also be performed on peripheral blood T cells obtained from patients who have recently sero-converted to HCV positivity. It is expected that significantly higher quantitation (1:10² to 1:10³ ) of epitope specific T cells be measured, should specific epitopes be involved in the disease response. This determination would be proof of the utility of specific epitopes for diagnostic purposes.

EXAMPLE 6

Demonstration of an effective DNA vaccine formulation for stimulating T cell responses in transgenic mice expressing the human HLA-A2 receptor

In order to prove the vaccine utility of predicted T cell epitope peptides, formulation of epitiopes in an accepted delivery platform, followed by induction of T cell responses in an animal model, can be demonstrated. Because small animal models for HCV infectivity do not exist, pre-clinical proof of concept is limited to immunogenicity studies.

It has been reported that DNA vaccines can present multiple T cell epitopes in clustered and simple linear array on a vector, without excessive immunodominance of one or few epitopes. In our initial attempts to validate our epitope selection and testing criteria, we will construct and measure the effectiveness of a DNA vaccine to stimulate T cell responses in a transgenic mouse model expressing human HLA-A2.

Peptide epitopes will be chosen based on T cells stimulation indices, followed by generation of a short DNA vaccine construct to express these peptides intracellularly. Genetic constructs successful in this approach have previously been identified, and we will use this strategy in our experiments. A synthetic mini-gene will be constructed by overlapping oligonucleotides and confirmed by DNA sequence analysis. The peptide motifs will be flanked by known proteosome cleavage sites in mouse for enhanced processing. The expression cassette will be driven by a standard CMV promoter in a commercially available expression plasmid.

Transgenic mice expressing HLA-A2 molecules in the complete absence of H- 2 class I molecules in an H-2Kb, H-2Db double KO context have been created, and used to determination the immunological response to viral T cell peptides known to be HLA-A2 specific.

Experimental groups will be organized as follows: Control 1: (n = 4) animals receiving DNA vaccine cassette lacking peptide epitopes

Control 2: (n = 4) animals receiving DNA vaccine cassette with HIV peptide control Test Group 1 : (n=4) animals receiving DNA vaccine with 5 sequential epitopes Test Group 2: (n=4) animals receiving DNA vaccine with 3 sequential epitopes Test Group 3: (n=4) animals receiving DNA vaccine with 1 epitope

Animals will be injected i.m. and boosted at 4 week intervals. Peripheral blood cells will be monitored for epitope specific T cells as described in Example 5, with pre-immunization blood cells measured for baseline levels.

Flow cytometric analysis of collected blood cell populations will be performed as described in Aim 3b, with the exception that secondary staining will be performed using anti-mouse CD3, anti-mouse CD4, and anti-mouse CD8 fluorescent antibodies to gain T cell number and subset information.

Claims

CLAIMS 1. A cytotoxic Hepatitis C virus T-cell epitope, the epitope being selected from the group of peptides consisting of GMGWAGWLL (SEQ ID NO:1), DLMGYIPLV (SEQ ID NO:2), RALAHGVRV (SEQ ID NO:3), FLLALLSCL (SEQ ID NO:4), MIMHTPGCV (SEQ ID NO:5), FLVSQLFTF (SEQ ID NO:6), MMNWSPTTA (SEQ ID NO:7), QLLRIPQAV (SEQ ID NO:8), LLRIPQAW (SEQ ID NO:9), MVAGAHWGV (SEQ ID NO: 10), SMVGNWAKV (SEQ ID NO:11), CMVDYPYRL (SEQ ID NO:12), RLWHYPCTV (SEQ ID NO:13), YVGGVEHRL (SEQ ID NO:14), LLSTTEWQI (SEQ ID NO:15), LIHLHRNIV (SEQ ID NO:16), YLYGIGSAV (SEQ ID NO: 17), FLLLADARV (SEQ ID NO:18), MLLIAQAEA (SEQ ID NO:19), RLVPGAAYA (SEQ ID NO:20), ALYGVWPLL (SEQ ID NO:21), LLLALPPRA (SEQ ID NO:22), LTLSPYYKV (SEQ ID NO:23), FLARLIWWL (SEQ ID NO:24), LLLAILGPL (SEQ ID NO:25), VLQAGITRV (SEQ ID NO:26), GITRVPYFV (SEQ ID NO:27), GLIRACMLV (SEQ ID NO:28), YVYDHLTPL (SEQ ID NO:29), DLAVAVEPV (SEQ ID NO:30), LLGCIITSL (SEQ ID NO:31), FLATCVNGV (SEQ ID NO:32), CVNGVCWTV (SEQ ID NO:33), LLCPSGHW (SEQ ID NO:34), YLNTPGLPV (SEQ ID NO:35), YLVAYQATV (SEQ ID NO:36), RLGAVQNEV (SEQ ID NO:37), CMSADLEW (SEQ ID NO:38), ALAAYCLTT (SEQ ID NO:39), ILSGRPAVI (SEQ ID NO:40), SLMAFTASt (SEQ ID NO:41), ILAGYGAGV (SEQ ID NO:42), ILSPGALW (SEQ ID NO:43), KLLPRLPGL (SEQ ID NO:44), VAAEEYVEV (SEQ ID NO:45), RLHRYAPVC (SEQ ID NO:46), EMGGNITRV (SEQ ID NO:47), SLLRHHSMV (SEQ ID NO:48), VLDDHYRDV (SEQ ID NO:49), ALYDWSTL (SEQ ID NO:50), KLQDCTMLV (SEQ ID NO:51), MLVNGDDLV (SEQ ID NO:52), IMYAPTLWA (SEQ ID NO:53), ILMTHFFSI (SEQ ID NO:54), ALDCQIYGA (SEQ ID NO:55), KLGVPPLRV (SEQ ID NO:56), MLCLLLLSV (SEQ ID NO:57), LLLSVGVGI (SEQ ID NO:58), ITTGGPITY (SEQ ID NO:59), IVDVQYLYG (SEQ ID NO:60), STDSTTILG (SEQ ID NO:61), ASQVCGPVY (SEQ ID NO:62), FTVFKVRMY (SEQ ID NO:63), STEDLVNLL (SEQ ID NO:64), LHGPTPLLY (SEQ ID NO:65), SLDPTFTIE (SEQ ID NO:66), CTCGSSDLY (SEQ ID NO:67), LSAFSLHSY (SEQ ID NO:68), LTDPSHITA (SEQ ID NO:69), LLSPRPISY (SEQ ID NO:70), YSSMPPLEG (SEQ ID N0:71), PGDPPQPEY (SEQ ID NO:72), DWCCSMSY (SEQ ID NO:73), KKCPMGFSY (SEQ ID NO:74), VTRVGDFHY (SEQ ID NO:75), CSIYPGHVS (SEQ ID NO:76), SFDPIRAVE (SEQ ID NO:77), VQDCNCSIY (SEQ ID NO:78), FTEAMTRYS (SEQ ID NO.79), VFQVGLNQY (SEQ ID NO:80), IFDITKLLL (SEQ ID N0:81), VLEDGVNYA (SEQ ID NO:82), ETDVLLLSN (SEQ ID NO:83), STLPGNPAI (SEQ ID NO:84), PTDPRRRSR (SEQ ID NO:85), STNPKPQRK (SEQ ID NO:86), RSELSPLLL (SEQ ID NO:87), VTNDCSNSS (SEQ ID NO:88), FSDMETKLI (SEQ ID NO:89), DYPYRLWHY (SEQ ID NO:90), RYAPVCKPL (SEQ ID N0:91), AYSQQTRGL (SEQ ID N0:92), RLIWWLQYF (SEQ ID NO:93), VFSDMETKL (SEQ ID NO:94), CYDAGCAWY (SEQ ID NO:95), CYSIEPLDL (SEQ ID NO:96), TYCKFLADG (SEQ ID NO:97), TYSWGENET (SEQ ID NO:98), YYSMVGNWA (SEQ ID NO:99), LYGVWPLLL (SEQ ID NO: 100), YYKVFLARL (SEQ ID NO: 101), LMTHFFSIL (SEQ ID NO.102), EYILLLFLL (SEQ ID NO:103), VYHGAGSKT (SEQ ID NO:104), TRPPQGNWF (SEQ ID NO.105), AYAAQGYKV (SEQ ID NO: 106), FYAHRFNAS (SEQ ID NO: 107), SYTWTGALI (SEQ ID NO:108), VWDWICTVL (SEQ ID NO: 109), AIKWEYILL (SEQ ID NO:110), CACLWMMLL (SEQ ID NO:111), KYLFNWAVK (SEQ ID NO:112), VLSDFKTWL (SEQ ID NO:113), PYIEQGMQL (SEQ ID NO:114), PYCWHYAPR (SEQ ID NO:115), RHTPVNSWL (SEQ ID NO.H6), RMILMTHFF (SEQ ID NO:117), MYTNVDQDL (SEQ ID NO:118), DYVPPWHG (SEQ ID NO:119), AYMSKAHGI (SEQ ID NO:120), VFLARLIWW (SEQ ID NO:121), RYSAPPGDP (SEQ ID NO:122), VFFCAAWYI (SEQ ID NO:123), RVCEKMALY (SEQ ID NO:124), VYDHLTPLR (SEQ ID NO:125), VILDSFDPI (SEQ ID NO:126), SYGFQYSPG (SEQ ID NO:127), AYYSMVGNW (SEQ ID NO:128), QYSPGQRVE (SEQ ID NO:129), SFSIFLLAL (SEQ ID NO: 130), RVCEKMALY (SEQ ID NO: 131), WFSDMETK (SEQ ID NO:132), IVFPDLGVR (SEQ ID N0.133), TVFKVRMYV (SEQ ID NO: 134), KVFLARLIW (SEQ ID NO: 135), AVMGPSYGF (SEQ ID NO:136), KYLFNWAVK (SEQ ID NO:137), AVCTRGVAK (SEQ ID NO:138), KLTPPHSAK (SEQ ID NO:139), STNPKPQRK (SEQ ID NO:140), RVLEDGVNY (SEQ ID NO:141), GLNAVAYYR (SEQ ID NO:142), QLFTFSPRR (SEQ ID NO: 143), EVDGVRLHR (SEQ ID NO: 144), EVRNVSGIY (SEQ ID NO: 145), GVPPLRVWR (SEQ ID NO:146), QTFQVAHLH (SEQ ID NO:147), SPRPISYLK (SEQ ID NO:148), YLFNWAVKT (SEQ ID NO:149), RLLAPITAY (SEQ ID NO:150), SVPAEILRK (SEQ ID NO:151), RLGVRATRK (SEQ ID NO:152), YLLPRRGPR (SEQ ID NO:153), EVFCVQPEK (SEQ ID NO: 154), WCAAILRR (SEQ ID NO: 155), LLTLSPYYK (SEQ ID NO:156), KLAALTGTY (SEQ ID NO:157), LVFFCAAWY (SEQ ID NO:158), SLRQKKVTF (SEQ ID NO:159), GVLAGLAYY (SEQ ID NO:160), SVFLVSQLF (SEQ ID NO:161), TVNFTVFKV (SEQ ID NO:162), KLGVPPLRV (SEQ ID NO:163), LIRACMLVR (SEQ ID NO:164), LVNTWKSKK (SEQ ID NO:165), ITRVPYFVR (SEQ ID NO:166), KHPEATYTK (SEQ ID NO:167), LLCPSGHW (SEQ ID NO:168), KTKRNTNRR (SEQ ID NO:169), SIYPGHVSG (SEQ ID NO:170), SSIPTTTIR (SEQ ID NO:171), TLPQDAVSR (SEQ ID NO:172), VTLTHPITK (SEQ ID NO:173), PIPPPRRKR (SEQ ID NO:174), RHADWPVR (SEQ ID NO:175), RHTPVNSWL (SEQ ID NO:176), RRCRASGVL (SEQ ID NO:177), RRGPRLGVR (SEQ ID NO:178), GRTWAQPGY (SEQ ID NO:179), RRYETVQDC (SEQ ID NO:180), RRGDSRGSL (SEQ ID NO:181), YRFVTPGER (SEQ ID NO:182), TRAEAHLQV (SEQ ID NO:183), GRAATCGKY (SEQ ID NO:184), AHWGVLAGL (SEQ ID NO:185), RRSRNLGKV (SEQ ID NO:186), GRRQPIPKA (SEQ ID NO:187), YHGAGSKTL (SEQ ID NO:188), LRDWAHAGL (SEQ ID NO:189), HRFNASGCP (SEQ ID NO:190), GRDKNQVDG (SEQ ID NO:191), RRHVGPGEG (SEQ ID NO: 192), THFFSILLA (SEQ ID NO: 193), RRQPIPKAR (SEQ ID NO: 194), TRPPQGNWF (SEQ ID NO: 195), GRLVPGAAY (SEQ ID NO:196), GHVSGHRMA (SEQ ID NO:197), RHRARSVRA (SEQ ID NO:198), HRNIVDVQY (SEQ ID NO:199), TRDPTTPLA (SEQ ID NO:200), GHVKNGSMR (SEQ ID NO.201), LRVWRHRAR (SEQ ID NO.202), NRRPQDVKF (SEQ ID NO:203), RRKRTWLT (SEQ ID NO:204), RHVDLLVGA (SEQ ID NO:205), THVTGGRVA (SEQ ID NO:206), ARALAHGVR (SEQ ID NO:207), RRGKEILLG (SEQ ID NO:208), AHFLSQTKQ (SEQ ID NO:209), RRGRTGRGR (SEQ ID NO:210), GHTHVTGGR (SEQ ID N0.211), SRCWVALTP (SEQ ID NO:212), TRVESENKV (SEQ ID NO:213), WHYPCTVNF.(SEQ ID NO:214), GQIVGGVYL (SEQ ID NO:215), GHWGIFRA (SEQ ID NO;216), ARRPEGRTW (SEQ ID NO:217), KHPEATYTK (SEQ ID N0.218), FHYVTGMTT (SEQ ID NO:219), HRARSVRAK (SEQ ID NO:220), LHGPTPLLY (SEQ ID NO:221), SRAQRRGRT (SEQ ID NO:222), TRVPYFVRA (SEQ ID NO:223), GRPAVIPDR (SEQ ID NO:224), ERSQPRGRR (SEQ ID N0.225), RRRSRNLGK (SEQ ID NO:226), RHHSMVYST (SEQ ID NO:227), PRRKRTWL (SEQ ID NO:228), GRKPARLIV (SEQ ID NO:229), RRPEGRTWA (SEQ ID NO:230), GRDAIILLT (SEQ ID NO:231), DHYRDVLKE (SEQ ID NO:232), ERLHGLSAF (SEQ ID NO:233), TRYSAPPGD (SEQ ID NO:234), RGYKGVWRG (SEQ ID NO:235), RCFDSTVTE (SEQ ID NO:236), YRLWHYPCT (SEQ ID NO:237), DRSELSPLL (SEQ ID NO:238), GHYVQMAFM (SEQ ID NO:239), NEGMGWAGW (SEQ ID NO:240), GQIVGGVYL (SEQ ID NO:241), EEWFQVGL (SEQ ID NO:242), AEAHLQVWV (SEQ ID NO:243), VEFLVNTWK (SEQ ID NO:244), GEDWCCSM (SEQ ID NO:245), GEIPFYGKA (SEQ ID NO:246), GEAGEDWC (SEQ ID NO:247), REISVPAEI (SEQ ID NO:248), GEINRVASC (SEQ ID NO:249), RNVSGIYHV (SEQ ID NO:250), EAIKGGRHL (SEQ ID NO:251), VEVTRVGDF (SEQ ID NO:252), GLTHIDAHF (SEQ ID NO:253), LEWTSTWV (SEQ ID NO:254), VEHRLNAAC (SEQ ID NO:255), AEILRKPRK (SEQ ID NO:256), VDMVAGAHW (SEQ ID NO:257), EEYVEVTRV (SEQ ID NO:258), GRHLIFCHS (SEQ ID NO:259), GERCDLEDR (SEQ ID NO:260), SQLDLSGWF (SEQ ID NO:261), RGRSGIYRF (SEQ ID NO:262), GSIGLGKVL (SEQ ID NO:263), QEMGGNITR (SEQ ID NO:264), KEILLGPAD (SEQ ID NO:265), AQGYKVLVL (SEQ ID NO:266), SSTQSLVSW (SEQ ID NO:267), NTRPPQGNW (SEQ ID NO:268), ITYSTYCKF (SEQ ID NO:269), KFPPALPIW (SEQ ID NO:270), SSIPTTTIR (SEQ ID NO:271), HSTDSTTIL (SEQ ID NO:272), YSPGEINRV (SEQ ID NO:273), STNPKPQRK (SEQ ID NO:274), TTLPALSTG (SEQ ID NO:275), LTDPSHITA (SEQ ID NO:276), TLSPYYKVF (SEQ ID NO:277), TSPLTTQNT (SEQ ID N0.278), STLPGNPAI (SEQ ID NO:279), FSLDPTFTI (SEQ ID NO:280), STLPQAVMG (SEQ ID NO:281), LTTQNTLLF (SEQ ID NO:282), YSWGENETD (SEQ ID NO:283), HSAKSKFGY (SEQ ID NO:284), VWGTTDRF (SEQ ID NO:285), ARRPEGRTW (SEQ ID NO:286), SSLTITQLL (SEQ ID NO:287), TVLSDFKTW (SEQ ID NO:288), STDSTTILG (SEQ ID NO:289), ESMETTMRS (SEQ ID NO:290), SSPPAVPQT (SEQ ID NO:291), TTALWSQL (SEQ ID NO:292), ITAETAKRR (SEQ ID NO:293), ITSPLTTQN (SEQ ID NO:294), TSGDWWA (SEQ ID NO:295), FSDMETKLI (SEQ ID NO:296), SSDQRPYCW (SEQ ID NO:297), ASITSPLTT (SEQ ID NO:298), ITTGGPITY (SEQ ID NO:299), TSCSSNVSV (SEQ ID NO:300), ITVPHPNIE (SEQ ID NO:301), CSTPCSGSW (SEQ ID NO:302), NVDQDLVGW (SEQ ID NO.303), LTTGSWIV (SEQ ID NO:304), KSTKVPAAY (SEQ ID NO:305), LSAPSLKAT (SEQ ID NO:306), YNPPLLESW (SEQ ID NO:307), HVSPTHYVP (SEQ ID NO:308), GLGLNAVAY (SEQ ID NO:309), GPGEGAVQW (SEQ ID NO:310), YLYGIGSAV (SEQ ID N0:311), LLSPRPISY (SEQ ID NO:312), SLRQKKVTF (SEQ ID NO:313), ILSPGALW (SEQ ID NO:314), GVDGHTHVT (SEQ ID NO:3Ϊ5), RLLAPITAY (SEQ ID NO:316), GPKGPITQM (SEQ ID NO:317), HVSGHRMAW (SEQ ID NO:318), EMKAKASTV (SEQ ID NO:319), RVGDFHYVT (SEQ ID NO:320), YVGGPLTNS (SEQ ID N0.321), DQRPYCWHY (SEQ ID NO:322), SIYPGHVSG (SEQ ID NO:323), FQYSPGQRV (SEQ ID NO:324), GLRDLAVAV (SEQ ID NO:325), VLSDFKTWL (SEQ ID NO:326), DLEWTSTW (SEQ ID NO:327), LLRHHSMVY (SEQ ID NO:328), ALRAFTEAM (SEQ ID NO:329), EMGGNITRV (SEQ ID NO:330), IQYLAGLST (SEQ ID NO:331), DLSDGSWST (SEQ ID NO:332), ILGPLMVLQ (SEQ ID NO:333), CQRGYKGVW (SEQ ID NO:334), ILLGPADSF (SEQ ID NO:335), VTRVGDFHY (SEQ ID NO:336), YVYDHLTPL (SEQ ID NO:337), DGGCSGGAY (SEQ ID NO:338), VAGGHYVQM (SEQ ID NO:339), IMYAPTLWA (SEQ ID NO:340), RVLEDGVNY (SEQ ID NO:341), YGIGSAWS (SEQ ID NO:342), ARRPEGRTW (SEQ ID NO:343), ALPPRAYAM (SEQ ID NO:344), GQIVGGVYL (SEQ ID NO:345), YVPPWHGC (SEQ ID NO:346), SQLDLSGWF (SEQ ID NO:347), RPRWFMLCL (SEQ ID NO:348), SPRGSRPSW (SEQ ID NO:349), APPPSWDQM (SEQ ID NO:350), APRPCGIVP (SEQ ID NO:351), LPRLPGLPF (SEQ ID NO:352), EPEPDVAVL (SEQ ID NO:353), ARRPEGRTW (SEQ ID NO:354), GPKGPITQM (SEQ ID NO:355), KPRKFPPAL (SEQ ID NO:356), APPGDPPQP (SEQ ID NO:357), TPIPAASQL (SEQ ID NO:358), RPDYNPPLL (SEQ ID NO:359), AVNHIRSVW (SEQ ID NO:360), APPIPPPRR (SEQ ID NO:361), RARPRWFML (SEQ ID NO:362), HPNIEEVAL (SEQ ID NO:363), RPSWGPTDP (SEQ ID NO:364), SPRPISYLK (SEQ ID NO:365), DPPQPEYDL (SEQ ID NO:366), WPAPPGARS (SEQ ID NO:367), APNYSRALW (SEQ ID NO:368), RPIDEFAQG (SEQ ID NO:369), APPGARSMT (SEQ ID NO:370), RAQAPPPSW (SEQ ID NO:371), SPPAVPQTF (SEQ ID NO:372), APTLWARMI (SEQ ID NO:373), CPSGHWGI (SEQ ID NO:374), QPGYPWPLY (SEQ ID NO:375), PPPSWDQMW (SEQ ID NO:376), LPIWARPDY (SEQ ID NO:377), APPSAASAF (SEQ ID NO:378), TPSPAPNYS (SEQ ID NO:379), SPAPNYSRA (SEQ ID NO.380), RPAVIPDRE (SEQ ID NO:381), APLGGAARA (SEQ ID NO:382), RAATCGKYL (SEQ ID NO:383), SPLTTQNTL (SEQ ID NO:384), QPRGRRQPI (SEQ ID NO:385), GPRLGVRAT (SEQ ID NO:386), SPGEINRVA (SEQ ID NO:387), NPAIASLMA (SEQ ID NO:388), FRKHPEATY (SEQ ID NO:389), DPSHITAET (SEQ ID NO:390), RPCGIVPAS (SEQ ID NO:391), NTRPPQGNW (SEQ ID NO:392), HPITKYIMA (SEQ ID NO:393), PPWHGCPL (SEQ ID NO:394), AVIPDREVL (SEQ ID NO:395), PPIPPPRRK (SEQ ID NO:396), TPPGSITVP (SEQ ID NO:397), IPAASQLDL (SEQ ID NO:398), TPSPVWGT (SEQ ID NO:399), GPTDPRRRS (SEQ ID NO:400), RARSVRAKL (SEQ ID NO:401), VPGAAYALY (SEQ ID NO:402), TPIDTTIMA (SEQ ID NO:403), AALTGTYVY (SEQ ID NO:404), AASCGGAVF (SEQ ID NO:405), TPAETSVRL (SEQ ID NO:406), EARQAIRSL (SEQ ID NO:407), KPTLHGPTP (SEQ ID NO.408), DPTTPLARA (SEQ ID NO:409), AVMGPSYGF (SEQ ID NO:410), AVAVEPWF (SEQ ID NO:411), RVASSTQSL (SEQ ID N0.412), ALAHGVRVL (SEQ ID NO:413), DPRRRSRNL (SEQ ID NO:414), GPGEGAVQW (SEQ ID NO:415), EPDVAVLTS (SEQ ID NO:416), RGRSGIYRF (SEQ ID NO:417), RAYLNTPGL (SEQ ID NO:418), MPSTEDLVN (SEQ ID NO:419), RLIWWLQYF (SEQ ID NO:420), VPHPNIEEV (SEQ ID NO:421), TPGERPSGM (SEQ ID NO:422), HVSGHRMAW (SEQ ID N0.423), ATLGFGAYM (SEQ ID NO:424), NPKPQRKTK (SEQ ID NO:425), QPIPKARRP (SEQ ID NO:426), FPDLGVRVC (SEQ ID NO:427), AAALCSAMY (SEQ ID NO:428), and VPAPEFFTE (SEQ ID NO:429), and variations hereof.

2. A cytotoxic Hepatitis C virus T-cell epitope, the epitope being selected from the group of peptides consisting of GMGWAGWLL (SEQ ID NO:1), DLMGYIPLV (SEQ ID NO:2), RALAHGVRV (SEQ ID NO:3), FLLALLSCL (SEQ ID NO:4), MIMHTPGCV (SEQ ID NO:5), FLVSQLFTF (SEQ ID NO:6), MMNWSPTTA (SEQ ID NO:7), QLLRIPQAV (SEQ ID NO:8), LLRIPQAW (SEQ ID NO:9), MVAGAHWGV (SEQ ID NO:10), SMVGNWAKV (SEQ ID NO:11), CMVDYPYRL (SEQ ID NO:12), RLWHYPCTV (SEQ ID NO:13), YVGGVEHRL (SEQ ID NO:14), LLSTTEWQI (SEQ ID NO:15), LIHLHRNIV (SEQ ID NO:16), YLYGIGSAV (SEQ ID NO:17), FLLLADARV (SEQ ID NO:18), MLLIAQAEA (SEQ ID NO:19), RLVPGAAYA (SEQ ID NO:20), ALYGVWPLL (SEQ ID NO:21), LLLALPPRA (SEQ ID NO:22), LTLSPYYKV (SEQ ID NO:23), FLARLIWWL (SEQ ID NO:24), LLLAILGPL (SEQ ID NO:25), VLQAGITRV (SEQ ID NO:26), GITRVPYFV (SEQ ID NO:27), GLIRACMLV (SEQ ID NO:28), YVYDHLTPL (SEQ ID NO:29), DLAVAVEPV (SEQ ID NO:30), LLGCIITSL (SEQ ID NO:31), FLATCVNGV (SEQ ID NO:32), CVNGVCWTV (SEQ ID NO:33), LLCPSGHW (SEQ ID NO:34), YLNTPGLPV (SEQ ID NO:35), YLVAYQATV (SEQ ID NO:36), RLGAVQNEV (SEQ ID NO:37), CMSADLEW (SEQ ID NO:38), ALAAYCLTT (SEQ ID NO:39), ILSGRPAVI (SEQ ID NO:40), SLMAFTASI (SEQ ID N0:41), ILAGYGAGV (SEQ ID NO:42), ILSPGALW (SEQ ID NO:43), KLLPRLPGL (SEQ ID NO:44), VAAEEYVEV (SEQ ID NO:45), RLHRYAPVC (SEQ ID NO:46), EMGGNITRV (SEQ ID NO:47), SLLRHHSMV (SEQ ID NO:48), VLDDHYRDV (SEQ ID NO:49), ALYDWSTL (SEQ ID NO:50), KLQDCTMLV (SEQ ID N0:51), MLVNGDDLV (SEQ ID NO:52), IMYAPTLWA (SEQ ID NO:53), ILMTHFFSI (SEQ ID NO:54), ALDCQIYGA (SEQ ID NO:55), KLGVPPLRV (SEQ ID NO:56), MLCLLLLSV (SEQ ID NO:57), LLLSVGVGI (SEQ ID NO:58), ITTGGPITY (SEQ ID NO:59), IVDVQYLYG (SEQ ID NO:60), STDSTTILG (SEQ ID N0:61), ASQVCGPVY (SEQ ID NO:62), FTVFKVRMY (SEQ ID NO:63), STEDLVNLL (SEQ ID NO:64), LHGPTPLLY (SEQ ID NO:65), SLDPTFTIE (SEQ ID NO:66), CTCGSSDLY (SEQ ID NO:67), LSAFSLHSY (SEQ ID NO:68), LTDPSHITA (SEQ ID NO:69), LLSPRPISY (SEQ ID NO:70), YSSMPPLEG (SEQ ID N0:71), PGDPPQPEY (SEQ ID NO:72), DWCCSMSY (SEQ ID NO:73), KKCPMGFSY (SEQ ID NO:74), VTRVGDFHY (SEQ ID NO:75), CSIYPGHVS (SEQ ID NO:76), SFDPIRAVE (SEQ ID N0.77), VQDCNCSIY (SEQ ID N0:78), FTEAMTRYS (SEQ ID NO:79), VFQVGLNQY (SEQ ID NO:80), IFDITKLLL (SEQ ID N0:81), VLEDGVNYA (SEQ ID NO:82), ETDVLLLSN (SEQ ID NO:83), STLPGNPAI (SEQ ID NO:84), PTDPRRRSR (SEQ ID NO:85), STNPKPQRK (SEQ ID NO:86), RSELSPLLL (SEQ ID NO:87), VTNDCSNSS (SEQ ID NO:88), FSDMETKLI (SEQ ID NO:89), DYPYRLWHY (SEQ ID NO:90), RYAPVCKPL (SEQ ID N0:91), AYSQQTRGL (SEQ ID NO:92), RLIWWLQYF (SEQ ID N0:93), VFSDMETKL (SEQ ID NO:94), CYDAGCAWY (SEQ ID NO:95), CYSIEPLDL (SEQ ID NO:96), TYCKFLADG (SEQ ID NO:97), TYSWGENET (SEQ ID NO:98), YYSMVGNWA (SEQ ID NO:99), LYGVWPLLL (SEQ ID NO:100), YYKVFLARL (SEQ ID NO:101), LMTHFFSIL (SEQ ID NO:102), EYILLLFLL (SEQ ID NO:103), VYHGAGSKT (SEQ ID NO:104), TRPPQGNWF (SEQ ID NO:105), AYAAQGYKV (SEQ ID NO: 106), FYAHRFNAS (SEQ ID NO:107), SYTWTGALI (SEQ ID NO:108), VWDWICTVL (SEQ ID NO:109), AIKWEYILL (SEQ ID NO: 110), CACLWMMLL (SEQ ID NO: 111), KYLFNWAVK (SEQ ID NO:112), VLSDFKTWL (SEQ ID NO:113), PYIEQGMQL (SEQ ID NO:114), PYCWHYAPR (SEQ ID NO:115), RHTPVNSWL (SEQ ID NO:116), RMILMTHFF (SEQ ID NO:117), MYTNVDQDL (SEQ ID NO: 118), DYVPPWHG (SEQ ID NO: 119), AYMSKAHGI (SEQ ID NO:120), VFLARLIWW (SEQ ID NO:121), RYSAPPGDP (SEQ ID NO:122), VFFCAAWYI (SEQ ID NO:123), RVCEKMALY (SEQ ID NO: 124), VYDHLTPLR (SEQ ID NO: 125), VILDSFDPI (SEQ ID NO: 126), SYGFQYSPG (SEQ ID NO: 127), AYYSMVGNW (SEQ ID NO:128), QYSPGQRVE (SEQ ID NO:129), SFSIFLLAL (SEQ ID NO: 130), RVCEKMALY (SEQ ID NO:131), WFSDMETK (SEQ ID NO: 132), IVFPDLGVR (SEQ ID NO: 133), TVFKVRMYV (SEQ ID N0.134), KVFLARLIW (SEQ ID NO:135), AVMGPSYGF (SEQ ID NO: 136), KYLFNWAVK (SEQ ID NO: 137), AVCTRGVAK (SEQ ID NO:138), KLTPPHSAK (SEQ ID N0.139), STNPKPQRK (SEQ ID NO.140), RVLEDGVNY (SEQ ID NO:141), GLNAVAYYR (SEQ ID NO:142), QLFTFSPRR (SEQ ID NO:143), EVDGVRLHR (SEQ ID NO: 144), EVRNVSGIY (SEQ ID NO: 145), GVPPLRVWR (SEQ ID NO: 146), QTFQVAHLH (SEQ ID NO:147), SPRPISYLK (SEQ ID NO: 148), YLFNWAVKT (SEQ ID NO: 149), RLLAPITAY (SEQ ID NO:150), SVPAEILRK (SEQ ID NO:151), RLGVRATRK (SEQ ID NO:152), YLLPRRGPR (SEQ ID NO:153), EVFCVQPEK (SEQ ID NO: 154), WCAAILRR (SEQ ID NO: 155), LLTLSPYYK (SEQ ID NO: 156), KLAALTGTY (SEQ ID NO: 157), LVFFCAAWY (SEQ ID NO:158), SLRQKKVTF (SEQ ID NO:159), GVLAGLAYY (SEQ ID NO: 160), SVFLVSQLF (SEQ ID NO: 161), TVNFT KV (SEQ ID NO: 162), KLGVPPLRV (SEQ ID NO: 163), LIRACMLVR (SEQ ID NO: 164), LVNTWKSKK (SEQ ID NO:165), ITRVPYFVR (SEQ ID NO:166), KHPEATYTK (SEQ ID NO:167), LLCPSGHW (SEQ ID NO:168), KTKRNTNRR (SEQ ID NO: 169), SIYPGHVSG (SEQ ID NO:170), SSIPTTTIR (SEQ ID NO:171), TLPQDAVSR (SEQ ID NO:172), VTLTHPITK (SEQ ID NO:173), PIPPPRRKR (SEQ ID NO:174), RHADWPVR (SEQ ID NO:175), RHTPVNSWL (SEQ ID NO:176), RRCRASGVL (SEQ ID NO:177), RRGPRLGVR (SEQ ID NO:178), GRTWAQPGY (SEQ ID NO:179), RRYE QDC (SEQ ID NO:180), RRGDSRGSL (SEQ ID NO:181), YRFVTPGER (SEQ ID NO:182), TRAEAHLQV (SEQ ID NO:183), GRAATCGKY (SEQ ID NO:184), AHWGVLAGL (SEQ ID NO:185), RRSRNLGKV (SEQ ID NO:186), GRRQPIPKA (SEQ ID NO:187), YHGAGSKTL (SEQ ID NO:188), LRDWAHAGL (SEQ ID NO:189), HRFNASGCP (SEQ ID NO:190), GRDKNQVDG (SEQ ID N0:191), RRHVGPGEG (SEQ ID NO: 192), THFFSILLA (SEQ ID NO: 193), RRQPIPKAR (SEQ ID NO:194), TRPPQGNWF (SEQ ID NO:195), GRLVPGAAY (SEQ ID NO:196), GHVSGHRMA (SEQ ID NO:197), RHRARSVRA (SEQ ID NO:198), HRNIVDVQY (SEQ ID NO:199), TRDPTTPLA (SEQ ID NO:200), GHVKNGSMR (SEQ ID NO:201), LRVWRHRAR (SEQ ID NO:202), NRRPQDVKF (SEQ ID NO:203), RRKRTWLT (SEQ ID NO:204), RHVDLLVGA (SEQ ID NO:205), THVTGGRVA (SEQ ID NO:206), ARALAHGVR (SEQ ID NO:207), RRGKEILLG (SEQ ID NO:208), AHFLSQTKQ (SEQ ID NO:209), RRGRTGRGR (SEQ ID NO:210), GHTHVTGGR (SEQ ID NO:211), SRCWVALTP (SEQ ID NO:212), TRVESENKV (SEQ ID NO:213), WHYPC NF (SEQ ID NO:214), GQIVGGVYL (SEQ ID NO:215), GHWGIFRA (SEQ ID NO:216), ARRPEGRTW (SEQ ID NO:217), KHPEATYTK (SEQ ID NO:218), FHYVTGMTT (SEQ ID NO:219), HRARSVRAK (SEQ ID NO:220), LHGPTPLLY (SEQ ID N0.221), SRAQRRGRT (SEQ ID NO:222), TRVPYFVRA (SEQ ID NO:223), GRPAVIPDR (SEQ ID NO:224), ERSQPRGRR (SEQ ID NO:225), RRRSRNLGK (SEQ ID NO:226), RHHSMVYST (SEQ ID NO:227), PRRKRTWL (SEQ ID NO:228), GRKPARLIV (SEQ ID NO:229), RRPEGRTWA (SEQ ID NO:230), GRDAIILLT (SEQ ID NO:231), DHYRDVLKE (SEQ ID NO:232), ERLHGLSAF (SEQ ID NO:233), TRYSAPPGD (SEQ ID NO:234), RGYKGVWRG (SEQ ID NO:235), RCFDSTVTE (SEQ ID NO:236), YRLWHYPCT (SEQ ID NO:237), DRSELSPLL (SEQ ID NO:238), GHYVQMAFM (SEQ ID NO:239), NEGMGWAGW (SEQ ID NO:240), GQIVGGVYL (SEQ ID NO:241), EEWFQVGL (SEQ ID NO:242), AEAHLQVWV (SEQ ID NO:243), VEFLVNTWK (SEQ ID NO:244), GEDWCCSM (SEQ ID NO:245), GEIPFYGKA (SEQ ID NO:246), GEAGEDWC (SEQ ID NO:247), REISVPAEI (SEQ ID NO:248), GEINRVASC (SEQ ID NO:249), RNVSGIYHV (SEQ ID NO:250), EAIKGGRHL (SEQ ID NO:251), VEVTRVGDF (SEQ ID NO:252), GLTHIDAHF (SEQ ID NO:253), LEWTSTWV (SEQ ID NO:254), VEHRLNAAC (SEQ ID NO:255), AEILRKPRK (SEQ ID NO:256), VDMVAGAHW (SEQ ID N0:257), EEYVEVTRV (SEQ ID NO:258), GRHLIFCHS (SEQ ID NO:259), GERCDLEDR (SEQ ID NO:260), SQLDLSGWF (SEQ ID NO:261), RGRSGIYRF (SEQ ID NO:262), GSIGLGKVL (SEQ ID NO:263), QEMGGNITR (SEQ ID NO:264), KEILLGPAD (SEQ ID NO:265), AQGYKVLVL (SEQ ID NO:266), SSTQSLVSW (SEQ ID NO:267), NTRPPQGNW (SEQ ID NO:268), ITYSTYCKF (SEQ ID NO:269), KFPPALPIW (SEQ ID NO:270), SSIPTTTIR (SEQ ID NO:271), HSTDSTTIL (SEQ ID NO:272), YSPGEINRV (SEQ ID NO:273), STNPKPQRK (SEQ ID NO:274), TTLPALSTG (SEQ ID NO:275), LTDPSHITA (SEQ ID NO:276), TLSPYYKVF (SEQ ID NO:277), TSPLTTQNT (SEQ ID NO:278), STLPGNPAI (SEQ ID NO:279), FSLDPTFTI (SEQ ID NO:280), STLPQAVMG (SEQ ID NO:281), LTTQNTLLF (SEQ ID NO:282), YSWGENETD (SEQ ID NO:283), HSAKSKFGY (SEQ ID NO:284), VWGTTDRF (SEQ ID NO:285), ARRPEGRTW (SEQ ID NO:286), SSLTITQLL (SEQ ID NO:287), TVLSDFKTW (SEQ ID NO:288), STDSTTILG (SEQ ID NO:289), ESMETTMRS (SEQ ID NO:290), SSPPAVPQT (SEQ ID NO:291), TTALWSQL (SEQ ID NO:292), ITAETAKRR (SEQ ID NO:293), ITSPLTTQN (SEQ ID NO:294), TSGDWWA (SEQ ID NO:295), FSDMETKLI (SEQ ID NO:296), SSDQRPYCW (SEQ ID NO:297), ASITSPLTT (SEQ ID NO:298), ITTGGPITY (SEQ ID NO:299), TSCSSNVSV (SEQ ID NO:300), ITVPHPNIE (SEQ ID NO:301), CSTPCSGSW (SEQ ID NO:302), NVDQDLVGW (SEQ ID NO:303), LTTGSWIV (SEQ ID NO:304), KSTKVPAAY (SEQ ID NO:305), LSAPSLKAT (SEQ ID NO:306), YNPPLLESW (SEQ ID NO:307), HVSPTHYVP (SEQ ID NO:308), GLGLNAVAY (SEQ ID NO:309), GPGEGAVQW (SEQ ID NO:310), YLYGIGSAV (SEQ ID NO:311), LLSPRPISY (SEQ ID NO:312), SLRQKKVTF (SEQ ID NO:313), ILSPGALW (SEQ ID NO:314), GVDGHTHVT (SEQ ID NO:315), RLLAPITAY (SEQ ID NO:316), GPKGPITQM (SEQ ID NO:317), HVSGHRMAW (SEQ ID NO:318), EMKAKASTV (SEQ ID NO:319), RVGDFHYVT (SEQ ID NO:320), YVGGPLTNS (SEQ ID NO:321), DQRPYCWHY (SEQ ID NO:322), SIYPGHVSG (SEQ ID NO:323), FQYSPGQRV (SEQ ID N0.324), GLRDLAVAV (SEQ ID NO:325), VLSDFKT L (SEQ ID NO:326), DLEWTSTW (SEQ ID NO:327), LLRHHSMVΫ (SEQ ID NO:328), ALRAFTEAM (SEQ ID N0.329), EMGGNITRV (SEQ ID NO:330), IQYLAGLST (SEQ ID NO:331), DLSDGSWST (SEQ ID NO:332), ILGPLMVLQ (SEQ ID NO:333), CQRGYKGVW (SEQ ID NO:334), ILLGPADSF (SEQ ID N0.335), VTRVGDFHY (SEQ ID N0.336), YVYDHLTPL (SEQ ID NO:337), DGGCSGGAY (SEQ ID NO:338), VAGGHYVQM (SEQ ID NO:339), IMYAPTLWA (SEQ ID NO:340), RVLEDGVNY (SEQ ID NO:341), YGIGSAWS (SEQ ID NO:342), ARRPEGRTW (SEQ ID NO:343), ALPPRAYAM (SEQ ID NO:344), GQIVGGVYL (SEQ ID NO:345), YVPPWHGC (SEQ ID NO:346), SQLDLSGWF (SEQ ID NO:347), RPRWFMLCL (SEQ ID NO:348), SPRGSRPSW (SEQ ID NO:349), APPPSWDQM (SEQ ID NO:350), APRPCGIVP (SEQ ID NO:351), LPRLPGLPF (SEQ ID NO:352), EPEPDVAVL (SEQ ID NO:353), ARRPEGRTW (SEQ ID NO:354), GPKGPITQM (SEQ ID NO:355), KPRKFPPAL (SEQ ID NO:356), APPGDPPQP (SEQ ID NO:357), TPIPAASQL (SEQ ID NO:358), RPDYNPPLL (SEQ ID NO:359), AVNHIRSVW (SEQ ID NO:360), APPIPPPRR (SEQ ID NO:361), RARPRWFML (SEQ ID NO:362), HPNIEEVAL (SEQ ID NO:363), RPSWGPTDP (SEQ ID NO:364), SPRPISYLK (SEQ ID NO:365), DPPQPEYDL (SEQ ID NO:366), WPAPPGARS (SEQ ID NO:367), APNYSRALW (SEQ ID NO:368), RPIDEFAQG (SEQ ID NO:369), APPGARSMT (SEQ ID NO:370), RAQAPPPSW (SEQ ID NO:371), SPPAVPQTF (SEQ ID NO:372), APTLWARMI (SEQ ID NO:373), CPSGHWGI (SEQ ID NO:374), QPGYPWPLY (SEQ ID NO:375), PPPSWDQMW (SEQ ID NO:376), LPIWARPDY (SEQ ID NO:377), APPSAASAF (SEQ ID NO:378), TPSPAPNYS (SEQ ID NO:379), SPAPNYSRA (SEQ ID NO:380), RPAVIPDRE (SEQ ID NO:381), APLGGAARA (SEQ ID NO:382), RAATCGKYL (SEQ ID NO:383), SPLTTQNTL (SEQ ID NO:384), QPRGRRQPI (SEQ ID NO:385), GPRLGVRAT (SEQ ID NO:386), SPGEINRVA (SEQ ID NO:387), NPAIASLMA (SEQ ID NO:388), FRKHPEATY (SEQ ID NO:389), DPSHITAET (SEQ ID NO:390), RPCGIVPAS (SEQ ID NO:391), NTRPPQGNW (SEQ ID NO:392), HPITKYIMA (SEQ ID NO:393), PPWHGCPL (SEQ ID NO:394), AVIPDREVL (SEQ ID NO:395), PPIPPPRRK (SEQ ID NO:396), TPPGSITVP (SEQ ID NO:397), IPAASQLDL (SEQ ID NO:398), TPSPVWGT (SEQ ID NO:399), GPTDPRRRS (SEQ ID NO:400), RARSVRAKL (SEQ ID NO:401), VPGAAYALY (SEQ ID NO:402), TPIDTTIMA (SEQ ID NO:403), AALTGTYVY (SEQ ID NO:404), AASCGGAVF (SEQ ID NO:405), TPAETSVRL (SEQ ID NO:406), EARQAIRSL (SEQ ID NO:407), KPTLHGPTP (SEQ ID NO:408), DPTTPLARA (SEQ ID NO:409), AVMGPSYGF (SEQ ID NO:410), AVAVEPWF (SEQ ID NO:411), RVASSTQSL (SEQ ID NO:412), ALAHGVRVL (SEQ ID NO:413), DPRRRSRNL (SEQ ID NO:414), GPGEGAVQW (SEQ ID NO:415), EPDVAVLTS (SEQ ID NO:416), RGRSGIYRF (SEQ ID NO:417), RAYLNTPGL (SEQ ID N0.418), MPSTEDLVN (SEQ ID NO:419), RLIWWLQYF (SEQ ID NO:420), VPHPNIEEV (SEQ ID NO:421), TPGERPSGM (SEQ ID NO:422), HVSGHRMAW (SEQ ID NO:423), ATLGFGAYM (SEQ ID NO:424), NPKPQRKTK (SEQ ID NO:425), QPIPKARRP (SEQ ID NO:426), FPDLGVRVC (SEQ ID NO:427), AAALCSAMY (SEQ ID NO:428), and VPAPEFFTE (SEQ ID NO:429),

3. A method for predicting peptides that are epitopes or can be used as diagnostic tools. The method comprising: Predicting which peptides binds to a HMC molecule with high affinity using a neural network with alt least one of the following features • Some or all of the inputs to the neural networks are generated using a hidden Markov model • Some or all of the inputs are encoded by a amino acid substitution matrix, different from an identity matrix

4. A method for predicting peptides that are epitopes or can be used as diagnostic tools. The method comprising: Predicting which peptides binds to a HMC molecule with high affinity using a neural network with both of the following features • Some of the inputs to the neural networks are generated using a hidden Markov model • Some of the inputs are encoded by amino acid substitution matrices • A method for predicting peptides that are epitopes or can be used as diagnostic tools. The method comprising: Predicting which peptides binds to a HMC molecule with high affinity using a combination of at least two neural networks each of which have a different input encoding of the sequence. The inputs to the neural networks may be generated using a hidden Markov model and/or a amino acid substitution matrix

5. The prediction of the neural network is combined with prediction or measurement of one of the following: • Proteasomal cleavage sites • MHC binding. • Presence of sequence or related sequence(s) in patent databases • TAP binding • Gene or protein expression level. • Function of the protein • Localization of the protein • Similarity to self proteins

6. A method according to one of the preceding claims where the epitopes are derived from hepatitis C virus.

7. A vaccine using a limited number such as at least 1, 2, 3, 4, 5, 8, 16, 32, 64, 128, 256, 512 of the peptides listed in claim 1 or 2.

8. A diagnostic tool using limited number such as at least 1 , 2, 3, 4, 5, 8, 16, 32, 64, 128, 256, 512 of the peptides listed in claim 1 or 2.

9. A vaccine using a limited number such as at least 1, 2, 3, 4, 5, 8, 16, 32, 64, 128, 256, 512 of the peptides predicted by the method described in claim 3, 4 or 5.

A diagnostic tool using limited number such as at least 1, 2, 3, 4, 5, 8, 16, 32, 64, 128, 256, 512 peptides predicted by the method described in claim 3, 4 or 5.