WO2014150924A2 - Accurate typing of hla through exome sequencing - Google Patents

Accurate typing of hla through exome sequencing Download PDF

Info

Publication number
WO2014150924A2
WO2014150924A2 PCT/US2014/024555 US2014024555W WO2014150924A2 WO 2014150924 A2 WO2014150924 A2 WO 2014150924A2 US 2014024555 W US2014024555 W US 2014024555W WO 2014150924 A2 WO2014150924 A2 WO 2014150924A2
Authority
WO
WIPO (PCT)
Prior art keywords
hla
alleles
read
contigs
reads
Prior art date
Application number
PCT/US2014/024555
Other languages
French (fr)
Other versions
WO2014150924A3 (en
Inventor
Xiao Yang
Chang Liu
Michael C. ZODY
John D. PFEIFER
Original Assignee
The Broad Institute, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Broad Institute, Inc. filed Critical The Broad Institute, Inc.
Priority to US14/772,347 priority Critical patent/US10176294B2/en
Publication of WO2014150924A2 publication Critical patent/WO2014150924A2/en
Publication of WO2014150924A3 publication Critical patent/WO2014150924A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Definitions

  • Embodiments of the present disclosure are directed to Human leukocyte antigen (HLA) detection at the allelic level using sequencing data.
  • HLA Human leukocyte antigen
  • Allelic HLA typing refers to sequencing-based typing to determine variations in coding DNA sequences that alter the protein sequences. This is also commonly referred to as high resolution typing or four-digit typing. There are also alleles that bear synonymous mutations and mutations within noncoding DNA, but resolution of these alleles is rarely necessary in clinical practice.
  • cDNA full-length or partial complementary DNA
  • gDNA genomic DNA
  • HLA typing at the allelic level is typically performed by Sanger sequencing of selected exons of Class I genes (HLA-A, -B and -C) and Class II genes (HLA-DRB1 and - DQB1). HLA typing using exome sequencing (exome-seq) data is in its infancy.
  • exome sequencing exome-seq
  • the nomenclature of HLA alleles follows the guidelines from the WHO Nomenclature Committee for Factors of the HLA System (http://hla.alleles.org/nomenclature/naming.html).
  • Embodiments of the present disclosure present bioinformatic systems, methods, for allelic HLA typing (as well as and computer readable media having instructions for performing methods of HLA typing) using, for example, Illumina exome-sequencing data (e.g., 101 basepair, paired-end reads).
  • Illumina exome-sequencing data e.g., 101 basepair, paired-end reads.
  • Embodiments of the present disclosure overcome a major bioinformatic hurdle in applying targeted Illumina sequencing to allelic HLA typing.
  • qualified exome- seq datasets from research projects can be analyzed to revisit HLA-disease associations with allelic resolution.
  • some embodiments provide a way to deep sequencing of captured HLA genes alone rather than the whole exome, which enables multiplexing samples for high-throughput typing of bone marrow donors at reduced cost.
  • the HLA genes can also be bundled with existing oncogene panels(3) for targeted Illumina sequencing in order to prepare leukemic patients for transplantation while characterizing the molecular profiles of their diseases.
  • a set of contigs may be assembled from target (e.g. HLA-A) specific reads to substantially represent haplotype exons of the target gene (in some embodiments, fully represent haplotype exons of the target gene).
  • target specific reads may be obtained using two stages of filtering. For example, exome-sequencing data may be first filtered against alleles of HLA genes obtained from the, for example, IMGT/HLA database ( Figure 11), to narrow the searching space.
  • reads that concurrently align to target and off-target genes may be filtered, as these reads may introduce ambiguities to an assembly. It has been observed that up to 9% of reads are multi- mapped, where HLA-A and -DRB1 have higher percentages of multi-mapped reads compared with HLA-B, -C, and -DQB 1 genes.
  • distinct strategies to facilitate sequence assembly in the highly polymorphic HLA region are adopted. For example, in some embodiments, a read pair is first converted to a haplotype sequence when both ends are aligned to the same reference allele.
  • degenerate bases are used as placeholders at uncertain positions in the contig. These gaps may be resolved based on information from reads merged into the contig at a later point, for example ( Figure 3).
  • degenerate bases can also serve as an indicator of insufficient read support, which can be factored in during allelic pair inference (for example). Additionally, assembly of contigs sharing longer substrings with higher frequencies may be prioritized in some embodiments.
  • identification of relevant alleles whose exons should match the contigs assembled at a previous step may be accomplished.
  • the quality of an allele may be measured by, for example, summation of Hamming distances between individual exons and their best match in the contigs.
  • sequence similarity over the length of matching substrings ( Figure 4) may be prioritized. This may be done for at least one of two considerations. First, in some embodiments, alleles with nearly identical sequences require differentiation. Second, exons may be only partially sequenced which results in degenerate bases in the contig. Accordingly, in some embodiments, candidate alleles are qualified within a hamming distance of, for example, 2, which allows inclusion of potential no vel alleles closely related to existing alleles.
  • an additional step for an HLA typing process includes, for example, inferring homozygous and/or heterozygous allelic pairs by considering all possible allelic pairs and checking if the allelic pairs are consistent with the data as represented by contigs.
  • the principle of parsimony is utilized which ensures allelic pairs, fully supported by the data, may be prioritized over pairs consistent with but not fully supported by the data ( Figure 5a).
  • the final allelic pairs in some embodiments, should also explain as much of the information present in the data as possible ( Figure 5b). Meanwhile, phasing information among discrete exons may be inferred, in some embodiments ( Figure 5 b) since exome-seq may not fully capture introns.
  • a computer- implemented method for HLA typing comprising one or more of the following steps (and in some embodiments, a plurality of the steps, and in some embodiments, all of the following steps): accessing sequencing data that contains HLA information stored in one or more databases, accessing HLA sequences of alleles of one or more HLA genes and the multiple sequence alignment of these HLA sequences stored in one or more databases, and searching for target-specific reads/read-pairs from the sequencing data.
  • Searching may comprise first filtering the sequencing data against HLA sequences to obtain a first set of reads/read-pairs aligned to these sequences, and second filtering the first set to eliminate reads/read-pairs concurrently aligned to both sequences of the target HLA gene and off-target HLA genes to obtain a second set of target HLA reads/read-pairs.
  • the method may further include assembling a plurality of reads/read-pairs uniquely aligned to a target HLA gene into one or more contigs. Assembling may comprise merging read-pairs having both ends aligned to the same reference allele to form single-reads/haplotype sequences, and assembling one or more contigs from the haplotype sequences.
  • the assembly of contigs having longer substrings with higher frequencies may be prioritized.
  • the method may also include identifying relevant target alleles having exons matching the assembled contigs, where identifying may comprise decomposing target alleles into exons, calculating the Hamming distance (HD) between individual target allele exons and corresponding best matches from the assembled contigs, and prioritizing sequence similarity over the length of matching substrings between the contigs and the target allele exons.
  • identifying may comprise decomposing target alleles into exons, calculating the Hamming distance (HD) between individual target allele exons and corresponding best matches from the assembled contigs, and prioritizing sequence similarity over the length of matching substrings between the contigs and the target allele exons.
  • HD Hamming distance
  • relevant alleles within a first predetermined HD are selected as candidate alleles.
  • the HLA typing method may also include inferring at least one of homozygous or heterozygous allelic pairs from the candidate alleles, where inferring may comprise selecting alleles from the candidate alleles having a HD less than a second predetermined distance.
  • first filtering may include soft-clipping during alignment.
  • a read/read-pair may be included in the first set upon the read/read-pair having a high quality suffix-prefix alignment with the HLA sequences of known alleles.
  • HLA typing method may further include dividing paired-end reads that align to the same known allele into substrings, where upon the prefix and suffix of the paired-end reads overlapping, and the substrings are substantially equal in length and not empty, the read-pairs are merged to form single-reads/haplotype sequences. Additionally, upon one read of a read-pair merging at a position including a sequencing error, degenerate bases may be encoded in the merged haplotype sequence, as well as upon the overlap being empty, the gap between the read-pair is encoded with degenerative bases. In some embodiments, upon degenerate bases being present and their intersection being not empty, the base at the in tersection is used to assemble the con tig.
  • Assembly of the contigs from the haplotype sequences may comprise merging haplotype sequences upon the haplotype sequences sharing common /- ⁇ mers. / may be initially set to the value of the maximum read length and then iteratively decreased by a fixed amount until a minimum threshold is reached.
  • Haplotype sequences containing sequencing errors may comprise reads/read-pairs having low frequency ⁇ -mers, and, in some embodiments, such read/read-pairs may be eliminated from assembly into contigs.
  • assembly includes decomposing haplotype sequences into a set of /-mers and sorting the results in a decreasing order of frequency. For example, for each / ⁇ mer in the sorted list, contigs which share a specific /-mer are assembled.
  • reads/read-pairs may be merged upon being concordant at each alignment position
  • assembly further comprises trimming the string;
  • * contig assembly may be carried out iteratively; * tracking at least one of the removal of previously assembled contigs and the creation of new contigs, where tracking may comprise a union-find algorithm;
  • candidate alleles may comprise alleles with HD of 0, upon such alleles corresponding to two or more types of protein coding sequences;
  • candidate alleles may comprise alleles with a HD of less than or equal to a predetermined threshold
  • the HLA typing method may further comprise inspecting multiple sequence alignment (MSA) of the alleles obtained from the database.
  • MSA multiple sequence alignment
  • the method may further then comprise dividing the MSA into different exons.
  • Some embodiments of the present disclosure are directed to a computer-implemented method for HLA typing which may include one or more of the following steps (and in some embodiments, a plurality of such steps, and in some embodiments, all such steps): accessing sequencing data that contains HLA information, filtering the sequencing data against known allelles of one or more HLA genes to obtain a first set of reads/read-pairs aligned with known HLA alleles, and assembling contigs from the first set of reads/read-pairs. Assembly may comprise filtering out reads/read-pairs aligning to known, non-target HLA alleles, merging paired-end reads, and generating contigs from merged paired-end reads.
  • the method may further comprise identifying alleles from the contigs, where identifying may comprise dividing each allele into exons, identifying a best match for each exon in the contigs, and calculating and summing Hamming distance (HD) between contigs and known target alleles.
  • the method may further include inferring allelic pairs, where inferring may comprise pairing candidate alleles, scoring all allelic pairs based on candidate alleles, and reporting allelic pairs with best scores.
  • an HLA typing system comprises at least one computer having at least one processor, the at least one processor configured with computer instructions operating thereon to perform a method of HLA typing according to any such embodiments disclosed above or otherwise herein.
  • a non-transitory computer readable medium which comprises computer-executable instructions recorded thereon for causing a computer to perform one or more of the HLA typing methods disclosed above or otherwise herein.
  • This application file contains at least one drawing executed in color.
  • Figure 1 is a workflow of allelic HLA typing using exome-seq data according to some embodiments of the present disclosure.
  • Figure 2a-c are comparisons of HLA typing results among conventional methods (Sanger sequencing), an existing method using exome-seq data (HLAminer) and some embodiments of the present disclosure, using 15 exome-seq datasets from Illumina platforms.
  • Figure 2a illustrates fold coverage at exons of each of the five target HLA genes where coverage data from all 15 samples are pooled. Median coverage (shaded lines) and range (shaded areas) are plotted over alignment positions for individual HLA genes. Dotted lines represent exon boundaries.
  • Figure 2b illustrate comparison of HLA typing results between an embodiment of the present disclosure and the conventional Sanger method (e.g., after only the first round of sequencing).
  • Figure 2c illustrates concordance rates of HLA typing results by an embodiment of the present disclosure and HLAminer as compared with conventional Sanger-based method.
  • Figures 3a-b illustrate ambiguities resulting upon merging paired-end reads (r 0 , ri).
  • r 0 and ri align to alleleo, where the aligned regions partly overlap.
  • Allelei contains an additional base C 300 compared to alleleo, within the gap between ro and rj, resulting in two legitimate merging results r' and r".
  • Figures 4a-b illustrate examples of preferring similarity to length when identifying a best match of an exon and the consideration of non-zero distance alleles. Specifically, Figure 4a illustrates the alignment of the second exons of the heterogeneous alleles A*02:01 :01:01 and A*02: l 1:01 of sample HG01953 and the two contigs a and c ⁇ identified by methods according to some embodiments of the disclosure.
  • Contig c3 ⁇ 4 is identical to exon 2 of A*02:01:01:01 and contig Cj is a proper prefix of exon 2 of A*02: 11 :01. Because exon 2 of the two alleles differ only by the two bases highlighted, c, naturally serves as a candidate hit for exon 2 of A*02: 11 :01 that has a length identical to exon 2 and a similarity over 99%. Nevertheless, Cj is shorter but has a 100% similarity. A*02: 11 :01 would have been missed without tolerating missing bases.
  • Figure 4b illustrates the alignment of the seventh exons of candidate alleles C*04:01 :01 :01 and C*04:30 and contig c3 ⁇ 4 identified by methods according to some embodiments of the disclosure.
  • c3 ⁇ 4 is identical to exon 7 of C*04:01 :01 :01 and it is the only contig that can serve as a candidate hit for exon 7 of C*04:30. Without tolerating non-zero distance alleles, C*04:30 would have been missed.
  • Figures 5a-b illustrate an example of the determination of allelic pairs.
  • Figure 5a illustrates the principle of parsimony. Assuming two contigs, Contig a and Contigb, have been assembled and two candidate alleles, alleleo and allele ⁇ have been identified. Alleleo differs from allele i by one base, highlighted in red.
  • three allelic pairs may be checked: (alleleo, allelei), (alleleo, alleleo) and (allelei, allelei). However, in some embodiments, it cannot be ruled out the possibility of (alleleo, allelei) or (allelei, allelei) being the correct answer as Contigb may have supported allelei if the missing bases were TGG.
  • the homozygous pair (alleleo, alleleo), in some embodiments, is preferred as it is based on the fewest assumptions and sufficient to represent both contigs.
  • Figure 5b illustrates that the allelic pair may capture as much information as presented in the data. For example, assuming four potential alleles (B*55:02:01, B*56: l l, B*35:03:01, B*35:60) have been identified for HLA-B of sample HG01873, for each exon of exons 1-4, two different haplotypes are present, and may be assigned with label A or B. A plurality of the alleles (e.g., all) may be supported as their individual exons are present in contigs.
  • Figure 6 illustrates the diversity of the HLA genotypes of included samples.
  • the HLA genotypes of 13 samples included in a study are mapped onto the phylogenetic trees of known alleles of HLA-A, -B, -C, -DRB1 and -DQB1 genes generated using RAxML (v.7.3.3).
  • the alleles at the four-digit resolution are highlighted in color corresponding to each gene.
  • Figure 7 illustrates variations in fold coverage among different target HLA genes and their exons. The median fold coverage is plotted for individual exons of each HLA gene. Error bars show 95% confidence intervals.
  • Figure 8 illustrates an example of exom-seq data of the individual NA 18526, according to some embodiments, that has inadequate coverage over target HLA genes.
  • the five target HLA genes of this individual were typed previously.
  • no typing result may be reported for any of the target gene as the assembled contigs may not unambiguously support any candidate allele due to insufficient exon coverage.
  • the alignment in some embodiments, is generated using BLAST between contigs of a target gene and one of the known cDNA sequences of this gene. In certain experiments, for DQB1 gene, no contig was produced.
  • HLAminer reported typing results for HLA-A, B, C and DRB 1 genes, all of which are discordant with the known type except HLA-B, where the known type is among one of the seven equally supported candidates. 4 other samples with inadequate coverage are HG01515, HG01049, HG00731 and A18605.
  • Figures 9a-e illustrate differential fold coverage at variant positions between validated allelic pairs for each individual.
  • the fold coverage is plotted for each allele of a heterozygous allelic pair at positions where they differ. In case of homozygosity, the coverage is plotted for all alignment positions. Dotted lines are exon boundaries.
  • Figure 10 illustrates a size of allelic bias in discrete exons. Differences in fold coverage between two heterozygous alleles at each variant position are plotted as the percentage of total coverage at the same variant position. Results are shown for different exons and genes of each sample; involved allelic pairs are shown to the right of each panel.
  • Figure 11 is a table illustrating sequences of typed HLA genes included in the reference.
  • Figure 12 is a table illustrating characteristics of exome-seq data used for in silico HLA typing.
  • Figures 13a-g illustrates HLA typing results using exome-seq data and the laboratory validation.
  • Figure 14 illustrates a sample report generated by an embodiment of the present disclosure.
  • Figures 15a-b illustrate a statistical analysis of allelic bias within individual exons.
  • Figures 16 and 17 represent example systems for at least one of conducting HLA typing analysis, analyzing data from such analysis, such analysis including at least one of qualifying, quantifying sequencing data, as well as including, in some embodiments, collecting and/or storing data for the analysis and/or produced as a result of such analysis.
  • Embodiments of the present disclosure overcome a major bioinformatics hurdle in applying targeted Illumina sequencing to allelic HLA typing.
  • Previous methods for HLA typing from traditional and next generation sequencing methods have relied on the assumption that more reads align results to the correct alleles (4) (2).
  • this assumption does not wholly apply to exome-sequencing data, as there are often large differences in the coverage of different exons ( Figure 2a, Figure 7).
  • sequencing data may be obtained from one or more sources including, but not limited to, e.g., whole genome sequencing datasets, whole exome sequencing datasets, qualified exome-sequencing datasets, datasets obtained from deep sequencing of captured HLA genes alone rather than the whole exome, one or more sets of contigs assembled from target (e.g.
  • HLA-A, HLA-B, HLA-C, etc. specific reads to substantially represent haplotype exons of the target gene (in some embodiments, fully represent haplotype exons of the target gene), and/or datasets with inclusion of gDNA sequences to enhance the ability to capture reads spanning intron-exon boundaries, and, alternatively, wherein gDNA sequences may include additional sequences where capture probes for HLA introns are included
  • the sequencing datasets may comprise HLA genes bundled, e.g., with existing oncogene panels (3) for targeted Illumina sequencing, e.g., in order to prepare leukemic patients for transplantation while characterizing the molecular profiles of their diseases.
  • FIG. 1 A high-level overview of a process for HLA typing according to some of the embodiments of the present disclosure is provided in Figure 1. Accordingly, some embodiments of the present disclosure were utilized in (and will be illustrated with respect to) an example experiment to derive allelic HLA types, and will be disclosed with specific details provided below.
  • the HLA types were derived from 15 Illumina exome-sequencing datasets: 13 subjects of five ethnicities, with two datasets are duplicates (obtained from multiple sources, including the 1000 Genomes Project (7)).
  • the characteristics of the datasets and diversity of included HLA types are summarized in Figure 12 and Figure 6, respectively. Briefly, after read filtering, the median coverage of different exons of individual genes ranged from approximately ten to a thousand fold (Figure 2a).
  • the HLA typing method (according to some embodiments of the disclosure) efficiently identified allelic pairs. Specifically, at a resolution that tolerates synonymous mutations in coding sequences, 74 out of 75 allelic pairs were correctly reported at an overall concordance rate of 99% compared with conventional typing ( Figure 13). In the example, the only case of discordance occurred to HLA-A of HG01872, where in addition to the correct allelic pair, one more pair with one base difference was reported. Interestingly, reads support for both exist in the exome-seq data.
  • the exome-seq data comprised fifteen (15) exome-seq datasets from 13 subjects were analyzed, including two duplicate sequencing datasets for two subjects (NA19240 and NA20313).
  • the sources of these data are the 1000 Genomes Project (1 1 datasets; available at ftp://ftp-trace.ncbi.nih.gov/1000genonnes/ftp/data/) and an internal validation project at Genome Technology Access Center (GTAC) at Washington University (four datasets; available upon request).
  • GTAC Genome Technology Access Center
  • the ethnicities of the subjects and Run ID of the datasets are listed in Figure 12.
  • the protocols for library preparation and exome capture were described previously (8).
  • allelic bias As sequencing of captured exons is susceptible to allelic bias (9), the coverage of heterozygous alleles at positions where they differ from each other was compared ( Figure 9 and Figure 15). Statistically significant allelic biases are most frequently observed at exons 2 through 4 of Class I genes and exon 2 of Class II genes, and neighboring exons can also exhibit preference to different haplotypes. The patterns of allelic bias are highly reproducible in the two replicates examined. Almost all the bias at individual variant positions (99%) are within 80% of the total coverage (Figure 10). In some embodiments, this does not affect the HLA typing.
  • exome-seq data do not cover polymorphisms within introns (Figure 2b); while this information is rarely necessary in current practice, in some embodiments it can be obtained if capture probes for HLA introns are included.
  • exome sequencing data may be aligned against a multi-FASTA file that contains all known alleles of HL genes available from the IMGT/HLA database ( Figure 11).
  • the inclusion of gDNA sequences enhances the ability to capture reads spanning intron-exon boundaries, although they are available for only a fraction of alleles in the database.
  • soft-clipping is done during alignment. Specifically, in some embodiments, it is sufficient to retain a read when it has a high quality suffix-prefix alignment with cDMA sequences.
  • Novoalign http://w ⁇ 'w.novocraft.com/ ' rnain/mdex.php
  • the aligner where no more than one edit distance was allowed.
  • Information is kept when a read or read pair is aligned to multiple HLA genes.
  • the alignment result is recorded in a compressed BAM format, from which reads/read-pairs aligned to a target HLA gene (e.g. HLA -A) with a customized BED file registering all alleles of this gene are extracted.
  • a target HLA gene e.g. HLA -A
  • a customized BED file registering all alleles of this gene are extracted.
  • reads/read-pairs that aligned to non-target genes e.g. all non-HLA-A genes in the reference
  • Reads/read-pairs uniquely aligned to the target gene may be used in the assembly step. They are first oriented to be consistent with the alignment so that their reverse complementary strands need not be considered. Paired-end reads aligned to the same reference allele are merged to form single reads based on the following method (for example): let a paired-end reads ( ⁇ , ?] ) that aligned to the same allele be divided into substrings
  • o 0 and O l should be equal in length.
  • o 0 and O l may be merged to form string o m , where the IUPAC (International Union of Pure and Applied Chemistry) characters, or degenerate bases, may be used to encode two different nucleotides merging at a position with one of them being a possible sequencing error.
  • IUPAC International Union of Pure and Applied Chemistry
  • the fragment length for certain read pairs may not uniquely be determined.
  • the uncertainties regarding the degenerate bases and fragment length may be resolved by identifying additional fragments that are likely to be obtained from the same genomic location during assembly ( Figure 3).
  • Reads containing sequencing errors may contain low frequency &-mers (substrings with length k set to be half of the read length) that occur only once or twice in the data(6). Since such reads may also belong to non-target genes in exome sequencing, they may be excluded from assembly rather than being corrected.
  • Each read may then be initialized to be a contig, and the base frequency may be recorded at each position. Assuming that contigs sharing longer substrings are more likely to be from the same haplotype, the comparison of contigs sharing longer substrings, in some embodiments, is prioritized. High-frequency substrings may also, in some embodiments, be inspected first, as haplotype sequences, such as exons with a higher read support, can be recovered more reliably. Specifically, contigs may be merged if they share common /-mers, where / may foe initially set to the value of the maximum read length and then may be iteratively decreased by a fixed amount until a minimum threshold (e.g., default of 40) is reached.
  • a minimum threshold e.g., default of 40
  • contigs For each / value, contigs may be decomposed into a set of /-mers, and may be sorted in a decreasing order of frequency. For each /-mer in the sorted list, contigs sharing this /-mer through alignment may be assembled, in which the relative positions of any two contigs can be determined in constant time by matching the /-mer. Because insertion/deletion errors are rare in Alumina sequencing, when generating the full alignment, they may be disallowed. In some embodiments, two contigs may be merged only if they are concordant at each alignment position. In some embodiments, if degenerate bases in IUPAC code are present, their intersection should not be empty and the base present in this intersection may be used in the assembled contig.
  • the base frequency for each position is accrued. This may be used later to identify regions of a contig with low base support ( ⁇ 2) that may result from insufficient exon capture or sequencing errors. As such regions do not provide reliable haplotype sequence information and may prevent further contig merging, they may be replaced with degenerate base 'N'. The prefix or suffix of a contig that consists of a string of s may then be trimmed, whereas internal and intermittent s may be retained. Further contig merging is sought, when possible, by repeating the above steps. To track the removal of existing contigs and the creation of new ones, a union -find algorithm may be used.
  • target exons are expected to be well represented by assembled contigs.
  • each allele of a target gene may be decomposed into exons, for which the best hit may be identified (i.e. a matching substring) among the contigs.
  • the quality of a hit may be determined by at least one of the length and similarity (and in some embodiments, both) of matched substrings.
  • a shorter hit with a higher similarity in some embodiments, may be considered to have a higher quality ( Figure 4).
  • Hamming distance may be used to quantify the differences between an exon and its hit.
  • hits covering a minimum percentage of an exon (default 85%) and within a maximum Hamming distance of 2 are considered (for example). Then, an overall distance may be calculated for each allele by summation of Hamming distances between all its exons and their best hits in the contigs.
  • An ideal candidate allele in some embodiments for example, should have a Hamming distance of zero. However, a true allele may have a. non-zero distance for several reasons (for example): (1) small exons and part of long exons may not be effectively captured in exome sequencing ( Figure 4), and/or (2) the subject may have a novel allele with a small number of mutations compared to a known allele.
  • ail alleles within a distance threshold (default 2).
  • the threshold is an adjustable parameter (e.g., by the user)
  • a higher threshold is unlikely to be meaningful as many known alleles have almost identical protein coding sequences.
  • the following situations may be excluded from calculation of the overall Hamming distance: (1 ) incompletely documented short exons ( ⁇ 25bp) and/or (2) exons with no hits in the contigs. These exclusions may prevent unjustified penalties on partially sequenced alleles in the database, or loss of novel alleles bearing inserted or deleted bases that inflate the Hamming distance dramatically when compared to a known allele.
  • exonic positions of mismatch compared to the best hit termed U (stands for unsupported) positions
  • exonic positions not fully covered by the best hit termed M (stands for missing) positions
  • total length of the covered region termed C (stands for missing) positions
  • short exons excluded from distance calculation are reported: (1) exonic positions of mismatch compared to the best hit, termed U (stands for unsupported) positions, (2) exonic positions not fully covered by the best hit, termed M (stands for missing) positions, (3) total length of the covered region, and/or (4) the short exons excluded from distance calculation.
  • allelic pairs Although all alleles conforming to the distance constraint may be reported, in some embodiments, a subset may be chosen as the candidate list to represent the input data: an allele may be excluded from the candidate list if it contains any exon with no hit; the candidate fist may comprise alleles only with distance zero on condition that they correspond to more than one type of protein coding sequences; otherwise, all alleles with distances less than or equal to one may be included in the candidate list.
  • a plurality of alleles may be chosen with replacement from the candidate list and a scoring scheme may be utilized to measure the distance between the selected allelic pair and the remaining alleles. More specifically, and for example, let the candidate list include n alleles be ⁇ a x , a 2 , ..., a n ) . The score of each allelic pair may be initialized to be zero. Next, multiple sequence alignment (MSA) of these alleles may be obtained. For alleles having the same protein coding sequence, only one of them may be included in the MSA. Then the MSA may be divided into different exons.
  • MSA multiple sequence alignment
  • function A returns the Hamming distance between two strings except that, in some embodiments (1) any U position in s l may be skipped for comparison, and/or (2) for any M position in s i or s when differs from s 5 may be increased by one.
  • the former is to account for the fact that any unsupported base in the data, preferably should not be penalized, and the latter is to penalize the bases in s t or 5 - lacking read coverage when differing from bases having read support in s ⁇ .
  • ⁇ 5 may be increased by the number of U positions in s i and 5 - such that a correct homozygous allele pair may have a higher score compared to an alternative. Note that when allele a. is not included in the MSA, 5. may be obtained from the allele in the MSA that shares the same coding sequence with a i . After the exons have been examined, the pairs with the same minimum score may be reported as the typing result.
  • Evaluation Measure in some embodiments, to measure the accuracy and to quantify the effort required to resolve ambiguities of the typing results, concordance rale may be measured. For example, let the number of inferred allelic pairs for m HLA genes be ⁇ n v n 2 ,..., n m ) , among them (C j , c 2 ,...,c m ) allelic pairs are consistent with the conventional typing results at four-digit level (same protein sequences from typed exons). Accordingly, the concordance rate is given by ⁇ i( c , ⁇ /» ; )/ ⁇ 3 ⁇ 4 ⁇ Any ambiguity or wrong prediction decreases this score.
  • HLA-DRB1 For HLA-DRB1 gene, a variation at codon 86 (GGT/GTG) shows dichotomy among most HLA-DRB1 alleles, and a Z primer specific for this variation was used preemptively during the first round of sequencing to reduce cis/trans ambiguities. Data analysis was performed with uTYPE 6.0 (Life Technologies, Brown Deer, WI). Equivalent allele pairs were reported with letter strings specified by the National Marrow Donor Program
  • some embodiments of the present disclosure significantly outperform HLAminer(2), the only other available program capable of HLA typing using exome-seq data.
  • HLAminer cannot finish computation even after 10 days of running (3 datasets tested).
  • HLAminer reported candidate alleles ranked by likelihood without inferring the final allelic pair(s).
  • the estimated overall concordance rate for HLAminer was 46% compared with conventional typing results, consistent with the previous report.
  • the performance comparison between some embodiments of the present disclosure and HLAminer for individual genes is shown in Figure 2c.
  • the example experiment was executed at a workstation having a six-core 880 MHz and 2400 MHz AMD processors with 250 GB RAM running GNU/Linux x86_64.
  • the method according to embodiments of the present disclosure finished within 2 minutes, while the experiment with HLAminer required 10 minutes to complete.
  • the example was performed (according to some embodiments), using the following:
  • HLAminer (v.1.0.5) may be used, with the script HPTASRwgs_classI-II.sh having a parameter "-i 1"; Novoalign (v.2.07.07) run on 9 cores with parameters "-t 30 -r all -1 80 -e 1 -i 230 140"; and/or
  • RAxML (v.7.3.3) to generate phylogenetic trees with parameters "-p 78960 -f a -x 12345 -# 1000 -m GTRGAMMA".
  • Various implementations of the embodiments disclosed above, in particular at least some of the methods/processes disclosed, may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof.
  • ASICs application specific integrated circuits
  • These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • Such computer programs include machine instructions for a programmable processor, for example, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language.
  • machine-readable medium refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the subject matter described herein may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor and the like) for displaying information to the user and a keyboard and/or a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor and the like
  • a keyboard and/or a pointing device e.g., a mouse or a trackball
  • this program can be stored, executed and operated by the dispensing unit, remote control, PC, laptop, smart-phone, media player or personal data assistant ("PDA").
  • PDA personal data assistant
  • feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • Certain embodiments of the subject matter described herein may be implemented in a computing system and/or devices that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front- end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the subject matter described herein), or any combination of such back-end, middleware, or front-end components.
  • the components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN”), a wide area network (“WAN”), and the Internet.
  • LAN local area network
  • WAN wide area network
  • the Internet the global information network
  • the computing system may include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a processor which may include instructions operating thereon for carrying out one and/or another disclosed method, which may communicate with one or more databases - of which, may store the sequencing data required for different embodiments of the disclosure.
  • the processor may include computer instructions operating thereon for accomplishing any and all of the methods and processes disclosed in the present disclosure.
  • Input/output means may also be included, and can be any such input output means known in the art (e.g., display, printer, keyboard, microphone, speaker, transceiver, and the like).
  • the processor and at least the database can be contained in a personal computer or client computer which may operate and/or collect data from the sequencer.
  • the processor also may communicate with other computers via a network (e.g., intranet, internet, VPN).
  • Figure 17 illustrates a system according to some embodiments which may be established as a server-client based system, in which the client computers are in communication with databases, and the like.
  • the client computers may communicate with the server via a network (e.g., intranet, internet, VPN).
  • a network e.g., intranet, internet, VPN.

Abstract

Embodiments of the present disclosure present bioinformatic systems, methods, for allelic HLA typing (as well as and computer readable media having instructions for performing methods of HLA typing) using, for example, Illumina exome-sequencing data (e.g., 101 basepair, paired-end reads).

Description

ACCURATE TYPING OF HLA THROUGH EXOME SEQUENCING
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH & DEVELOPMENT
[0001] Inventions disclosed herein were made with government support under NIH Grant No. HHSN272200900018C, awarded by the National Institute of Health. The U.S. Government has certain rights in the inventions.
FIELD OF THE DISCLOSURE
[0002] Embodiments of the present disclosure are directed to Human leukocyte antigen (HLA) detection at the allelic level using sequencing data.
BACKGROUND
[0003] Allelic HLA typing refers to sequencing-based typing to determine variations in coding DNA sequences that alter the protein sequences. This is also commonly referred to as high resolution typing or four-digit typing. There are also alleles that bear synonymous mutations and mutations within noncoding DNA, but resolution of these alleles is rarely necessary in clinical practice. In the IMGT/HLA database (1), the majority of HLA alleles are represented by full-length or partial complementary DNA (cDNA) sequences. Some HLA alleles have both cDNA and genomic DNA (gDNA) sequences deposited in the database. For simplicity, we also term a cDNA and/or gDNA sequence of an HLA gene an allele.
[0004] HLA typing at the allelic level is typically performed by Sanger sequencing of selected exons of Class I genes (HLA-A, -B and -C) and Class II genes (HLA-DRB1 and - DQB1). HLA typing using exome sequencing (exome-seq) data is in its infancy. Currently, there is no methodology available to perform HLA typing at the allelic level with the necessary accuracy given the typical read length of Illumina and the level of polymorphism within the region. Indeed, a recent report(2) demonstrated that allelic HLA typing is challenging to obtain from Illumina exome-seq data even at high coverage. [0005] The nomenclature of HLA alleles follows the guidelines from the WHO Nomenclature Committee for Factors of the HLA System (http://hla.alleles.org/nomenclature/naming.html).
SUMMARY OF THE EMBODIMENTS
[0006] Embodiments of the present disclosure present bioinformatic systems, methods, for allelic HLA typing (as well as and computer readable media having instructions for performing methods of HLA typing) using, for example, Illumina exome-sequencing data (e.g., 101 basepair, paired-end reads).
[0007] Embodiments of the present disclosure overcome a major bioinformatic hurdle in applying targeted Illumina sequencing to allelic HLA typing. For example, qualified exome- seq datasets from research projects can be analyzed to revisit HLA-disease associations with allelic resolution. In addition, some embodiments provide a way to deep sequencing of captured HLA genes alone rather than the whole exome, which enables multiplexing samples for high-throughput typing of bone marrow donors at reduced cost. In some embodiments, the HLA genes can also be bundled with existing oncogene panels(3) for targeted Illumina sequencing in order to prepare leukemic patients for transplantation while characterizing the molecular profiles of their diseases.
[0008] Previous methods have relied on the assumption that more reads align to the correct alleles(4) (2). However, this assumption does not wholly apply to exome-sequencing data, as there are often large differences in the coverage of different exons (Figure 2a, Figure 7). In some embodiments of the present disclosure, a set of contigs may be assembled from target (e.g. HLA-A) specific reads to substantially represent haplotype exons of the target gene (in some embodiments, fully represent haplotype exons of the target gene). In some embodiments, the target specific reads may be obtained using two stages of filtering. For example, exome-sequencing data may be first filtered against alleles of HLA genes obtained from the, for example, IMGT/HLA database (Figure 11), to narrow the searching space.
[0009] In some embodiments, due to homology between alleles of different HLA genes, reads that concurrently align to target and off-target genes may be filtered, as these reads may introduce ambiguities to an assembly. It has been observed that up to 9% of reads are multi- mapped, where HLA-A and -DRB1 have higher percentages of multi-mapped reads compared with HLA-B, -C, and -DQB 1 genes. [0010] Unlike existing short read assemblers (5) (6), in some embodiments, distinct strategies to facilitate sequence assembly in the highly polymorphic HLA region are adopted. For example, in some embodiments, a read pair is first converted to a haplotype sequence when both ends are aligned to the same reference allele. This is accomplished in order to effectively elongate the input reads. Furthermore, because many fragments may have sequencing gaps between both ends, in some embodiments, degenerate bases are used as placeholders at uncertain positions in the contig. These gaps may be resolved based on information from reads merged into the contig at a later point, for example (Figure 3). In some embodiments, degenerate bases can also serve as an indicator of insufficient read support, which can be factored in during allelic pair inference (for example). Additionally, assembly of contigs sharing longer substrings with higher frequencies may be prioritized in some embodiments.
[0011] In some embodiments, identification of relevant alleles whose exons should match the contigs assembled at a previous step may be accomplished. In such embodiments, the quality of an allele may be measured by, for example, summation of Hamming distances between individual exons and their best match in the contigs. During a matching process, for example, sequence similarity over the length of matching substrings (Figure 4) may be prioritized. This may be done for at least one of two considerations. First, in some embodiments, alleles with nearly identical sequences require differentiation. Second, exons may be only partially sequenced which results in degenerate bases in the contig. Accordingly, in some embodiments, candidate alleles are qualified within a hamming distance of, for example, 2, which allows inclusion of potential no vel alleles closely related to existing alleles.
[0012] In some embodiments, an additional step for an HLA typing process includes, for example, inferring homozygous and/or heterozygous allelic pairs by considering all possible allelic pairs and checking if the allelic pairs are consistent with the data as represented by contigs. To achieve this, in some embodiments, the principle of parsimony is utilized which ensures allelic pairs, fully supported by the data, may be prioritized over pairs consistent with but not fully supported by the data (Figure 5a). The final allelic pairs, in some embodiments, should also explain as much of the information present in the data as possible (Figure 5b). Meanwhile, phasing information among discrete exons may be inferred, in some embodiments (Figure 5 b) since exome-seq may not fully capture introns.
[0013] Accordingly, in some embodiments of the present disclosure, a computer- implemented method for HLA typing is provided, with the method comprising one or more of the following steps (and in some embodiments, a plurality of the steps, and in some embodiments, all of the following steps): accessing sequencing data that contains HLA information stored in one or more databases, accessing HLA sequences of alleles of one or more HLA genes and the multiple sequence alignment of these HLA sequences stored in one or more databases, and searching for target-specific reads/read-pairs from the sequencing data. Searching may comprise first filtering the sequencing data against HLA sequences to obtain a first set of reads/read-pairs aligned to these sequences, and second filtering the first set to eliminate reads/read-pairs concurrently aligned to both sequences of the target HLA gene and off-target HLA genes to obtain a second set of target HLA reads/read-pairs. The method may further include assembling a plurality of reads/read-pairs uniquely aligned to a target HLA gene into one or more contigs. Assembling may comprise merging read-pairs having both ends aligned to the same reference allele to form single-reads/haplotype sequences, and assembling one or more contigs from the haplotype sequences. The assembly of contigs having longer substrings with higher frequencies may be prioritized. The method may also include identifying relevant target alleles having exons matching the assembled contigs, where identifying may comprise decomposing target alleles into exons, calculating the Hamming distance (HD) between individual target allele exons and corresponding best matches from the assembled contigs, and prioritizing sequence similarity over the length of matching substrings between the contigs and the target allele exons.
[0014] In some embodiments, relevant alleles within a first predetermined HD are selected as candidate alleles.
[0015] In some embodiments, the HLA typing method may also include inferring at least one of homozygous or heterozygous allelic pairs from the candidate alleles, where inferring may comprise selecting alleles from the candidate alleles having a HD less than a second predetermined distance.
[0016] In some embodiments, first filtering may include soft-clipping during alignment.
[0017] In some embodiments, a read/read-pair may be included in the first set upon the read/read-pair having a high quality suffix-prefix alignment with the HLA sequences of known alleles.
[0018] In some embodiments, HLA typing method may further include dividing paired-end reads that align to the same known allele into substrings, where upon the prefix and suffix of the paired-end reads overlapping, and the substrings are substantially equal in length and not empty, the read-pairs are merged to form single-reads/haplotype sequences. Additionally, upon one read of a read-pair merging at a position including a sequencing error, degenerate bases may be encoded in the merged haplotype sequence, as well as upon the overlap being empty, the gap between the read-pair is encoded with degenerative bases. In some embodiments, upon degenerate bases being present and their intersection being not empty, the base at the in tersection is used to assemble the con tig.
[0019] Assembly of the contigs from the haplotype sequences, in some embodiments, may comprise merging haplotype sequences upon the haplotype sequences sharing common /- mers. / may be initially set to the value of the maximum read length and then iteratively decreased by a fixed amount until a minimum threshold is reached.
[0020] Haplotype sequences containing sequencing errors may comprise reads/read-pairs having low frequency έ-mers, and, in some embodiments, such read/read-pairs may be eliminated from assembly into contigs.
[0021] In some embodiments, assembly includes decomposing haplotype sequences into a set of /-mers and sorting the results in a decreasing order of frequency. For example, for each /■■ mer in the sorted list, contigs which share a specific /-mer are assembled.
[0022] Other embodiments may include one or more of the following features, at least one or more of which can stand on their own as separate embodiments, or combined with any number of the above embodiments:
* relative positions of any two contigs may be determined in constant time by matching the /-mer;
® insertion/deletion errors may be disallowed during contig assembly;
* reads/read-pairs may be merged upon being concordant at each alignment position;
* the frequency of bases of each contig for each position is accrued for identifying insufficient exon capture or sequencing errors during assembly, where base regions of any contig having low base support may be replaced with a degenerative base, and where upon a prefix or suffix of a contig is a string of degenerative bases, assembly further comprises trimming the string;
* contig assembly may be carried out iteratively; * tracking at least one of the removal of previously assembled contigs and the creation of new contigs, where tracking may comprise a union-find algorithm;
* an allele may be excluded from the candidate list upon the allele including an exon with no alignment found in the contigs;
* candidate alleles may comprise alleles with HD of 0, upon such alleles corresponding to two or more types of protein coding sequences;
* candidate alleles may comprise alleles with a HD of less than or equal to a predetermined threshold; and/or
* selecting two alleles with replacement from the candidate list and scoring a distance between the selected allelic pair and the remaining alleles, where the score of each allelic pair may be initialized at 0, and the HLA typing method may further comprise inspecting multiple sequence alignment (MSA) of the alleles obtained from the database. In such embodiments, for alleles having the same protein coding sequence, only one allele is included in the MSA, and the method may further then comprise dividing the MSA into different exons.
[0023] Some embodiments of the present disclosure are directed to a computer-implemented method for HLA typing which may include one or more of the following steps (and in some embodiments, a plurality of such steps, and in some embodiments, all such steps): accessing sequencing data that contains HLA information, filtering the sequencing data against known allelles of one or more HLA genes to obtain a first set of reads/read-pairs aligned with known HLA alleles, and assembling contigs from the first set of reads/read-pairs. Assembly may comprise filtering out reads/read-pairs aligning to known, non-target HLA alleles, merging paired-end reads, and generating contigs from merged paired-end reads. The method may further comprise identifying alleles from the contigs, where identifying may comprise dividing each allele into exons, identifying a best match for each exon in the contigs, and calculating and summing Hamming distance (HD) between contigs and known target alleles. The method may further include inferring allelic pairs, where inferring may comprise pairing candidate alleles, scoring all allelic pairs based on candidate alleles, and reporting allelic pairs with best scores.
[0024] In some embodiments, an HLA typing system is provided and comprises at least one computer having at least one processor, the at least one processor configured with computer instructions operating thereon to perform a method of HLA typing according to any such embodiments disclosed above or otherwise herein.
[0025] In some embodiments, a non-transitory computer readable medium is provided which comprises computer-executable instructions recorded thereon for causing a computer to perform one or more of the HLA typing methods disclosed above or otherwise herein.
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] This application file contains at least one drawing executed in color.
[0027] Figure 1 is a workflow of allelic HLA typing using exome-seq data according to some embodiments of the present disclosure.
[0028] Figure 2a-c are comparisons of HLA typing results among conventional methods (Sanger sequencing), an existing method using exome-seq data (HLAminer) and some embodiments of the present disclosure, using 15 exome-seq datasets from Illumina platforms. Specifically, Figure 2a illustrates fold coverage at exons of each of the five target HLA genes where coverage data from all 15 samples are pooled. Median coverage (shaded lines) and range (shaded areas) are plotted over alignment positions for individual HLA genes. Dotted lines represent exon boundaries. Figure 2b illustrate comparison of HLA typing results between an embodiment of the present disclosure and the conventional Sanger method (e.g., after only the first round of sequencing). The numbers of allelic pairs in five categories indicated are plotted for different genes of each sample. Most ambiguities encountered by Sanger sequencing are resolved only after subsequent sequencing or PCRs with sequence specific primers. For the full typing results, see Figure 13. Figure 2c illustrates concordance rates of HLA typing results by an embodiment of the present disclosure and HLAminer as compared with conventional Sanger-based method.
[0029] Figures 3a-b illustrate ambiguities resulting upon merging paired-end reads (r0, ri). (a) r0 and ri align to alleleo, where the aligned regions partly overlap. The bases 300 indicate disagreement, which is denoted by a degenerate base S = {G, C} in the merged read r'. With r2 aligning to the same region, base C is selected in the newly merged read r". (b) r0 and ri align to both allele! and allele2. Allelei contains an additional base C 300 compared to alleleo, within the gap between ro and rj, resulting in two legitimate merging results r' and r". r " is supported upon the identification of an additional read r2 that is consistent with r ' '. [0030] Figures 4a-b illustrate examples of preferring similarity to length when identifying a best match of an exon and the consideration of non-zero distance alleles. Specifically, Figure 4a illustrates the alignment of the second exons of the heterogeneous alleles A*02:01 :01:01 and A*02: l 1:01 of sample HG01953 and the two contigs a and c} identified by methods according to some embodiments of the disclosure. Contig c¾ is identical to exon 2 of A*02:01:01:01 and contig Cj is a proper prefix of exon 2 of A*02: 11 :01. Because exon 2 of the two alleles differ only by the two bases highlighted, c, naturally serves as a candidate hit for exon 2 of A*02: 11 :01 that has a length identical to exon 2 and a similarity over 99%. Nevertheless, Cj is shorter but has a 100% similarity. A*02: 11 :01 would have been missed without tolerating missing bases. Figure 4b illustrates the alignment of the seventh exons of candidate alleles C*04:01 :01 :01 and C*04:30 and contig c¾ identified by methods according to some embodiments of the disclosure. c¾ is identical to exon 7 of C*04:01 :01 :01 and it is the only contig that can serve as a candidate hit for exon 7 of C*04:30. Without tolerating non-zero distance alleles, C*04:30 would have been missed.
[0031] Figures 5a-b illustrate an example of the determination of allelic pairs. Specifically, Figure 5a illustrates the principle of parsimony. Assuming two contigs, Contiga and Contigb, have been assembled and two candidate alleles, alleleo and allele^have been identified. Alleleo differs from allele i by one base, highlighted in red. In some embodiments, three allelic pairs may be checked: (alleleo, allelei), (alleleo, alleleo) and (allelei, allelei). However, in some embodiments, it cannot be ruled out the possibility of (alleleo, allelei) or (allelei, allelei) being the correct answer as Contigb may have supported allelei if the missing bases were TGG. However, the homozygous pair (alleleo, alleleo), in some embodiments, is preferred as it is based on the fewest assumptions and sufficient to represent both contigs. Figure 5b illustrates that the allelic pair may capture as much information as presented in the data. For example, assuming four potential alleles (B*55:02:01, B*56: l l, B*35:03:01, B*35:60) have been identified for HLA-B of sample HG01873, for each exon of exons 1-4, two different haplotypes are present, and may be assigned with label A or B. A plurality of the alleles (e.g., all) may be supported as their individual exons are present in contigs. To that end, there are a total of 10 possible allelic combinations (4 homozygous and 6 heterozygous) of these four alleles. However, data indicates that all four exons are heterozygous, and only the pair B*55:02:01 and B*35:03:01 result in heterozygosity (both A and B alleles) at all four exons. [0032] Figure 6 illustrates the diversity of the HLA genotypes of included samples. The HLA genotypes of 13 samples included in a study are mapped onto the phylogenetic trees of known alleles of HLA-A, -B, -C, -DRB1 and -DQB1 genes generated using RAxML (v.7.3.3). The alleles at the four-digit resolution are highlighted in color corresponding to each gene.
[0033] Figure 7 illustrates variations in fold coverage among different target HLA genes and their exons. The median fold coverage is plotted for individual exons of each HLA gene. Error bars show 95% confidence intervals.
[0034] Figure 8 illustrates an example of exom-seq data of the individual NA 18526, according to some embodiments, that has inadequate coverage over target HLA genes. The five target HLA genes of this individual were typed previously. In some embodiments, no typing result may be reported for any of the target gene as the assembled contigs may not unambiguously support any candidate allele due to insufficient exon coverage. The alignment, in some embodiments, is generated using BLAST between contigs of a target gene and one of the known cDNA sequences of this gene. In certain experiments, for DQB1 gene, no contig was produced. HLAminer, on the other hand, reported typing results for HLA-A, B, C and DRB 1 genes, all of which are discordant with the known type except HLA-B, where the known type is among one of the seven equally supported candidates. 4 other samples with inadequate coverage are HG01515, HG01049, HG00731 and A18605.
[0035] Figures 9a-e illustrate differential fold coverage at variant positions between validated allelic pairs for each individual. The fold coverage is plotted for each allele of a heterozygous allelic pair at positions where they differ. In case of homozygosity, the coverage is plotted for all alignment positions. Dotted lines are exon boundaries.
[0036] Figure 10 illustrates a size of allelic bias in discrete exons. Differences in fold coverage between two heterozygous alleles at each variant position are plotted as the percentage of total coverage at the same variant position. Results are shown for different exons and genes of each sample; involved allelic pairs are shown to the right of each panel.
[0037] Figure 11 is a table illustrating sequences of typed HLA genes included in the reference.
[0038] Figure 12 is a table illustrating characteristics of exome-seq data used for in silico HLA typing. [0039] Figures 13a-g illustrates HLA typing results using exome-seq data and the laboratory validation.
[0040] Figure 14 illustrates a sample report generated by an embodiment of the present disclosure.
[0041] Figures 15a-b illustrate a statistical analysis of allelic bias within individual exons.
[0042] Figures 16 and 17 represent example systems for at least one of conducting HLA typing analysis, analyzing data from such analysis, such analysis including at least one of qualifying, quantifying sequencing data, as well as including, in some embodiments, collecting and/or storing data for the analysis and/or produced as a result of such analysis.
DETAILED DESCRIPTION OF SOME OF THE EMBODIMENTS
[0043] Embodiments of the present disclosure overcome a major bioinformatics hurdle in applying targeted Illumina sequencing to allelic HLA typing. Previous methods for HLA typing from traditional and next generation sequencing methods have relied on the assumption that more reads align results to the correct alleles (4) (2). However, this assumption does not wholly apply to exome-sequencing data, as there are often large differences in the coverage of different exons (Figure 2a, Figure 7).
[0044] Described herein are systems and methods where datasets from newly generated sequencing reads as well as datasets containing pre-existing sequence may be analyzed to captured HLA genes. In various embodiments, sequencing data may be obtained from one or more sources including, but not limited to, e.g., whole genome sequencing datasets, whole exome sequencing datasets, qualified exome-sequencing datasets, datasets obtained from deep sequencing of captured HLA genes alone rather than the whole exome, one or more sets of contigs assembled from target (e.g. HLA-A, HLA-B, HLA-C, etc.) specific reads to substantially represent haplotype exons of the target gene (in some embodiments, fully represent haplotype exons of the target gene), and/or datasets with inclusion of gDNA sequences to enhance the ability to capture reads spanning intron-exon boundaries, and, alternatively, wherein gDNA sequences may include additional sequences where capture probes for HLA introns are included In some embodiments, the sequencing datasets may comprise HLA genes bundled, e.g., with existing oncogene panels (3) for targeted Illumina sequencing, e.g., in order to prepare leukemic patients for transplantation while characterizing the molecular profiles of their diseases. [0045] A high-level overview of a process for HLA typing according to some of the embodiments of the present disclosure is provided in Figure 1. Accordingly, some embodiments of the present disclosure were utilized in (and will be illustrated with respect to) an example experiment to derive allelic HLA types, and will be disclosed with specific details provided below. In the example, the HLA types were derived from 15 Illumina exome-sequencing datasets: 13 subjects of five ethnicities, with two datasets are duplicates (obtained from multiple sources, including the 1000 Genomes Project (7)). The characteristics of the datasets and diversity of included HLA types are summarized in Figure 12 and Figure 6, respectively. Briefly, after read filtering, the median coverage of different exons of individual genes ranged from approximately ten to a thousand fold (Figure 2a). The coverage decayed towards exon boundaries and varied significantly among exons (Figure 7), which may be likely due to bias toward certain exons during exome capturing. In the embodiment utilized in the example, datasets lacking coverage at informative exons (Figure 8) were rejected with no alleles or allelic pairs reported.
[0046] For the datasets, the HLA typing method (according to some embodiments of the disclosure) efficiently identified allelic pairs. Specifically, at a resolution that tolerates synonymous mutations in coding sequences, 74 out of 75 allelic pairs were correctly reported at an overall concordance rate of 99% compared with conventional typing (Figure 13). In the example, the only case of discordance occurred to HLA-A of HG01872, where in addition to the correct allelic pair, one more pair with one base difference was reported. Interestingly, reads support for both exist in the exome-seq data.
[0047] In the example, the advantage of clonal sequencing was also harnessed to distinguish cis/trans ambiguities encountered during Sanger sequencing (which are traditionally resolved by additional sequencing or PCR with sequence specific primers). Moreover, polymorphisms were detected in exons not covered by conventional methods, and the system/method was able to exclude additional allelic pairs (Figure 2b and Figure 13). A report was produced providing results of the example (Figure 14).
[0048] As noted above, for the illustrative example, the exome-seq data comprised fifteen (15) exome-seq datasets from 13 subjects were analyzed, including two duplicate sequencing datasets for two subjects (NA19240 and NA20313). The sources of these data are the 1000 Genomes Project (1 1 datasets; available at ftp://ftp-trace.ncbi.nih.gov/1000genonnes/ftp/data/) and an internal validation project at Genome Technology Access Center (GTAC) at Washington University (four datasets; available upon request). The ethnicities of the subjects and Run ID of the datasets are listed in Figure 12. The protocols for library preparation and exome capture were described previously (8). The protocols for the internal validation project at GTAC were similar, except that 3ug of DNA was used as input, 6 cycles of PCR were performed to enrich for proper adaptor ligated fragments, and the Agilent SureSelect Human All Exon v.2 kit was upgraded to v.3 kit. Sequencing was performed on Illumina platforms as 2 x 101 bp pair-end reads, and the read count and coverage data for all samples are listed in Figure 12.
[0049] As sequencing of captured exons is susceptible to allelic bias (9), the coverage of heterozygous alleles at positions where they differ from each other was compared (Figure 9 and Figure 15). Statistically significant allelic biases are most frequently observed at exons 2 through 4 of Class I genes and exon 2 of Class II genes, and neighboring exons can also exhibit preference to different haplotypes. The patterns of allelic bias are highly reproducible in the two replicates examined. Almost all the bias at individual variant positions (99%) are within 80% of the total coverage (Figure 10). In some embodiments, this does not affect the HLA typing.
[0050] It is worth noting that exome-seq data do not cover polymorphisms within introns (Figure 2b); while this information is rarely necessary in current practice, in some embodiments it can be obtained if capture probes for HLA introns are included.
[0051] In some embodiments, where no allelic pairs are reported from datasets with adequate coverage, clues from reported relevant alleles can be further investigated by Sanger sequencing (for example).
[0052] Accordingly, provided below are additional details for a method of HLA typing, according to some embodiments, which was used for the noted example.
[0053] Scouting for target-specific reads/read-pairs. Ini tially, exome sequencing data may be aligned against a multi-FASTA file that contains all known alleles of HL genes available from the IMGT/HLA database (Figure 11). The inclusion of gDNA sequences enhances the ability to capture reads spanning intron-exon boundaries, although they are available for only a fraction of alleles in the database. To account for this, soft-clipping is done during alignment. Specifically, in some embodiments, it is sufficient to retain a read when it has a high quality suffix-prefix alignment with cDMA sequences. In the present example, Novoalign (http://w\\'w.novocraft.com/'rnain/mdex.php) was used as the aligner, where no more than one edit distance was allowed. Information is kept when a read or read pair is aligned to multiple HLA genes.
[0054] The alignment result is recorded in a compressed BAM format, from which reads/read-pairs aligned to a target HLA gene (e.g. HLA -A) with a customized BED file registering all alleles of this gene are extracted. Likewise, reads/read-pairs that aligned to non-target genes (e.g. all non-HLA-A genes in the reference) are extracted in the same manner.
[00551 Assembly. Reads/read-pairs uniquely aligned to the target gene may be used in the assembly step. They are first oriented to be consistent with the alignment so that their reverse complementary strands need not be considered. Paired-end reads aligned to the same reference allele are merged to form single reads based on the following method (for example): let a paired-end reads (^, ?] ) that aligned to the same allele be divided into substrings
(ro, 0o> ri> °i)> wnere r^ a d r[ denote the prefix and suffix of r0 and Tj that do not overlap in the alignments, whereas o0 and Ol denote the overlapping substrings. Note that in some embodiments, o0 and Ol should be equal in length. When o0 and Ol are not empty, they may be merged to form string om , where the IUPAC (International Union of Pure and Applied Chemistry) characters, or degenerate bases, may be used to encode two different nucleotides merging at a position with one of them being a possible sequencing error. When o0 and Ojare empty, the sequencing gap between r^ and r may be encoded with degenerate bases ' '.
Due to the existence of alleles that contain length polymorphisms (i.e. indels), the fragment length for certain read pairs may not uniquely be determined. The uncertainties regarding the degenerate bases and fragment length may be resolved by identifying additional fragments that are likely to be obtained from the same genomic location during assembly (Figure 3).
[0056] Reads containing sequencing errors may contain low frequency &-mers (substrings with length k set to be half of the read length) that occur only once or twice in the data(6). Since such reads may also belong to non-target genes in exome sequencing, they may be excluded from assembly rather than being corrected.
[0057] Each read may then be initialized to be a contig, and the base frequency may be recorded at each position. Assuming that contigs sharing longer substrings are more likely to be from the same haplotype, the comparison of contigs sharing longer substrings, in some embodiments, is prioritized. High-frequency substrings may also, in some embodiments, be inspected first, as haplotype sequences, such as exons with a higher read support, can be recovered more reliably. Specifically, contigs may be merged if they share common /-mers, where / may foe initially set to the value of the maximum read length and then may be iteratively decreased by a fixed amount until a minimum threshold (e.g., default of 40) is reached. For each / value, contigs may be decomposed into a set of /-mers, and may be sorted in a decreasing order of frequency. For each /-mer in the sorted list, contigs sharing this /-mer through alignment may be assembled, in which the relative positions of any two contigs can be determined in constant time by matching the /-mer. Because insertion/deletion errors are rare in Alumina sequencing, when generating the full alignment, they may be disallowed. In some embodiments, two contigs may be merged only if they are concordant at each alignment position. In some embodiments, if degenerate bases in IUPAC code are present, their intersection should not be empty and the base present in this intersection may be used in the assembled contig.
[0058] Meanwhile, in some embodiments, the base frequency for each position is accrued. This may be used later to identify regions of a contig with low base support (< 2) that may result from insufficient exon capture or sequencing errors. As such regions do not provide reliable haplotype sequence information and may prevent further contig merging, they may be replaced with degenerate base 'N'. The prefix or suffix of a contig that consists of a string of s may then be trimmed, whereas internal and intermittent s may be retained. Further contig merging is sought, when possible, by repeating the above steps. To track the removal of existing contigs and the creation of new ones, a union -find algorithm may be used.
[0059] Identification of relevant alleles. With adequate coverage, target exons, in some embodiments, are expected to be well represented by assembled contigs. Next, each allele of a target gene may be decomposed into exons, for which the best hit may be identified (i.e. a matching substring) among the contigs. The quality of a hit may be determined by at least one of the length and similarity (and in some embodiments, both) of matched substrings. A shorter hit with a higher similarity, in some embodiments, may be considered to have a higher quality (Figure 4). Hamming distance may be used to quantify the differences between an exon and its hit. In some embodiments, hits covering a minimum percentage of an exon (default 85%) and within a maximum Hamming distance of 2 are considered (for example). Then, an overall distance may be calculated for each allele by summation of Hamming distances between all its exons and their best hits in the contigs. An ideal candidate allele, in some embodiments for example, should have a Hamming distance of zero. However, a true allele may have a. non-zero distance for several reasons (for example): (1) small exons and part of long exons may not be effectively captured in exome sequencing (Figure 4), and/or (2) the subject may have a novel allele with a small number of mutations compared to a known allele. Therefore, in some embodiments, it is reasonable to consider ail alleles within a distance threshold (default 2). Although the threshold is an adjustable parameter (e.g., by the user), a higher threshold is unlikely to be meaningful as many known alleles have almost identical protein coding sequences. Note, the following situations may be excluded from calculation of the overall Hamming distance: (1 ) incompletely documented short exons (< 25bp) and/or (2) exons with no hits in the contigs. These exclusions may prevent unjustified penalties on partially sequenced alleles in the database, or loss of novel alleles bearing inserted or deleted bases that inflate the Hamming distance dramatically when compared to a known allele. For each candidate allele identified, the following information may be reported: (1) exonic positions of mismatch compared to the best hit, termed U (stands for unsupported) positions, (2) exonic positions not fully covered by the best hit, termed M (stands for missing) positions, (3) total length of the covered region, and/or (4) the short exons excluded from distance calculation.
[0060] inferring allelic pairs , Although all alleles conforming to the distance constraint may be reported, in some embodiments, a subset may be chosen as the candidate list to represent the input data: an allele may be excluded from the candidate list if it contains any exon with no hit; the candidate fist may comprise alleles only with distance zero on condition that they correspond to more than one type of protein coding sequences; otherwise, all alleles with distances less than or equal to one may be included in the candidate list.
[0061] Next, a plurality of alleles, for example a pair, may be chosen with replacement from the candidate list and a scoring scheme may be utilized to measure the distance between the selected allelic pair and the remaining alleles. More specifically, and for example, let the candidate list include n alleles be {ax, a2, ..., an) . The score of each allelic pair may be initialized to be zero. Next, multiple sequence alignment (MSA) of these alleles may be obtained. For alleles having the same protein coding sequence, only one of them may be included in the MSA. Then the MSA may be divided into different exons. Within each exon, a set of variable positions may be identified, i.e., alignment positions with more than one type of nucleotides. Then, subsequences specified by the variable positions may be obtained from all sequences in the MSA, which results in a list of strings L = (s s2,...,sm) where m≤n . L captures non-redundant haplotype information of the exonic region. For each allelic pair (α,., α .) with 1 < < ;' < « , its score may be incremented by∑mw.(h(Si, s,), h(sj, s,)) + S, where <5
1=1
is initialized to be zero, and function A returns the Hamming distance between two strings except that, in some embodiments (1) any U position in sl may be skipped for comparison, and/or (2) for any M position in si or s when differs from s 5 may be increased by one.
The former is to account for the fact that any unsupported base in the data, preferably should not be penalized, and the latter is to penalize the bases in st or 5 - lacking read coverage when differing from bases having read support in s} . In addition, <5 may be increased by the number of U positions in si and 5 - such that a correct homozygous allele pair may have a higher score compared to an alternative. Note that when allele a. is not included in the MSA, 5. may be obtained from the allele in the MSA that shares the same coding sequence with ai . After the exons have been examined, the pairs with the same minimum score may be reported as the typing result.
[0062] Evaluation Measure, in some embodiments, to measure the accuracy and to quantify the effort required to resolve ambiguities of the typing results, concordance rale may be measured. For example, let the number of inferred allelic pairs for m HLA genes be {nv n2,..., nm) , among them (Cj, c2,...,cm) allelic pairs are consistent with the conventional typing results at four-digit level (same protein sequences from typed exons). Accordingly, the concordance rate is given by∑ i(c,■/»;)/∑¾■· Any ambiguity or wrong prediction decreases this score. While the calculation of concordance rate for some embodiments is straightforward, since HLAminer does not infer allelic pairs but only assigns a likelihood score to each candidate allele, alleles with top two highest scores to be the predicted allelic pair are chosen (according to some embodiments). If more than two alleles share the same highest score, all of the possible pairing among them may be considered. When only one allele is reported, a homozygous allelic pair may be considered.
[0063] Laboratory validation of HLA typing. Allelic HLA typing was performed using SeCore HLA Sequencing Reagents (Life Technologies, Brown Deer, WI) per the manufacturer's instructions. Briefly, exon 2-4 of HLA-A, -B and -C, exon 2 of HLA-DRB 1, and exon 2 and 3 of HLA-DQB 1 were amplified in five PCR reactions followed by Sanger sequencing from both the 5' and 3' ends. Next, some cis/trans ambiguities from the first round of sequencing were resolved by sequencing one haplotype of a heterozygous allelic pair with allele specific primers, called Z primers. For HLA-DRB1 gene, a variation at codon 86 (GGT/GTG) shows dichotomy among most HLA-DRB1 alleles, and a Z primer specific for this variation was used preemptively during the first round of sequencing to reduce cis/trans ambiguities. Data analysis was performed with uTYPE 6.0 (Life Technologies, Brown Deer, WI). Equivalent allele pairs were reported with letter strings specified by the National Marrow Donor Program
(http://bioinformatics.nmdp.org/HLA/Allele_Codes/Allele_Code_Lists/Allele_Code_Lists.as px) and the IMGT/HLA Database(l). Cases with unresolved ambiguities were further examined by the sequence specific primer (SSP) method. This was performed using the Olerup-SSP kit (Olerup SSP AB, Stockholm, Sweden). Panels of allele specific primers were selected for genes with ambiguous typing results, and the patterns of positive PCR amplifications were correlated with specific alleles using worksheets provided by the manufacturer. In addition, sequence specific oligonucleotide probe (SSOP) hybridization was performed for all samples using the LabType SSO kit (One Lambda, Canoga Park, CA) on a Luminex platform (Luminex, Austin, TX).
[0064] Accordingly, some embodiments of the present disclosure significantly outperform HLAminer(2), the only other available program capable of HLA typing using exome-seq data. However, without read filtering, HLAminer cannot finish computation even after 10 days of running (3 datasets tested). Using filtered reads as input, HLAminer reported candidate alleles ranked by likelihood without inferring the final allelic pair(s). After testing all datasets, the estimated overall concordance rate for HLAminer was 46% compared with conventional typing results, consistent with the previous report. The performance comparison between some embodiments of the present disclosure and HLAminer for individual genes is shown in Figure 2c.
[0065] The example experiment was executed at a workstation having a six-core 880 MHz and 2400 MHz AMD processors with 250 GB RAM running GNU/Linux x86_64. The method according to embodiments of the present disclosure finished within 2 minutes, while the experiment with HLAminer required 10 minutes to complete. Moreover, the example was performed (according to some embodiments), using the following:
HLAminer (v.1.0.5) may be used, with the script HPTASRwgs_classI-II.sh having a parameter "-i 1"; Novoalign (v.2.07.07) run on 9 cores with parameters "-t 30 -r all -1 80 -e 1 -i 230 140"; and/or
RAxML (v.7.3.3) to generate phylogenetic trees with parameters "-p 78960 -f a -x 12345 -# 1000 -m GTRGAMMA".
[0066] Various implementations of the embodiments disclosed above, in particular at least some of the methods/processes disclosed, may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
[0067] Such computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, for example, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term "machine-readable medium" refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
[0068] To provide for interaction with a user, the subject matter described herein may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor and the like) for displaying information to the user and a keyboard and/or a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer. For example, this program can be stored, executed and operated by the dispensing unit, remote control, PC, laptop, smart-phone, media player or personal data assistant ("PDA"). Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
[0069] Certain embodiments of the subject matter described herein may be implemented in a computing system and/or devices that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front- end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the subject matter described herein), or any combination of such back-end, middleware, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), and the Internet.
[0070] The computing system according to some such embodiments described above may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
[0071] For example, as shown in Figure 16, a processor which may include instructions operating thereon for carrying out one and/or another disclosed method, which may communicate with one or more databases - of which, may store the sequencing data required for different embodiments of the disclosure. As noted, the processor may include computer instructions operating thereon for accomplishing any and all of the methods and processes disclosed in the present disclosure. Input/output means may also be included, and can be any such input output means known in the art (e.g., display, printer, keyboard, microphone, speaker, transceiver, and the like). Moreover, in some embodiments, the processor and at least the database can be contained in a personal computer or client computer which may operate and/or collect data from the sequencer. The processor also may communicate with other computers via a network (e.g., intranet, internet, VPN).
[0072] Similarly, Figure 17 illustrates a system according to some embodiments which may be established as a server-client based system, in which the client computers are in communication with databases, and the like. The client computers may communicate with the server via a network (e.g., intranet, internet, VPN). [0073] Any and all references to publications or other documents, including but not limited to, patents, patent applications, articles, webpages, books, etc., presented in the present application, are herein incorporated by reference in their entirety.
[0074] Although a few variations have been described in detail above, other modifications are possible. For example, any logic flow depicted in the accompanying figures and described herein does not require the particular order shown, or sequential order, to achieve desirable results. Other implementations may be within the scope of at least some of the following exemplary claims.
[0075] Example embodiments of the devices, systems and methods have been described herein. As noted elsewhere, these embodiments have been described for illustrative purposes only and are not limiting. Other embodiments are possible and are covered by the disclosure, which will be apparent from the teachings contained herein. Thus, the breadth and scope of the disclosure should not be limited by any of the above-described embodiments but should be defined only in accordance with claims supported by the present disclosure and their equivalents. Moreover, embodiments of the subject disclosure may include methods, systems and devices which may further include any and all elements from any other disclosed methods, systems, and devices, including any and all elements corresponding to HLA typing. In other words, elements from one or another disclosed embodiments may be interchangeable with elements from other disclosed embodiments. In addition, one or more features/elements of disclosed embodiments may be removed and still result in patentable subject matter (and thus, resulting in yet more embodiments of the subject disclosure).
REFERENCES
1. Robinson, J., Mistry, K., McWilliam, H., Lopez, R., Parham, P. and Marsh, S.G. (201 1) The IMGT/HLA database. Nucleic Acids Res, 39, Dl 171 -1 176.
2. Warren, R.L., Choe, G., Freeman, D.J., Castellarin, M., Munro, S., Moore, R. and Holt, R.A.
(2012) Derivation of HLA types from shotgun sequence datasets. Genome Med, 4, 95.
3. Spencer, D.H., Abel, H.J., Lockwood, CM., Payton, J.E., Szankasi, P., Kelley, T.W., Kulkarni, S., Pfeifer, J.D. and Duncavage, EJ. (2013) Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next-Generation Sequencing Data. JMol Diagn, 15, 81 -93.
4. Wang, C, Krishnakumar, S., Wilhelmy, J., Babrzadeh, F., Stepanyan, L., Su, L.F., Levinson, D., Fernandez-Vina, M.A., Davis, R.W., Davis, M.M. et al. (2012) High-throughput, high- fidelity HLA genotyping with deep sequencing. Proc Natl Acad Sci US A, 109, 8676-8681.
5. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C. and Jaffe, D. (2008) ALLPATHS: de novo assembly of whole -genome shotgun microreads. Genome Res, 18, 810-820.
6. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K. et al. (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res, 20, 265-272. Abecasis, G.R., Altshuler, D., Auton, A., Brooks, L.D., Durbin, R.M., Gibbs, R.A., Hurles, M.E. and McVean, G.A. (2010) A map of human genome variation from population-scale sequencing. Nature, 467, 1061 -1073.
Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T. and McVean, G.A. (2012) An integrated map of genetic variation from 1 ,092 human genomes. Nature, 491, 56-65.
Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E.M., Brockman, W., Fennell, T., Giannoukos, G., Fisher, S., Russ, C. et al. (2009) Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol, 27, 182-189.

Claims

currently claimed:
A computer-implemented method for HLA typing, the method comprising:
accessing sequencing data that contains HLA information stored in one or more databases;
accessing HLA sequences of alleles of one or more HLA genes and the multiple sequence alignment of these HLA sequences stored in one or more databases; searching for target-specific reads/read-pairs from the sequencing data, wherein searching comprises:
first filtering the sequencing data against HLA sequences to obtain a first set of reads/read-pairs aligned to these sequences; and
second filtering the first set to eliminate reads/read-pairs concurrently aligned to both sequences of the target HLA gene and off-target HLA genes to obtain a second set of target HLA reads/read-pairs;
assembling a plurality of reads/read-pairs uniquely aligned to a target HLA gene into one or more contigs, wherein assembling comprises:
merging read-pairs having both ends aligned to the same reference allele to form single-reads/haplotype sequences; and
assembling one or more contigs from the haplotype sequences, wherein assembly of contigs having longer substrings with higher frequencies is prioritized;
and
identifying relevant target alleles having exons matching the assembled contigs, wherein identifying comprises:
decomposing target alleles into exons;
calculating the Hamming distance (HD) between individual target allele exons and corresponding best matches from the assembled contigs; and prioritizing sequence similarity over the length of matching substrings between the contigs and the target allele exons.
2. The method according to claim 1, wherein relevant alleles within a first predetermined HD are selected as candidate alleles.
3. The method of claim 1 , further comprising inferring at least one of homozygous or heterozygous allelic pairs from the candidate alleles, wherein inferring comprises selecting alleles from the candidate alleles having a HD less than a second predetermined distance.
4. The method of claim 1, wherein first filtering includes soft-clipping during alignment.
5. The method of claim 4, wherein a read/read-pair is included in the first set upon the read/read-pair having a high quality suffix-prefix alignment with the HLA sequences of known alleles.
6. The method of any of claim 1-5, further comprising dividing paired-end reads that align to the same known allele into substrings, wherein upon the prefix and suffix of the paired-end reads overlapping, and the substrings being substantially equal in length and not empty, the read-pairs are merged to form single-reads/haplotype sequences.
7. The method of claim 6, wherein upon one read of a read-pair merging at a position including a sequencing error, degenerate bases are encoded in the merged haplotype sequence.
8. The method of claim 7, wherein upon the overlap being empty, the gap between the read-pair is encoded with degenerative bases.
9, The method of any of claim 1-8, wherein assembly of the contigs from the haplotype sequences comprises merging haplotype sequences upon the haplotype sequences sharing common /-mers.
The method of claim 9, wherein / is initially set to the value of the maximum read length and then iteratively decreased by a fixed amount until a minimum threshold is reached.
1 1. The method of any of claims 1-10, wherein haplotype sequences containing sequencing errors comprise reads/read-pairs having low frequency &-mers, and wherein such read/read-pairs are eliminated from assembly into contigs.
12. The method of any of claims 1-1 1 , wherein assembly includes decomposing haplotype sequences into a set of /-mers and sorting the results in a decreasing order of frequency.
13. The method of claim 12, wherein for each /-mer in the sorted list, contigs which share a specific /-mer are assembled.
14. The method of any of claims 1-13, wherein the relative positions of any two contigs are determined in constant time by matching the /-mer.
The method of any of claims 1-14, wherein during assembly, insertion/ deletion are disallowed.
16. The method of any of claims 1-15, wherein reads/read-pairs are merged upon being concordant at each alignment position.
17. The method of any of claims 7 or 8, wherein upon degenerate bases being present and their intersection being not empty, the base at the intersection is used to assemble the contig.
18. The method of any of claims 1-17, wherein during assembly, the frequency of bases of each contig for each position is accrued for identifying insufficient exon capture or sequencing errors.
19. The method of claim 18, wherein base regions of any contig having low base support are replaced with a degenerative base.
20. The method of claim 19, wherein upon a prefix or suffix of a contig is a string of degenerative bases, assembly further comprises trimming the string,
21. The method of any of claims 1-20, wherein assembly is carried out iteratively.
22. The method of any of claims 1 -21 , further comprising tracking at least one of the removal of previously assembled contigs and the creation of new contigs.
23. The method of claim 22, wherein tracking comprises a union-find algorithm.
24. The method of any of claims 1 -23, wherein an allele is excluded from the candidate list upon the allele including an exon with no alignment found in the contigs.
25. The method of any of claims 1-24, wherein candidate alleles comprise alleles with HD of 0, upon such alleles corresponding to two or more types of protein coding sequences.
26. The method of any of claims 1-24, wherein candidate alleles comprise alleles with a HD of less than or equal to a predetermined threshold.
27. The method of any of claims 1-26, further comprising selecting two alleles with replacement from the candidate list and scoring a distance between the selected allelic pair and the remaining alleles.
28. The method of claim 27, wherein the score of each allelic pair is initialized at 0, and wherein the method further comprises inspecting multiple sequence alignment (MSA) of the alleles obtained from the database.
29. The method of claim 28, wherein for alleles having the same protein coding sequence, only one allele is included in the MSA.
30. The method of claim 29, further comprising dividing the MSA into different exo s.
31. A computer-implemented method for HLA typing, the method comprising:
accessing sequencing data that contains HLA information;
filtering the sequencing data against known allelles of one or more HLA genes to obtain a first set of reads/read-pairs aligned with known HLA alleles;
assembling contigs from the first set of reads/read-pairs, wherein assembly comprises: filtering out reads/read-pairs aligning to known, non-target HLA alleles;
merging paired-end reads; and
generating contigs from merged paired-end reads;
identifying alleles from the contigs, wherein identifying comprises:
dividing each allele into exons;
identifying a best match for each exon in the contigs; and calculating and summing Hamming distance (HD) between contigs and known target alleles;
and
inferring allelic pairs, wherein inferring comprises:
pairing candidate alleles;
scoring all allelic pairs based on candidate alleles; and
reporting allelic pairs with best scores.
32. An HLA typing system comprising a computer having a processor, the processor configured with computer instructions operating thereon to perform a method of HLA typing according to any of claims 1-31 and 37-41.
33. The system of claim 32, wherein the method also includes the step of accessing sequencing data that contains HLA information stored in one or more databases.
34. The system of claim 32 or 33, wherein the method also includes the step of accessing HLA sequences of alleles of one or more HLA genes and the multiple sequence alignment of these HLA sequences stored in one or more databases.
35. A non-transitory computer readable medium comprising computer-executable instructions recorded thereon for causing a computer to perform the method according to any of claims 1-31 and 37-41.
36. A method for HLA typing according to any one or more of the embodiments disclosed herein.
37. A computer-implemented method for HLA typing, the method comprising: searching for target-specific reads/read-pairs from sequencing data, wherein searching comprises:
first filtering the sequencing data against HLA sequences to obtain a first set of reads/read-pairs aligned to these sequences; and second filtering the first set to eliminate reads/read-pairs concurrently aligned to both sequences of the target HLA gene and off-target HLA genes to obtain a second set of target HLA reads/read-pairs.
38. The method of claim 37, further comprising assembling a plurality of reads/read-pairs uniquely aligned to a target HLA gene into one or more contigs.
39. The method of claim 38, wherein assembling comprises:
merging read-pairs having both ends aligned to the same reference allele to form single-reads/haplotype sequences; and
assembling one or more contigs from the haplotype sequences, wherein assembly of contigs having longer substrings with higher frequencies is prioritized.
40. The method of any of claims 37-39, further comprising identifying relevant target alleles having exons matching the assembled contigs.
41. The method of any of claims 37-40, wherein identifying comprises:
decomposing target alleles into exons;
calculating the Hamming distance (HD) between individual target allele exons and corresponding best matches from the assembled contigs; and prioritizing sequence similarity over the length of matching substrings between the contigs and the target allele exons.
42. The method of any of claims 1-31 and 37-41, wherein the sequencing data is obtained from one or more datasets selected from: whole genome sequencing datasets, whole exome sequencing datasets, qualified exome-sequencing datasets, datasets obtained from deep sequencing of captured HLA genes alone, one or more sets of contigs assembled from target specific reads to substantially or fully represent haplotype exons of the target gene, datasets with inclusion of gDNA sequences to enhance the ability to capture reads spanning intron-exon boundaries, and gDNA sequences that include additional sequences where capture probes for HLA introns are included.
PCT/US2014/024555 2013-03-15 2014-03-12 Accurate typing of hla through exome sequencing WO2014150924A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/772,347 US10176294B2 (en) 2013-03-15 2014-03-12 Accurate typing of HLA through exome sequencing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361793961P 2013-03-15 2013-03-15
US61/793,961 2013-03-15

Publications (2)

Publication Number Publication Date
WO2014150924A2 true WO2014150924A2 (en) 2014-09-25
WO2014150924A3 WO2014150924A3 (en) 2014-11-13

Family

ID=51581622

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/024555 WO2014150924A2 (en) 2013-03-15 2014-03-12 Accurate typing of hla through exome sequencing

Country Status (2)

Country Link
US (1) US10176294B2 (en)
WO (1) WO2014150924A2 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10426824B1 (en) 2010-05-14 2019-10-01 The General Hospital Corporation Compositions and methods of identifying tumor specific neoantigens
EP3626835A1 (en) 2018-09-18 2020-03-25 Sistemas Genómicos, S.L. Method for genotypically identifying both alleles of at least one locus of a subject's hla gene
US10801070B2 (en) 2013-11-25 2020-10-13 The Broad Institute, Inc. Compositions and methods for diagnosing, evaluating and treating cancer
US10835585B2 (en) 2015-05-20 2020-11-17 The Broad Institute, Inc. Shared neoantigens
US10975442B2 (en) 2014-12-19 2021-04-13 Massachusetts Institute Of Technology Molecular biomarkers for cancer immunotherapy
US10993997B2 (en) 2014-12-19 2021-05-04 The Broad Institute, Inc. Methods for profiling the t cell repertoire
US11452768B2 (en) 2013-12-20 2022-09-27 The Broad Institute, Inc. Combination therapy with neoantigen vaccine
US11549149B2 (en) 2017-01-24 2023-01-10 The Broad Institute, Inc. Compositions and methods for detecting a mutant variant of a polynucleotide
US11725237B2 (en) 2013-12-05 2023-08-15 The Broad Institute Inc. Polymorphic gene typing and somatic change detection using sequencing data
US11793867B2 (en) 2017-12-18 2023-10-24 Biontech Us Inc. Neoantigens and uses thereof

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101815529B1 (en) * 2016-07-29 2018-01-30 (주)신테카바이오 Human Haplotyping System And Method
CN111213210A (en) * 2017-09-06 2020-05-29 河谷控股Ip有限责任公司 HLA tissue matching and methods therefor
US20230220466A1 (en) * 2019-09-10 2023-07-13 The Regents Of The University Of California Immune cell sequencing methods
CN111798924B (en) * 2020-07-07 2024-03-26 博奥生物集团有限公司 Human leukocyte antigen typing method and device
CN112669903B (en) * 2020-12-29 2024-04-02 北京旌准医疗科技有限公司 HLA typing method and equipment based on Sanger sequencing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009049889A1 (en) * 2007-10-16 2009-04-23 Roche Diagnostics Gmbh High resolution, high throughput hla genotyping by clonal sequencing
US20100261189A1 (en) * 2008-10-03 2010-10-14 Roche Molecular Systems, Inc. System and method for detection of HLA Variants

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009049889A1 (en) * 2007-10-16 2009-04-23 Roche Diagnostics Gmbh High resolution, high throughput hla genotyping by clonal sequencing
US20100261189A1 (en) * 2008-10-03 2010-10-14 Roche Molecular Systems, Inc. System and method for detection of HLA Variants

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BOEGEL, S ET AL.: 'HLA Typing From RNA-Seq Sequence Reads.' GENOME MED. vol. 4, no. 12, 2012, page 102 *
WANG, C ET AL.: 'High-Throughput, High-fidelity HLA Genotyping With Deep Sequencing.' PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES vol. 109, no. 22, 2012, pages 8676 - 8681 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10426824B1 (en) 2010-05-14 2019-10-01 The General Hospital Corporation Compositions and methods of identifying tumor specific neoantigens
US10801070B2 (en) 2013-11-25 2020-10-13 The Broad Institute, Inc. Compositions and methods for diagnosing, evaluating and treating cancer
US11834718B2 (en) 2013-11-25 2023-12-05 The Broad Institute, Inc. Compositions and methods for diagnosing, evaluating and treating cancer by means of the DNA methylation status
US11725237B2 (en) 2013-12-05 2023-08-15 The Broad Institute Inc. Polymorphic gene typing and somatic change detection using sequencing data
US11452768B2 (en) 2013-12-20 2022-09-27 The Broad Institute, Inc. Combination therapy with neoantigen vaccine
US10975442B2 (en) 2014-12-19 2021-04-13 Massachusetts Institute Of Technology Molecular biomarkers for cancer immunotherapy
US10993997B2 (en) 2014-12-19 2021-05-04 The Broad Institute, Inc. Methods for profiling the t cell repertoire
US11939637B2 (en) 2014-12-19 2024-03-26 Massachusetts Institute Of Technology Molecular biomarkers for cancer immunotherapy
US10835585B2 (en) 2015-05-20 2020-11-17 The Broad Institute, Inc. Shared neoantigens
US11549149B2 (en) 2017-01-24 2023-01-10 The Broad Institute, Inc. Compositions and methods for detecting a mutant variant of a polynucleotide
US11793867B2 (en) 2017-12-18 2023-10-24 Biontech Us Inc. Neoantigens and uses thereof
EP3626835A1 (en) 2018-09-18 2020-03-25 Sistemas Genómicos, S.L. Method for genotypically identifying both alleles of at least one locus of a subject's hla gene

Also Published As

Publication number Publication date
US20160125128A1 (en) 2016-05-05
US10176294B2 (en) 2019-01-08
WO2014150924A3 (en) 2014-11-13

Similar Documents

Publication Publication Date Title
US10176294B2 (en) Accurate typing of HLA through exome sequencing
Song et al. Capturing the phylogeny of Holometabola with mitochondrial genome data and Bayesian site-heterogeneous mixture models
Liu et al. ATHLATES: accurate typing of human leukocyte antigen through exome sequencing
Posada-Cespedes et al. Recent advances in inferring viral diversity from high-throughput sequencing data
Hu et al. Bioinformatics resources for SARS-CoV-2 discovery and surveillance
Lu et al. Oxford Nanopore MinION sequencing and genome assembly
Léveillé-Bourret et al. Resolving rapid radiations within angiosperm families using anchored phylogenomics
Pylro et al. Data analysis for 16S microbial profiling from different benchtop sequencing platforms
EP3049557B1 (en) Methods and systems for large scale scaffolding of genome assemblies
Magi et al. Characterization of MinION nanopore data for resequencing analyses
Hernandez et al. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer
KR102199322B1 (en) Noninvasive prenatal molecular karyotyping from maternal plasma
KR102638152B1 (en) Verification method and system for sequence variant calling
Vohr et al. A phylogenetic approach for haplotype analysis of sequence data from complex mitochondrial mixtures
Deshpande et al. RNA-seq data science: From raw data to effective interpretation
Ockendon et al. Optimization of next‐generation sequencing transcriptome annotation for species lacking sequenced genomes
Tigano et al. Chromosome-level assembly of the Atlantic silverside genome reveals extreme levels of sequence diversity and structural genetic variation
Kuipers et al. Within-patient genetic diversity of SARS-CoV-2
Fritz et al. Haploflow: strain-resolved de novo assembly of viral genomes
US11289177B2 (en) Computer method and system of identifying genomic mutations using graph-based local assembly
Formenti et al. Merfin: improved variant filtering and polishing via k-mer validation
Kamoun et al. Improving prokaryotic transposable elements identification using a combination of de novo and profile HMM methods
US10424395B2 (en) Computation pipeline of single-pass multiple variant calls
Goswami et al. RNA-Seq for revealing the function of the transcriptome
Tetikol et al. Population-specific genome graphs improve high-throughput sequencing data analysis: A case study on the Pan-African genome

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14769174

Country of ref document: EP

Kind code of ref document: A2

122 Ep: pct application non-entry in european phase

Ref document number: 14769174

Country of ref document: EP

Kind code of ref document: A2