NZ791625A - Variant classifier based on deep neural networks - Google Patents
Variant classifier based on deep neural networksInfo
- Publication number
- NZ791625A NZ791625A NZ791625A NZ79162519A NZ791625A NZ 791625 A NZ791625 A NZ 791625A NZ 791625 A NZ791625 A NZ 791625A NZ 79162519 A NZ79162519 A NZ 79162519A NZ 791625 A NZ791625 A NZ 791625A
- Authority
- NZ
- New Zealand
- Prior art keywords
- variant
- neural network
- feature
- sequence
- metadata
- Prior art date
Links
- 230000001537 neural Effects 0.000 title claims abstract description 207
- 230000035772 mutation Effects 0.000 claims abstract description 55
- 238000000034 method Methods 0.000 claims abstract description 37
- 230000000392 somatic Effects 0.000 claims abstract description 32
- 210000004602 germ cell Anatomy 0.000 claims abstract description 30
- 230000002596 correlated Effects 0.000 claims abstract description 24
- 206010028980 Neoplasm Diseases 0.000 claims description 26
- 238000003860 storage Methods 0.000 claims description 20
- 230000015654 memory Effects 0.000 claims description 15
- 150000001413 amino acids Chemical class 0.000 claims description 13
- 238000003780 insertion Methods 0.000 claims description 12
- 230000035945 sensitivity Effects 0.000 claims description 12
- 238000010200 validation analysis Methods 0.000 claims description 12
- 108020004705 Codon Proteins 0.000 claims description 10
- 108090000623 proteins and genes Proteins 0.000 claims description 9
- 102000004169 proteins and genes Human genes 0.000 claims description 9
- 238000006467 substitution reaction Methods 0.000 claims description 9
- 239000003814 drug Substances 0.000 claims description 8
- 125000003275 alpha amino acid group Chemical group 0.000 claims description 6
- 229940079593 drugs Drugs 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 6
- 241000894007 species Species 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 3
- 229920001850 Nucleic acid sequence Polymers 0.000 abstract description 18
- 239000000523 sample Substances 0.000 description 134
- 239000002773 nucleotide Substances 0.000 description 66
- 125000003729 nucleotide group Chemical group 0.000 description 65
- 239000002585 base Substances 0.000 description 53
- 230000002068 genetic Effects 0.000 description 36
- 150000007523 nucleic acids Chemical class 0.000 description 31
- 210000000349 Chromosomes Anatomy 0.000 description 27
- 108020004707 nucleic acids Proteins 0.000 description 25
- 210000004027 cells Anatomy 0.000 description 21
- 229920003013 deoxyribonucleic acid Polymers 0.000 description 20
- 239000003153 chemical reaction reagent Substances 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 15
- 201000011510 cancer Diseases 0.000 description 13
- 230000000875 corresponding Effects 0.000 description 13
- 210000002569 neurons Anatomy 0.000 description 13
- 150000002500 ions Chemical class 0.000 description 12
- 239000000203 mixture Substances 0.000 description 11
- 229920001405 Coding region Polymers 0.000 description 10
- 229920000665 Exon Polymers 0.000 description 10
- 230000000576 supplementary Effects 0.000 description 10
- 230000003321 amplification Effects 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 9
- 238000010606 normalization Methods 0.000 description 9
- 238000003199 nucleic acid amplification method Methods 0.000 description 9
- 235000018102 proteins Nutrition 0.000 description 8
- 206010069754 Acquired gene mutation Diseases 0.000 description 7
- 241000995070 Nirvana Species 0.000 description 7
- 238000003776 cleavage reaction Methods 0.000 description 7
- 238000004220 aggregation Methods 0.000 description 6
- 230000002776 aggregation Effects 0.000 description 6
- 238000001914 filtration Methods 0.000 description 5
- 239000012530 fluid Substances 0.000 description 5
- 239000000463 material Substances 0.000 description 5
- 210000004369 Blood Anatomy 0.000 description 4
- 230000003044 adaptive Effects 0.000 description 4
- 239000008280 blood Substances 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 230000000670 limiting Effects 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000011144 upstream manufacturing Methods 0.000 description 4
- 229920002287 Amplicon Polymers 0.000 description 3
- 238000001574 biopsy Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000006011 modification reaction Methods 0.000 description 3
- 230000036961 partial Effects 0.000 description 3
- 230000001717 pathogenic Effects 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 230000000306 recurrent Effects 0.000 description 3
- 230000003252 repetitive Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 210000001519 tissues Anatomy 0.000 description 3
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 2
- 229920002393 Microsatellite Polymers 0.000 description 2
- 229920000970 Repeated sequence (DNA) Polymers 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N Thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 239000002253 acid Substances 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 2
- 238000004166 bioassay Methods 0.000 description 2
- 239000012472 biological sample Substances 0.000 description 2
- 230000000903 blocking Effects 0.000 description 2
- 101700057343 chr1 Proteins 0.000 description 2
- 230000002708 enhancing Effects 0.000 description 2
- 238000010348 incorporation Methods 0.000 description 2
- 201000005249 lung adenocarcinoma Diseases 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000000869 mutational Effects 0.000 description 2
- 238000007481 next generation sequencing Methods 0.000 description 2
- 230000003287 optical Effects 0.000 description 2
- 210000000056 organs Anatomy 0.000 description 2
- 230000002093 peripheral Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000001105 regulatory Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000002441 reversible Effects 0.000 description 2
- 238000007841 sequencing by ligation Methods 0.000 description 2
- 230000002194 synthesizing Effects 0.000 description 2
- 229920000160 (ribonucleotides)n+m Polymers 0.000 description 1
- 206010049460 Abasia Diseases 0.000 description 1
- 229960000643 Adenine Drugs 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Natural products NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 210000004381 Amniotic Fluid Anatomy 0.000 description 1
- 240000001436 Antirrhinum majus Species 0.000 description 1
- 210000003567 Ascitic Fluid Anatomy 0.000 description 1
- 241000271566 Aves Species 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- NLZUEZXRPGMBCV-UHFFFAOYSA-N Butylhydroxytoluene Chemical compound CC1=CC(C(C)(C)C)=C(O)C(C(C)(C)C)=C1 NLZUEZXRPGMBCV-UHFFFAOYSA-N 0.000 description 1
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000283707 Capra Species 0.000 description 1
- 210000003483 Chromatin Anatomy 0.000 description 1
- 108010077544 Chromatin Proteins 0.000 description 1
- 210000003917 Chromosomes, Human Anatomy 0.000 description 1
- 241000710137 Cucumber necrosis virus Species 0.000 description 1
- 229940104302 Cytosine Drugs 0.000 description 1
- OPTASPLRGRRNAP-UHFFFAOYSA-N Cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 1
- 101700011961 DPOM Proteins 0.000 description 1
- 241000854491 Delta Species 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 241000283086 Equidae Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 241000233866 Fungi Species 0.000 description 1
- UYTPUPDQBNUYGX-UHFFFAOYSA-N Guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 1
- 108010033040 Histones Proteins 0.000 description 1
- 102000006947 Histones Human genes 0.000 description 1
- 229920002459 Intron Polymers 0.000 description 1
- 108020004391 Introns Proteins 0.000 description 1
- 241000229754 Iva xanthiifolia Species 0.000 description 1
- 101710029649 MDV043 Proteins 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 241000535824 Mastacembelocleidus bam Species 0.000 description 1
- 241000283898 Ovis Species 0.000 description 1
- 101700061424 POLB Proteins 0.000 description 1
- 101700064519 PSTN Proteins 0.000 description 1
- 210000002381 Plasma Anatomy 0.000 description 1
- 206010036790 Productive cough Diseases 0.000 description 1
- 101700054624 RF1 Proteins 0.000 description 1
- 229920001914 Ribonucleotide Polymers 0.000 description 1
- 210000003296 Saliva Anatomy 0.000 description 1
- 241000238102 Scylla Species 0.000 description 1
- 210000003802 Sputum Anatomy 0.000 description 1
- 241000282887 Suidae Species 0.000 description 1
- 241000282890 Sus Species 0.000 description 1
- 229920002803 Thermoplastic polyurethane Polymers 0.000 description 1
- 229940113082 Thymine Drugs 0.000 description 1
- 229940035295 Ting Drugs 0.000 description 1
- 210000002700 Urine Anatomy 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 150000007513 acids Chemical class 0.000 description 1
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 1
- 239000003513 alkali Substances 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000011030 bottleneck Methods 0.000 description 1
- 238000002619 cancer immunotherapy Methods 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 150000001768 cations Chemical class 0.000 description 1
- 238000005119 centrifugation Methods 0.000 description 1
- 230000000295 complement Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000006481 deamination reaction Methods 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- 230000001419 dependent Effects 0.000 description 1
- 230000001809 detectable Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000007865 diluting Methods 0.000 description 1
- 150000002009 diols Chemical class 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 238000004108 freeze drying Methods 0.000 description 1
- 238000007710 freezing Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 230000000977 initiatory Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000011068 load Methods 0.000 description 1
- 230000002934 lysing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000002085 persistent Effects 0.000 description 1
- 229920000023 polynucleotide Polymers 0.000 description 1
- 239000002157 polynucleotide Substances 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000001556 precipitation Methods 0.000 description 1
- 238000007639 printing Methods 0.000 description 1
- 235000004252 protein component Nutrition 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 108091007521 restriction endonucleases Proteins 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 125000002652 ribonucleotide group Chemical group 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000003685 thermal hair damage Effects 0.000 description 1
- 230000001131 transforming Effects 0.000 description 1
- 230000001960 triggered Effects 0.000 description 1
- 238000004450 types of analysis Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Abstract
introduce a variant classifier that uses trained deep neural networks to predict whether a given variant is somatic or germline. Our model has two deep neural networks: a convolutional neural network (CNN) and a fully-connected neural network (FCNN), and two inputs: a DNA sequence with a variant and a set of metadata features correlated with the variant. The metadata features represent the variant’s mutation characteristics, read mapping statistics, and occurrence frequency. The CNN processes the DNA sequence and produces an intermediate convolved feature. A feature sequence is derived by concatenating the metadata features with the intermediate convolved feature. The FCNN processes the feature sequence and produces probabilities for the variant being somatic, germline, or noise. A transfer learning strategy is used to train the model on two mutation datasets. Results establish advantages and superiority of our model over traditional classifiers. and a set of metadata features correlated with the variant. The metadata features represent the variant’s mutation characteristics, read mapping statistics, and occurrence frequency. The CNN processes the DNA sequence and produces an intermediate convolved feature. A feature sequence is derived by concatenating the metadata features with the intermediate convolved feature. The FCNN processes the feature sequence and produces probabilities for the variant being somatic, germline, or noise. A transfer learning strategy is used to train the model on two mutation datasets. Results establish advantages and superiority of our model over traditional classifiers.
Description
Atty. Docket No.: ILLM 1007-3WO/IPPCT
T CLASSIFIER BASED ON DEEP NEURAL NETWORKS
PRIORITY APPLICATIONS
This application claims priority to or the t of the ing applications:
US ional Patent Application No. 62/656,741, entitled “VARIANT CLASSIFIER BASED ON
DEEP NEURAL NETWORKS,” filed on April 12, 2018, (Atty. Docket No. ILLM 1007-1/IPPRV); and
Netherlands Application No. 2020861, entitled “VARIANT CLASSIFIER BASED ON DEEP
NEURAL NETWORKS,” filed on May 2, 2018, (Atty. Docket No. ILLM 1007-4/IPNL).
The priority applications are hereby incorporated by reference for all purposes.
FIELD OF THE LOGY DISCLOSED
The technology disclosed relates to artificial intelligence type computers and digital data processing
systems and corresponding data sing methods and products for emulation of intelligence (i.e., knowledge
based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with
uncertainty (e.g., fuzzy logic s), adaptive systems, machine learning systems, and cial neural networks.
In particular, the technology disclosed relates to using deep neural networks such as convolutional neural networks
(CNNs) and fully-connected neural ks (FCNNs) for analyzing data.
BACKGROUND
The subject matter discussed in this section should not be d to be prior art merely as a result of
its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter
provided as background should not be assumed to have been usly recognized in the prior art. The subject
matter in this section merely represents different approaches, which in and of themselves can also correspond to
implementations of the claimed technology.
Next-generation sequencing has made large amounts of sequenced data available for variant
classification. Sequenced data are highly correlated and have x interdependencies, which has hindered the
application of traditional classifiers like support vector machine to the variant classification task. Advanced
classifiers that are e of extracting high-level features from sequenced data are thus desired.
Deep neural networks are a type of cial neural networks that use multiple nonlinear and complex
orming layers to successively model high-level features and provide feedback via backpropagation. Deep
neural networks have evolved with the availability of large training datasets, the power of parallel and buted
ing, and sophisticated training algorithms. Deep neural networks have facilitated major advances in
numerous domains such as computer vision, speech ition, and natural language processing.
Convolutional neural networks and ent neural networks are components of deep neural networks.
Convolutional neural networks have succeeded particularly in image recognition with an architecture that comprises
convolution layers, nonlinear layers, and pooling layers. Recurrent neural networks are designed to utilize sequential
ation of input data with cyclic connections among building blocks like perceptrons, long short-term memory
units, and gated recurrent units. In addition, many other emergent deep neural networks have been proposed for
limited contexts, such as deep spatio-temporal neural networks, multi-dimensional recurrent neural networks, and
convolutional auto-encoders.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
The goal of training deep neural networks is optimization of the weight parameters in each layer,
which lly combines simpler features into complex features so that the most suitable hierarchical
representations can be learned from data. A single cycle of the optimization process is organized as follows. First,
given a training dataset, the forward pass sequentially computes the output in each layer and ates the function
signals forward through the k. In the final output layer, an objective loss function measures error between the
inferenced outputs and the given labels. To minimize the training error, the backward pass uses the chain rule to
backpropagate error signals and compute gradients with respect to all weights throughout the neural network.
Finally, the weight parameters are updated using optimization algorithms based on stochastic gradient descent.
Whereas batch gradient descent performs parameter updates for each complete dataset, stochastic gradient descent
es stochastic approximations by performing the updates for each small set of data examples. Several
optimization algorithms stem from stochastic gradient t. For example, the Adagrad and Adam training
algorithms perform stochastic gradient descent while adaptively modifying learning rates based on update frequency
and s of the nts for each parameter, respectively.
Another core element in the training of deep neural networks is regularization, which refers to
strategies intended to avoid overfitting and thus achieve good generalization performance. For example, weight
decay adds a penalty term to the objective loss function so that weight parameters converge to smaller absolute
values. Dropout randomly s hidden units from neural networks during training and can be considered an
le of possible subnetworks. To enhance the capabilities of dropout, a new activation function, maxout, and a
variant of dropout for ent neural networks called rnnDrop have been proposed. Furthermore, batch
normalization provides a new regularization method through normalization of scalar es for each activation
within a mini-batch and learning each mean and variance as parameters.
Given that sequenced data are multi- and imensional, deep neural networks have great promise
for bioinformatics research because of their broad applicability and enhanced prediction power. Convolutional
neural networks have been adapted to solve sequence-based problems in cs such as motif discovery,
pathogenic variant identification, and gene expression inference. A hallmark of convolutional neural networks is the
use of convolution filters. Unlike traditional classification approaches that are based on elaborately-designed and
manually-crafted es, ution filters perform adaptive learning of features, analogous to a process of
mapping raw input data to the informative representation of knowledge. In this sense, the convolution s serve as
a series of motif scanners, since a set of such filters is capable of izing relevant patterns in the input and
updating themselves during the training procedure. ent neural networks can capture long-range encies
in sequential data of varying lengths, such as protein or DNA sequences.
Therefore, an opportunity arises to use deep neural networks for variant classification.
BRIEF DESCRIPTION OF THE GS
In the drawings, like reference characters lly refer to like parts throughout the different views.
Also, the drawings are not necessarily to scale, with an emphasis instead lly being placed upon illustrating the
principles of the technology disclosed. In the following description, various implementations of the technology
disclosed are described with reference to the following drawings.
illustrates an nment in which the variant classifier operates according to one
implementation.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
illustrates an e input sequence with a variant flanked by upstream and downstream
bases.
shows the one-hot encoding scheme used to encode the input sequence.
shows one implementation of a ta correlator that correlates each unclassified variant
with respective values of mutation characteristics, read mapping statistics, and occurrence frequency.
highlights some examples of context metadata features correlated with the variant.
highlights some examples of sequencing metadata features correlated with the variant.
highlights some examples of functional metadata features correlated with the variant.
highlights some examples of tion metadata features correlated with the variant.
highlights one example of an ethnicity metadata feature correlated with the t.
shows an architectural example of variant classification performed by the variant classifier.
shows an algorithmic example of variant classification performed by the variant classifier.
depicts one implementation of training the variant classifier ing to a transfer learning
gy, followed by evaluation and testing of the trained variant fier.
shows performance results of the variant caller (also referred to herein as Sojourner) on exonic
data. These results, fied by sensitivity and specificity, establish Sojourner’s advantages and superiority over a
non-deep neural network classifier.
shows the improvement in false positive rate using Sojourner versus the non-deep neural
network classifier when classifying variants over exons.
shows the mean absolute tumor mutational burden (TMB) error using Sojourner versus the
non-deep neural network classifier when fying variants over exons.
shows the improvement in mean absolute TMB error using Sojourner versus the non-deep
neural k classifier when fying variants over exons.
shows performance results of Sojourner on CDS (coding DNA sequence) data. These results,
quantified by sensitivity and specificity, establish Sojourner’s advantages and superiority over the non-deep neural
network classifier.
shows similar false positive rate using Sojourner versus the non-deep neural k classifier
when classifying variants over coding regions.
shows the mean absolute TMB error using Sojourner versus the non-deep neural network
classifier when classifying variants over coding regions.
shows similar mean absolute TMB error using ner versus the non-deep neural k
classifier when classifying variants over exons.
shows a computer system that can be used to ent the variant classifier.
DETAILED DESCRIPTION
The ing discussion is presented to enable any person skilled in the art to make and use the
technology disclosed, and is provided in the context of a particular application and its requirements. Various
modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general
principles defined herein may be applied to other entations and applications without departing from the spirit
and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed
The discussion is organized as follows. First, an introduction describing some of the technical
problems addressed by various implementations is presented, followed by an overview of the variant classifier and
an explanation of terminology used throughout the discussion. Next, an example environment in which the variant
classifier operates is discussed at a high-level along with a sequencing process and a variant annotation/call
application. Then, various data structures fed as input to the variant classifier are discussed together with a data
correlation model and some ta samples. Next, an architectural example of variant classification performed by
the variant classifier is presented, followed by an algorithmic example of the same. Then, a transfer ng strategy
used to train the variant classifier is discussed in conjunction with strategies for evaluating and testing the variant
classifier. Next, performance s that establish advantages and superiority of the variant classifier over a nondeep
neural network classifier are presented. , various particular implementations are discussed.
Introduction
The transformation of a normal cell into a cancer cell takes pl ace h a sequence of discrete
genetic events called somatic mutations. Tumor onal burden (TMB) is a measurement of the number of
somatic mutations per megabase of sequenced DNA and is used as a quantitative indicator for predicting response to
cancer immunotherapy. Germline variant filtering is an important preprocessing step for obtaining accurate TMB
assessments because only somatic variants are used for calculating TMB and germline variants are far more
common than somatic variants (100-1000×).
We introduce a variant classifier that uses trained deep neural ks to t whether a given
variant is somatic or germline. Our model has two deep neural networks: a convolutional neural k (CNN) and
a fully-connected neural network (FCNN). Our model receives two inputs: a DNA sequence with a variant and a set
of metadata features correlated with the t.
The first input to the model is the DNA ce. We regard the DNA sequence as an image with
multiple ls that numerically encode the four types of nucleotide bases, A, C, G, and T. The DNA sequence,
spanning the variant, is one-hot encoded to conserve the position-specific information of each individual base in the
sequence.
The convolutional neural network receives the one-hot encoded DNA sequence because it is capable of
preserving the spatial locality onships within the sequence. The convolutional neural network processes the
DNA sequence through multiple convolution layers and produces one or more intermediate convolved features. The
convolution layers e convolution filters to detect es within the DNA sequence. The convolution filters act
as motif detectors that scan the DNA sequence for low-level motif features and produce signals of different
strengths depending on the ying sequence patterns. The ution filters are automatically learned after
training on thousands and millions of training examples of c and germline ts.
The second input to the model is the set of metadata features correlated with the variant. The metadata
features represent the variant’s mutation characteristics, read g statistics, and occurrence frequency.
Examples of mutation characteristics are variant type, amino acid impact, evolutionary conservation, and clinical
significance. es of read g statistics are variant allele ncy, read depth, and base call quality
score. Examples of occurrence frequency are allele frequencies in sequenced populations and ethnic subpopulations.
Some of the metadata features are d using categorical data such as one-hot or Boolean values,
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
while others are encoded using continuous data such as percentage and probability values. The metadata features
lack locality relationships because they are correlated only with the variant. This makes them suitable for sing
by the fully-connected neural network.
First, a feature sequence is derived by concatenating the metadata features with the ediate
convolved features. The fully-connected neural network then processes the feature sequence through multiple fullyconnected
layers. The y connected neurons of the fully-connected layers detect evel features d in
the feature sequence. Finally, a fication layer of the fully-connected neural network outputs probabilities for
the variant being somatic, germline, or noise. Having the noise category improves classification along the somatic
and germline categories.
Pairs of batch normalization and rectified linear unit nonlinearity are interspersed between the
convolutional layers and the connected layers to enhance learning rates and reduce overfitting. The model is
pre-trained on somatic and germline variants from The Cancer Genome Atlas (TCGA) dataset and then fine-tuned
on the TruSight Tumor (TST) dataset according a transfer learning strategy. Results demonstrate the effectiveness
and efficiency of our model on validation data held-out from the TST dataset. These s, quantified by sensitivity
and specificity, establish advantages and superiority of our model over traditional classifiers.
Terminology
All literature and similar material cited in this application, including, but not limited to, patents, patent
applications, articles, books, treatises, and web pages, regardless of the format of such literature and similar
materials, are expressly incorporated by reference in their entirety. In the event that one or more of the incorporated
literature and similar materials differs from or contradicts this application, including but not d to defined
terms, term usage, described techniques, or the like, this application controls.
As used herein, the ing terms have the gs indicated.
Some ns of this application, particularly the drawings, refer to the variant classifier as
“Sojourner”.
A base refers to a nucleotide base or nucleotide, A (adenine), C (cytosine), T (thymine), or G
(guanine).
The term “chromosome” refers to the heredity-bearing gene carrier of a living cell, which is derived
from chromatin strands comprising DNA and protein components (especially histones). The conventional
internationally recognized individual human genome chromosome numbering system is employed herein.
The term “site” refers to a unique position (e.g., chromosome ID, chromosome position and
orientation) on a reference genome. In some entations, a site may be a e, a sequence tag, or a segment's
position on a sequence. The term “locus” may be used to refer to the specific location of a nucleic acid ce or
polymorphism on a reference chromosome.
The term “sample” herein refers to a sample, typically derived from a biological fluid, cell, tissue,
organ, or organism containing a nucleic acid or a mixture of nucleic acids containing at least one nucleic acid
sequence that is to be sequenced and/or phased. Such samples e, but are not d to sputum/oral fluid,
amniotic fluid, blood, a blood fraction, fine needle biopsy samples (e.g., surgical biopsy, fine needle biopsy, etc.),
urine, peritoneal fluid, l fluid, tissue explant, organ culture and any other tissue or cell preparation, or fraction
or derivative thereof or ed rom. gh the sample is often taken from a human t (e.g., patient),
samples can be taken from any organism having chromosomes, including, but not limited to dogs, cats, horses,
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
goats, sheep, , pigs, etc. The sample may be used directly as obtained from the biological source or following a
pretreatment to modify the character of the sample. For example, such pretreatment may e preparing plasma
from blood, diluting viscous fluids and so forth. s of pretreatment may also involve, but are not limited to,
filtration, precipitation, on, distillation, mixing, centrifugation, freezing, lyophilization, concentration,
amplification, nucleic acid fragmentation, inactivation of interfering components, the on of reagents, lysing,
The term “sequence” includes or represents a strand of nucleotides coupled to each other. The
nucleotides may be based on DNA or RNA. It should be understood that one sequence may include multiple subsequences.
For example, a single ce (e.g., of a PCR amplicon) may have 350 tides. The sample read
may include multiple sub-sequences within these 350 nucleotides. For instance, the sample read may include first
and second flanking subsequences having, for example, 20-50 nucleotides. The first and second flanking subsequences
may be located on either side of a repetitive segment having a corresponding quence (e.g., 40-100
nucleotides). Each of the flanking sub-sequences may include (or include portions of) a primer sub-sequence (e.g.,
-30 nucleotides). For ease of reading, the term “sub-sequence” will be referred to as “sequence,” but it is
understood that two sequences are not necessarily separate from each other on a common . To differentiate the
various sequences described herein, the sequences may be given ent labels (e.g., target sequence, primer
sequence, flanking ce, nce sequence, and the like). Other terms, such as “allele,” may be given different
labels to differentiate between like objects.
The term d-end sequencing” refers to sequencing methods that sequence both ends of a target
fragment. Paired-end sequencing may facilitate detection of genomic ngements and repetitive segments, as
well as gene fusions and novel transcripts. Methodology for paired-end sequencing are described in PCT publication
WO07010252, PCT application Serial No. PCTGB2007/003798 and US patent application publication US
2009/0088327, each of which is incorporated by reference herein. In one example, a series of operations may be
med as s; (a) generate clusters of nucleic acids; (b) linearize the nucleic acids; (c) hybridize a first
sequencing primer and carry out repeated cycles of extension, scanning and deblocking, as set forth above; (d)
“invert” the target nucleic acids on the flow cell surface by synthesizing a complimentary copy; (e) linearize the
resynthesized strand; and (f) hybridize a second sequencing primer and carry out repeated cycles of extension,
scanning and deblocking, as set forth above. The inversion operation can be carried out be delivering reagents as set
forth above for a single cycle of bridge amplification.
The term “reference genome” or “reference sequence” refers to any particular known genome
sequence, whether partial or complete, of any organism which may be used to nce identified sequences from a
subject. For example, a reference genome used for human subjects as well as many other sms is found at the
National Center for Biotechnology Information at ncbi.nlm.nih.gov. A “genome” refers to the complete genetic
information of an organism or virus, expressed in nucleic acid ces. A genome includes both the genes and the
noncoding ces of the DNA. The reference sequence may be larger than the reads that are aligned to it. For
example, it may be at least about 100 times larger, or at least about 1000 times larger, or at least about 10,000 times
larger, or at least about 105 times larger, or at least about 106 times larger, or at least about 107 times larger. In one
example, the reference genome ce is that of a full length human genome. In another example, the reference
genome ce is limited to a specific human chromosome such as chromosome 13. In some implementations, a
reference chromosome is a chromosome sequence from human genome version hg19. Such sequences may be
referred to as chromosome reference sequences, although the term reference genome is intended to cover such
{00691484.DOCX }
Atty. Docket No.: ILLM WO/IPPCT
sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, subchromosomal
regions (such as strands), etc., of any species. In various implementations, the reference genome is a
sus sequence or other combination derived from multiple individuals. However, in certain applications, the
reference sequence may be taken from a particular dual.
The term “read” refer to a collection of ce data that describes a fragment of a nucleotide sample
or reference. The term “read” may refer to a sample read and/or a reference read. Typically, though not necessarily,
a read represents a short sequence of contiguous base pairs in the sample or reference. The read may be represented
symbolically by the base pair ce (in ATCG) of the sample or reference fragment. It may be stored in a
memory device and processed as appropriate to determine r the read matches a reference sequence or meets
other criteria. A read may be obtained directly from a sequencing apparatus or indirectly from stored sequence
information concerning the sample. In some cases, a read is a DNA sequence of sufficient length (e.g., at least about
bp) that can be used to identify a larger sequence or region, e.g., that can be aligned and specifically assigned to a
chromosome or genomic region or gene.
Next-generation sequencing methods include, for example, sequencing by synthesis technology
(Illumina), quencing (454), ion semiconductor technology (Ion t sequencing), single-molecule realtime
sequencing (Pacific Biosciences) and sequencing by ligation (SOLiD cing). Depending on the
sequencing methods, the length of each read may vary from about 30 bp to more than 10,000 bp. For example,
Illumina sequencing method using SOLiD cer generates nucleic acid reads of about 50 bp. For another
example, Ion Torrent cing generates nucleic acid reads of up to 400 bp and 454 quencing generates
nucleic acid reads of about 700 bp. For yet another example, single-molecule real-time sequencing s may
generate reads of 10,000 bp to 15,000 bp. Therefore, in certain implementations, the nucleic acid sequence reads
have a length of 30-100 bp, 50-200 bp, or 50-400 bp.
The terms “sample read”, “sample sequence” or e nt” refer to sequence data for a
genomic sequence of st from a sample. For example, the sample read comprises sequence data from a PCR
amplicon having a forward and reverse primer sequence. The sequence data can be obtained from any select
sequence methodology. The sample read can be, for example, from a sequencing-by-synthesis (SBS) reaction, a
sequencing-by-ligation reaction, or any other suitable sequencing methodology for which it is desired to determine
the length and/or identity of a repetitive element. The sample read can be a consensus (e.g., ed or weighted)
sequence derived from multiple sample reads. In certain implementations, providing a reference sequence comprises
identifying a locus-of-interest based upon the primer ce of the PCR amplicon.
The term “raw fragment” refers to sequence data for a portion of a genomic sequence of interest that at
least partially ps a designated position or secondary on of interest within a sample read or sample
fragment. miting es of raw fragments include a duplex stitched fragment, a simplex stitched fragment,
a duplex un-stitched fragment and a simplex un-stitched fragment. The term “raw” is used to indicate that the raw
fragment includes sequence data having some relation to the sequence data in a sample read, regardless of whether
the raw fragment exhibits a supporting variant that corresponds to and authenticates or confirms a potential variant
in a sample read. The term “raw fragment” does not indicate that the fragment necessarily includes a supporting
t that validates a variant call in a sample read. For example, when a sample read is determined by a variant
call application to exhibit a first variant, the variant call application may determine that one or more raw fragments
lack a corresponding type of “supporting” variant that may ise be expected to occur given the variant in the
sample read.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
The terms “mapping”, “aligned,” “alignment,” or “aligning” refer to the process of comparing a read or
tag to a reference sequence and thereby determining whether the reference sequence contains the read sequence. If
the reference sequence contains the read, the read may be mapped to the reference sequence or, in certain
implementations, to a particular location in the reference sequence. In some cases, alignment simply tells whether or
not a read is a member of a particular nce sequence (i.e., whether the read is present or absent in the reference
sequence). For example, the alignment of a read to the nce sequence for human some 13 will tell
whether the read is present in the reference sequence for chromosome 13. A tool that provides this information may
be called a set membership tester. In some cases, an alignment additionally indicates a location in the reference
sequence where the read or tag maps to. For example, if the reference sequence is the whole human genome
sequence, an alignment may indicate that a read is present on chromosome 13, and may r indicate that the read
is on a particular strand and/or site of chromosome 13.
The term “indel” refers to the insertion and/or the on of bases in the DNA of an organism. A
micro-indel represents an indel that s in a net change of 1 to 50 nucleotides. In coding regions of the genome,
unless the length of an indel is a le of 3, it will produce a hift mutation. Indels can be contrasted with
point mutations. An indel inserts and s nucleotides from a sequence, while a point on is a form of
substitution that replaces one of the nucleotides without changing the overall number in the DNA. Indels can also be
contrasted with a Tandem Base Mutation (TBM), which may be defined as tution at adjacent nucleotides
(primarily substitutions at two adjacent nucleotides, but substitutions at three adjacent nucleotides have been
observed.
The term “variant” refers to a nucleic acid sequence that is different from a nucleic acid reference.
Typical nucleic acid sequence variant includes without limitation single tide polymorphism (SNP), short
deletion and insertion polymorphisms (Indel), copy number ion (CNV), microsatellite markers or short tandem
repeats and structural ion. Somatic variant calling is the effort to identify variants present at low frequency in
the DNA . Somatic variant calling is of interest in the context of cancer treatment. Cancer is caused by an
lation of mutations in DNA. A DNA sample from a tumor is generally heterogeneous, including some
normal cells, some cells at an early stage of cancer progression (with fewer mutations), and some late-stage cells
(with more mutations). Because of this heterogeneity, when sequencing a tumor (e.g., from an FFPE sample),
somatic mutations will often appear at a low frequency. For example, a SNV might be seen in only 10% of the reads
covering a given base. A variant that is to be classified as somatic or ne by the variant classifier is also
referred to herein as the “variant under test”.
The term “noise” refers to a en variant call resulting from one or more errors in the sequencing
process and/or in the variant call application.
The term “variant frequency” represents the relative frequency of an allele (variant of a gene) at a
particular locus in a population, expressed as a fraction or percentage. For e, the fraction or percentage may
be the fraction of all chromosomes in the population that carry that allele. By way of example, sample variant
frequency represents the relative frequency of an allele/variant at a ular position along a genomic
sequence of interest over a “population” corresponding to the number of reads and/or samples obtained for the
genomic sequence of interest from an individual. As another example, a baseline variant frequency ents the
relative frequency of an /variant at a particular locus/position along one or more baseline genomic sequences
where the “population” corresponding to the number of reads and/or samples obtained for the one or more baseline
genomic sequences from a population of normal individuals.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
The term “variant allele frequency (VAF)” refers to the percentage of sequenced reads observed
matching the variant d by the overall coverage at the target position. VAF is a measure of the proportion of
sequenced reads ng the variant.
The terms “position”, nated position”, and “locus” refer to a location or coordinate of one or
more nucleotides within a sequence of nucleotides. The terms “position”, “designated position”, and “locus” also
refer to a location or nate of one or more base pairs in a ce of nucleotides.
The term “haplotype” refers to a combination of alleles at adjacent sites on a chromosome that are
inherited together. A ype may be one locus, several loci, or an entire chromosome depending on the number
of recombination events that have occurred between a given set of loci, if any occurred.
The term “threshold” herein refers to a numeric or non-numeric value that is used as a cutoff to
characterize a sample, a nucleic acid, or portion thereof (e.g., a read). A old may be varied based upon
empirical is. The threshold may be compared to a measured or calculated value to determine whether the
source giving rise to such value suggests should be classified in a particular manner. Threshold values can be
fied empirically or analytically. The choice of a threshold is dependent on the level of confidence that the user
wishes to have to make the classification. The threshold may be chosen for a particular purpose (e.g., to balance
sensitivity and selectivity). As used herein, the term “threshold” indicates a point at which a course of analysis may
be changed and/or a point at which an action may be triggered. A threshold is not required to be a predetermined
number. Instead, the threshold may be, for ce, a function that is based on a plurality of factors. The threshold
may be adaptive to the circumstances. Moreover, a old may indicate an upper limit, a lower limit, or a range
between limits.
In some implementations, a metric or score that is based on cing data may be compared to the
threshold. As used herein, the terms “metric” or “score” may e values or results that were determined from the
sequencing data or may e functions that are based on the values or results that were ined from the
sequencing data. Like a old, the metric or score may be adaptive to the circumstances. For instance, the metric
or score may be a normalized value. As an example of a score or metric, one or more implementations may use
count scores when analyzing the data. A count score may be based on number of sample reads. The sample reads
may have undergone one or more filtering stages such that the sample reads have at least one common characteristic
or quality. For e, each of the sample reads that are used to determine a count score may have been aligned
with a reference sequence or may be assigned as a potential allele. The number of sample reads having a common
characteristic may be counted to determine a read count. Count scores may be based on the read count. In some
implementations, the count score may be a value that is equal to the read count. In other implementations, the count
score may be based on the read count and other information. For example, a count score may be based on the read
count for a particular allele of a genetic locus and a total number of reads for the genetic locus. In some
implementations, the count score may be based on the read count and previously-obtained data for the genetic locus.
In some implementations, the count scores may be normalized scores between predetermined values. The count
score may also be a function of read counts from other loci of a sample or a function of read counts from other
samples that were concurrently run with the sample-of-interest. For instance, the count score may be a function of
the read count of a particular allele and the read counts of other loci in the sample and/or the read counts from other
samples. As one example, the read counts from other loci and/or the read counts from other samples may be used to
normalize the count score for the particular allele.
{00691484.DOCX }
Atty. Docket No.: ILLM WO/IPPCT
The terms “coverage” or ent coverage” refer to a count or other measure of a number of sample
reads for the same nt of a sequence. A read count may represent a count of the number of reads that cover a
ponding fragment. Alternatively, the coverage may be determined by multiplying the read count by a
designated factor that is based on historical dge, knowledge of the , knowledge of the locus, etc.
The term “read depth” (conventionally a number ed by “×”) refers to the number of sequenced
reads with overlapping alignment at the target position. This is often expressed as an average or percentage
exceeding a cutoff over a set of intervals (such as exons, genes, or panels). For example, a clinical report might say
that a panel average coverage is 1,105× with 98% of ed bases covered >100×.
The terms “base call y score” or “Q score” refer to a PHRED-scaled probability ranging from 0-
inversely proportional to the ility that a single sequenced base is t. For example, a T base call with Q
of 20 is considered likely correct with a confidence P-value of 0.01. Any base call with Q<20 should be considered
low quality, and any variant identified where a substantial proportion of sequenced reads supporting the variant are
of low quality should be considered potentially false positive.
The terms “variant reads” or “variant read number” refer to the number of sequenced reads supporting
the presence of the variant.
nment
We describe a system and various implementations for variant classification using a so-called
Sojourner variant classifier. The system and processes are described with reference to Because is an
ectural diagram, certain details are intentionally omitted to improve the clarity of the description. The
discussion of is organized as follows. First, the modules of the figure are introduced, followed by their
interconnections. Then, the use of the modules is described in greater detail.
illustrates an nment 100 in which the variant classifier 104 operates according to one
entation. The environment 100 includes the ing processing s: variant classifier 104,
concatenator 112, and metadata correlator 116. The environment 100 also includes the following databases:
sified variants 124, input sequences 102, metadata features 126, and feature sequences 122.
The processing engines and databases of designated as modules, can be implemented in
hardware or software, and need not be divided up in precisely the same blocks as shown in Some of the
modules can also be implemented on different processors, computers, or servers, or spread among a number of
different processors, computers, or servers. In addition, it will be appreciated that some of the modules can be
combined, operated in el or in a different sequence than that shown in without affecting the functions
achieved. The modules in can also be thought of as art steps in a method. A module also need not
necessarily have all its code disposed contiguously in memory; some parts of the code can be ted from other
parts of the code with code from other modules or other functions disposed in between.
The interconnections of the modules of environment 100 are now described. The network(s) 114
couples the processing engines and the databases, all in communication with each other (indicated by solid doublearrowed
lines). The actual communication path can be point-to-point over public and/or private networks. The
communications can occur over a variety of networks, e.g., private networks, VPN, MPLS circuit, or Internet, and
can use appropriate application programming interfaces (APIs) and data interchange formats, e.g., Representational
State Transfer (REST), JavaScript Object Notation (JSON), Extensible Markup Language (XML), Simple Object
Access Protocol (SOAP), Java Message Service (JMS), and/or Java Platform Module System. All of the
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
communications can be encrypted. The communication is generally over a network such as the LAN (local area
network), WAN (wide area network), telephone k (Public Switched Telephone Network (PSTN), Session
Initiation Protocol (SIP), wireless network, point-to-point network, star network, token ring network, hub network,
Internet, inclusive of the mobile Internet, via protocols such as EDGE, 3G, 4G LTE, Wi-Fi, and WiMAX.
Additionally, a variety of authorization and authentication techniques, such as username/password, Open
ization (OAuth), os, ID, digital certificates and more, can be used to secure the
communications.
Sequencing Process
Implementations set forth herein may be applicable to analyzing nucleic acid sequences to identify
sequence variations. Implementations may be used to analyze potential variants/alleles of a genetic position/locus
and determine a genotype of the genetic locus or, in other words, provide a genotype call for the locus. By way of
example, nucleic acid sequences may be analyzed in accordance with the methods and systems described in US
Patent Application Publication No. 2016/0085910 and US Patent Application Publication No. 2013/0296175, the
complete subject matter of which are expressly incorporated by reference herein in their entirety.
In one implementation, a sequencing process includes receiving a sample that es or is suspected
of including nucleic acids, such as DNA. The sample may be from a known or unknown source, such as an animal
(e.g., , plant, bacteria, or fungus. The sample may be taken directly from the source. For instance, blood or
saliva may be taken directly from an individual. Alternatively, the sample may not be ed directly from the
source. Then, one or more processors direct the system to prepare the sample for sequencing. The preparation may
include removing extraneous material and/or isolating certain material (e.g., DNA). The biological sample may be
prepared to include features for a particular assay. For e, the biological sample may be ed for
sequencing-by-synthesis (SBS). In certain implementations, the preparing may include amplification of certain
regions of a genome. For instance, the preparing may include amplifying predetermined c loci that are known
to include STRs and/or SNPs. The genetic loci may be ied using predetermined primer sequences.
Next, the one or more processors direct the system to sequence the sample. The sequencing may be
performed through a variety of known sequencing protocols. In particular implementations, the sequencing includes
SBS. In SBS, a plurality of fluorescently-labeled nucleotides are used to sequence a ity of clusters of ied
DNA (possibly millions of clusters) present on the surface of an optical substrate (e.g., a surface that at least
partially defines a l in a flow cell). The flow cells may contain nucleic acid samples for sequencing where the
flow cells are placed within the appropriate flow cell holders.
The nucleic acids can be prepared such that they se a known primer sequence that is nt to
an unknown target sequence. To initiate the first SBS sequencing cycle, one or more differently labeled nucleotides,
and DNA polymerase, etc., can be flowed into/through the flow cell by a fluid flow subsystem. Either a single type
of nucleotide can be added at a time, or the tides used in the sequencing ure can be specially ed
to possess a reversible termination property, thus allowing each cycle of the sequencing reaction to occur
simultaneously in the presence of several types of labeled nucleotides (e.g., A, C, T, G). The nucleotides can include
detectable label moieties such as fluorophores. Where the four nucleotides are mixed together, the polymerase is
able to select the correct base to incorporate and each sequence is extended by a single base. corporated
nucleotides can be washed away by flowing a wash solution through the flow cell. One or more lasers may excite
the nucleic acids and induce fluorescence. The fluorescence emitted from the nucleic acids is based upon the
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
fluorophores of the incorporated base, and ent fluorophores may emit different wavelengths of emission light.
A king reagent can be added to the flow cell to remove reversible terminator groups from the DNA strands
that were extended and detected. The deblocking reagent can then be washed away by flowing a wash solution
through the flow cell. The flow cell is then ready for a further cycle of sequencing starting with introduction of a
labeled nucleotide as set forth above. The c and detection operations can be repeated several times to complete
a sequencing run. e sequencing methods are described, for example, in Bentley et al., Nature 456:53-59
(2008), International Publication No. WO 04/018497; U.S. Pat. No. 7,057,026; International Publication No. WO
91/06678; International Publication No. WO 07/123744; U.S. Pat. No. 7,329,492; U.S. Patent No. 7,211,414; U.S.
Patent No. 019; U.S. Patent No. 7,405,281, and U.S. Patent Application Publication No. 2008/0108082, each
of which is incorporated herein by reference.
In some implementations, nucleic acids can be attached to a surface and amplified prior to or during
sequencing. For example, amplification can be carried out using bridge amplification to form c acid clusters
on a surface. Useful bridge amplification methods are described, for example, in U.S. Patent No. 5,641,658; U.S.
Patent Application Publication No. 2002/0055100; U.S. Patent No. 7,115,400; U.S. Patent Application Publication
No. 2004/0096853; U.S. Patent ation Publication No. 2004/0002090; U.S. Patent Application Publication No.
2007/0128624; and U.S. Patent Application Publication No. 2008/0009420, each of which is incorporated herein by
reference in its entirety. Another useful method for amplifying c acids on a surface is rolling circle
amplification (RCA), for example, as described in Lizardi et al., Nat. Genet. 19:225-232 (1998) and U.S. Patent
Application Publication No. 2007/0099208 A1, each of which is incorporated herein by reference.
One e SBS protocol exploits modified nucleotides having removable 3’ blocks, for example, as
described in ational Publication No. WO 04/018497, U.S. Patent Application Publication No.
2007/0166705A1, and U.S. Patent No. 7,057,026, each of which is orated herein by reference. For example,
repeated cycles of SBS ts can be delivered to a flow cell having target nucleic acids attached thereto, for
example, as a result of the bridge amplification protocol. The nucleic acid clusters can be converted to single
stranded form using a linearization solution. The linearization solution can contain, for example, a ction
clease capable of cleaving one strand of each cluster. Other methods of cleavage can be used as an
alternative to restriction enzymes or nicking enzymes, ing inter alia chemical cleavage (e.g., cleavage of a diol
linkage with ate), cleavage of abasic sites by cleavage with clease (for example ‘USER’, as supplied
by NEB, Ipswich, Mass., USA, part number M5505S), by exposure to heat or alkali, ge of ribonucleotides
incorporated into amplification products ise comprised of deoxyribonucleotides, photochemical cleavage or
cleavage of a peptide linker. After the linearization operation a sequencing primer can be delivered to the flow cell
under conditions for hybridization of the sequencing primer to the target nucleic acids that are to be sequenced.
A flow cell can then be contacted with an SBS extension reagent having modified nucleotides with
removable 3’ blocks and fluorescent labels under conditions to extend a primer hybridized to each target nucleic
acid by a single nucleotide addition. Only a single nucleotide is added to each primer because once the modified
nucleotide has been incorporated into the growing polynucleotide chain complementary to the region of the te
being sequenced there is no free 3’-OH group available to direct further ce extension and therefore the
polymerase cannot add further nucleotides. The SBS extension reagent can be removed and replaced with scan
t containing components that t the sample under excitation with radiation. Example components for
scan reagent are bed in U.S. Patent Application Publication No. 2008/0280773 A1 and U.S. Patent Application
No. 13/018,255, each of which is incorporated herein by nce. The extended nucleic acids can then be
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
fluorescently detected in the presence of scan reagent. Once the fluorescence has been detected, the 3’ block may be
removed using a k reagent that is riate to the blocking group used. Example deblock reagents that are
useful for respective blocking groups are bed in WO004018497, US 2007/0166705A1 and U.S. Patent No.
7,057,026, each of which is incorporated herein by reference. The deblock reagent can be washed away leaving
target nucleic acids hybridized to extended primers having 3’-OH groups that are now competent for addition of a
further nucleotide. Accordingly the cycles of adding extension reagent, scan reagent, and deblock reagent, with
optional washes between one or more of the operations, can be repeated until a desired sequence is obtained. The
above cycles can be d out using a single ion reagent ry operation per cycle when each of the
modified tides has a different label ed thereto, known to correspond to the particular base. The different
labels facilitate discrimination between the nucleotides added during each incorporation operation. atively,
each cycle can include separate operations of extension reagent delivery followed by separate operations of scan
reagent delivery and detection, in which case two or more of the nucleotides can have the same label and can be
distinguished based on the known order of delivery.
Although the sequencing operation has been discussed above with respect to a particular SBS protocol,
it will be understood that other ols for sequencing any of a variety of other molecular analyses can be d
out as desired.
Then, the one or more processors of the system e the sequencing data for subsequent analysis.
The sequencing data may be formatted in various manners, such as in a .BAM file. The sequencing data may
include, for example, a number of sample reads. The sequencing data may e a plurality of sample reads that
have corresponding sample sequences of the nucleotides. Although only one sample read is discussed, it should be
understood that the sequencing data may include, for example, hundreds, thousands, hundreds of thousands, or
millions of sample reads. ent sample reads may have different numbers of nucleotides. For example, a sample
read may range between 10 nucleotides to about 500 nucleotides or more. The sample reads may span the entire
genome of the source(s). As one example, the sample reads are ed toward predetermined genetic loci, such as
those genetic loci having ted STRs or suspected SNPs.
Each sample read may include a sequence of nucleotides, which may be referred to as a sample
sequence, sample nt or a target sequence. The sample sequence may include, for example, primer sequences,
flanking sequences, and a target sequence. The number of nucleotides within the sample sequence may include 30,
40, 50, 60, 70, 80, 90, 100 or more. In some implementations, one or more the sample reads (or sample sequences)
includes at least 150 nucleotides, 200 nucleotides, 300 nucleotides, 400 nucleotides, 500 nucleotides, or more. In
some implementations, the sample reads may include more than 1000 nucleotides, 2000 nucleotides, or more. The
sample reads (or the sample sequences) may include primer sequences at one or both ends.
Next, the one or more processors analyze the sequencing data to obtain potential t call(s) and a
sample variant frequency of the sample variant call(s). The operation may also be referred to as a variant call
application or variant . Thus, the variant caller identifies or s variants and the t classifier classifies
the detected variants as somatic or germline. Alternative variant callers may be utilized in accordance with
entations herein, wherein different variant callers may be used based on the type of sequencing operation
being performed, based on features of the sample that are of interest and the like. One non-limiting example of a
variant call application, such as the Pisces™ application by Illumina Inc. (San Diego, CA) hosted at
https://github.com/Illumina/Pisces and described in the article Dunn, Tamsen & Berry, Gwenn & Emig-Agius,
Dorothea & Jiang, Yu & Iyer, Anita & Udar, Nitin & Strömberg, Michael. (2017). Pisces: An Accurate and
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
Versatile Single Sample Somatic and Germline Variant Caller. 595-595. 10.1145/3107411.3108203, the complete
subject matter of which is expressly incorporated herein by reference in its entirety.
Such a variant call application can comprise four sequentially executed modules:
(1) Pisces Read Stitcher: Reduces noise by stitching paired reads in a BAM (read one and read two of
the same molecule) into consensus reads. The output is a stitched BAM.
(2) Pisces Variant Caller: Calls small SNVs, insertions and deletions. Pisces includes a variantcollapsing
algorithm to coalesce ts broken up by read boundaries, basic filtering algorithms, and a simple
Poisson-based variant confidence-scoring algorithm. The output is a VCF.
(3) Pisces Variant Quality Recalibrator (VQR): In the event that the variant calls elmingly
follow a pattern associated with thermal damage or FFPE deamination, the VQR step will downgrade the variant Q
score of the suspect variant calls. The output is an adjusted VCF.
(4) Pisces t Phaser (Scylla): Uses a acked greedy clustering method to assemble small
variants into x alleles from clonal subpopulations. This allows for the more accurate determination of
functional consequence by downstream tools. The output is an adjusted VCF.
Additionally or atively, the operation may utilize the variant call application Strelka™
application by Illumina Inc. hosted at https://github.com/Illumina/strelka and described in the article T Saunders,
Christopher & Wong, Wendy & Swamy, Sajani & Becq, Jennifer & J Murray, Lisa & Cheetham, Keira. .
Strelka: Accurate somatic small-variant g from sequenced tumor-normal sample pairs. Bioinformatics (Oxford,
England). 28. 1811-7. 10.1093/bioinformatics/bts271, the complete subject matter of which is expressly
incorporated herein by reference in its entirety. Furthermore, additionally or alternatively, the operation may e
the variant call application Strelka2™ application by Illumina Inc. hosted at https://github.com/Illumina/strelka and
described in the article Kim, S., Scheffler, K., Halpern, A.L., Bekritsky, M.A., Noh, E., Källberg, M., Chen, X.,
Beyter, D., Krusche, P., and Saunders, C.T. (2017). Strelka2: Fast and accurate variant calling for clinical
sequencing ations, the complete subject matter of which is expressly incorporated herein by reference in its
entirety. Moreover, onally or alternatively, the operation may utilize a variant annotation/call tool, such as the
Nirvana™ application by Illumina Inc. hosted at https://github.com/Illumina/Nirvana/wiki and described in the
article Stromberg, Michael & Roy, Rajat & Lajugie, Julien & Jiang, Yu & Li, Haochen & Margulies, Elliott. (2017).
a: Clinical Grade t Annotator. 596-596. 10.1145/3107411.3108204, the te subject matter of
which is sly incorporated herein by reference in its entirety.
Such a variant annotation/call tool can apply different algorithmic techniques such as those disclosed
in Nirvana:
a. Identifying all overlapping transcripts with al Array: For functional annotation, we can
identify all transcripts overlapping a variant and an interval tree can be used. However, since a set of intervals can be
static, we were able to further optimize it to an Interval Array. An interval tree returns all overlapping ripts in
O(min(n,k lg n)) time, where n is the number of intervals in the tree and k is the number of overlapping intervals. In
practice, since k is really small ed to n for most variants, the ive runtime on interval tree would be O(k
lg n) . We improved to O(lg n + k ) by creating an interval array where all intervals are stored in a sorted array so
that we only need to find the first overlapping interval and then enumerate through the remaining (k-1).
b. CNVs/SVs (Yu): annotations for Copy Number Variation and ural Variants can be provided.
r to the annotation of small variants, transcripts overlapping with the SV and also previously ed
structural variants can be annotated in online databases. Unlike the small variants, not all pping ripts
{00691484.DOCX }
Atty. Docket No.: ILLM WO/IPPCT
need be annotated, since too many transcripts will be overlapped with a large SVs. Instead, all overlapping
transcripts can be annotated that belong to a partial overlapping gene. Specifically, for these transcripts, the
impacted introns, exons and the consequences caused by the structural variants can be ed. An option to allow
output all overlapping transcripts is available, but the basic information for these transcripts can be reported, such as
gene , flag whether it is canonical overlap or partial overlapped with the transcripts. For each SV/CNV, it is
also of interest to know if these variants have been d and their frequencies in different populations. Hence, we
ed overlapping SVs in external ses, such as 1000 genomes, DGV and ClinGen. To avoid using an
arbitrary cutoff to ine which SV is overlapped, instead all overlapping ripts can be used and the
reciprocal overlap can be calculated, i.e. the overlapping length d by the minimum of the length of these two
c. ing supplementary annotations : Supplementary tions are of two types: small and
structural ts (SVs). SVs can be d as intervals and use the interval array discussed above to identify
overlapping SVs. Small variants are modeled as points and matched by position and (optionally) allele. As such,
they are searched using a binary-search-like algorithm. Since the supplementary annotation database can be quite
large, a much smaller index is created to map chromosome positions to file locations where the supplementary
annotation resides. The index is a sorted array of objects (made up of chromosome position and file location) that
can be binary searched using position. To keep the index size small, multiple positions (up to a certain max count)
are compressed to one object that stores the values for the first position and only deltas for subsequent positions.
Since we use Binary search, the runtime is O(lg n) , where n is the number of items in the database.
d. VEP cache files
e. Transcript Database : The Transcript Cache (cache) and Supplementary database (SAdb) files are
serialized dump of data objects such as transcripts and mentary tions. We use l VEP cache as
our data source for cache. To create the cache, all transcripts are inserted in an interval array and the final state of
the array is stored in the cache files. Thus, during annotation, we only need to load a pre-computed interval array
and perform searches on it. Since the cache is loaded up in memory and searching is very fast (described above),
finding overlapping transcripts is extremely quick in Nirvana (profiled to less than 1% of total runtime?).
] f. Supplementary Database : The data sources for SAdb are listed under supplementary material. The
SAdb for small variants is produced by a k -way merge of all data sources such that each object in the database
(identified by nce name and position) holds all nt supplementary annotations. Issues encountered during
parsing data source files have been documented in detail in Nirvana’s home page. To limit memory usage, only the
SA index is loaded up in memory. This index allows a quick lookup of the file location for a supplementary
tion. However, since the data has to be fetched from disk, adding supplementary annotation has been
identified as Nirvana’s t bottleneck (profiled at ~30% of total runtime.)
g. Consequence and Sequence Ontology : Nirvana’s functional annotation (when provided) follows the
Sequence Ontology (SO) (http://www.sequenceontology.org/ ) guidelines. On occasions, we had the opportunity to
identify issues in the current SO and collaborate with the SO team to improve the state of annotation.
Such a variant annotation tool can include pre-processing. For example, Nirvana included a large
number of annotations from External data sources, like ExAC, EVS, 1000 Genomes project, dbSNP, ClinVar,
Cosmic, DGV and ClinGen. To make full use of these databases, we have to sanitize the information from them. We
implemented different strategy to deal with different conflicts that exist from different data s. For example, in
case of multiple dbSNP entries for the same position and alternate allele, we join all ids into a comma separated list
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
of ids; if there are le s with different CAF values for the same allele, we use the first CAF value. For
conflicting ExAC and EVS entries, we consider the number of sample counts and the entry with higher sample
count is used. In 1000 Genome Projects, we d the allele frequency of the conflicting allele. Another issue is
inaccurate information. We mainly extracted the allele frequencies information from 1000 Genome Projects,
however, we noticed that for GRCh38, the allele frequency reported in the info field did not exclude samples with
genotype not available, g to deflated frequencies for variants which are not available for all samples. To
guarantee the accuracy of our annotation, we use all of the individual level genotype to compute the true allele
frequencies. As we know, the same variants can have different representations based on different alignments. To
make sure we can accurately report the information for already identified variants, we have to preprocess the
variants from different resources to make them have consistent entation. For all external data sources, we
trimmed s to remove duplicated nucleotides in both reference allele and alternative allele. For ClinVar, we
directly parsed the xml file we performed a five-prime alignment for all variants, which is often used in vcf file.
Different databases can contain the same set of information. To avoid unnecessary ates, we removed some
duplicated information. For example, we removed variants in DGV which has data source as 1000 genome projects,
since we already reported these variants in 1000 genomes with more detailed information.
In accordance with at least some entations, the variant call application provides calls for low
frequency variants, ne calling and the like. As non-limiting example, the variant call application may run on
tumor-only samples and/or tumor-normal paired samples. The variant call application may search for single
tide variations (SNV), multiple nucleotide variations (MNV), indels and the like. The variant call application
identifies ts, while filtering for mismatches due to sequencing or sample preparation errors. For each variant,
the variant caller identifies the reference ce, a position of the variant, and the potential variant sequence(s)
(e.g., A to C SNV, or AG to A deletion). The variant call application identifies the sample sequence (or sample
fragment), a reference sequence/fragment, and a variant call as an indication that a t is present. The variant
call application may identify raw fragments, and output a designation of the raw fragments, a count of the number of
raw fragments that verify the potential variant call, the position within the raw fragment at which a ting
variant occurred and other nt information. Non-limiting examples of raw fragments include a duplex stitched
nt, a simplex stitched fragment, a duplex un-stitched fragment and a simplex un- ed fragment.
The variant call application may output the calls in various formats, such as in a .VCF or .GVCF file.
By way of example only, the variant call application may be included in a eporter pipeline (e.g., when
implemented on the MiSeq® sequencer instrument). Optionally, the ation may be implemented with various
ows. The analysis may include a single protocol or a combination of protocols that e the sample reads
in a designated manner to obtain desired information.
Then, the one or more processors m a validation operation in connection with the ial
variant call. The validation operation may be based on a quality score, and/or a hierarchy of tiered tests, as explained
hereafter. When the validation operation authenticates or verifies that the potential variant call, the validation
operation passes the variant call information (from the variant call ation) to the sample report generator.
Alternatively, when the validation ion invalidates or disqualifies the potential variant call, the validation
operation passes a corresponding indication (e.g., a negative tor, a no call indicator, an in-valid call indicator)
to the sample report generator. The validation operation also may pass a confidence score related to a degree of
confidence that the variant call is correct or the in-valid call designation is correct.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
Next, the one or more sors generate and store a sample report. The sample report may include,
for example, information regarding a plurality of genetic loci with respect to the sample. For example, for each
genetic locus of a ermined set of genetic loci, the sample report may at least one of e a pe call;
indicate that a genotype call cannot be made; provide a confidence score on a certainty of the genotype call; or
indicate potential problems with an assay ing one or more genetic loci. The sample report may also indicate a
gender of an individual that provided a sample and/or indicate that the sample include multiple sources. As used
herein, a “sample ” may include digital data (e.g., a data file) of a genetic locus or predetermined set of genetic
locus and/or a printed report of the genetic locus or the set of c loci. Thus, generating or providing may
include creating a data file and/or printing the sample report, or displaying the sample .
The sample report may indicate that a variant call was determined, but was not validated. When a
variant call is determined invalid, the sample report may indicate additional information regarding the basis for the
determination to not validate the variant call. For example, the additional information in the report may include a
description of the raw fragments and an extent (e.g., a count) to which the raw fragments support or contradicted the
t call. Additionally or alternatively, the additional information in the report may include the quality score
obtained in accordance with implementations described herein.
Variant Call ation
] Implementations disclosed herein include ing cing data to identify potential variant calls.
Variant calling may be performed upon stored data for a previously performed sequencing operation. Additionally
or atively, it may be performed in real time while a sequencing operation is being performed. Each of the
sample reads is assigned to corresponding genetic loci. The sample reads may be ed to corresponding genetic
loci based on the sequence of the nucleotides of the sample read or, in other words, the order of nucleotides within
the sample read (e.g., A, C, G, T). Based on this analysis, the sample read may be designated as including a possible
variant/allele of a particular genetic locus. The sample read may be collected (or aggregated or binned) with other
sample reads that have been designated as including possible variants/alleles of the genetic locus. The assigning
operation may also be referred to as a g operation in which the sample read is identified as being possibly
associated with a particular genetic position/locus. The sample reads may be analyzed to locate one or more
identifying sequences (e.g., primer sequences) of nucleotides that differentiate the sample read from other sample
reads. More specifically, the identifying sequence(s) may identify the sample read from other sample reads as being
associated with a particular genetic locus.
The assigning ion may include analyzing the series of n nucleotides of the identifying sequence
to determine if the series of n nucleotides of the identifying sequence effectively matches with one or more of the
select sequences. In particular implementations, the assigning operation may include analyzing the first n
nucleotides of the sample sequence to determine if the first n nucleotides of the sample sequence effectively matches
with one or more of the select sequences. The number n may have a y of values, which may be programmed
into the protocol or entered by a user. For example, the number n may be defined as the number of nucleotides of the
shortest select sequence within the database. The number n may be a predetermined number. The predetermined
number may be, for example, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30
tides. r, fewer or more nucleotides may be used in other implementations. The number n may also be
ed by an individual, such as a user of the system. The number n may be based on one or more conditions. For
instance, the number n may be defined as the number of nucleotides of the shortest primer sequence within the
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
database or a designated number, whichever is the smaller number. In some implementations, a minimum value for
n may be used, such as 15, such that any primer sequence that is less than 15 nucleotides may be designated as an
exception.
In some cases, the series of n nucleotides of an identifying sequence may not precisely match the
nucleotides of the select sequence. Nonetheless, the identifying sequence may effectively match the select sequence
if the identifying sequence is nearly identical to the select ce. For example, the sample read may be called for
a genetic locus if the series of n nucleotides (e.g., the first n tides) of the identifying sequence match a select
sequence with no more than a designated number of ches (e.g., 3) and/or a designated number of shifts (e.g.,
2). Rules may be established such that each mismatch or shift may count as a difference between the sample read
and the primer sequence. If the number of differences is less than a designated number, then the sample read may be
called for the corresponding c locus (i.e., assigned to the corresponding genetic locus). In some
implementations, a matching score may be determined that is based on the number of differences between the
identifying sequence of the sample read and the select sequence associated with a genetic locus. If the matching
score passes a designated matching threshold, then the genetic locus that corresponds to the select sequence may be
ated as a ial locus for the sample read. In some implementations, uent analysis may be performed
to determine whether the sample read is called for the genetic locus.
If the sample read effectively matches one of the select sequences in the se (i.e., exactly matches
or nearly matches as bed above), then the sample read is assigned or designated to the genetic locus that
correlates to the select sequence. This may be referred to as locus calling or provisional-locus calling, wherein the
sample read is called for the genetic locus that correlates to the select ce. However, as discussed above, a
sample read may be called for more than one genetic locus. In such implementations, further analysis may be
performed to call or assign the sample read for only one of the potential genetic loci. In some implementations, the
sample read that is compared to the database of reference sequences is the first read from - end sequencing.
When performing paired-end sequencing, a second read (representing a raw fragment) is obtained that correlates to
the sample read. After assigning, the subsequent analysis that is performed with the ed reads may be based on
the type of genetic locus that has been called for the assigned read.
Next, the sample reads are analyzed to identify potential variant calls. Among other things, the s
of the analysis identify the potential variant call, a sample variant frequency, a reference sequence and a position
within the genomic sequence of interest at which the variant occurred. For example, if a genetic locus is known for
including SNPs, then the assigned reads that have been called for the genetic locus may undergo analysis to identify
the SNPs of the assigned reads. If the c locus is known for including polymorphic repetitive DNA elements,
then the assigned reads may be analyzed to identify or characterize the polymorphic repetitive DNA elements within
the sample reads. In some implementations, if an assigned read effectively matches with an STR locus and an SNP
locus, a warning or flag may be assigned to the sample read. The sample read may be designated as both an STR
locus and an SNP locus. The analyzing may include aligning the assigned reads in ance with an alignment
protocol to determine sequences and/or s of the assigned reads. The alignment protocol may include the
method described in International Patent Application No. cation No.
filed on March 15, 2013, which is herein orated by reference in its entirety.
Then, the one or more processors analyze raw fragments to determine r supporting variants
exist at ponding positions within the raw fragments. Various types of raw fragments may be identified. For
example, the variant caller may identify a type of raw fragment that exhibits a variant that validates the original
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
variant call. For example, the type of raw fragment may represent a duplex stitched fragment, a x stitched
fragment, a duplex tched fragment or a simplex tched fragment. Optionally other raw fragments may be
identified instead of or in addition to the foregoing examples. In connection with fying each type of raw
fragment, the variant caller also identifies the position, within the raw fragment, at which the supporting t
occurred, as well as a count of the number of raw fragments that exhibited the supporting variant. For example, the
variant caller may output an indication that 10 reads of raw nts were identified to represent duplex stitched
fragments having a supporting variant at a particular position X. The variant caller may also output indication that
five reads of raw fragments were identified to represent simplex un-stitched fragments having a supporting variant at
a ular position Y. The variant caller may also output a number of raw fragments that corresponded to reference
sequences and thus did not include a supporting t that would otherwise provide evidence validating the
potential variant call at the genomic sequence of interest.
Next, a count is maintained of the raw fragments that include supporting variants, as well as the
position at which the supporting variant occurred. Additionally or alternatively, a count may be maintained of the
raw fragments that did not include supporting variants at the position of interest (relative to the position of the
ial variant call in the sample read or sample fragment). Additionally or alternatively, a count may be
maintained of raw fragments that correspond to a reference sequence and do not authenticate or m the
potential variant call. The information determined is output to the variant call validation application, including a
count and type of the raw fragments that support the potential variant call, positions of the supporting variance in the
raw fragments, a count of the raw fragments that do not support the potential variant call and the like.
When a potential variant call is fied, the process outputs an indicating of the potential variant
call, the variant ce, the variant position and a reference sequence associated therewith. The variant call is
designated to represent a “potential” variant as errors may cause the call s to identify a false variant. In
accordance with implementations herein, the potential variant call is ed to reduce and eliminate false variants
or false positives. Additionally or alternatively, the process analyzes one or more raw fragments associated with a
sample read and outputs a corresponding variant call associated with the raw fragments.
Data ures
Database 124 includes variants that have not yet been fied as somatic or germline. These variants
are detected by the sequencing process and the variant annotation/call applications bed above. The DNA
segments, spanning the variants, can be derived from tumor s or tumor-normal pair samples. The variants can
be single-nucleotide polymorphisms (SNPs), insertions, or deletions. The variants can also be crawled from ly
available databases such as The Cancer Genome Atlas (TCGA), International Cancer Genome Consortium (ICGC),
database of short c variants (dbSNP), Catalog of Somatic Mutations in Cancer (COSMIC), 1000 Genomes
Project (1000Genomes), Exome Aggregation Consortium (ExAC), and Exome t Server (EVS). Prior to being
added to the se 124, the variants can be filtered based on criteria such as cancer ation, cancer type (e.g.,
lung adenocarcinoma (LUAD), variant allele frequency (VAF), and coding region (exonic/intronic).
] Database 102 includes input sequences that are one-hot encodings of DNA segments containing the
variants. illustrates an example input sequence 200 with a variant at a target position d by upstream
(left) and downstream (right) bases. shows the one-hot encoding scheme 300 used to encode the input
sequence. The following is an example of a one-hot encoding scheme (A, G, C, T, N) that is used to encode the
DNA segments: A = (1 0 0 0 0), G = (0 1 0 0 0), C = (0 0 1 0 0), T = (0 0 0 1 0), and N = (0 0 0 0 1). Each input
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
sequence includes at least one variant, preferably located at the center (target position) of the sequence. An input
sequence can be 21 bases long, with the variant flanked by 10 downstream and upstream bases, or it can also be 41
bases long, with the variant flanked by 20 downstream and upstream bases. It will be appreciated that input
sequences of varying lengths can be constructed. In contrast to being based on naturally ing DNA, the input
ces can be simulated by selecting a variant from the database 124 and flanking it with randomly generated
downstream and am bases.
Data Correlation Model
shows one implementation of the ta correlator 116 that correlates each unclassified
variant in the database 124 with respective values of mutation characteristics, read mapping statistics, and
occurrence frequency. In implementation, the metadata ator 116 includes the Nirvana™ al-grade variant
annotation application discussed above along with one or more ethnicity detection applications. The metadata
correlator 116 encodes the correlations in so-called metadata es that are stored in the database 126. ation
400 is performed on a variant-by-variant basis and includes identifying utes of a particular variant in the
databases 402, 412, and 422 and associating/linking/appending the found attributes with or to the variant.
Database 402 includes on characteristics of the variant, such as whether the t is a SNP, an
insertion, or deletion; whether the variant is nonsynonymous or not; what was the base(s) in the reference sequence
that the variant d; what is the clinical significance of the variant as determined from clinical tests (e.g.,
clinical effect, drug sensitivity, and ompatibility); evolutionary conservation of the variant position across
multiple species (e.g., mammals, birds), what is the ethnic makeup of the individual that provided the tumor sample
associated with the variant, and what is the functional impact of the variant on resulting proteins. Database 402
represents one or more publically available ses and tools such as r, Polymorphism Phenotyping
(PolyPhen), Sorting Intolerant from Tolerant (SIFT), and phylop. Database 402 can also be populated by data from
the sequencing process and the variant annotation/call applications described above (e.g., from the .BAM file, the
.VCF or .GVCF file, the sample report, and/or the count). For example, whether the variant is a SNP, an insertion, or
deletion and whether the variant is nonsynonymous or not is determined from the .VCF file, according to one
implementation.
Database 412 includes read mapping statistics of the variant, such as variant allele frequency (VAF),
read depth, base call quality score (Q score), variant reads (variant read number), variant quality scores (QUAL),
mapping quality scores, and Fisher strand bias. Database 412 is populated by data from the sequencing process and
the variant tion/call applications described above (e.g., from the .BAM file, the .VCF or .GVCF file, the
sample report, and/or the count).
] Database 422 includes occurrence frequency of the variant, such as allele ncies of the variant in
sequenced populations, allele ncies of the variant in ethnic sub-populations stratified from sequenced
populations, frequency of the variant ced cancerous tumors. Database 422 represents one or more publically
available databases such as database of short c variants ), 1000 Genomes Project (1000Genomes),
Exome Aggregation Consortium (ExAC), Exome Variant Server (EVS), Genome Aggregation Database (gnomAD),
and Catalog of Somatic Mutations in Cancer (COSMIC). Database 422 can also be populated by data from the
sequencing process and the variant annotation/call ations described above (e.g., from the .BAM file, the .VCF
or .GVCF file, the sample report, and/or the count).
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
Metadata Samples
The following are two samples of metadata features A to Q produced by the metadata correlator 116.
As sed above, some of the metadata features are encoded using categorical data such as one-hot or Boolean
values, while others are encoded using continuous data such as percentage and probability . In
implementations, only a subset of the metadata es are provided as input to the variant caller. For example, in
some implementations, the chromosome e, the reference sequence feature, and the coordinate position feature
are not ed in the metadata features that are ed as input.
First sample:
A. Name: chromosome feature
Description: specifies the chromosome on which the DNA segment
spanning the variant occurs.
Type: on characteristic
1. chr chr1
B. Name: reference sequence feature
Description: specifies the reference sequence mutated by the
variant.
Type: mutation teristic
1. ref C
C. Name: coordinate position feature
Description: specifies the coordinate position of the variant on the
chromosome.
Type: mutation characteristic
1. pos 11205058
D. Name: ative allele feature
Description: specifies at least one base mutated by the variant at the
target position in the reference sequence.
Type: mutation characteristic
1. alt_A -1.0
2. alt_C -1.0
3. alt_G -1.0
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
4. alt_T 1.0
. alt_Other -1.0
E. Name: variant allele frequency feature
Description: specifies variant allele frequency (VAF) of the variant.
Type: read mapping statistic
1. VAF 1.0
F. Name: read depth feature
Description: ies read depth of the variant.
Type: read g statistic
1. dp 1.07
G. Name: mutation type feature
Description: specifies whether the t is a single-nucleotide
t (SNV), insertion, or deletion.
Type: mutation characteristic
1. type_snv 1.0
2. type_insertion -1.0
3. type_deletion -1.0
H. Name: population frequency feature
Description: specifies allele frequencies of the variant in sequenced
populations such as database of short c variants (dbSNP),
1000 Genomes Project (1000Genomes), Exome Aggregation
Consortium (ExAC), and Exome Variant Server (EVS).
Type: occurrence frequency
1. dbsnp 0.4525
2. oneKg 0.547524
3. exac 0
4. evs 0
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
I. Name: amino acid impact feature
Description: specifies whether the variant is a onymous
variant that changes a codon so as to produce a new codon which
codes for a different amino acid.
Type: mutation characteristic
1. nonsyn_true -1.0
2. nonsyn_false 1.0
J. Name: evolutionary conservation feature
Description: specifies conservativeness of the variant position
across multiple species, as determined from phylop.
Type: mutation characteristic
1. phylop 0.078
K. Name: evolutionary conservation data bility feature
Description: specifies whether any phylop data is available.
Type: mutation characteristic
1. phylop_NA 1
L. Name: clinical significance feature
Description: ies the variant’s clinical effect, drug sensitivity,
and histocompatibility as determined from al test results
submitted on ClinVar.
Type: mutation characteristic
1. rSig_drug response -1.0
2. clinvarSig_uncertain -1.0
icance
3. clinvarSig_likely -1.0
pathogenic
4. clinvarSig_pathogenic -1.0
. clinvarSig_not provided -1.0
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
6. clinvarSig_nan 1.0
7. clinvarSig_likely benign -1.0
8. clinvarSig_benign -1.0
9. clinvarSig_other -1.0
M. Name: functional impact feature
Description: specifies the variant’s impact on functionality of a
protein ing from an amino acid substitution caused by the
variant as determined from Polymorphism Phenotyping (PolyPhen).
Type: mutation characteristic
1. polyPhen_benign -1.0
2. polyPhen_possibly -1.0
damaging
3. polyPhen_nan 1.0
4. polyPhen_probably -1.0
damaging
. en_unknown -1.0
N. Name: functional impact e
Description: specifies the variant’s impact on functionality of a
protein resulting from an amino acid substitution caused by the
t as determined from Sorting Intolerant from Tolerant (SIFT).
Type: mutation characteristic
1. sift_tolerated -1.0
2. sift_deleterious - low -1.0
confidence
3. sift_nan 1.0
4. sift_deleterious -1.0
. sift_tolerated - low -1.0
confidence
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
O. Name: tumor frequency feature
Description: specifies frequency of the variant in sequenced
ous tumors as determined from Catalog of Somatic Mutations
in Cancer (COSMIC) database.
Type: occurrence frequency
1. CNT 2.09217
P. Name: sub-population frequency e
Description: specifies allele frequencies of the variant in ethnic subpopulations
stratified from sequenced tions as determined
from Genome ation Database (gnomAD) database.
Type: occurrence frequency
1. gnomadExomeAf 0.04
2. gnomadExome_afrAf 0.686792
3. gnomadExome_asmrAf 0.14098000000000002
4. gnomadExome_easAf 00.8134640000000001
. gnomadExome_finAf 0.7214389999999999
6. gnomadExome_nfeAf 0.7409239999999999
7. gnomadExome_asjAf 0.5827749999999999
8. gnomadExome_sasAf 54
9. gnomadExome_othAf 0.684902
. gnomadAf 0.5688719999999999
11. gnomad_afrAf 0.15348399999999998
12. gnomad_asmrAf 0
13. gnomad_easAf 0.8003709999999999
14. gnomad_finAf 0.709336
. _nfeAf 0.737876
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
16. gnomad_asjAf 0.55298
17. gnomad_sasAf 0
18. gnomad_othAf 69
Q. Name: ethnicity prediction feature
Description: specifies likelihoods identifying ethnic makeup of the
individual that provided the tumor sample associated with the
variant.
Type: occurrence frequency
1. ethno_P_AFR 4.137788205335579e-49
2. ethno_P_AMR 0.00484825490847577
3. ethno_P_EAS 058155646697e-55
4. ethno_P_EUR 0.9951517345697741
. P_SAS 763446561e-08
Second :
A. Name: chromosome feature
Description: specifies the chromosome on which the DNA segment spanning the variant occurs.
Type: mutation characteristic
1. chr chr1
B. Name: reference sequence feature
Description: specifies the reference sequence mutated by the variant.
Type: mutation characteristic
1. ref A
C. Name: coordinate position feature
Description: ies the coordinate position of the variant on the chromosome.
Type: mutation characteristic
1. pos 2488153
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
D. Name: alternative allele feature
Description: specifies at least one base mutated by the variant at the target on in the reference
sequence.
Type: mutation characteristic
1. alt_A -1.0
2. alt_C -1.0
3. alt_G 1.0
4. alt_T -1.0
. alt_Other -1.0
E. Name: variant allele frequency feature
Description: specifies variant allele frequency (VAF) of the variant.
Type: read mapping statistic
1. VAF 0.9974
F. Name: read depth feature
Description: specifies read depth of the variant.
Type: read mapping statistic
1. dp 3.82
G. Name: mutation type feature
Description: specifies r the variant is a single-nucleotide variant (SNV), ion, or
deletion.
Type: mutation characteristic
1. type_snv 1.0
2. type_insertion -1.0
3. type_deletion -1.0
H. Name: population ncy feature
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
Description: specifies allele frequencies of the t in ced populations such as database of
short genetic variants (dbSNP), 1000 Genomes Project (1000Genomes), Exome Aggregation
Consortium (ExAC), and Exome Variant Server (EVS).
Type: occurrence frequency
1. dbsnp 0.3852
2. oneKg 0.6148159999999999
3. exac 0
4. evs 0
I. Name: amino acid impact feature
Description: specifies whether the t is a nonsynonymous t that changes a codon so as
to produce a new codon which codes for a different amino acid.
Type: mutation teristic
1. nonsyn_true 1.0
2. nonsyn_false -1.0
J. Name: evolutionary conservation feature
Description: specifies conservativeness of the t position across multiple species, as
determined from phylop.
Type: mutation characteristic
1. phylop -0.17600000000000002
K. Name: evolutionary conservation data availability feature
Description: specifies r any phylop data is available.
Type: mutation characteristic
1. phylop_NA 1
L. Name: clinical significance feature
Description: specifies the t’s clinical effect, drug sensitivity, and histocompatibility as
determined from clinical test results submitted on ClinVar.
Type: mutation characteristic
1. clinvarSig_drug response -1.0
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
2. clinvarSig_uncertain -1.0
significance
3. clinvarSig_likely pathogenic -1.0
4. clinvarSig_pathogenic -1.0
. clinvarSig_not provided -1.0
6. clinvarSig_nan 1.0
7. clinvarSig_likely benign -1.0
8. clinvarSig_benign -1.0
9. rSig_other -1.0
M. Name: functional impact feature
Description: specifies the variant’s impact on onality of a protein resulting from an amino
acid substitution caused by the variant as determined from Polymorphism Phenotyping (PolyPhen).
Type: mutation characteristic
1. polyPhen_benign 1.0
2. polyPhen_possibly damaging -1.0
3. en_nan -1.0
4. polyPhen_probably damaging -1.0
. polyPhen_unknown -1.0
N. Name: functional impact feature
Description: specifies the variant’s impact on functionality of a protein resulting from an amino
acid substitution caused by the variant as determined from Sorting Intolerant from Tolerant (SIFT).
Type: mutation teristic
1. sift_tolerated 1.0
2. sift_deleterious - low -1.0
confidence
3. sift_nan -1.0
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
4. sift_deleterious -1.0
. sift_tolerated - low ence -1.0
O. Name: tumor frequency feature
Description: specifies frequency of the variant in sequenced ous tumors as determined from
Catalog of Somatic Mutations in Cancer (COSMIC) database.
Type: ence frequency
1. CNT 3.46492
P. Name: sub-population ncy feature
Description: specifies allele frequencies of the variant in ethnic sub-populations stratified from
sequenced populations as determined from Genome Aggregation Database (gnomAD) database.
Type: ence frequency
1. gnomadExomeAf 0.04
2. gnomadExome_afrAf 0.512886
3. gnomadExome_asmrAf 0.727304
4. gnomadExome_easAf 00.48744
. gnomadExome_finAf 0.48818900000000004
6. gnomadExome_nfeAf 0.466213
7. gnomadExome_asjAf 0.443545
8. gnomadExome_sasAf 93
9. gnomadExome_othAf 0.499022
. gnomadAf 0.5445989999999999
11. gnomad_afrAf 0.7156319999999999
12. gnomad_asmrAf 0
13. gnomad_easAf 0.46091800000000005
14. gnomad_finAf 0.48421400000000003
. gnomad_nfeAf 0.473486
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
16. gnomad_asjAf 0.446667
17. gnomad_sasAf 0
18. gnomad_othAf 0.515369
Q. Name: ethnicity tion e
Description: specifies likelihoods fying ethnic makeup of the individual that provided the
tumor sample associated with the t.
Type: ence frequency
1. ethno_P_AFR 4.137788205335579e-49
2. ethno_P_AMR 0.00484825490847577
3. ethno_P_EAS 2.4537058155646697e-55
4. ethno_P_EUR 0.9951517345697741
. P_SAS 1.0521763446561e-08
highlights some examples of context metadata es 500A correlated with the variant. The
context metadata features 500A collectively represent the alternative allele feature and the mutation type feature
discussed above.
highlights some examples of cing metadata es 500B correlated with the variant.
The sequencing metadata features 500B collectively represent the variant allele frequency feature and the read depth
feature discussed above.
highlights some examples of functional metadata es 500C ated with the variant.
The functional metadata features 500C collectively represent the amino acid impact feature, the evolutionary
conservation feature, the evolutionary conservation data availability feature, the clinical significance feature, the
functional impact features, and the tumor ncy feature discussed above.
] highlights some examples of population metadata es 500D correlated with the variant.
The population metadata features 500D collectively represent the population frequency feature and the subpopulation
frequency feature discussed above.
highlights one example of an ethnicity metadata feature 500E correlated with the variant. The
ethnicity metadata feature 500E represents the ethnicity prediction feature discussed above.
Section 1.01
Section 1.02 Variant Classification
The task of the variant classifier 104 is to classify each variant in the database 124 as somatic or
germline. shows an architectural example 600 of variant classification performed by the variant classifier
104. An input sequence 602, with a variant at a target position d by at least ten bases on each side, is fed as
input to the convolutional neural network (CNN) 612. Convolutional neural network 612 comprises ution
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
layers which perform the convolution operation between the input values and convolution filters (matrix of weights)
that are learned over many gradient update iterations during the training.
Let m be the filter size and W be the matrix of weights, then a convolution layer performs a
convolution of the W with the input X by calculating the dot product W • x + b, where x is an instance of X and b is
the bias. The step size by which the convolution filters slide across the input is called the stride, and the filter width
m is called the receptive field. A same convolution filter is applied across different positions of the input, which
s the number of weights learned. It also allows location invariant learning, i.e., if an important pattern exists
in the input, the convolution filters learn it no matter where it is in the sequence. Additional details about the
convolutional neural network 612 can be found in I. J. Goodfellow, D. Farley, M. Mirza, A. Courville, and
Y. Bengio, “CONVOLUTIONAL KS,” Deep Learning, MIT Press, 2016; J. Wu, “INTRODUCTION TO
CONVOLUTIONAL NEURAL KS,” Nanjing University, 2017; and N. ten DIJKE, “Convolutional
Neural Networks for Regulatory Genomics,” Master’s Thesis, Universiteit Leiden Opleiding Informatica, 17 June
2017, the complete subject matter of which is expressly incorporated herein by reference in its entirety.
After processing the input sequence 602, the convolutional neural network 612 produces an
intermediate convolved feature 622 as output. The concatenator 112 concatenates (*) the intermediate convolved
feature 622 with one or more metadata features 626 discussed above. Concatenation can occur across the row
dimension or the column dimension. The result of the concatenation is a feature sequence 634, which is stored in the
se 122.
The feature sequence 634 is fed as input to the fully-connected neural k (FCNN) 674. The fullyconnected
neural network 674 comprises fully-connected layers — each neuron receives input from all the previous
layer’s neurons and sends its output to every neuron in the next layer. This contrasts with how convolutional layers
work where the s send their output to only some of the neurons in the next layer. The neurons of the onnected
layers are optimized over many gradient update ions during the training. Additional details about the
fully-connected neural network 674 can be found in I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and
Y. , “CONVOLUTIONAL NETWORKS,” Deep ng, MIT Press, 2016; J. Wu, “INTRODUCTION TO
CONVOLUTIONAL NEURAL NETWORKS,” Nanjing University, 2017; and N. ten DIJKE, lutional
Neural ks for Regulatory Genomics,” Master’s Thesis, Universiteit Leiden ing Informatica, 17 June
2017, the te subject matter of which is expressly incorporated herein by nce in its entirety.
A classification layer 684 of the fully-connected neural network 674 outputs fication scores 694
for likelihood that the variant is a somatic variant, a germline variant, or noise. The classification layer 684 can be a
softmax layer or a sigmoid layer. The number of classes and their type can be modified, depending on the
implementation. As discussed above, having the noise category improves classification along the somatic and
germline ries.
In other implementations, the metadata features 626 can be fed directly to the convolutional neural
network 612 and encoded into the input sequence 602 or fed separately, but simultaneously with the input sequence
602 or fed separately, but before/after the input sequence 602.
shows an algorithmic e 700 of t classification performed by the variant classifier
104. In the illustrated implementation, the convolution neural network (CNN) 612 has two convolution layers and
the fully-connected neural network (FCNN) 674 has three fully-connected layers. In other implementations, the
variant fier 104, and its convolution neural network 612 and fully-connected neural network 674, can have
additional, fewer, or different parameters and hyperparameters. Some examples of parameters are number of
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
convolution layers, number of batch normalization and ReLU layers, number of fully-connected layers, number of
convolution filters in respective convolution layers, number of neurons in tive fully-connected layers, number
of outputs produced by the final classification layer, and residual tivity. Some examples of arameters
are window size of the convolution filters, stride length of the convolution filters, padding, and on. In the
discussion below, the term “layer” refers to an algorithm implemented in code as a software logic or module. Some
examples of layers can be found in Keras™ documentation available at https://keras.io/layers/about-keras-layers/,
the complete t matter of which is expressly incorporated herein by reference in its entirety.
A one-hot encoded input sequence 702 is fed to a first ution layer 704 of the convolutional
neural k (CNN) 612. The dimensionality of the input sequence 702 is 41, 5, where 41 represents the 41 bases
in the input sequence 702 with a particular variant at a center target position flanked by 20 bases on each side, and 5
represents the 5 ls A, T, C, G, N used to encode the input sequence 702 and illustrated in
The first convolution layer 704 has 25 s, each of which convolves over the input sequence 702
with a window size of 7 and stride length of 1. The convolution is followed by batch normalization and ReLU
nonlinearity layers 712. What results is an output (feature map) 714 of dimensionality 25, 35. Output 714 can be
regarded as the first ediate convolved feature.
Output 714 is fed as input to a second convolution layer 722 of the convolutional neural network 612.
The second ution layer 722 has 15 filters, each of which convolves over the output 714 with a window size of
and stride length of 1. The convolution is followed by batch normalization and ReLU nonlinearity layers 724.
What results is an output (feature map) 732 of dimensionality 15, 31. Output 732 can be regarded as the second
ediate convolved feature and also the final output of the convolutional neural network 612.
In order to concatenate the output 732 with the metadata features 742 and also to allow ream
processing by the fully-connected neural network (FCNN) 674, the output 732 is flattened by a flattening layer 734.
Flattening includes vectorizing the output 732 to have either one row or one column. That is, by way of example,
converting the output 732 of dimensionality 15, 31 into a flattened vector of dimensionality 1, 465 (1 row and 15x31
= 465 columns).
The metadata features 742, correlated with the particular variant, have a dimensionality of 49, 1. A
concatenation layer 744 concatenates the metadata features 742 with the flattened vector derived from the output
732. What results is an output 752 of ionality 1, 49. Output 752 can be regarded as the feature sequence.
The output 752 is then fed as input to the fully-connected neural network (FCNN) 674. The fullyconnected
neural network 674 has three fully-connected layers 754, 764, and 774, each succeeded by pairs 762, 772,
and 782 of batch normalization and ReLU nonlinearity layers. The first fully-connected layer 754 has 512 neurons,
which are fully connected to 512 neurons in the second fully-connected layer 764. The 512 neurons in the second
fully-connected layer 764 are fully connected to 256 neurons in the third fully-connected layer 774.
The classification layer 784 (e.g., softmax) has 3 neurons which output the 3 classification scores or
probabilities 792 for the particular variant being c, germline, or noise.
In other implementations, the metadata features 742 can be fed directly to the convolutional neural
network 612 and encoded into the input sequence 702 or fed separately, but simultaneously with the input ce
702 or fed separately, but before/after the input sequence 702.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
Transfer Learning
depicts one implementation of training the variant classifier 104 according to a transfer
learning strategy 800, followed by evaluation and testing of the trained variant classifier 104. Transfer learning
strategy 800 involves pre-training 802 the variant classifier 104 on a base dataset 812 (e.g., TCGA) and task (variant
classification), and then repurposing or erring the learned weights (filters, neurons) of the convolutional neural
network (CNN) 612 and the fully-connected neural network 674 for training 822 on a target dataset 832 (e.g., TST)
and task (variant classification). This process works well because the TCGA dataset 812 and the TST dataset 832
share common features.
Evaluation 842 es iteratively checking the variant classification performance of the variant
classifier 104 on validation data 852 held-out from the TST dataset 862. After a gence condition has met
(e.g., meeting a certain benchmark like F-measure or minimizing error below a threshold), the trained variant
classifier 104 is deployed for inference or testing 862. Deployment 856 can e hosting the trained variant
classifier 104 on a cloud-based environment like na’s BaseSpace™ for use by the research community,
making the trained fier 104 runnable on a memory chip or GPU for incorporation in mobile computing
devices, and/or making the variant classifier 104 available for download from the web. During inference 862, the
trained variant fier 104 can receive input sequences in the form of inference data 872 and perform variant
classification as discussed above.
Performance Results
shows performance results 900 of the variant caller (also referred to herein as Sojourner) on
exonic data. These results, fied by sensitivity and specificity, ish Sojourner’s advantages and superiority
over a non-deep neural network classifier.
shows the improvement in false positive rate 1000 using Sojourner versus the non-deep neural
network classifier when classifying variants over exons.
shows the mean absolute tumor mutational burden (TMB) error 1100 using Sojourner versus
the non-deep neural network classifier when classifying variants over exons.
shows the improvement in mean absolute TMB error 1200 using Sojourner versus the non-
deep neural network classifier when classifying variants over exons.
shows performance results 1300 of Sojourner on CDS (coding DNA sequence) data. These
results, fied by sensitivity and specificity, establish Sojourner’s advantages and superiority over the non-deep
neural network classifier.
shows r false positive rate 1400 using ner versus the non-deep neural network
fier when classifying ts over coding regions.
shows the mean absolute TMB error 1500 using Sojourner versus the non-deep neural
network classifier when classifying ts over coding regions.
shows r mean absolute TMB error 1600 using Sojourner versus the non-deep neural
network classifier when classifying variants over exons.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
Computer System
shows a computer system 1700 that can be used to implement the variant classifier 104.
Computer system 1700 includes at least one central processing unit (CPU) 1772 that communicates with a number
of peripheral devices via bus subsystem 1755. These peripheral devices can include a storage subsystem 1710
including, for example, memory devices and a file storage subsystem 1736, user interface input devices 17317, user
interface output devices 1776, and a network interface subsystem 1774. The input and output devices allow user
interaction with computer system 1700. Network interface subsystem 1774 provides an interface to outside
networks, including an interface to corresponding interface devices in other er systems.
In one implementation, the variant classifier 104 is communicably linked to the storage subsystem
1710 and the user interface input devices 1738.
User interface input devices 1738 can e a keyboard; pointing devices such as a mouse, trackball,
touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as
voice recognition systems and microphones; and other types of input devices. In general, use of the term “input
device” is intended to e all possible types of devices and ways to input information into er system
1700.
User interface output devices 1776 can include a y subsystem, a printer, a fax machine, or nonvisual
displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube
(CRT), a anel device such as a liquid crystal y (LCD), a projection device, or some other mechanism for
creating a visible image. The y subsystem can also provide a non-visual display such as audio output devices.
In general, use of the term “output device” is intended to include all possible types of devices and ways to output
ation from computer system 1700 to the user or to another machine or computer system.
Storage tem 1710 stores mming and data constructs that e the onality of some
or all of the modules and methods described herein. These re modules are generally executed by deep
learning sors 1778.
Deep ng processors 1778 can be graphics sing units (GPUs) or field-programmable gate
arrays (FPGAs). Deep learning processors 1778 can be hosted by a deep learning cloud platform such as Google
Cloud Platform™, Xilinx™, and Cirrascale™. Examples of deep learning sors 1778 include Google’s Tensor
Processing Unit (TPU)™, rackmount solutions like GX4 Rackmount Series™, GX17 Rackmount Series™,
NVIDIA DGX-1™, Microsoft’ Stratix V FPGA™, Graphcore’s Intelligent Processor Unit (IPU)™, Qualcomm’s
Zeroth Platform™ with Snapdragon processors™, NVIDIA’s Volta™, NVIDIA’s DRIVE PX™, NVIDIA’s
JETSON TX1/TX2 MODULE™, Intel’s Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM’s DynamicIQ™, IBM
TrueNorth™, and .
] Memory subsystem 1722 used in the storage subsystem 1710 can include a number of memories
including a main random access memory (RAM) 1732 for storage of instructions and data during program execution
and a read only memory (ROM) 1734 in which fixed ctions are . A file storage subsystem 1736 can
provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along
with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules
implementing the functionality of n implementations can be stored by file storage subsystem 1736 in the
storage subsystem 1710, or in other machines accessible by the processor.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
Bus subsystem 1755 provides a mechanism for letting the various components and subsystems of
computer system 1700 communicate with each other as intended. Although bus subsystem 1755 is shown
schematically as a single bus, ative implementations of the bus tem can use multiple busses.
Computer system 1700 itself can be of varying types including a personal computer, a portable
computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a
widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to
the ever-changing nature of ers and networks, the description of computer system 1700 depicted in
is intended only as a specific example for purposes of illustrating the preferred embodiments of the t
invention. Many other configurations of computer system 1700 are possible having more or less components than
the computer system depicted in .
ular Implementations
We describe a system and various implementations of a variant classifier that uses trained deep neural
networks to predict whether a given variant is somatic or germline. One or more features of an implementation can
be combined with the base entation. Implementations that are not mutually exclusive are taught to be
combinable. One or more features of an implementation can be combined with other implementations. This
disclosure periodically reminds the user of these options. Omission from some implementations of recitations that
repeat these s should not be taken as limiting the combinations taught in the preceding ns – these
recitations are hereby incorporated forward by reference into each of the following implementations.
In one implementation, the technology disclosed presents a neural network-implemented system. The
system comprises a variant classifier which runs on one or more processors operating in parallel and coupled to
memory.
The variant classifier has: (i) a convolutional neural network and (ii) a fully-connected neural network.
The utional neural network has at least two convolution layers and each of the convolution layers has at least
five convolution filters trained over one nd to millions of gradient update iterations to: (a) process an input
sequence with a t at a target position d by at least ten bases on each side, and (b) produce an
intermediate convolved feature. In some implementations, each of the convolution layers has at least six ution
A metadata correlator correlates the t with a set of metadata features which represent: (i)
on teristics of the variant, (ii) read mapping statistics of the t, and (iii) occurrence frequency of
the t.
The fully-connected neural network has at least two fully-connected layers trained over the one
thousand to millions of gradient update iterations to: (a) process a feature sequence derived from a combination of
the intermediate convolved feature and the metadata features, and (b) output classification scores for likelihood that
the variant is a somatic variant, a germline variant, or noise.
This system implementation and other systems disclosed optionally include one or more of the
following features. System can also include features described in connection with methods disclosed. In the interest
of conciseness, alternative combinations of system features are not individually ated. Features applicable to
systems, methods, and articles of manufacture are not repeated for each statutory class set of base es. The
reader will understand how features identified in this section can readily be combined with base features in other
statutory s.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
The metadata correlator can be further configured to correlate the variant with an amino acid impact
feature that specifies whether the variant is a nonsynonymous variant that changes a codon so as to produce a new
codon which codes for a different amino acid.
The metadata correlator can be further configured to correlate the variant with a t type feature
that specifies type whether the variant is a -nucleotide polymorphism, an insertion, or a on.
The metadata correlator can be further configured to correlate the variant with a read mapping statistic
feature that ies quality ters of read mapping that identified the variant.
The metadata correlator can be further configured to correlate the variant with a population frequency
feature that specifies allele frequencies of the variant in sequenced populations.
The metadata correlator can be r configured to correlate the variant with a sub-population
frequency feature that specifies allele frequencies of the variant in ethnic sub-populations stratified from sequenced
populations.
The metadata correlator can be r configured to correlate the variant with an evolutionary
conservation feature that specifies conservativeness of the target position across multiple s.
] The ta correlator can be further configured to ate the variant with a clinical significance
feature that specifies the variant’s clinical effect, drug sensitivity, and histocompatibility as determined from clinical
tests.
The metadata correlator can be r configured to correlate the variant with a functional impact
e that specifies the variant’s impact on functionality of a protein ing from an amino acid substitution
caused by the variant.
The metadata correlator can be further configured to correlate the variant with an ethnicity prediction
feature that specifies likelihoods identifying ethnic makeup of an individual that provided a tumor sample associated
with the variant.
The metadata correlator can be further configured to correlate the variant with a tumor ncy
feature that specifies ncy of the variant in sequenced cancerous tumors.
The metadata correlator can be further configured to correlate the variant with an alternative allele
feature that ies at least one base mutated by the variant at the target position in a reference sequence.
The convolutional neural k and the fully-connected neural network of the variant classifier can
be trained together -end on five hundred thousand training examples from a first dataset of -causing
mutations, followed by training on fifty thousand training examples from a second dataset of cancer-causing
mutations.
The convolutional neural network and the fully-connected neural network of the variant classifier can
be tested together end-to-end on validation data held-out only from the second dataset.
Each of the convolution layers and the fully-connected layers can be followed by at least one rectified
linear unit layer. Each of the convolution layers and the fully-connected layers can be ed by at least one batch
normalization layer.
The variant can be flanked by at least 19 bases on each side. In another implementation, the variant can
be flanked by at least 20 bases on each side.
The system can be further configured to comprise a concatenator that derives the feature sequence by
concatenating the intermediate feature with the metadata features.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
The metadata features can be encoded in a one-dimensional array. The input sequence can be encoded
in an n-dimensional array, where n≥2.
Other implementations may include a ansitory computer readable storage medium storing
instructions executable by a processor to perform actions of the system described above. Each of the features
discussed in the particular implementation section for other implementations apply equally to this entation.
As indicated above, all the other features are not repeated here and should be considered repeated by reference.
In another implementation, the technology disclosed presents a neural network-implemented method
of variant classification.
The method includes processing an input sequence through a convolutional neural network to produce
an ediate convolved feature. The convolutional neural network has at least two convolution layers and each of
the convolution layers has at least five ution filters d over one thousand to millions of gradient update
iterations. In some implementations, each of the ution layers has at least six convolution filters.
The input ce has a variant at a target position flanked by at least ten bases on each side.
] The method includes correlating the variant with a set of metadata features which represent: (i)
mutation characteristics of the variant, (ii) read g statistics of the variant, and (iii) occurrence frequency of
the variant.
] The method includes processing a feature sequence h a fully-connected neural network to output
fication scores for hood that the variant is a somatic variant, a germline variant, or noise. The fullyconnected
neural network has at least two fully-connected layers trained over the one thousand to millions of
gradient update iterations. The feature sequence is derived from a ation of the intermediate convolved feature
and the metadata features.
Other implementations may e a non-transitory computer readable storage medium (CRM)
storing instructions executable by a processor to perform the method described above. Yet another implementation
may include a system including memory and one or more processors operable to execute instructions, stored in the
memory, to perform the method described above. Each of the features discussed in the particular implementation
section for other implementations apply equally to this implementation. As indicated above, all the other features are
not repeated here and should be considered repeated by reference.
In yet another implementation, the technology disclosed presents a neural network-implemented
system. The system ses a variant classifier which runs on one or more processors operating in parallel and
coupled to .
The variant classifier has: (i) a convolutional neural network and (ii) a fully-connected neural network.
The convolutional neural network is trained to process an input sequence and e an intermediate convolved
feature. The convolutional neural network has at least two convolution layers and each of the convolution layers has
at least five convolution filters trained over one thousand to millions of gradient update iterations. In some
implementations, each of the convolution layers has at least six convolution filters.
The input sequence has a t at a target position flanked by at least ten bases on each side and has a
set of metadata features correlated with the variant.
The metadata features ent: (i) mutation characteristics of the variant, (ii) read mapping statistics
of the t, and (iii) occurrence frequency of the variant.
The fully-connected neural network is trained to process the intermediate convolved feature and output
classification scores for likelihood that the variant is a somatic variant, a germline variant, or noise. The fully-
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
connected neural k has at least two fully-connected layers trained over the one thousand to millions of
gradient update iterations.
] The system can be r ured to comprise a metadata correlator that ates the variant with
the metadata es.
Other implementations may include a non-transitory computer readable storage medium storing
instructions executable by a processor to perform s of the system bed above. Each of the features
discussed in the particular implementation section for other implementations apply equally to this implementation.
As indicated above, all the other features are not repeated here and should be considered repeated by reference.
In yet further implementation, the technology disclosed presents a neural network-implemented
method of variant classification.
The method includes sing an input sequence through a convolutional neural network to produce
an intermediate convolved feature. The convolutional neural network has at least two convolution layers and each of
the convolution layers has at least five convolution filters trained over one thousand to millions of nt update
iterations.
The input sequence has a variant at a target position flanked by at least ten bases on each side and has a
set of metadata features correlated with the variant.
The metadata features represent: (i) mutation characteristics of the variant, (ii) read g statistics
of the variant, and (iii) occurrence frequency of the variant.
] The method includes processing the ediate ved feature through a fully-connected neural
network to output fication scores for likelihood that the variant is a somatic variant, a germline variant, or
noise. The fully-connected neural k has at least two fully-connected layers trained over the one thousand to
millions of gradient update ions.
Other implementations may include a non-transitory computer readable storage medium (CRM)
storing instructions executable by a processor to perform the method described above. Yet another implementation
may include a system including memory and one or more processors operable to execute instructions, stored in the
memory, to perform the method described above. Each of the es discussed in the particular entation
section for other implementations apply equally to this implementation. As indicated above, all the other features are
not repeated here and should be considered repeated by reference.
While the technology disclosed is disclosed by reference to the preferred embodiments and examples
detailed above, it is to be understood that these examples are intended in an illustrative rather than in a limiting
sense. It is contemplated that modifications and combinations will readily occur to those skilled in the art, which
modifications and combinations will be within the spirit of the innovation and the scope of the ing claims.
The disclosure also includes the following clauses:
1. A neural network-implemented , comprising:
a variant classifier, running on one or more processors operating in parallel and coupled to memory, that has
a convolutional neural network having at least two convolution layers and each of the convolution layers
having at least five convolution filters trained over one thousand to millions of gradient update iterations to
process an input sequence with a variant at a target position flanked by at least ten bases on each
side, and
produce an intermediate convolved feature;
a metadata correlator that correlates the variant with a set of metadata es which represent
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
mutation characteristics of the variant,
read mapping statistics of the variant, and
occurrence frequency of the variant; and
a connected neural network having at least two fully-connected layers trained over the one thousand
to millions of nt update iterations to
process a feature sequence derived from a combination of the intermediate convolved feature and
the ta features, and
output classification scores for likelihood that the variant is a somatic variant, a ne t,
or noise.
2. The neural network-implemented system of clause 1, wherein the metadata correlator is further configured to
correlate the variant with an amino acid impact feature that specifies whether the variant is a nonsynonymous
variant that changes a codon so as to produce a new codon which codes for a different amino acid.
3. The neural network-implemented system of any of clauses 1-2, wherein the metadata correlator is further
configured to correlate the variant with a variant type feature that specifies type whether the variant is a singlenucleotide
polymorphism, an insertion, or a deletion.
4. The neural network-implemented system of any of clauses 1-3, wherein the metadata correlator is further
configured to correlate the variant with a read mapping statistic feature that specifies quality parameters of read
mapping that identified the variant.
. The neural network-implemented system of any of clauses 1-4, wherein the metadata correlator is further
ured to correlate the variant with a population frequency feature that specifies allele frequencies of the variant
in sequenced populations.
6. The neural network-implemented system of any of clauses 1-5, n the metadata correlator is further
configured to ate the variant with a sub-population frequency feature that specifies allele frequencies of the
t in ethnic sub-populations fied from sequenced populations.
7. The neural network-implemented system of any of s 1-6, wherein the metadata correlator is further
configured to correlate the variant with an evolutionary conservation feature that ies conservativeness of the
target position across multiple s.
8. The neural network-implemented system of any of clauses 1-7, wherein the metadata correlator is further
configured to correlate the variant with a clinical significance feature that specifies the variant’s al effect, drug
sensitivity, and histocompatibility as determined from clinical tests.
9. The neural network-implemented system of any of s 1-8, wherein the metadata correlator is further
configured to correlate the variant with a functional impact feature that specifies the t’s impact on
functionality of a protein resulting from an amino acid tution caused by the variant.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
. The neural network-implemented system of any of clauses 1-9, n the ta correlator is further
configured to correlate the variant with an ethnicity prediction feature that specifies likelihoods fying ethnic
makeup of an individual that ed a tumor sample associated with the variant.
11. The neural network-implemented system of any of clauses 1-10, wherein the ta correlator is further
configured to correlate the variant with a tumor frequency feature that specifies frequency of the variant in
sequenced cancerous tumors.
12. The neural k-implemented system of any of clauses 1-11, n the metadata correlator is further
configured to correlate the variant with an alternative allele feature that specifies at least one base mutated by the
variant at the target position in a reference sequence.
13. The neural network-implemented system of any of clauses 1-12, wherein the convolutional neural network and
the fully-connected neural network of the variant classifier are trained together end-to-end on five hundred nd
training es from a first dataset of cancer-causing mutations, followed by training on fifty thousand training
examples from a second dataset of -causing mutations.
14. The neural network-implemented system of any of clauses 1-13, wherein the convolutional neural network and
the fully-connected neural network of the variant classifier are tested together end-to-end on tion data ut
only from the second dataset.
. The neural network-implemented system of any of clauses 1-14, wherein each of the convolution layers and
the fully-connected layers is ed by at least one rectified linear unit layer.
16. The neural network-implemented system of any of clauses 1-15, wherein each of the convolution layers and
the fully-connected layers are followed by at least one batch normalization layer.
17. The neural network-implemented system of any of clauses 1-16, wherein the variant is flanked by at least 19
bases on each side.
18. The neural network-implemented system of any of clauses 1-17, further configured to comprise a concatenator
that derives the feature sequence by concatenating the intermediate feature with the metadata es.
19. The neural network-implemented system of any of clauses 1-18, wherein the metadata features are encoded in
a one-dimensional array.
. The neural network-implemented system of any of clauses 1-19, wherein the input sequence is encoded in an
n-dimensional array, where n≥2.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
21. The neural network-implemented system of any of clauses 1-20, wherein each of the convolution layers has at
least six convolution filters.
22. A neural network-implemented method of variant classification, including:
processing an input sequence through a convolutional neural network to produce an intermediate convolved
feature, wherein
the convolutional neural network has at least two ution layers and each of the convolution
layers has at least five convolution filters d over one thousand to millions of gradient update
iterations, and
the input sequence has a variant at a target position flanked by at least ten bases on each side;
ating the variant with a set of metadata features which ent
mutation characteristics of the variant,
read mapping statistics of the variant, and
ence frequency of the variant; and
processing a feature sequence through a fully-connected neural network to output classification scores for
likelihood that the variant is a somatic variant, a germline variant, or noise, wherein
the fully-connected neural k has at least two fully-connected layers trained over the one
thousand to millions of gradient update ions, and
the feature sequence is derived from a ation of the intermediate convolved feature and the
metadata features.
23. The neural network-implemented method of clause 22, implementing each of the clauses which ultimately
depend from clause 1.
24. A non-transitory computer readable storage medium impressed with computer program instructions to classify
variants, the instructions, when executed on a processor, implement a method comprising:
processing an input sequence through a convolutional neural k to produce an intermediate convolved
feature, wherein
the convolutional neural k has at least two convolution layers and each of the convolution
layers has at least five convolution filters trained over one thousand to millions of gradient update
iterations, and
the input sequence has a variant at a target position flanked by at least ten bases on each side;
correlating the variant with a set of ta features which represent
mutation characteristics of the variant,
read mapping tics of the variant, and
occurrence frequency of the variant; and
processing a feature sequence through a connected neural network to output classification scores for
likelihood that the variant is a somatic variant, a germline t, or noise, wherein
the fully-connected neural network has at least two fully-connected layers trained over the one
thousand to millions of gradient update iterations, and
the e sequence is derived from a combination of the intermediate ved feature and the
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
metadata features.
. The non-transitory computer readable storage medium of clause 24, implementing each of the clauses which
ultimately depend from clause 1.
26. A neural network-implemented system, comprising:
a variant classifier, running on one or more processors operating in parallel and d to memory, that has
a convolutional neural network trained to process an input sequence and produce an intermediate
convolved feature, wherein
the convolutional neural network has at least two ution layers and each of the convolution
layers has at least five convolution filters trained over one thousand to millions of gradient update
iterations,
the input sequence has a variant at a target position flanked by at least ten bases on each side and
has a set of metadata es correlated with the variant, and
the metadata features represent mutation characteristics of the variant, read mapping statistics of
the variant, and occurrence frequency of the variant; and
a fully-connected neural k trained to s the intermediate ved e and output
classification scores for likelihood that the variant is a somatic variant, a germline variant, or noise, wherein
the fully-connected neural network has at least two fully-connected layers d over the one
thousand to millions of gradient update iterations.
27. The neural network-implemented system of clause 26, further configured to comprise a ta correlator
that ates the variant with the metadata features.
28. The neural network-implemented system of any of clauses 26-27, implementing each of the clauses 1-17.
29. A neural k-implemented method of variant classification, including:
processing an input sequence through a convolutional neural network to e an intermediate convolved
feature, wherein
the convolutional neural network has at least two convolution layers and each of the convolution
layers has at least five convolution filters trained over one nd to ns of gradient update
iterations,
the input sequence has a t at a target position flanked by at least ten bases on each side and
has a set of metadata features correlated with the variant, and
the metadata features represent mutation characteristics of the variant, read mapping statistics of
the variant, and occurrence frequency of the variant; and
processing the intermediate convolved feature through a fully-connected neural network to output
classification scores for likelihood that the variant is a somatic variant, a germline variant, or noise, wherein
the connected neural network has at least two fully-connected layers trained over the one
thousand to millions of gradient update iterations.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
. The neural network-implemented method of clause 29, implementing each of the clauses 22-23.
31. A non-transitory er readable storage medium impressed with computer program instructions to classify
variants, the instructions, when ed on a processor, implement a method comprising:
processing an input sequence through a convolutional neural network to produce an intermediate convolved
feature, wherein
the convolutional neural k has at least two convolution layers and each of the convolution
layers has at least five convolution filters trained over one thousand to millions of gradient update
iterations,
the input ce has a variant at a target position flanked by at least ten bases on each side and
has a set of metadata features correlated with the variant, and
the metadata features represent mutation characteristics of the variant, read mapping statistics of
the t, and occurrence frequency of the t; and
processing the intermediate convolved feature through a fully-connected neural network to output
classification scores for likelihood that the variant is a somatic variant, a germline variant, or noise, n
the fully-connected neural network has at least two connected layers trained over the one
thousand to millions of gradient update iterations.
32. The non-transitory computer readable storage medium of clause 31, implementing the method according to
on or more of the s 22, 23, 29-30.
{00691484.DOCX }
Atty. Docket No.: ILLM 1007-3WO/IPPCT
Claims (26)
1. A neural network-implemented system for classifying variants, comprising: a variant classifier, running on one or more processors coupled to memory, that has a convolutional neural network trained to process an input sequence and produce an intermediate convolved feature, wherein the convolutional neural network has at least two ution layers, and the input sequence has a variant at a target position d by a plurality of bases on each side; a fully-connected neural network having at least two fully-connected layers trained to process a feature sequence derived from a combination of the intermediate convolved feature and a set of metadata features correlated with the variant representing mutation characteristics of the variant, read mapping statistics of the variant, and occurrence frequency of the t; and an output layer that inputs s from the fully-connected neural network and outputs classification scores for likelihood that the variant is a c variant, a germline variant, or noise.
2. The neural network-implemented system of claim 1, further sing a metadata correlator configured to correlate the variant with the metadata features.
3. The neural network-implemented system of claim 2, wherein the metadata correlator is further configured to correlate the variant with an amino acid impact feature that specifies whether the variant is a nonsynonymous variant that changes a codon so as to produce a new codon which codes for a different amino acid.
4. The neural network-implemented system of any of claims 2-3, wherein the metadata correlator is further configured to correlate the variant with a variant type feature that specifies type whether the variant is a singlenucleotide polymorphism, an insertion, or a deletion.
5. The neural network-implemented system of any of claims 2-4, wherein the metadata correlator is r configured to correlate the variant with a read mapping statistic feature that specifies quality parameters of read mapping that fied the variant.
6. The neural network-implemented system of any of claims 2-5, n the metadata correlator is further configured to correlate the variant with a population frequency feature that specifies allele frequencies of the variant in sequenced populations.
7. The neural network-implemented system of any of claims 2-6, wherein the metadata ator is further configured to correlate the variant with a sub-population frequency e that specifies allele frequencies of the variant in ethnic sub-populations stratified from ced populations.
8. The neural k-implemented system of any of claims 2-7, wherein the metadata correlator is r configured to correlate the variant with an evolutionary conservation feature that specifies conservativeness of the target position across le species. {00691484.DOCX } Atty. Docket No.: ILLM 1007-3WO/IPPCT
9. The neural network-implemented system of any of claims 2-8, wherein the metadata correlator is further configured to correlate the variant with a clinical significance feature that ies the t’s clinical effect, drug sensitivity, and histocompatibility as determined from al tests.
10. The neural network-implemented system of any of claims 2-9, wherein the metadata correlator is further configured to correlate the variant with a functional impact feature that ies the variant’s impact on functionality of a protein resulting from an amino acid substitution caused by the variant.
11. The neural network-implemented system of any of claims 2-10, wherein the metadata correlator is further configured to ate the variant with an ethnicity tion feature that specifies likelihoods identifying ethnic makeup of an individual that provided a tumor sample associated with the variant.
12. The neural network-implemented system of any of claims 2-11, wherein the metadata correlator is further configured to correlate the variant with a tumor frequency feature that specifies frequency of the variant in sequenced ous tumors.
13. The neural network-implemented system of any of claims 2-12, wherein the metadata correlator is further configured to correlate the variant with an alternative allele feature that specifies at least one base mutated by the variant at the target position in a reference ce.
14. The neural network-implemented system of any of claims 2-13, wherein the convolutional neural network and the fully-connected neural network are d together end-to-end on five hundred thousand training examples from a first dataset of cancer-causing mutations, followed by training on fifty thousand training examples from a second dataset of -causing mutations.
15. The neural network-implemented system of claim 14, wherein the convolutional neural k and the fully-connected neural network are tested together end-to-end on validation data held-out only from the second dataset.
16. The neural network-implemented system of any of claims 1-15, wherein each of the convolution layers and the fully-connected layers is ed by at least one rectified linear unit layer.
17. The neural network-implemented system of any of claims 1-16, wherein each of the convolution layers and the fully-connected layers is followed by at least one batch ization layer.
18. The neural network-implemented system of any of claims 1-17, wherein the variant is flanked by at least 19 bases on each side.
19. The neural network-implemented system of any of claims 1-18, further comprising a enator that derives the feature sequence by concatenating the intermediate ved feature with the metadata features.
20. The neural network-implemented system of any of claims 1-19, wherein the metadata features are encoded in a one-dimensional array. {00691484.DOCX } Atty. Docket No.: ILLM 1007-3WO/IPPCT
21. The neural network-implemented system of any of claims 1-20, wherein the input sequence is encoded in an n-dimensional array, where n≥2.
22. The neural network-implemented system of any of claims 1-21, wherein each of the convolution layers has at least six convolution filters.
23. A neural network-implemented method of variant classification, sing: sing an input sequence through a convolutional neural network to produce an ediate convolved feature, wherein the convolutional neural network has at least two convolution layers, and the input sequence has a t at a target position flanked by a ity of bases on each side; sing a feature sequence derived from a combination of the intermediate convolved feature and a set of metadata features correlated with the t through a fully-connected neural network having at least two fullyconnected layers, wherein the set of metadata features represents on characteristics of the variant, read mapping statistics of the variant, and occurrence frequency of the variant; and processing results from the connected neural k and outputting classification scores for likelihood that the variant is a somatic variant, a germline variant, or noise.
24. The neural network-implemented method of claim 23, implementing each of the claims which ultimately depend from claim 1.
25. A non-transitory computer readable storage medium sed with computer program instructions to classify variants, when executed on a sor, implement a method comprising: processing an input sequence through a convolutional neural network to produce an intermediate convolved feature, wherein the convolutional neural network has at least two convolution layers, and the input sequence has a variant at a target on flanked by a plurality of bases on each side; processing a feature sequence derived from a combination of the intermediate convolved feature and a set of metadata features correlated with the variant through a fully-connected neural network having at least two fullyconnected layers, wherein the set of metadata features represents mutation characteristics of the variant, read mapping statistics of the variant, and occurrence ncy of the variant; and processing results from the fully-connected neural network and outputting classification scores for likelihood that the variant is a somatic variant, a germline variant, or noise.
26. The non-transitory computer le storage medium of claim 25, implementing each of the claims which ultimately depend from claim 1. {00691484.DOCX } ϭ ͬ Ϯϭ &ODVVLILHG 9DULDQWV 0HWDGDWD &RUUHODWRU 0HWDGDWD HV ),* 9DULDQW &ODVVLILHU 6RMRXUQHU 1HWZRUN V 8QFODVVLILHG 9DULDQWV ,QSXW 6HTXHQFHV &RQFDWHQDWRU )HDWXUH 6HTXHQFHV Ϯ ͬ Ϯϭ W 7$&*$7$&$*$$&7« 5LJKW )ODQNLQJ %DVHV a * 7DUJHW 3RVLWLRQ « $&7$&&&7$*&&&7 /HIW )ODQNLQJ %DVHV a ϯ ͬ Ϯϭ $ & * 7 7 7 * & $ & $ & $ & * " * 7 $ 7 $ 7 2QH +RW QJ * 7 $ 7 $ 7 ),* $ & * 7 7 7 * & $ & $ & $ & * 1 $ & * 7 1 ϰ ͬ Ϯϭ 0HWDGDWD )HDWXUHV ),* 8QFODVVLILHG 9DULDQWV 0HWDGDWD &RUUHODWRU 0XWDWLRQ &KDUDFWHULVWLFV 5HDG 0DSSLQJ 6WDWLVWLFV HQFH )UHTXHQF
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US62/656,741 | 2018-04-12 | ||
NL2020861 | 2018-05-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
NZ791625A true NZ791625A (en) | 2022-08-26 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7143486B2 (en) | Variant Classifier Based on Deep Neural Networks | |
AU2021282469B2 (en) | Deep learning-based variant classifier | |
US20190318806A1 (en) | Variant Classifier Based on Deep Neural Networks | |
US20200251183A1 (en) | Deep Learning-Based Framework for Identifying Sequence Patterns that Cause Sequence-Specific Errors (SSEs) | |
CA3064226C (en) | Deep learning-based framework for identifying sequence patterns that cause sequence-specific errors (sses) | |
NL2021473B1 (en) | DEEP LEARNING-BASED FRAMEWORK FOR IDENTIFYING SEQUENCE PATTERNS THAT CAUSE SEQUENCE-SPECIFIC ERRORS (SSEs) | |
NZ791625A (en) | Variant classifier based on deep neural networks | |
NZ789499A (en) | Deep learning-based variant classifier |