CN113053460A

CN113053460A - Systems and methods for genomic and genetic analysis

Info

Publication number: CN113053460A
Application number: CN201911374963.7A
Authority: CN
Inventors: M·斯坦恩; R·博内特; N·里贝尔
Original assignee: Molecular Health Co ltd
Current assignee: Molecular Health Co ltd; Molecular Health GmbH
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-06-29

Abstract

The present invention relates to a method for genomic and/or genetic analysis of a human nucleic acid sample, said method comprising the steps of: providing a set of human reference genomes; testing the sex and/or ancestry of the human nucleic acid sample; selecting one or more population-specific human reference genomes (PHREGs) from the set of human reference genomes based on the results of the gender and/or ancestry test; aligning the human nucleic acid sample with the selected PHREG; and performing variant identification with respect to the selected PHREG. The invention also provides a corresponding computer system and a computer program.

Description

Systems and methods for genomic and genetic analysis

The present invention relates to a system and method for genomic and genetic analysis of human nucleic acid samples.

Background

Next Generation Sequencing (NGS)

Next generation sequencing, also known as high throughput sequencing, is a routine method of high throughput parallel sequencing of nucleic acid fragments well known to those skilled in the art. The apparatus and methodology for next generation sequencing is commercially available (see, e.g., www.illumina.com) from a variety of suppliers.

Next generation sequencing is a generic term used to describe several different modern sequencing technologies, including:

illumina (solexa) sequencing;

ion torrent: proton/PGM sequencing;

SOLiD sequencing.

NGS technology produces high quality DNA sequences ("reads"). These read lengths are greatly shortened compared to those generated by capillary-based Sanger sequencing (650-1000 bp). The Sanger sequencing technology was developed in 1977 by Frederick Sanger and co-workers and was the most widely used sequencing method in the last 30 years. Sanger reads are produced in a low-throughput, high-cost format, while NGS methods produce much shorter reads (25-500 bases) at moderate cost, but the total number of base pairs sequenced in one NGS run is orders of magnitude higher. These two factors pose many new informatics challenges, including the ability to handle these millions or even billions of short NGS read lengths. These read lengths are typically handled in one of two ways: mapping them back to their correct positions in the existing backbone/reference sequence creates a sequence that is similar to, but not necessarily identical to, the backbone sequence (called "Read Mapping"), or constructs them as a new sequence (called "De Novo Assembly").

The main advantage of mapping read lengths to the reference genome over assembly from the beginning is that it greatly simplifies the process of genome inference. De novo assembly requires de-finding the entire genomic sequence and creates much ambiguity, whereas re-sequencing based on a reference sequence only requires de-finding the differences between the sample and the reference sequence. De novo assembly is orders of magnitude slower and takes up more memory than map assembly in terms of complexity and time requirements.

Read-length mapping is the first and most basic step in the NGS analysis pipeline, aimed at discovering variants of the newly sequenced human genome (or fragments thereof, such as a small portion of a targeted gene or exome) relative to the previously sequenced human reference genome.

Read length mapping can also be used to align these millions or billions of short NGS reads to detect coverage (number of reads at a particular location/site), which is a key quality parameter for NGS experiments and all future conclusions.

Human Reference Genome (HRG)

In month 2 2001, the U.S. federal government Human Genome Project (Human Genome Project) together with the private company Celera Genomics successfully completed a draft of the entire Human Genome followed by several modifications [ Lander et al, 2001; venter et al, 2001; church et al, 2001 ]. Genome assembly has steadily improved over the years and new versions ("build") have been released, so the current Genome Reference partner (GRC) human Genome assembly GRCh38[ Scheneider et al, 2017] can be said to be the best assembled mammalian Genome currently existing, leaving only 875 assembly gaps and less than 1.6 billion unspecified "N nucleotides (cut to GRCh38.p8), while the first version has about 150,000 gaps [ Editorial (October 2010)," E plura u num ". Nature methods.7(5):331.Doi:10.1038/nmeth0510-331 ].

HRG is the most important single resource used in human genetics and genomics today, and therefore, as a general coordinate system, it is also the space in which annotations (genes, promoters, etc.) and gene variants are described [ Harrow et al, 2012; ENCODE, 2012; 1000genome Project Consortium,2012 ]. HRG can also be used as a reference in the read length alignment step in the next generation sequencing analysis pipeline. Downstream of this mapping, it can be used for functional assays and variant identification pipelines [ Li H & Durbin 2009; DePrist et al, 2011 ].

The initial version of HRG consisted of a small pool of DNA sequences from 13 anonymous DNA donors with predominantly European ancestry volunteered in New York buffalo City [ Snyder et al ]. Donors were recruited by advertising on news in buffalo city on days 3, 23 and weekly in 1997. The first ten male and female volunteers were invited to make appointments with the gene consultant for the program and donated blood from which DNA could be extracted. The manner in which these DNA samples were processed resulted in approximately 80% of the reference genome being from eight individuals, while one male, labeled RP11, accounted for 66% of the total.

To identify and solve large assembly problems, for example, complex regions containing large-scale repeats and structural variants, sequence data measured by new genomic mapping techniques, and haplotype single resources derived from new donors have been introduced into the latest internal versions. At the time of filing this application, GRCh38 contains sequences from approximately 50 different individuals, see http:// www.bio-itworld. com/2013/4/22/church-on-reference-genes-past-present-future. html.

Limitations of HRG

HRG is linear

Human DNA is packaged in physically separate units called chromosomes. Humans are diploid organisms that contain two sets of genetic information, one inherited from the mother and the other from the father. Thus, each somatic cell has 22 pairs of autosomes (the members in each pair are from one parent each) and 2 sex chromosomes (the X and Y chromosomes in males and the two X chromosomes in females). Each chromosome contains a very long linear DNA molecule. The smallest DNA molecule in the human chromosome consists of approximately 5 million pairs of nucleotides; the largest chromosome contains approximately 2.5 hundred million pairs of nucleotides.

From the above, the diploid human genome consists of 46 individual DNA molecules of 24 different types. Because human chromosomes exist in pairs and the two DNA molecules in each pair are almost identical, information for a complete representative human genome need only be sequenced over 30 hundred million pairs of nucleotides (haploid genomes). It is often said that the human genome contains 30 hundred million pairs of nucleotides, although most human cells contain 60 hundred million pairs of nucleotides. The haplotype human genome consists of 22 autosomes as well as Y and X chromosomes.

Each chromosome represents a single DNA molecule, i.e., a sequence of millions of nucleotide bases. These molecules are linear, so it is expected that each chromosome should be represented by a single, contiguous/linear nucleic acid sequence. This is not the case, however, for two main reasons: 1) due to the nature of genomic DNA and limitations of sequencing methods, some parts of the genome have not been sequenced yet, and 2) certain regions of the genome vary so much from individual to individual that they cannot be represented by a single contiguous sequence. Whereas HRG is represented by 24 linear DNA sequences consisting of normal bases (A, C, T or G) with gaps represented by a series of "N" to clearly indicate the position of the gaps within the assembly.

The main goal of the human genome project is to generate a single representative sequence, i.e., a single "frame", for each physical chromosome, even if there are undefined regions. Although it also includes a few alternative frameworks that represent allelic variants (different versions of the DNA bases present at the SNP site are called alleles), they have no formal relationship with the main framework (normalized relationship). Since it was recognized that certain highly polymorphic regions of the genome were particularly difficult to represent with a single reference sequence, formal models were added starting from GRCh37[ Church et al, 2011] to introduce representative highly variable region replacement versions. Sequences in the form of "alternative site frameworks" ranging from kilobases (kilobases) to megabases (multi-megabases) are described relative to "primary" (haploid) assembly, anchored at positions along the primary framework. The assembly at the time of filing this application (grch38.p9) comprised 178 regions and a total of 261 linear sequences [ patent et al, 2011 ].

Another complicating factor is that HRG is deduced from a population of DNA from multiple anonymous individuals from the original international genome sequencing project. Thus, the resulting HRG is indeed a randomly mixed complex (haplotype), a haplotype mosaic of different DNA sequences (mosaics), so that in some cases it may not be represented correctly as a single linear sequence.

HRG is apparently not disease-free

Chen & button (2011) identified 3,556 disease-susceptible variants in HRG, 15 of which were rare variants (< 1% major allele frequency). The authors, using a well-chosen high-quality quantitative human disease-SNP association database, evaluated the probability ratio of the reference genome for an increased risk of 104 diseases relative to healthy populations and found that type 1 diabetes, hypertension and other diseases were at high risk. This clearly demonstrates that HRG does not represent an ordinary person, and is clearly not disease-free. Although HRG greatly accelerated the analysis work of personal genome sequencing, focusing only on variants relative to this reference genome may miss many pathogenic variants, including rare variants [ Chen & button, 2011 ].

3. Bias towards European ancestry with reference to alleles

The major problem with HRG assembly in prior art NGS analysis pipelines is that it is actually a DNA sample from relatively few anonymous donors that are biased towards european ancestry and therefore represent only a small sample of human genetic variants.

Although this reference genome has relative validity and versatility as a coordinate system for most genomes, there is an increasing concern that studying all other human genomes with HRG as a lens (lens) will exclude a large number of common human variants and introduce a common reference allelic bias [ Petrovski et al, 2016, patent et al, 2017 ]. Reference allele bias tends to over-report the alleles present in the reference genome, while under-reporting other alleles whose base DNA does not match the reference allele [ Degner et al, 2009, Brandt et al, 2015 ].

This bias results primarily from read length mapping and alignment steps in sequencing experiments. To map correctly, the resulting read-length genomic sequence is both represented in the reference sequence and sufficiently similar to the reference sequence to be recognized as the same genomic element. When none of these conditions is met, errors in the mapping can produce systematic blindness to the true sequence [ patent et al, 2017 ]. Based on the ancestral history of the reference genome biased at each site, the reference allele bias may also have a greater impact on certain genetic subgroups and certain regions of the genome than on other genes [ Petrovski et al, 2016, patent et al, 2017 ]. Highly polymorphic regions (e.g., HLA genes) are particularly susceptible to reference allele bias [ Nielsen et al, 2011], particularly when a single reference genome is used as a reference for NGS read length alignment. In this case, many true variants cannot be identified, because they differ from the genome as an indicator in haploids, and therefore cannot be aligned to the reads generated from these regions, resulting in a loss of information [ Brandt et al, 2015 ].

As described above, reference bias is a known problem for variant recognition with HRG in human genome sequencing, and modification of the reference can improve recognition accuracy and translatability [ Fakhro et al.2016 ]. One way to alleviate this problem is to modify the variant prevalence information early in the genome-interpretation process by modifying the reference genome such that the variants found in the genome are minor alleles in the population [ Dewey et al, 2011 ]. Modification of the reference may simplify the analysis workflow, as it may reduce the number of false positives and require fewer variants to be interpreted [ Fakhro et al, 2016 ].

In the future: graph-based reference structure/genome map

It is increasingly recognized that a single haplotype reference genome is a poor universal reference structure for human genetics and genomics, as it represents only a small fraction of human variants: there are also variants and annotations [ Horton et al, 2008, Pei et al, 2012] that cannot be easily described with respect to the reference genome. Furthermore, as described above, it introduces a reference allele bias as a target for read length mapping and interpretation. To alleviate these problems, the latest version of reference genome assembly (e.g., human genome assembly (grch38.p9) at the time of filing the present application) has included "alternate loci" sequences ("alts"): various other sequences, which are considered to be highly polymorphic human genomic regions, are represented whose ends are fixed in position within the "primary" (haplotype) reference assembly. This structure, which contains multiple partially overlapping sequence paths, can be thought of as a form of mathematical graph-a genomic map [ Novak et al, 2017 ].

Graphs have a long history in biological sequence analysis, and they are often used to compactly represent a set of possible sequences therein. Typically, the sequence itself is implicitly coded as a running block in the graph. This makes the graphs very naturally suitable for representing reference sets, since they are essentially a set of related sequences [ patent et al, 2017 ]. The graph contains not only the approximate sequence of the sample, but also many of their specific variants.

Genomic maps are expected to improve read length mapping, variant calling (variant calling) and haplotype determination. It is expected that graph-based references will replace linear references in humans and other applications with a set of individuals sequenced [ Novak et al, 2017 ]. Many projects are constructing and using these genomic maps. Genomic maps can now be constructed from a pool of common variants, although still in the experimental phase, some tools demonstrate the great potential of graph-based approaches.

Despite these theoretical advantages, the use of genomic maps for variant identification remains relatively new. There are many problems to be solved. How do repetitions (duplicate) and repetitions (repeat) be represented? How best to map to a map? How should short variants with unclear homology be resolved? How can a map be used to achieve a more comprehensive taxonomy of variants? These problems open the way for future research.

To be useful in practice, genomic maps must be able to translate the reduction in the reference bias of their consent into a measurable improvement in variant recognition compared to existing methods. Accordingly, the development of variant recognition algorithms for genomic profiling is currently an important research front.

Catal genome (QTRG)

Catal is a small peninsula on the gulf of bos, with a total population of about 30 million catals. Catall is one of the countries in the world with the highest marriage rate of close relatives and is still increasing, while catall's internal marriage rate approaches 100%. All these factors, as well as the large household size, are the main cause of high incidence of indigenous genetic diseases, which means a financial burden on the catal budget. These factors have triggered the catall government to seek ways to protect its nations from genetic diseases [ Zayed 2016 ].

In 2013, government officials decided to initiate the Katar Genome Project (QGP) (http:// www.gulf-times. com/store/374345/Qatarlaunches-Genome-Project). The aim of this program is to sequence the genome of each catal citizen to protect the catal from the high incidence of indigenous genetic diseases by mapping the causative/rare variants and establishing a catal reference genome as a path to personalized medicine. The ultimate goal of the program is to apply the information to clinical practice and make this approach a routine component of the catal medical system [ Zayed 2016 ]. To achieve the prospective clinical application prospects of QGP, several serious challenges must be addressed, including achieving higher sensitivity and accuracy of variant identification [ kobold 2010 ].

To facilitate the development of precision medicine in the middle east and northern africa, a population-specific genome was constructed that was specifically tailored for disease studies in the catal indigenous population of arabiers by integrating allele frequency data from whole genome sequencing of 1,161 catals (representing 0.4% of the population). In total, 2090 ten thousand Single Nucleotide Polymorphisms (SNPs) and 310 ten thousand insertions and deletions (indels) were observed in catal, containing on average 1.79% of the new variants per genome [ Fakhro et al, 2016 ].

1000genome project (1kG)

The 1000genome Project (1000 Genomes Project) was established in 2008 and aims to sequence and catalog human genetic variants (relative to HRG GRCh37) and haplotypes for at least 1000 human Genomes worldwide (hence the 1000genome Project). The current phase 3 analysis of the program contained 2,504 individuals from 26 populations and defined 5 so-called super populations, each consisting of associations of 4 to 7 populations [1000 genes Project Consortium et al, 2015 ]. This more elaborate haplotype resource would help understand genetic variation at both the genomic and geographic levels [ bayer, 2011 ].

Object of the Invention

Recent advances in NGS technology have made DNA and RNA sequencing faster and less expensive, thus revolutionizing the study of genomics and molecular biology. Genome sequencing programs for healthy and diseased populations have detected many functional or disease-associated genomic variants that can provide clues to therapeutic targets or genomic markers for novel clinical applications.

Variant identification of genes is mainly based on aligning the read lengths of the original sequences with respect to a reference genome (read length mapping). This alignment-based approach has many limitations, including imperfections in genome assembly [ Meyer, L.R et al, 2013], structural variants present in the genome of normal individuals [ Sudmant et al, 2015], sequencing errors in read length, and Single Nucleotide Polymorphism (SNP) interference with read length mapping [ Iqbal, Z.et al, 2012 ].

At present, read-length mapping relative to linear HRG is the only standard method at the time of filing this application, and will also continue to be the standard in clinical NGS analysis pipelines and human individual sequencing, because HRG has relative validity and universality as a coordinate system for most genomes. Moreover (unlike emerging genome inference with genomic maps), many methodologies have been disclosed for successful variant identification using linear reference genomes [ Nielsen et al, 2011 ].

However, as mentioned above, the main problem with HRG is its bias, ignoring previous information about intraspecies genetic variation. Currently, this problem is often addressed by modifying the reference genome such that the identified variants are a minority of alleles in the population relative to the modified reference genome.

The success of clinical genomics using NGS technology requires accurate and consistent identification of individual genomic variants. These targets are premised on accurate read-length mapping (alignment) and subsequent variant identification.

It is an object of the present invention to detect new biomarkers, in particular gene variants, such as Single Nucleotide Variants (SNV), insertions and deletions (inDel), Copy Number Variants (CNV) and Structural Variants (SV) (e.g. chromosomal translocations (trans), inversions (inversions), duplications (duplications), large insertions and deletions (inDel)) in order to use next generation sequencing techniques in human genome research.

Another objective is to improve the accuracy and confidence of existing NGS-based biomarkers (e.g., for cancer treatment, where the technique is used to analyze DNA of tumor cells and their damage).

Disclosure of Invention

According to a first aspect of the present invention there is provided a method of performing genomic and/or genetic analysis of a human nucleic acid sample comprising the steps of:

a) providing a set of human reference genomes;

b) testing the human nucleic acid sample for gender and/or ancestry (anestry);

c) selecting one or more population-specific human reference genomes (PHREGs) from the set of human reference genomes based on the results of the gender and/or ancestry test in step b); and

d) aligning the human nucleic acid sample with the PHREG selected in step c).

Population-specific Human Reference Genomes (PHREGs) are hereinafter understood to be ancestral-specific Reference Genomes and gender-specific Reference Genomes. PHREG greatly reduces the reference bias and improves the alignment accuracy, and also improves the accuracy of variant identification if it is performed later. Advantageously, the present invention not only improves the accuracy of the alignment, but also improves the calculation speed, the number of read lengths for correct alignment and the number of calculation steps for alignment. Genomic and/or genetic analysis of human nucleic acid samples using PHREG can also improve read length coverage depth, and the benefit of using PHREG can be assessed by sensitivity improvement of variant recognition.

In the context of the present invention, the term "human nucleic acid sample" generally refers to any nucleic acid sample isolated from a human sample. In particular, the human nucleic acid sample may include NGS reads, which are defined in more detail below.

The human nucleic acid sample may generally comprise samples from various standard biochemical, molecular and/or cellular biological procedures suitable for preparing human nucleic acids. Such procedures include aspiration, biopsy, liquid biopsy, cell-free DNA isolation kits, and the like. The human nucleic acid sample may be or be derived from a variety of suitable sources, including, but not limited to, bodily fluids, mucosa, tissue extracts or cells, or any combination thereof. The human nucleic acid sample may also be a control sample from a variety of suitable sources. The human nucleic acid sample may comprise, for example, a blood sample, a plasma sample, a urine sample, a tumor sample, which may include unwanted errors (unidentified artifact) caused by a fixation process in the tissue processing program FFPE (formalin-fixed partially-embedded tissue or formaldehydefixed-embedded tissue).

In particular, the human nucleic acid sample may comprise DNA, RNA and/or size fractionated total DNA or RNA. Providing DNA from a sample of interest may include one or more biochemical purification steps, such as, for example, centrifugation, lysis and/or stratification, cell lysis by mechanical or chemical disruption steps, including but not limited to multiple freezing and/or thawing cycles, salt treatment, phenol-chloroform extraction, Sodium Dodecyl Sulfate (SDS) treatment, and proteinase K digestion. Optionally, providing DNA from the target sample may further comprise a step of removing large RNAs (e.g., abundant ribosomal rRNA) by precipitation in the presence of polyethylene or salt, or a step of removing interfering Sodium Dodecyl Sulfate (SDS) by precipitation in the presence of salt (preferably potassium chloride solution). Methods for purifying total DNA or RNA from cells and/or tissues are well known to those skilled in the art and include, for example, standard procedures such as extraction using guanidine thiocyanate-acidic phenol-chloroform (e.g.,

invitrogen, usa). However, it is also preferred that the DNA in the target sample may also be provided without any biochemical precipitation and/or purification steps described herein.

In the context of the present invention, the term "nucleic acid" generally refers to any kind of single-or double-stranded oligonucleotide molecule consisting of deoxyribonucleotides or ribonucleotides or both, including genomic DNA, nuclear DNA, somatic DNA, germline DNA, synthetically designed and/or prepared DNA, including, but not limited to, DNA generated in vitro from messenger RNA profiles, preferably in cDNA form. The term "nucleic acid" generally refers to a single-or double-stranded oligonucleotide molecule having the same or similar length, e.g., consisting of the same or similar number of nucleotides.

The human nucleic acid sample may comprise genomic sequences that can be used to assess, analyze, align, index, and/or map specific mutations at the genomic, transcriptional, or post-transcriptional level. Thus, a human nucleic acid sample according to the present invention may refer to and include, but is not limited to, any coding region, non-coding region, exon, intron, chromosomal and/or intrachromosomal region, promoter region, enhancer region, region encoding small and/or long regulatory RNAs, active transcribed region and/or non-transcribed region, transposon, hot spot mutation region, frame shift mutated region, and the like.

The "set of human reference genomes" comprises at least two human reference genomes, preferably a plurality of human reference genomes. The gender and/or ancestry test in step b) allows to select in step c) one or more most suitable human reference genomes from a set of human reference genomes. Preferably, the gender and/or ancestry test in step b) is such that the gender and/or ancestry will be automatically classified and allow the selection of one PHREG from a set of human reference genomes for the alignment in the subsequent step d), but it is also possible to select one or more additional PHREGs for the subsequent analysis.

Preferably, the sex and/or pedigree test in step b) is based on a subclass of blood and/or sex specific sequence variants related to sex and/or pedigree extracted from a professional database (cured database). Preferably, these sequence variants are Single Nucleotide Polymorphisms (SNPs) and/or Single Nucleotide Variants (SNVs). This subclass of sequence variants for gender and/or Ancestry testing is also known as Population-dependent Human Ancestry and gender Patterns (PHASPs). Preferably, the professional database comprises all known sequence variants in all populations. The PHASP dataset is an excerpt of the professional database, which is much smaller than the PHREG dataset and is the most distinctive subset of the taxonomy. The technique for generating PHASP is a computer method from machine learning (including feature reduction, where the feature is genotype). These studies can be compared and tested with the results of the standard classification.

Preferably, the gender and/or pedigree test comprises an initial alignment step to detect individual sequence variant patterns of the sample, wherein the human nucleic acid sample is aligned with a single human reference genome (e.g., GRCh37 or GRCh 38). Such a single human reference genome for testing in step b) is not ancestral or gender specific. By aligning the sequence variant pattern of the sample with the PHASP dataset, the ancestry and gender of the patient are determined.

According to one embodiment, the test may comprise a gender test. According to another embodiment, the test may comprise a pedigree test. According to another embodiment, the test may comprise a gender test and a pedigree test.

In an exemplary embodiment, a set of human reference genomes includes male and female reference genomes. If the sex test in step b) determines that the human nucleic acid sample is a male or female reference genome, then in step c) the respective male or female reference genome or genomes will be selected as representative PHREGs for the subsequent alignment step d).

Since the sex chromosomes contain homologous sequences, the use of a sex-adjusted reference genome (male with X and Y chromosomes and female without Y chromosome) prevents read length misalignments. Thus, the use of a sex-specific reference genome reduces false positives and false negatives for subsequent variant recognition.

In another exemplary embodiment, a set of human reference genomes comprises a number of ancestral-specific reference genomes. The pedigree test in step b) identifies the best one or some of the pedigree-specific reference genomes. Then in step c) the closest reference genome or genomes will be selected as PHREG or PHREGs for the subsequent alignment step d).

Selection of the wrong ancestry can result in a large number of false positive and false negative variant identifications. The use of lineage specific reference genomes can effectively increase the number of correctly aligned reads, reducing false positives and false negatives.

Likewise, the combination of gender and ancestry testing is decisive when a set of human reference genomes comprises an ancestry-specific male reference genome and an ancestry-specific female reference genome.

The term "testing" in step b) is to be understood as including at least one genetic and/or genomic test of a human nucleic acid sample. Genetic and/or genomic testing is more reliable than any information from self-reporting. Self-reported and researcher-assigned descent often relies on subjective interpretation of a complex combination of genetic and non-genetic information, including behavior, culture, social guidelines, skin tone, and other influencing factors. It is rare for a study participant or patient to report his race without error. Errors in self-reporting ethnicity can be caused by a number of reasons; some people may not be fully aware of their true ancestry or only the recent ancestry (or their geographical origin), while others may only be aware of one ethnicity, even though they have a mixed background [ Mersha & Abebe 2015 ]. The literature demonstrates that self-reported ancestry and gender are often incorrect [ Ainsworth, 2015; mersha & Abebe,2015 ]. Indeed, Ainsworth explains that even one of 100 people is affected by sexual disorders, resulting in an appearance that is not consistent with its genome.

Advantageously, the method can also detect whether samples are interchanged based on gender and ancestry as an additional quality check. Mismatches between gender and ancestry predicted by self-declaration and sequencing runs can reveal, for example, sample displacement and other errors in laboratory processing.

The term "alignment" generally refers to a computer step in which a sample being sequenced is compared and fitted to a reference sequence. To do this, it is necessary to find for each read length in the generated sequencing data its corresponding portion in the sequence. In other words, alignment or read length mapping is the process by which the most likely source of an observed nucleic acid sequencing read length is determined in a genomic sequence. In typical embodiments, the read length is an NGS read length, but it should be understood that read lengths from other sequencing technologies are also included in the teachings of the present invention.

Aligned reads from a human nucleic acid sample can be displayed, stored, printed, sent over a communications network, or otherwise further processed. In particular, further applications or uses of the aligned human nucleic acid samples may include one or more of the following:

1) local realignment around insertion and deletion (inDel)

The term "inDel" denotes the insertion or deletion of base pairs in the genome, and typically includes minigene variants from 1 to 1000bp in length. Re-alignment around inDel improves subsequent data analysis, particularly subsequent variant identification.

2) Base Quality Score Recalibration (BQSR)

The term "base mass fraction" describes the error estimate for each base, which represents the confidence with which the sequencing instrument recognizes the base. The score may be used, for example, to weight evidence of subsequent variant identification. BQSR allows for adjustment of mass fraction by taking into account systematic technical errors, which are caused by the physics or chemistry of how sequencing is performed.

3) The true isolated variants are thus separated from the machine artifacts (machine artifacts) common in next generation sequencing technologies by machine learning.

4) Variant discovery and genotyping all potential variants, also referred to herein as variant calling, are discovered.

Variant discovery may include discovery of SNPs/SNVs, InDel, CNV and SVs (chromosomal translocations, inversions, duplications, large InDel).

5) Evolution analysis study

Evolutionary analysis studies may include tools to measure nucleotide diversity, population differences, linkage disequilibrium, and one or more population mutation spectra. Evolutionary analysis may generally include computational tools for computing evolutionary sequence statistics. The computing tool may be adapted to perform the analysis in a sliding window across the chromosome or frame. The computational tool may, for example, generate a phylogenetic tree of the sample.

Such evolutionary analysis may be performed, for example, using "POPBAM" software, as described in detail, e.g., https:// www.ncbi.nlm.nih.gov/PMC/articules/PMC 3767577/.

6) Testing wild-type biomarkers

Further, aligned human genome samples can be tested for the presence of wild-type biomarkers, i.e., those biomarkers that are not detected in variant recognition due to being included within the PHREG. Thus, the calculation step after alignment may include testing each known biomarker for whether it is present in the aligned human genome samples, regardless of the information of PHREG at this location.

According to one embodiment, the method comprises the additional step of performing variant identification on the human nucleic acid sample aligned to the selected PHREG. Advantageously, the present invention improves the accuracy of variant identification by introducing initial gender and/or pedigree testing to determine the correct PHREG for subsequent alignment and variant identification steps.

Thus, aligned human nucleic acid samples, more specifically aligned NGS reads from human nucleic acid samples, can also be further processed by one or more so-called variant recognizers, which are computer modules comprising different variant recognition algorithms that can detect variants of any kind (e.g., SNV, InDel, copy number variation, and structural variants). Subsequent method steps may include variant interpretation. The results of the variant identification and/or variant interpretation may be displayed, stored, printed, transmitted over a communications network, or otherwise further processed. Advantageously, by removing bias from the reference genome used, the method is able to detect biomarkers not previously found (e.g., in cancer or other diseases). In particular, the method according to the invention allows to distinguish between a variety of gene mutations, including but not limited to SNV, polynucleotide variants (MNV), complex events (complex events) and large structural variants, in particular hot-spot mutations, frame-shift mutations, non-silent mutations, stop codon mutations, nucleotide insertions, nucleotide deletions, copy number variations (copy number alterations), copy number alterations (splice sites) and/or splice sites.

The donor of the human nucleic acid sample may be a patient, i.e., a human having a disease or suspected of having a disease. However, the use of this method should not be construed as being limited to only patients.

Variant recognition or interpretation may comprise genomic sequence analysis indicating the presence or absence of a certain disease. Based on the variant interpretation, patients may be divided into a first group comprising patients not indicated for a certain treatment and a second group comprising patients indicated for a certain treatment. Thus, the present invention may be advantageously used as part of a disease screening procedure to assess the presence or absence of disease in a patient.

Additionally or alternatively, the method may comprise the step of retrieving (retrieve) an indication of a disease associated with or associated with the human nucleic acid sample. The disease indication may be retrieved from e.g. an electronic health record or added manually by the patient himself or the attending physician by means of an input by a computer device. Disease indications, such as ICD-10, MeSH or MedDRA, may be identified from a disease classification database (disease association). For certain categories of indications, there may also be specialized classification databases that have advantages such as more accurate indications classification. In oncology, it may be beneficial to use ICD-O-3 and/or TNM staging systems.

Based on the results of the variant identification and interpretation, and in view of the patient's disease, the method may involve providing a treatment plan for the patient. In this context, a treatment plan may in particular be a personalized treatment plan for a patient, wherein the personalized treatment plan comprises treatment options tailored to the patient's genetic data, in particular to his/her clinical, molecular and/or genetic condition.

To identify a promising treatment for a patient, the method can include examining whether any variants (e.g., mutations found in the patient (e.g., in a tumor or normal control tissue of the patient)) are indicative of the patient's outcome under any treatment. The method may further comprise identifying all treatments associated with any of the variants found. The method may include scoring the determined treatment methods and ranking according to the score to give the patient a priority of treatment options or treatment contraindications.

In the context of the present invention, the term "treatment" includes the administration of a therapeutically effective drug or pharmaceutically active compound in the form of a pharmaceutical composition for preventing, ameliorating or treating the symptoms associated with the indication. The term "treatment" also includes any kind of surgery, radiation therapy and/or chemotherapy or any combination thereof.

For both options, i.e. in the context of screening methods or personalized treatment plans, the invention may allow the diagnostic capabilities of the physician to be improved, e.g. improved treatment decisions for the physician due to improved accuracy of the comparison and variant identification.

According to one embodiment, the alignment is performed with PHREG at the level of the majority of alleles. The reference sequence is adjusted to a population at the majority allele level using the unique nucleotide code (A, C, G, T) in PHREG. The single nucleotide most commonly observed at a particular site in the population is selected. In the case of allele frequency correlation, alleles present in the base reference sequence (e.g., GRCh37 or GRCh38) can be used.

According to another embodiment, the alignment is performed with PHREG at the non-rare allele level. This non-rare allele level uses the fuzzy nucleotide codes according to the established IUPAC nomenclature [ Cornish-Bowden,1985], e.g., "R" stands for "A" or "G". Non-rare allele levels may encode up to two or three, preferably two alleles with a high frequency in the population. A very high frequency can be defined as exceeding or equal to 30%, 20%, 15%, 10%, 5%, 3%, 1% or 0.1%, in particular exceeding or equal to 5%. Since each genomic position of PHREG integrates more than one variant allele, it can be expected that the read length alignment will be more accurate. In one embodiment, only Single Nucleotide Variants (SNVs) are considered at the non-rare allele level. In other embodiments, insertions and deletions (inDel) and other structural variations are also contemplated.

According to one embodiment, variant recognition is performed relative to PHREG at the level of the majority of alleles. In some embodiments, the alignment is performed at the non-rare allele level and the variant identification is performed at the majority allele level. Alternatively, variant recognition is performed at the non-rare allele level.

According to one embodiment, the human reference genome in step a) is a published human reference genome. In particular, the disclosed human reference genome may comprise internal versions of HRGs, in particular internal versions of GRCh37 and GRCh38. Additionally or alternatively, the disclosed human reference genome may comprise a QTRG. Additionally or alternatively, a disclosed human reference genome can include a genome from the 1000genome (1kG) project. For a 1kG plan, the latest version of all chromosome VCF files can be downloaded and used from the 1kG FTP site FTP:// ftp.1000genes.ebi.ac.uk/vol 1/FTP/release/20130502/. These can also be used in the methods of the invention if more individual and ethnic datasets are disclosed (e.g., 1000 arabian genome programs [ Al-Ali, m.et Al, 2018] that study the arabian unikeman population).

Additionally or alternatively, the human reference genome in step a) is derived from a published human reference genome. The term "derived from" may specifically encompass error correction and/or adjustment of the human reference genome to a majority allele encoding level or to a non-rare allele level.

Error correction can be performed such that the reference nucleotides observed in zero individuals of a given population are replaced by the corresponding majority of nucleotides.

In one embodiment, step a) comprises adjusting the human reference genome to a coding level comprising a unique nucleotide code or an ambiguous nucleotide code. In particular, the encoding levels comprising unique nucleotide codes can be used to define PHREG at the level of the majority of alleles. In particular, the encoding levels comprising ambiguous nucleotide codes can be used to define PHREGs at the non-rare allele level.

In one embodiment, single nucleotide variants are contemplated for adjustment to the coding level. For each population (or superpopulation), all reported SNVs and their allelic frequencies were used. In other embodiments, inDel, CNV, and/or SV are also contemplated.

According to one embodiment, it is proposed to adjust the reference sequence to four different levels of the population, two of which are limited to unique nucleotide codes (A, C, G, T) and two of which utilize ambiguous nucleotide codes according to the IUPAC nomenclature [ Cornish-Bowden,1985], e.g., "R" stands for "a" or "G". These PHREG coding levels are defined as follows:

1. maximum conservative error correction (maximum conservative error correction): the reference nucleotides observed in zero individuals of the population are replaced by a corresponding plurality of nucleotides (e.g., a corresponding plurality of 1kG nucleotides).

2. Most alleles: the single nucleotides most commonly observed at a given site in the population are selected (if correlated with allele frequencies, the alleles present in the underlying reference sequences (e.g., GRCh37 or GRCh38) are used).

3. Non-rare alleles: encoding up to two alleles at a high frequency (e.g., > ═ 5%) in a population, IUPAC codes can be used if necessary.

4. The observed alleles were fully modeled: at each position all (up to four) alleles are encoded, which are reported in at least one individual in the population.

However, the complete representation of the 1kG variant in PHREG at level 4 is at the cost of disproportionately large number of genome modifications that introduce ambiguity (ambiguity) that may severely hinder seed discovery by the read length mapper. Thus, in one embodiment, the alignment is performed with level 3, which uses the IUPAC fuzzy-aware alignment algorithm (IUPAC ambiguity-aware alignment algorithm). Since the variant recognizer that performs best at present cannot be used to process the fuzzy code, subsequent variant recognitions use the 2 nd level PHREG unless there is a better IUPAC fuzzy perception comparison algorithm.

Thus, advantageously, the method may allow the PHREG to make user-defined level adjustments to population genetic variations based on the target population and downstream analysis.

According to one embodiment, the human reference genome in step a) is PHREG. Thus, step a) may comprise, for example, downloading the PHREG from a common resource.

As defined above, first, PHREG is understood to be a ancestry-specific reference genome and/or a gender-specific reference genome. In one embodiment, the human reference genome provided in step a) is already population-specific in that they contain metadata (meta data) indicative of their blood and/or gender. For example, at the time of filing the present invention, the current phase 3 analysis of the 1kG program contained 2,504 individuals from 26 populations and 5 so-called super populations, each of which is a consortium consisting of 4 to 7 populations. 26 populations from stage 3 of the 1kG study and their associated 5 superpopulations (AFR, African; AMR, Ad-pooled Americans; EAS, east Asian; EUR, Europe; SAS, south Asian) can be found at http:// www.internationalgenome.org/faq/which-publications-are-part-your-study.

In one embodiment, an optimized population-specific human reference genome is constructed for each of the 31 (super) populations and another super population that contains all other populations, using data from a 1kG plan.

When the human reference genome provided in step a) is a PHREG, public metadata of the PHREG may also be provided (e.g., by downloading from a public resource). The metadata may provide quality control for the method. Quality control may be considered successful if the metadata coincides with gender and ancestral classification data. If there is no coincidence, the software may generate a warning or alarm to be displayed to the user, and additionally or alternatively, the software may, for example, stop the program before the comparison step.

According to one embodiment, the gender testing comprises at least one of the following steps: testing for at least one location in a sex specific gene on the X chromosome and/or the Y chromosome; using aligned differences of human genome samples on the X chromosome and/or the Y chromosome; cytogenetic testing; performing FISH analysis; CGH analysis, or any other assay that can directly or indirectly determine the sex of a human nucleic acid sample.

Thus, the sex test may also be the result or a by-product of a FISH analysis (fluorescence in situ hybridization analysis) [ Gall J.G.1969] of a human nucleic acid sample. Thus, the sex test may also be the result or by-product of CGH analysis (comparative genomic hybridization) of human nucleic acid samples [ Kallioniemi A. et al, 1992 ].

Gender testing can effectively and reliably distinguish between male or female human nucleic acid samples.

Since individuals from one ancestry or race share many SNPs, distinguishing them from other ancestry or race, it is possible to determine the PHREG that best fits read-length alignment and variant recognition by examining a series of ancestral-determined SNPs. Thus, PHREGs can be selected from a set of human reference genomes based on the results of the pedigree test.

Different experimental settings can be used in upstream genomic analysis pipeline steps to determine the ancestry of individuals before performing the alignment, to determine the best matching PHREG reference and avoid errors.

1) The pedigree determination may be based on a machine learning algorithm used on human nucleic acid samples, or another classification scheme that utilizes pedigree-specific variants. In particular, pedigree testing methods may be based on machine learning, which makes use of the genotype of exon locations, e.g. over 100, 500, 1000, 2000 or preferably over 5000 exon locations.

2) Related genotypes can be determined based on NGS data or another experimental approach, such as SNP arrays, as they are done in forensic studies [ fonnevia et al, 2013 ]. Here, the use of non-coding SNPs can help determine ethnicity.

3) The same non-coding SNPs (plus flanking regions) as tested in the forensic SNP array of option 2) can be added to the existing targeted NGS patch (panel) to determine the relevant genotypes.

In particular, the pedigree test may comprise using the genotype of at least one genomic position.

In a particular embodiment, the pedigree test comprises testing at least one gene selected from the sequence protocols in the appendix. To generate accurate results, 249 genes from the sequence scheme in the appendix are shown.

Additionally or alternatively, pedigree testing may include testing SNP arrays and/or SNP chips and/or testing markers from Sanger sequencing or mass spectrometry, or any other experimental method for determining the relevant genotype.

In a particular embodiment, the pedigree test comprises testing at least one gene selected from the group consisting of: ABL2, ATP1A3, CIC, CYP2C8, CYP2C9, EPHA3, EPHA7, ERBB3, ERG, ETV 3, F3, FAS, HFE, IL11 3, IL 23, ITGB3, KIF 3, KIT, KLK3, LRP 3, MDM 3, NAT 3, NTRK 3, PDGFB, PIK3R 3, PLA2G3, PLAU, PRKCB, RICTOR, SLC7A 3, STAT3, VCAM 3, VDR, VEGFB, ACVRL 3, AXL, CA 3, CACR, CASP 3, ENG, EPHB 3, ERBB3, ESR 3, HPS 3, HSP 3690, 36K 3, GARPEST 3, EPRCS 3, EPRCH 3, EPTC 3, EPTC 3, EPTC 3, EPTC 3636363672, 3, EPTC 3, EPTC 3, ROCK2, SLC6A2, TET2, TGM2, TH, ABCB1, CD22, CD40, CD44, CDH20, CYP11B2, ERCC5, GPR124, IL7R, ITGB3, ITGB5, NCL, NOD2, NR4A1, PGR, PLCG1, PPP2R1A, PRAME, PTCH2, RET, SETD2, XPC, ASXL1, EPHB4, PLA2G 4, SYK, TET 4, EP300, FLT4, ITGA 4, LOCSF 4, PDGFRB, PIK 34, SSTR 4, TEC, APC, ATRCE, CROP, BBP, CYP2D 4, EML4, MMP 4, PARP 4, SPE CSF, FRA 4, TRPC 4, TR.

In a more particular embodiment, the pedigree test comprises testing at least one genomic coordinate selected from the genomic coordinates listed in appendix 1. Appendix 1 describes GRCh 37-based genomic coordinates for the features of the pedigree classifier (an approach classifier). The format of the first 3 columns is set according to the BED file standard (https:// www.ensembl.org/info/website/upload/BED. html) and corresponds (from left to right) to the chromosome, the start of the feature starting from 0, and the end of the feature starting from 0 (i.e., the first position after the end of the feature). Column 4 shows the bases associated with the classifier at this position and column 5 shows the corresponding gene name.

The Gene name was approved by the HUGO Gene Nomenclature Committee (Gene Nomenclature Committee, HGNC, https:// www.genenames.org /). HGNC is responsible for approving unique symbols and names of human gene loci, including protein coding genes, ncRNA genes, and pseudogenes (pseudogenes), for unambiguous scientific communication. The gene name used in this application was retrieved at 8 months in 2013.

In another particular embodiment, the pedigree test comprises at least one SNP listed in appendix 2[ Fondevila et al, 2013 ]. Appendix 2 indicates the chromosome number (left column), the exact chromosome position (middle column) and the corresponding rs number (right column) at which the SNP is located. rs number is the accession number (accession number) that NCBI (National Center for Biotechnology Information) assigns in its SNP database (dbSNP, https:// www.ncbi.nlm.nih.gov/projects/SNP /), and is widely used to refer to a specific SNP in the entire genomic database. When the researchers identified the SNP, they sent a report to the dbSNP database (the report includes the sequence immediately adjacent to the SNP). If overlapping reports are sent, they will be merged into the same non-redundant cluster of reference SNPs, which is assigned a unique rsid. For more information, see URL http:// www.ncbi.nlm.nih.gov/sites/books/NBK 44406/.

Such pedigree tests may include genetic and/or genomic tests to differentiate between pedigree categories. According to the 1kG plan, such ancestral classes may be defined as AFR, AMR, EAS, EUR, SAS. However, the method is not limited to 1kG planning data, e.g. if more comprehensive datasets with more individuals/races are present, these could be used for the same purpose.

According to one embodiment, the human nucleic acid sample comprises a set of reads from a next generation sequencing program, and wherein the aligning comprises the step of mapping the reads to the selected PHREGs. Additionally or alternatively, the human nucleic acid sample comprises a set of reads from a target sequencing program, e.g., from small segment sequencing.

Advantageously, the method can be seamlessly integrated into any existing NGS analytics workflow based on read-length mapping of HRGs.

Human nucleic acid samples and selected PHREGs are aligned by mapping read lengths to the selected PHREGs, which can presuppose a ready sequencing library prepared by random fragmentation (fragmentation) of DNA or cDNA samples followed by 5 'and 3' -adaptor ligation (ligation). In some embodiments, the fragmentation and ligation reactions are combined into one step, followed by PCR amplification of the adaptor-ligated fragments.

The human nucleic acid samples and the selected PHREGs were aligned by mapping the read lengths to the selected PHREGs, which presupposed that the set of DNA fragments had been sequenced, resulted in read lengths of between about 28 base pairs (bp) and 1000 base pairs (bp) [ Goodwin s.et al, 2016 ]. The set includes a sufficient number of read lengths to achieve a predetermined coverage of the target area (typically between several times and several thousand times) to be suitable for answering the experimental questions asked.

In one embodiment, the next generation sequencing program involves whole exome sequencing. In another embodiment, the next generation sequencing program involves whole genome sequencing. The term "whole exome sequencing" generally refers to a technique for sequencing all of the genes encoding proteins in a genome (referred to as exons). It involves first selecting only a subset of the DNA encoding the protein (called exons) and then sequencing that DNA using any high throughput DNA sequencing technique. Humans have about 18 ten thousand exons, accounting for about 1.5% of the human genome, or about 3000 ten thousand base pairs. In particular, exon sequencing can be performed by next generation sequencing. "Whole genome sequencing" (also known as WGS, whole genome sequencing), complete genome sequencing (complete genome sequencing) or whole genome sequencing (entry genome sequencing)) is a laboratory process that can determine the complete DNA sequence of an organism's genome at a time. This requires sequencing of all chromosomal DNA of the organism as well as DNA contained in mitochondria.

According to another aspect of the present invention, a computer system for performing genetic analysis on a human nucleic acid sample comprises:

a) a first module comprising computer instructions for providing a set of human reference genomes;

b) a second module for testing the gender and/or ancestry of a human nucleic acid sample;

c) a third module comprising computer instructions for selecting one or more population-specific human reference genomes (PHREGs) from the set of human reference genomes based on results of the gender and/or ancestry tests; and

d) a fourth module comprising computer instructions for aligning the human nucleic acid sample with the determined PHREG.

In particular, the computer system may be adapted or may be configured to perform any of the methods disclosed above. Thus, it is to be understood that features which have been described in the context of a method also apply to the computer system, and vice versa, features which will be described in the context of a computer system also apply to the above method.

The modules may be software modules, software routines or software subroutines stored on a machine-readable storage medium, such as a permanent or rewritable storage device, or a storage medium assigned to a computer device, such as a removable storage medium (e.g., a CD-ROM, DVD, blu-ray disc, memory stick, or memory card). Additionally or alternatively, the module may be provided on a computer device, such as a server or cloud server, for downloading, for example over a data network such as the internet or over a communication line such as a telephone line or wireless line.

Any modules disclosed herein may be functional units that are not necessarily physically separate from each other. For example, if multiple functions are implemented in a software package, several elements of the module may be implemented as a single physical element.

The computer modules disclosed herein are not necessarily part of an integrated system, but may be distributed over several separate systems interacting with each other over a communications network.

According to one embodiment, the second module for testing the gender and/or ancestry of a human nucleic acid sample is a computer module comprising computer instructions. Additionally or alternatively, the second module may comprise a wet laboratory experiment, such as an experiment that performs a FISH test. The results of the FISH test can be analyzed electronically or visually to determine the sex of the sample.

According to another aspect of the invention, a computer program comprises the following instructions: when executed by a computer, causes the computer to perform steps a), b), c) and d) according to any of the methods described above.

According to another aspect of the invention, a computer readable storage medium contains instructions for: when executed by a computer, cause the computer to perform steps a), b), c) and d) according to any of the methods described above.

As already discussed above, the method of the invention is particularly suitable for identifying changes in the genome of a patient which are indicative for a given disease or which are specifically indicative for the sensitivity of the patient to a given therapy.

As used herein, the term "disease" includes any disease characterized by one or more genomic changes. This includes cancer, autoimmune diseases, cardiovascular diseases and any genetic diseases. The patient may be of any species, but is preferably a mammal, more preferably a human.

Depending on the individual disease and therapy, one skilled in the art will be able to select an individual mode of treatment that is beneficial to the patient.

Therefore, in a further aspect of the invention, the invention relates to a method of diagnosing a disease in a patient, comprising:

obtaining identifying information of disease indications of the patient;

obtaining a nucleic acid sample from the patient; and is

The method of genomic and/or genetic analysis of a human nucleic acid sample as described herein is used to perform genomic and/or genetic analysis on the nucleic acid sample to determine the disease state of the patient.

The identifying information for the disease indication can be retrieved by any method known in the art, such as user input, an electronic health record or electronic medical record or a patient database containing medical records.

In the context of this aspect of the invention, the term "disease state" refers in one embodiment to a condition in which the patient is identified. In another embodiment, the term means that the disease is more accurately diagnosed, i.e., individual subtypes of the disease are identified.

The present invention also relates to a method of treating a disease in a patient comprising:

obtaining identifying information of disease indications of the patient;

obtaining a nucleic acid sample from the patient; and is

Performing genomic and/or genetic analysis on the nucleic acid sample based on the methods described herein, thereby determining the disease state of the patient, and treating the patient.

In another aspect of the invention, the invention relates to a method of determining whether a patient has a susceptibility to treatment with a drug, comprising:

obtaining identifying information of disease indications of the patient;

obtaining a nucleic acid sample from the patient;

performing genomic and/or genetic analysis on a human nucleic acid sample based on the methods for performing genomic and/or genetic analysis on said nucleic acid sample as described herein;

obtaining a likely therapy for the disease indication of the patient;

performing variant identification and interpretation; and

the retrieved possible therapies are classified based on the variant interpretation, wherein the therapies are classified as applicable or contraindicated to the patient.

In this way it is possible to decide for the patient which therapy is available or which therapy is advantageous. For example, it may be possible to determine whether a patient is susceptible to a given therapy or whether only acceptable side effects of a given therapy can be expected.

The identifying information for the disease indication can again be retrieved by any method known in the art, such as user input, an electronic health record or electronic medical record or a patient database containing medical records.

Possible therapies for an indication of a patient's disease can be retrieved by any method known in the art, for example from a database.

The present invention also relates to a method of treating a patient comprising:

obtaining identifying information indicative of a disease of the patient,

obtaining a nucleic acid sample from said patient,

performing genomic and/or genetic analysis on a human nucleic acid sample based on the methods for performing genomic and/or genetic analysis on said nucleic acid sample as described herein,

obtaining a likely therapy for the disease indication of the patient,

the recognition and interpretation of the variants is performed,

classifying the retrieved possible treatment methods based on the variant interpretation, wherein the therapy is classified as applicable or not applicable to the patient,

selecting an indicated therapy, and

the patient is treated according to the selected therapy.

Again, possible therapies for the disease indication of the patient may be retrieved by any method known in the art, for example from a database.

Drawings

The foregoing and other objects, aspects, features, and advantages of the disclosure will become more apparent and better understood by reference to the following description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart depicting a method for genomic and/or genetic analysis of a human nucleic acid sample according to the present invention;

FIG. 2 is a flow chart depicting a method of data analysis in accordance with the present invention;

FIG. 3 shows a read length mapping step;

FIG. 4 is a flow chart depicting a method for genomic and/or genetic analysis of a human nucleic acid sample according to the present invention;

fig. 5 is a class distribution diagram showing features selected for a sex classifier (sex classifier) calculated on MH Panel data (MH Panel data); and

FIG. 6 is a box plot of memory usage and runtime for two Ansextry classifiers (gender classifier and pedigree classifier) and EthSEQ.

Detailed description of the drawings

Figure 1 shows a general workflow for genomic and/or genetic analysis of human nucleic acid samples, including the process of extracting human nucleic acid samples, preparing sequencing libraries, sequencing and subsequent data analysis. In the context of the present invention, the processes of extracting a human nucleic acid sample, preparing a sequencing library and sequencing may involve well-known standard processes and will therefore not be explained in more detail. Figure 2 shows the data analysis portion of the present invention in more detail.

Fig. 2 shows the data analysis steps of fig. 1, including a gender and ancestry testing step (first step), an alignment (or read length mapping) step, a variant identification step, and an annotation step. The input file to the read-length map computation module is raw sequence data (e.g., in the form of a FASTQ file). The output file of the read-length-map calculation module is, for example, a BAM file, which serves as an input file of the variant-recognition calculation module. The output file of the variant call computation module is, for example, a VCF file. Subsequent annotation calculation modules may annotate the data from the VCF file and output it in a desired format (e.g., PDF, HTML, or otherwise). The file format is exemplary only and may be a different format, e.g., not a BAM but possibly a SAM or CRAM file or others. The data analysis pipeline of FIG. 2 may also include a computer module that converts input or output files from one file format to another.

Fig. 2 also compares the prior art case with the case of the present invention. The prior art method (denoted by "a" in fig. 2) does not provide gender and ancestry testing, so its alignment and variant identification is performed relative to standard HRG. The method according to the present invention (denoted by "B" in fig. 2) provides gender and ancestry testing, allowing selection of one or more determined PHREGs. Subsequent alignments and variant identification are performed with respect to the determined PHREG.

FIG. 3 is a diagram of an exemplary read length mapping step. In this example, NGS reads long bands with lineage specific SNP "a". The ancestry-specific SNP "a" is located in the vicinity of a previously undiscovered biomarker variant "G". The vicinity may be as far as the read length.

During the alignment process, 2 mismatches, i.e., lineage specific SNPs and biomarker variants, result when NGS reads are aligned relative to standard HRGs. However, during the alignment process, only 1 mismatch, i.e., biomarker variants, results when NGS reads are aligned relative to the corresponding PHREG because the PHREG has been modified at ancestry-specific locations to be consistent with the ancestry-specific SNPs.

The alignment algorithm uses a scoring system that involves penalties for each mismatch (pendites) and/or gaps between read lengths of the sequencing and the selected reference genome. The read lengths are then aligned to the best scoring position, or not aligned at all because the total score is low or too many genomic positions have the same alignment score. Due to mismatch penalties in the alignment algorithm, read lengths are less likely to align relative to HRGs than PHREGs, especially if later found variants lie within the length of the read length. Therefore, the read length is discarded, or worse, even the alignment is performed at the wrong position of the HRG.

Therefore, PHREG has the effect of: it can salvage reads derived from lineage-specific variant regions, especially if they carry variants other than lineage-specific variants (e.g., pathogenic variants). This enables detection of previously undiscovered biomarkers.

Fig. 4 shows a flow diagram depicting a method for genomic and/or genetic analysis of a human nucleic acid sample according to the invention.

In a first step, a set of human reference genomes is provided to a system comprising a processing unit. To this end, the first computer module of the system may download the reference genome from a remote facility (e.g., an internet database). The processing unit may be any programmable computer device that basically comprises at least one processor with an internal memory, such as a RAM (random access memory), that allows storing and executing instructions. The processing unit may access non-volatile storage (non-volatile storage means) that may store data sets and computer files (e.g., a human reference genome and clinical data and genetic profiles of patients). The system may have access to a communication network, such as a LAN or the internet.

In a second step, the human reference genome is adjusted to the encoding level by a computer module of the system, preferably operated by a user of the system. The encoding level may include a unique nucleotide code or an ambiguous nucleotide code. In some embodiments, four different levels are proposed to adjust the human reference genome to a population, two of which are limited to unique nucleotide codes (A, C, G, T) and two of which utilize ambiguous nucleotide codes according to IUPAC nomenclature, in particular, maximum conservative error correction (maximum conservative error correction), majority allele levels, non-rare allele levels, and complete modeling of all observed alleles.

In a third step, a sample of the patient's human nucleic acids is provided. To this end, another computer module of the system may download raw sequence data (e.g., in the form of FASTQ files) from a sequencing laboratory that sequences a sample of interest on a remote platform. In an alternative embodiment, sequencing is performed locally and result transfer is performed internally. In the context of the third step, the system may also receive other clinical data of the patient from other input sources (e.g., information about the disease from which the patient is suffering, information about the current treatment, etc.). The clinical patient data may, for example, be received directly from the patient, may, for example, be typed in on a keyboard or may be derived from free text typed in on a keyboard, or may be received from a multiple selection element in the GUI. Clinical patient data may also be retrieved from Electronic Health Records (EHRs) or Electronic Medical Records (EMRs), possibly on a chip card or in a database retrievable over a communication network.

In a fourth step, a sex and/or ancestry test is performed on the human nucleic acid sample. Also, the testing may be performed locally, or another computer module of the system may obtain the test results from an external service provider over a communications network. Gender and/or pedigree testing may be performed by a second computing module or another wet laboratory test (wet lab experiment).

In a fifth step, one or more PHREGs are selected from a set of human reference genomes based on the results of the gender and/or ancestry tests. The selection may be performed by a third computing module.

In the sixth step, the human nucleic acid samples are aligned to the selected PHREGs. The alignment includes mapping read length groups from the NGS program to selected PHREGs. The comparison may be performed by a fourth computing module and the output file may be a BAM file.

In the seventh step, variant recognition is performed on the aligned human nucleic acid samples relative to the selected PHREG. Before variant identification, a calculation module of the system may adjust the human reference genome again to the encoding level, preferably operated by the system user. The encoding level may include a unique nucleotide code or an ambiguous nucleotide code and is different from the encoding level used in the aligning step. The most appropriate up-to-date algorithm can be used to identify variants. Variant recognition may be performed by the fifth calculation module, and the output content may include sequence data in a variant format (VCF file) that exists in a variant form with respect to the PHREG.

In the eighth step, variant interpretation is performed. Thus, the system may comprise a further post-processing calculation module adapted to perform the analysis of the identified variants. In one embodiment, the post-processing module can analyze a set of genes and/or variants that suggest the presence or absence of a disease in the patient. Additionally or alternatively, the post-processing module may determine a set of therapies for the disease of the patient, and in view of other clinical patient data, may determine the most suitable individualized treatment regimen for the patient based on the genetic data of the patient, in particular based on the identified genetic variants. In another embodiment, the post-processing module can perform statistical analysis and determine mutation history, nucleotide substitution rates, and hot spot mutations from the identified variants.

The found variants can also be used as input to a classifier that can predict therapeutic efficacy or therapeutic safety or for diagnostic and/or therapeutic purposes.

In the ninth step, diagnostic and/or therapeutic significance may be generated and provided. To this end, the system may comprise an output interface functionally connected to any of the third, fourth, fifth and post-processing modules, such that their results may be output. The output interface may be coupled to any display device or printer so that information computed by the processing unit may be presented. Furthermore, there may be links to communication systems for intranets and/or the internet, for example, for enabling sending and receiving of e-mails via the output interface.

FIG. 5 shows a profile of selected features sorted by gender calculated from MH Panel data (MH Panel data) (F: female; M: male. color vertical bar represents the median of the categories. top: read length ratio aligned chrX/chrY; middle: fraction of majority allele frequency in bin 0.8-1.0 for 500 common SNP locations on chrX; bottom: percentage of correctly paired read lengths on chrY. FIG. 5 should be viewed in the context of the embodiment described below.

Fig. 6 shows box plots of memory usage (top, in GB) and run time (bottom, in minutes) for two Ansextry classifiers and EthSEQ over 300 TCGA full exon samples. Fig. 6 should be viewed in the context of the embodiments described below.

Example 1

AnSextry is a machine learning-based tool that uses read-length alignments from whole exome sequencing data to derive gender and ancestry from samples. Self-declaring of both features is known to be unreliable, and AnSextry's prediction is useful in both sample exchange testing and unbiased interpretation of genomic variants, especially in large cohort studies (large cohort study). The benchmarking of AnSextry over 1,300 samples showed high accuracy, low time and low memory requirements.

1. Introduction to the design reside in

Over the past decade, with the dramatic drop in cost, next generation sequencing for large queues has become commonplace [ Cancer Genome Atlas Research Network et al, 2013; rand et al, 2016], the whole exome approach plays an important role in large-scale research, especially in the field of precise medicine or comprehensive characterization of diseases. In this case, a reliable knowledge of the ancestry and sex of the sample can provide a number of benefits. First, it can be used as a simple quality control to help identify sample exchanges due to complex protocols and manual operations involved in sample processing. Second, descent is critical for interpretation of variant effects, circumventing the strong european bias present in most genomic studies and human reference genomes, and improving clinical care for people of multiple descent [ Petrovski et al, 2016; mersha et al, 2015; fakhro et al, 2016 ]. Finally, descent is widely used in genetic association studies to avoid spurious associations with disease due to population stratification [ Wu et al, 2011 ]. Self-declaring gender and ancestry is often unreliable [ Mersha et al, 2015; ainsworth,2015 ], which also calls for the use of genomic information to identify gender and descent.

A machine learning method based on logistic regression, AnSextry, was developed to rapidly and reliably characterize gender and ancestry from whole-exome sequencing paired-end read alignment. The algorithm relies on a standard file format and is easily integrated into existing next generation sequencing analysis workflows. It provides a ready-made model that requires only a simple BAM file to be entered. Furthermore, the low memory requirement of AnSextry makes it possible to run on a desktop computer. Benchmark programs show that AnSextry has advantages over EthSEQ [ Romanel et al, 2017], which is the only known alternative Whole exome lineage inference tool based on BAM files, in terms of accuracy, runtime, and memory usage. To date, there are no other published methods for gender prediction.

2. Method of producing a composite material

2.1 Algorithm

Based on paired-end read length alignment of whole exome sequencing, two classifiers were prepared that could infer the most likely gender and ancestry of an individual. The tool takes advantage of read length mapping and differences between individual genotypes for prediction.

Gender and pedigree classifiers are based on logistic regression using Python and scilit-left machine learning libraries. Both features (features) come from one input BAM file. Paired-end read length alignments use BWA0.7.15 default alignment settings without similar local realignment or duplicate post-processing steps. Using GRCh37 as a reference genome, there were no non-chromosomal supercontigs (supercontinig), but masked pseudo-autosomal regions PAR1 and PAR2 to prevent alignment distortions on the X and Y chromosomes. In the context of the present invention, the term "supercontig" is generally understood as an ordered set of contigs, i.e. a continuous length of genomic sequence whose base order is known with high confidence.

The gender classifier was run with two-class (two-class) logistic regression using L1 regularization (L1-regularization) and returned the probability of getting each class (class). 5 fold cross-validation (5-fold cross-validation) was used to determine the appropriate regularization length. The model with the largest area under the exact recall curve of the training data was selected as the best model and evaluated on the test data set.

The pedigree classifier is based on multinomial logistic regression using L2 regularization and Principal Component Analysis (PCA), and returns the probability of finding each of the five continents defined in the 1000genome project: african (AFR), Ad pooled American (AMR), middle East (EAS), European (EUR), South Asian (SAS) (The 1000 genes Project Consortium et al 2015). 5 fold cross validation (5-fold cross-validation) was used to determine the appropriate parameters. The model with the highest F1 score for the training data was selected as the best model and evaluated on the test data set.

2.2 characteristics

The features of the gender classifier were based on the aligned differences between the X and Y chromosomes (FIG. 5). The read length rates for chrX and chrY and the percentage of correctly paired reads on chrY were used. Furthermore, the majority allele frequencies of the 500 common exon SNP positions on chrX were compounded. To avoid population bias, SNPs frequently occurring between different major blood lineages were selected.

For the pedigree classifier, the genotypes of All autosomal SNPs were determined from the 1000genome data described in section 2.3, which are at the intersection of the target region of the commonly used Agilent All Exon kit (V5, V6, V6+ COSMIC) and the Molecular Health Pan-Cancer gene fragment (panel) (target size 2.9 Mbp). Feature selection was used to retain meaningful SNPs showing variation between different lineages, which yielded 10,000 genotypes corresponding to 5,040 genomic positions, which could be used as features for this classifier. The corresponding BED file can be found in appendix 1, which can be used to determine overlap with any target sequencing kit.

2.3 data

To obtain data from different descent, genotype data from 1735 individuals from stage 3 of the 1000genome project was used to train and test the descent classifier. The ancestry (AFR, AMR, EAS, EUR, SAS) and the individual of each continent used for classification were randomly chosen to obtain a balanced class (class). 694 individuals were part of the test group.

Primary whole-exon control data (primary white-exon control data) of 300 individuals of self-reporting ethnic and sexually classified individuals were downloaded from TCGA (cancer. nih. gov) as a test group, corresponding to three cancers (urothelial bladder cancer, lung adenocarcinoma/squamous cell lung cancer, gastric adenocarcinoma). All samples were sequenced with the Agilent SureSelect Human full Exon 50Mb kit (Agilent SureSelect Human All Exon 50Mb kit). Records were randomly selected to achieve the size of the balanced class corresponding to the TCGA class: 150 males and 150 female subjects, and 100 whites, 100 asians and 100 blacks or african americans.

Target sequence data from self-reported sexes of 988 Cancer patients, sequenced with the Molecular Health Pan-Cancer gene combination (gene panel), was used to train and test the gender classifier. Individuals were randomly selected to achieve male-female (class) balance. 396 test data were randomly selected as gender classifiers. The 300 TCGA cases described above were used as an additional test set.

3. Method of producing a composite material

3.1 gender sorter

A gender classifier was trained using 592 data sets measured using the Molecular Health Path-Cancer gene combination (gene panel). The paired end reads were compared and features were calculated according to the contents described in the methods section. After adjusting the method with cross-validation, the performance of two sets of test data sets (396 individuals sequenced with the above gene fragment (panel), and 300 TCGA individuals with all exon data) was evaluated.

On the test data of panel, the gender classifier achieved an average accuracy of 97.5%, with 10 individuals (5 males and 5 females) misclassified (see table 1). Misclassification is not associated with lower coverage.

Sex	Predicted gender	Prediction probability of true gender [% ]]	Average coverage (double repeat)
				F	M	39.0	2579
F	M	15.3	2099
				F	M	33.8	1656
F	M	17.1	1787
				F	M	0.8	1797
M	F	0.0	6016
				M	F	28.0	2401
M	F	0.3	3603
				M	F	0.0	1606
M	F	0.0	1705

Table 1: MH panel (panel) detailed information of sequenced individuals, where the predicted gender did not match the self-declared gender. The median coverage for all samples used was 2116 x. The average coverage of all misclassified samples was close to or above the median, indicating that the mispredictions appeared to be unrelated to coverage below the median.

Since the prevalence of progression disorders in the general population is 1% [ Ainsworth,2015 ], some examples of misclassification may actually be correctly classified, but have an incorrect self-declared gender.

On the TCGA test data, the accuracy of the gender classifier reached 100%. All 300 individuals were correctly classified. In terms of runtime and memory usage, gender prediction took less than one minute in all cases, with an average memory usage of 526MB (fig. 6).

3.2 classificators of descent

Pedigree classifiers were trained on 1041 datasets from a 1000genome project. As described in section 2.2, idiotype is used as a feature. The best performing model was predicted on both sets of test data: the remaining 694 individuals from the 1000genome project, and 300 TCGA individuals with the full exon sequenced.

On 1000 genomic test data, the mean accuracy of the pedigree classifier reached 99%, performing best in asian pedigrees (100% accuracy in both south and east asia), followed by african and south american pedigrees (99% accuracy) and european pedigrees (98%). Only 5 of the total 694 people were misclassified.

On 300 TCGA exome test datasets, the accuracy of the pedigree classifier was slightly lower, 96.33%, for a total of 11 people misclassified. These results were compared to EthSEQ [ Romanel et al, 2017], which is currently the only other phylogenetic prediction method known, can provide a suitable pre-computational model, and can be used instantly in a single full exon BAM file. However, the results of both methods are highly consistent, with EthSEQ achieving slightly less precision (94%), and a total of 18 people being misclassified. In addition, EthSEQ requires much higher run time and memory: the average runtime of the pedigree classifier was 28 seconds, the average memory usage was 540MB, whereas EthSEQ required 4.8 minutes and 14.7GB on average, even with multithreading (4 cores) (fig. 6).

An important observation is that the consistency between the two algorithms is also high for misclassified data sets: the ancestry predictions for 10 of 11 individuals did not match that provided by TCGA, which was also classified differently by EthSEQ, and 8 of 10 were predicted to be of the same ancestry by both methods. This indicates that at least some of these people may be misclassified by TCGA, and that the ethnicity information for TCGA is based on self-declaration. In 10 concordant cases 6 were predicted to be AFR or AMR, consistent with Mersha et al's opinion that self-declared errors are most prevalent in african-american and hispanics. Table 2 shows misclassified individuals.

Table 2: detailed information on TCGA individuals, where predicted descent (by Ansextry, EthSEQ, or both) does not match TCGA self-declared ethnicity. The TCGA ethnicity class includes "black or african american", "white" and "asian". The white rows correspond to samples of ethnicities that neither Ansextry nor EthSEQ matched the TCGA. The light grey rows are samples of only EthSEQ that are predicted to not match TCGA; and the dark grey rows are samples for which only Ansextry predicts a mismatch with TCGA. When the corresponding gene locus does not have sufficient coverage, the genotype is inferred from the reference for Ansextry prediction. The median coverage for all samples was 91x, indicating that most mispredicted samples had a median or above coverage, and thus the mispredictions appeared to be unrelated to coverage below the median. Likewise, the median of the predicted genotypes for the Ansextry classification in all samples was 390, close to the median of the mispredicted Ansextry samples (393). The number of inferred genotypes varied between 227 (min) and 690 (max) in all 300 TCGA samples, indicating that 10-15% of the inferred genotypes did not appear to have a negative impact on the Ansextry prediction.

Interestingly, the only individual misclassified by AnSextry but not EthSEQ, and classified as caucasian by TCGA but predicted as AMR by the pedigree classifier, was actually predicted as a mixed pedigree with a probability of 54.7% AMR and 45.1% EUR.

4. Conclusion

AnSextry is a novel method that can reliably and easily determine the sex and ancestry of an individual based on aligned paired-end reads from the entire exome, or, where the target size allows, display targeted sequencing experiments. The tool provides two logistic regression-based Python-based classifiers, and pedigree prediction represents an alternative to the principal PCA-based approach in the field of demographics. AnSextry provides a ready-made reference model and requires minimal user input. It is quick, accurate and easy to use.

Disclaimer of disclaimer

In this document, the terms "ancestry-specific"/"ethnicity-specific"/"population-specific" are used interchangeably, as different authors use different terms for the same purpose.

Reference to the literature

1.Lander,E.S.et al.Initial sequencing and analysis of the human genome.Nature 409:860–921(2001).[PMID:11237011]

2.Church,D.M.et al.Modernizing reference genome assemblies.PLoS Biol.9:e1001091(2011).[PMID:21750661]

3.Harrow,J.et al.GENCODE:the reference human genome annotation for The ENCODE Project.Genome Res.22:1760-1774(2012).[PMID 22955987]

4.ENCODE Project Consortium.An integrated encyclopedia of DNA elements in the human genome.Nature 489:57-74(2012).[PMID:22955616]

5.1000 Genomes Project Consortium et al.A global reference for human genetic variation.Nature 526:68-74(2015).[PMID:26432245]

6.Li H&Durbin R.Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics 25:1754-1760(2009).[PMID:19451168]

7.DePristo,M.A.et al.A framework for variation discovery and genotyping using next-generation DNA sequencing data.Nat.Genet.43:491-498(2011).[PMID:21478889]

8.Horton,R.et al.Variation analysis and gene annotation of eight MHC haplotypes:the MHC Haplotype Project.Immunogenetics 60:1-18(2008).[PMID:18193213]

9.Pei,B.et al.The GENCODE pseudogene resource.Genome Biol.13:R51(2012).[PMID:22951037]

10.Degner,J.F.et al.Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data.Bioinformatics 25:3207-3212(2009).[PMID:19808877]

11.Brandt,D.Y.C.et al.Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data.G3 5:931-941(2015).[PMID:25787242]

12.Novak A.；Hickey G.；Garrison E.；Blum S.；Connelly A.；Dilthey A；Eizenga J.；Elmohamed M.；Guthrie S.；Kahles A.；Keenan S.；e Kelleher J.；Kural D.；Li H.；Lin M.；Miga K.；Ouyang N.；Rakocevic G.；Smuga-Otto M.；Zaranek A.；Durbin R.；McVean G.；Haussler D.；(https://www.biorxiv.org/content/biorxiv/ early/2017/01/18/101378.full.pdf)

13.Paten B,Novak AM,Eizenga JM,Garrison E.Genome graphs and the evolution of genome inference.Genome Res.5:665-676(2017)[PMID:28360232]

14.Snyder M.,et al.Personal genome sequencing:current approaches and challenges.Genes Dev.5,423-431(2010)[PMID:20194435]

15.Young,A.L.et al.A new strategy for genome assembly using short sequence reads and reduced representation libraries.Genome Res 2:249-256(2010)[PMID:20123915]

16.Flicek,P&Birney,E.Sense from sequence reads:methods for alignment and assembly.Nat Methods.6:S6-S12(2009)[PMID 19844229]

17.Chen R.&Butte A.J.The reference human genome demonstrates high risk of type 1 diabetes and other disorders.Pac Symp Biocomput.2011:231-242(2011)[PMID:21121051]

18.International Human Genome Sequencing Consortium.2001.Initial sequencing and analysis of the human genome.Nature 409:860-921(2001)[PMID:11237011]

19.International Human Genome Sequencing Consortium.2004.Finishing the euchromatic sequence of the human genome.Nature 431:931-945(2004)[PMID:15496913]

20.Schneider V.A.et al.Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.Genome Res.5:849-864.(2017)[PMID:28396521]

21.[Editorial(October 2010)."E pluribus unum".Nature Methods.5:331.doi:10.1038/nmeth0510-331.(2010)[PMID:20440876]

22.Nielsen R.,Paul J.S.,Albrechtsen A.,Song Y.S.Genotype and SNP calling from next-generation sequencing data.Nat.Rev.Genet.12:443-45.(2011)[PMID:21587300]

23.Fakhro,K.A.,Staudt M.R.,Ramstetter M.D.,Robay A.,Malek J.A.,Badii R.,et al.The Qatar genome:a population-specific tool for precision medicine in the Middle East.Hum.Genome Var.3:16016 Human Genome Variation(2016)3,16016 doi:10.1038/hgv.2016.16；published online 30 June 2016(2016)[PMID:27408750]

24.Zayed H.The Qatar genome project:translation of whole-genome sequencing into clinical practice.Int J Clin Pract.10:832-834 doi:10.1111/ijcp.12871.Epub 2016 Sep (2016)[PMID:27586018]

25.Sanger F.,et al.DNA sequencing with chain-terminating inhibitors.Proc Natl Acad Sci USA.74:5463-5467.(1977)[PMID:271968]

26.Venter,J.C.et al.The Sequence of the Human Genome.Science 291:1304-1351.(2001)[PMID:11181995]

27.Petrovski S&Goldstein D.B.Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine.Genome Biol 2016；17:157.doi:10.1186/s13059-016-1016-y.(2016)[PMID:27418169]

28.Koboldt DC,Ding L,Mardis ER,Wilson RK.Challenges of sequencing human genomes.Brief Bioinform.11:484-498.(2010)[PMID:20519329]

29.Dewey F.E.,Chen R.,Cordero S.P.,Ormond K.E.,Caleshu C.,Karczewski K.J.et al.Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.PLoS Genet.2011 Sep；7(9):e1002280.doi:10.1371/journal.pgen.1002280.Epub 2011 Sep 15.(2011)[PMID:21935354]

30.Cao H,Wu H,Luo R,Huang S,Sun Y,Tong X et al.De novo assembly of a haplotype-resolved human genome.Nat Biotechnol 33:617-622.(2015)[PMID:26006006]

31.Wu L.,Yavas G.,Hong H.,et al.Direct comparison of performance of single nucleotide variant calling in human genome with alignment-based and assembly-based approaches.Sci Rep.2017 Sep 8；7(1):10963.doi:10.1038/s41598-017-10826-9.(2017)[PMID:28887485]

32.Meyer,L.R.et al.The UCSC Genome Browser database:extensions and updates 2013.Nucleic acids research 41:D64-D69(2013).[PMID:23155063]

33.Sudmant,P.H.et al.An integrated map of structural variation in 2,504 human genomes.Nature 526:75-81(2015).[PMID:26432246]

34.Iqbal,Z.,Caccamo,M.,Turner,I.,Flicek,P.&McVean,G.De novo assembly and genotyping of variants using colored de Bruijn graphs.Nature genetics 44:226-232(2012).[PMID:22231483]

35.Cornish-Bowden A.(1985).Nomenclature for incompletely specified bases in nucleic acid sequences:recommendations 1984.Nucleic Acids Res.13:3021-3030.(1985)[PMID:2582368]

36.Mersha T.B.,&Abebe T.Self-reported race/ethnicity in the age of genomic research:its potential impact on understanding health disparities.Hum.Genomics 9:1.(2015)[PMID:25563503]

37.Baye T.M.Inter-chromosomal variation in the pattern of human population genetic structure.Hum Genomics 5:220-240.(2011)[PMID:21712187]

38.Fondevila M.et al.Revision of the SNPforID 34-plex forensic ancestry test:Assay enhancements,standard reference sample genotypes and extended population studies.Forensic Sci Int Genet 7:63-74.(2013)[PMID:22749789]

39.Ainsworth C.Sex redefined.Nature 518:288-291.doi:10.1038/518288a.(2015)[PMID:25693544]

40.Gall J.G.,Pardue M.L.Formation and detection of RNA-DNA hybrid molecules in cytological preparations.Proc.Natl.Acad.Sci.USA 63,Nr.2,1969,S.378–383,[PMID 4895535].

41.Kallioniemi A.et al.Comparative genomic hybridization for molecular cytogeneticanalysis of solid tumors.Science Band 258,Nr.5083,1992,S.818–821.

42.Goodwin S.,McPherson JD,McCombie WR.Coming of age:ten years of next-generation sequencing technologies.Nat.Rev.Genet.2016 May 17；17(6):333351

43.Al-Ali M,Osman W.,Tay G.K.,AlSafar H.S.A 1000 Arab genome project to study the Emirati population.J.Hum.Genet.63(4):533-536(2018).[PMID:29410509]

44.Cancer Genome Atlas Research Network et al.The Cancer Genome Atlas Pan-Cancer analysis project.Nat.Genet.,45(10),1113-1120(2013).

45.Rand,K.A.et al.Whole-exome sequencing of over 4100 men of African ancestry and prostate cancer risk.Hum.Mol.Genet.,25(2),371-381(2016).

46.Wu,C.et al.A Comparison of Association Methods Correcting for Popula-tion Stratification in Case-Control Studies.Ann.Hum.Genet.,75(3),418-427(2011).

47.Romanel,A.et al.EthSEQ:ethnicity annotation from whole exome sequencing data.Bioinformatics,33(15),2402-2404(2017).

Example 2

Using PHREG as a reference for NGS read-length mapping increased coverage of clinically relevant biomarkers

We used 741 germline samples from GDC/TCGA [1] that have been sequenced by whole exon capture Illumina sequencing. The set of data contained 155 samples of African (AFR) descent, 33 samples of latin/pooled American (AMR) descent, 179 samples of East Asian (EAS) descent, 354 samples of European (EUR) descent, and 20 samples of South Asian (SAS) descent. Each sample was aligned with a standard Human Reference Genome (HRG) GRCh37[3], PHREG assigned using our pedigree classifier, and HSA PHREG using Novoalign 4.00.01. HSA PHREG was generated by summarizing the variant data of all GnomAD v2.1 ancestry [4] (including AFR, AMR, EAS, EUR and SAS).

With respect to these read length mapping strategies, we compared the coverage of 15,483 pathogenic ClinVar biomarker versions 2019-12[5] in Gencode v31 CDS exon [6] covering 1,288 genes. We found that when the ClinVar biomarker was aligned to PHREG instead of HRG, there was an increase in coverage of: 211(AFR), 147(AMR), 121(EAS), 173(EUR), 105(SAS) and 162(HSA) (see Table 3). Most variants with increased coverage are located near the site of PHREG implant population-specific nucleotides. When mapping the read length of a sample to its nearest PHREG, the number of mismatches during alignment will be reduced and thus the coverage will be increased, eliminating the decrease in coverage when aligned to HRG.

In summary, our analysis shows that using the correct PHREG can increase coverage, thereby improving the detection of clinically relevant biomarkers.

Table 3(ClinVar _ PHREG _ coverage _ diff _ relative. xlsxx) illustrates:

the ClinVar biomarker list (gene name | contig | start | end) in the Gencode CDS exon shows the difference in coverage compared to that of PHG when aligned to PHREG. Differences were made for the case of descent and all 741 cases (HSA) with the coverage calculated for HRG alignment per PHREG (AFR, AMR, EAS, EUR, SAS, HSA) as the median. A positive number indicates an increase in coverage and a negative number indicates a decrease in coverage.

TABLE 3

Reference to example 2

[1]https://portal.gdc.cancer.gov

[2]http://www.novocraft.com/products/novoalign

[3]https://www.ncbi.nlm.nih.gov/grc/human

[4]https://gnomad.broadinstitute.org/faq

[5]https://www.ncbi.nlm.nih.gov/clinvar

[6]https://www.gencodegenes.org/human/release_31lift37.html

Appendix 1

Appendix 2

chr1 36768200rs1573020

chr1 159174683rs2814778

chr1 204790977rs2065160

chr2 7149155rs896788

chr2 109513601rs3827760

chr2 136616754rs182549

chr3 168645035rs1498444

chr4 38803255rs4540055

chr4 159181963rs2026721

chr5 33951693rs16891982

chr7 4457003rs917118

chr10 17064992rs7897550

chr10 34755348rs1978806

chr11 32424389rs5030240

chr12 29369871rs10843344

chr12 56603834rs773658

chr13 20901724rs1335873

chr13 22374700rs1886510

chr13 34864240rs2065982

chr14 36170607rs10141763

chr14 101142890rs730570

chr15 28365618rs12913832

chr15 48426484rs1426654

chr16 31079371rs881929

chr16 90105333rs3785181

chr17 75551667rs2304925

chr18 75432386rs1024116

chr19 42410331rs2303798

chr20 38849642rs1321333

chr21 16685598 rs722098

chr21 17710424 rs239031

chr21 25672460 rs2572307

chr22 26350103 rs5997008

chr22 47836412 rs2040411

Claims

1. A method of performing genomic and/or genetic analysis of a human nucleic acid sample comprising the steps of:

a) providing a set of human reference genomes;

b) testing the sex and/or ancestry of the human nucleic acid sample;

d) aligning the human nucleic acid sample with the selected PHREG.

2. The method of claim 1, wherein the alignment is performed at the majority allele level or non-rare allele level.

3. The method according to claim 1 or 2, comprising the additional step of:

e) variant recognition was performed on human nucleic acid samples aligned to the selected PHREG.

4. The method of claim 3, wherein the variant identification is performed at the majority allele level or non-rare allele level.

5. The method according to any one of the preceding claims, wherein the human reference genome provided in step a) is or is derived from a published human reference genome.

6. The method of any one of the preceding claims, wherein step a) comprises adjusting the human reference genome to a coding level comprising a unique nucleotide code or an ambiguous nucleotide code.

7. The method according to any one of the preceding claims, wherein the human reference genome provided in step a) is PHREG.

8. The method according to any one of the preceding claims, wherein the gender testing comprises one or more of the following steps:

testing for at least one location in a sex specific gene on the X chromosome and/or the Y chromosome; using aligned differences of human genome samples on the X chromosome and/or the Y chromosome; cytogenetic testing; performing FISH analysis; and (4) CGH analysis.

9. The method of any one of the preceding claims, wherein the pedigree test is based on a machine learning algorithm used on human nucleic acid samples, or on another classification scheme utilizing pedigree-specific variants.

10. The method according to any one of the preceding claims, wherein the pedigree test comprises using the genotype of at least one genomic position, and/or testing a SNP array or SNP chip, and/or testing markers from Sanger sequencing or mass spectrometry.

11. The method of any one of the preceding claims, wherein the pedigree test comprises testing at least one gene selected from the group consisting of: ABL2, ATP1A3, CIC, CYP2C8, CYP2C9, EPHA3, EPHA7, ERBB3, ERG, ETV 3, F3, FAS, HFE, IL11 3, IL 23, ITGB3, KIF 3, KIT, KLK3, LRP 3, MDM 3, NAT 3, NTRK 3, PDGFB, PIK3R 3, PLA2G3, PLAU, PRKCB, RICTOR, SLC7A 3, STAT3, VCAM 3, VDR, VEGFB, ACVRL 3, AXL, CA 3, CACR, CASP 3, ENG, EPHB 3, ERBB3, ESR 3, HPS 3, HSP 3690, 36K 3, GARPEST 3, EPRCS 3, EPRCH 3, EPTC 3, EPTC 3, EPTC 3, EPTC 3636363672, 3, EPTC 3, EPTC 3, ROCK2, SLC6A2, TET2, TGM2, TH, ABCB1, CD22, CD40, CD44, CDH20, CYP11B2, ERCC5, GPR124, IL7R, ITGB3, ITGB5, NCL, NOD2, NR4A1, PGR, PLCG1, PPP2R1A, PRAME, PTCH2, RET, SETD2, XPC, ASXL1, EPHB4, PLA2G 4, SYK, TET 4, EP300, FLT4, ITGA 4, LOCSF 4, PDGFRB, PIK 34, SSTR 4, TEC, APC, ATRCE, CROP, BBP, CYP2D 4, EML4, MMP 4, PARP 4, SPE CSF, FRA 4, TRPC 4, TR.

12. The method of any one of the preceding claims, wherein the human nucleic acid sample comprises a set of reads from a Next Generation Sequencing (NGS) program, and wherein the aligning comprises the step of mapping the reads to selected PHREGs.

13. A computer system for genomic and/or genetic analysis of a human nucleic acid sample, the computer system comprising:

14. A computer program comprising instructions for: when executed by a computer, causes the computer to perform steps a), b), c) and d) of any one of the methods of claims 1 to 12.

15. A computer readable storage medium containing instructions for: the instructions, when executed by a computer, cause the computer to perform steps a), b), c) and d) of any one of the methods of claims 1 to 12.

16. A method of treating a patient comprising:

-obtaining identification information of a disease indication of the patient;

-obtaining a nucleic acid sample of the patient;

-performing genomic and/or genetic analysis on said nucleic acid sample based on the method of claim 1;

-obtaining a likely therapy for the disease indication of the patient;

-performing variant identification and interpretation;

-classifying the retrieved possible therapies based on the variant interpretation, wherein a therapy is classified as applicable or not applicable to the patient;

-selecting an appropriate therapy;

-treating the patient according to the selected therapy.