WO2019234247A1

WO2019234247A1 - A method for analysis of real-time amplification data

Info

Publication number: WO2019234247A1
Application number: PCT/EP2019/065039
Authority: WO
Inventors: Pantelis Georgiou; Ahmad MONIRI; Jesus RODRIGUEZ-MANZANO
Original assignee: Imperial College Of Science, Technology And Medicine
Priority date: 2018-06-08
Filing date: 2019-06-07
Publication date: 2019-12-12
Also published as: GB201809418D0; US20210257051A1; EP3803880A1; CN112997255A

Abstract

This disclosure relates to methods, systems, computer programs and computer- readable media for the multidimensional analysis of real-time amplification data. A framework is presented that shows that the benefits of standard curves extend beyond absolute quantification when observed in a multidimensional environment. Relating to the field of Machine Learning, the disclosed method combines multiple extracted features (e.g. linear features) in order to analyse real-time amplification data using a multidimensional view. The method involves two new concepts: the multidimensional standard curve and its 'home', the feature space. Together they expand the capabilities of standard curves, allowing for simultaneous absolute quantification, outlier detection and providing insights into amplification kinetics. The new methodology thus enables enhanced quantification of nucleic acids, single-channel multiplexing, outlier detection, characteristic patterns in the multidimensional space related to amplification kinetics and increased robustness for sample identification and quantification.

Description

A Method for analysis of real-time amplification data

This disclosure relates to methods, systems, computer programs and computer- readable media for the multidimensional analysis of real-time amplification data.

Background

Since its inception, the real-time polymerase chain reaction (qPCR) has become a routine technique in molecular biology for detecting and quantifying nucleic acids. This is predominantly due to its large dynamic range (7 - 8 orders of magnitude), desirable sensitivity (5-10 molecules) and reproducible quantification results. New methods to improve the analysis of qPCR data are invaluable to a number of analytical fields, including environmental monitoring and clinical diagnostics. Absolute quantification of nucleic acids in real-time PCR using standard curves is undoubtedly important and significant in various fields of biomedicine, although research has saturated in recent years.

The current“gold standard” for absolute quantification of a specific target sequence is the cycle-threshold (C_t) method. The C_t value is a feature of the amplification curve defined as the number of cycles in the exponential region where there is a detectable increase in fluorescence. Since this method has been proposed, several alternative methods have been developed in a hope to improve absolute quantification in terms of accuracy, precision and robustness. The focus of existing research has been based on the computation of single features, such as C_y and -logio(Fo), that are linearly related to initial concentration. This provides a simple approach for absolute quantification, however, data analysis based on such single features has been limited. Thus, research into improving methods for absolute quantification of nucleic acids using standard curves has plateaued and is very incremental in improvement.

Rutledge et al 2004 proposed the Sigmoidal curve-fitting (SCF) for quantification based on three kinetic parameters (Fc, F_max and F₀). Sisti et al 2010 developed the“shape- based outlier detection” method, which is not based on amplification efficiency and uses a non-linear fitting to parametrise PCR amplification profiles. The shape-based outlier detection method takes a multidimensional approach in order to define a similarity measure between amplification curves, but relies on using a specific model for amplification, namely the 5-parameter sigmoid, and is not a general method. Furthermore, the shape-based outlier detection method is typically used as an add-on, and only uses a multidimensional approach for outlier detection, such that quantification is only considered using a unidimensional approach. Guescini et al. 2013 proposed the C_y0 method, which is similar to the C_t method but takes into account the kinetic parameters of the amplification curve and may compensate for small variations among the samples being compared. Bar et al. 2013 proposed a method (KOD) based on amplification efficiency calculation for the early detection of non-optimal assay conditions.

The present disclosure aims to at least partially overcome the problems inherent in existing techniques.

Summary

The invention is defined by the appended claims. The supporting disclosure herein presents a framework that shows that the benefits of standard curves extend beyond absolute quantification when observed in a multidimensional environment. The focus of existing research has been on the computation of a single value, referred to herein as a“feature”, that is linearly related to target concentration, and thus there has been a gap in existing approaches in terms of taking advantage of multiple features. It has now been realised that the benefits of combining linear features are non-trivial. Previous methods have been restricted to the simplicity of conventional standard curves such as the gold standard cycle- threshold (C_t) method. This new methodology enables enhanced quantification of nucleic acids, single-channel multiplexing, outlier detection, characteristic patterns in the multidimensional space related to amplification kinetics and increased robustness for sample identification and quantification.

Relating to the field of Machine Learning, the presently disclosed method takes a multidimensional view, combining multiple features (e.g. linear features) in order to take advantage of, and improve on, information and principles behind existing methods to analyse real-time amplification data. The disclosed method involves two new concepts: the multidimensional standard curve and its‘home’, the feature space. Together they expand the capabilities of standard curves, allowing for simultaneous absolute quantification, outlier detection and providing insights into amplification kinetics. This disclosure describes a general method which, for the first time, presents a multi-dimensional standard curve, increasing the degrees of freedom in data analysis and thereby being capable of uncovering trends and patterns in real-time amplification data obtained by existing qPCR instruments (such as the LightCycler 96 System from Roche Life Science). It is believed that this disclosure redefines the foundations of analysing real-time nucleic acid amplification data and enables new applications in the field of nucleic acid research.

In a first aspect of the disclosure there is provided a method for use in quantifying a sample comprising a target nucleic acid, the method comprising: obtaining a set of first real time amplification data for each of a plurality of target concentrations; extracting a plurality of N features from the set of first data, wherein each feature relates the set of first data to the concentration of the target; and fitting a line to a plurality of points defined in an N- dimensional space by the features, each point relating to one of the plurality of target concentrations, wherein the line defines a multidimensional standard curve specific to the nucleic acid target which can be used for quantification of target concentration.

Optionally the method further comprises: obtaining second real-time amplification data relating to an unknown sample; extracting a corresponding plurality of N features from the second data; and calculating a distance measure between the line in N-dimensional space and a point defined in N-dimensional space by the corresponding plurality of N features. Optionally, the method further comprises computing a similarity measure between amplification curves from the distance measure, which can optionally be used to identify outliers or classify targets.

Optionally each feature is different to each of the other features, and optionally wherein each feature is linearly related to the concentration of the target, and optionally wherein one or more of the features comprises one of C_t, C_y and -logi₀(F₀).

Optionally the method further comprises mapping the line in N-dimensional space to a unidimensional function, M₀, which is related to target concentration, and optionally wherein the unidimensional function is linearly related to target concentration, and/or optionally wherein the unidimensional function defines a standard curve for quantifying target concentration. Optionally, the mapping is performed using a dimensionality reduction technique, and optionally wherein the dimensionality reduction technique comprises at least one of: principal component analysis; random sample consensus; partial-least squares regression; and projecting onto a single feature. Optionally, the mapping comprises applying a respective scalar feature weight to each of the features, and optionally wherein the respective feature weights are determined by an optimisation algorithm which optimises an objective function, and optionally wherein the objective function is arranged for optimisation of quantisation performance.

Optionally, calculating the distance measure comprises projecting the point in N- dimensional space onto a plane which is normal to the line in N-dimensional space, and optionally wherein calculating the distance measure further comprises calculating, based on the projected point, a Euclidean distance and/or a Mahalanobis distance. Optionally, the method further comprises calculating a similarity measure based on the distance measure, and optionally wherein calculating a similarity measure comprises applying a threshold to the similarity measure. Optionally, the method further comprises determining whether the point in N-dimensional space is an inlier or an outlier based on the similarity measure. Optionally, the method further comprises: if the point in N-dimensional space is determined to be an outlier then excluding the point from training data upon which the step of fitting a line to a plurality of points defined in N-dimensional space is based, and if the point in N- dimensional space is not determined to be an outlier then re-fitting the line in N-dimensional space based additionally on the point in N-dimensional space.

Optionally, the method further comprises determining a target concentration based on the multidimensional standard curve, and optionally further based on the distance measure, and optionally when dependent on claim 4 based on the unidimensional function which defines the standard curve. Optionally, the method further includes displaying the target concentration on a display.

Optionally, the method further comprises a step of fitting a curve to the set of first data, wherein the feature extraction is based on the curve-fitted first data, and optionally wherein the curve fitting is performed using one or more of a 5-parameter sigmoid, an exponential model, and linear interpolation. Optionally, the set of first data relating to the melting temperatures is pre-processed, and the curve fitting is carried out on the processed set of first data, and optionally wherein the pre-processing comprises one or more of: subtracting a baseline; and normalisation.

Optionally, the data relating to the melting temperature is derived from one or more physical measurements taken versus sample temperature, and optionally wherein the one or more physical measurements comprise fluorescence readings.

In a second aspect there is provided a system comprising at least one processor and/or at least one integrated circuit, the system arranged to carry out a method according to the first aspect.

In a third aspect there is provided a computer program comprising instructions which, when executed by one or more processors, cause the one or more processors to perform a method according to the first aspect.

In a fourth aspect there is provided a computer-readable medium storing instructions which when executed by at least one processor, cause the at least one processor to carry out a method according to the first aspect.

In a fifth aspect there is provided a method according to the first aspect, used for detection of genomic material, and optionally wherein the genomic material comprises one or more pathogens, and optionally wherein the pathogens comprise one more carbapenemase-producing enterobacteria, and optionally wherein the pathogens comprise one or more carbapenemase genes from the set comprising blaOXA-48, blaVlM, blaNDM and blaKPC.

In a sixth aspect there is provided a method for diagnosis of an infection by detection of one or more pathogens according to the method of the first aspect, and optionally wherein the pathogens comprise one more carbapenemase-producing enterobacteria, and optionally wherein the pathogens comprise one or more carbapenemase genes from the set comprising blaOXA-48, blaVIM, blaNDM and blaKPC.

In a seventh aspect there is provided a method for point-of-care diagnosis of an infectious disease by detection of one or more pathogens according to the method of the first aspect, and optionally wherein the pathogens comprise one more carbapenemase- producing enterobacteria, and optionally wherein the pathogens comprise one or more carbapenemase genes from the set comprising blaOXA-48, blaVIM, blaNDM and blaKPC. The methods disclosed herein, if used for diagnosis, can be performed in vitro or ex vivo. Embodiments can be used for single-channel multiplexing without post-PCR manipulations.

It will be appreciated in the light of the present disclosure that certain features of certain aspects and/or embodiments described herein can be advantageously combined with those of other aspects and/or embodiments. The following description of specific embodiments should not therefore be interpreted as indicating that all of the described steps and/or features are essential. Instead, it will be understood that certain steps and/or features are optional by virtue of their function or purpose, even where those steps or features are not explicitly described as being optional. The above aspects are thus not intended to limit the invention, and instead the invention is defined by the appended claims.

Brief description of the Figures

In order that the disclosure may be understood, preferred embodiments are described below, by way of example, with reference to the Figures in which like features are provided with like reference numerals. Figures are not necessarily drawn to scale.

Figure 1 is a representation of training and testing in an existing unidimensional approach, compared with the proposed multidimensional framework.

Figures 2a-2c illustrate the process of training using the multidimensional approach described herein.

Figures 2d-2f illustrate the process of testing using the multidimensional approach described herein.

Figure 3 is a representation of an algorithm for optimising feature weights.

Figure 4a is a representation of a multidimensional standard curve.

Figure 4b is a representation of a resulting quantification curve obtained after dimensionality reduction through principal component regression.

Figure 5 shows a mean of outliers in the feature space, and an orthogonal projection of the mean of the outliers onto the standard curve.

Figure 6a is a representation of a view of the feature space along an axis of the multidimensional standard curve, by projecting onto a plane that is perpendicular to the standard curve.

Figure 6b is a representation of the resulting projected points according to Figure 6a.

Figure 6c is a representation of a transformation of the orthogonal view of the feature space of Figure 6b into a new space where the Euclidean distance is equivalent to the Mahalanobis distance in the original space.

Figure 7 shows a histogram of Mahalanobis distance squared, for an entire training set superimposed with a x²-distribution with 2 degrees of freedom.

Figure 8a shows a multidimensional pattern associated with temperature.

Figure 8b shows a multidimensional pattern associated with primer mix concentration.

Figure 8c shows a variation of training data points along the axis of the multidimensional standard curve, for low concentrations of nucleic acids.

Figure 9 is an illustration of experimental workflow and comparison of real-time uni dimensional vs multi-dimensional standard curves.

Figure 10 shows multidimensional standard curves constructed using a single primer mix (by multiplex real-time PCR) fix for four target genes using C_t, C_y and -logi₀(F₀).

Figure 11 shows real-time amplification data and melting curve analysis (for validation purposes) for the training samples.

Figure 12 shows a Mahalanobis space for each of four multidimensional standard curves. Figure 13 is a representation of an example networked computer system in which embodiments of the disclosure can be implemented.

Figure 14 is a representation of an example computing device such as the ones shown in Figure 13.

Figures 15a-15d show melting curves analysis for the training data (15a), outliers (15b), primer concentration experiment (15c) and temperature variation experiment (15d), according to an example.

Figure 16 shows average Mahalanobis distance from standard points to sample tests in an example. Which is used to classify the samples into blaOXA-48, blaNDM, blaVIM and blaKPC genes, based only on real-time amplification curves obtained by the multiplex PCR assay.

Detailed description

The structure of the disclosure is as follows. In order to understand the proposed framework, it is useful to have an overall picture of what is done in the conventional approach in the same language. First, the conventional approach and then the proposed multidimensional framework are presented. For easier comprehension, the theory and benefits of the disclosed method are explained and discussed. Further, by way of example, an example instance of this new method is given, with a set of real-time data using lambda DNA as a template, and specific applications of the disclosed methods are explored.

Figure 1 is a block diagram showing the disclosed multi-dimensional method (bottom branch) compared to a conventional method (top branch) for absolute quantification of target based on serial dilution of a known target.

Conventional Approach

In a conventional method, raw amplification data for several known concentrations of the target is typically pre-processed and fitted with an appropriate curve. A single feature such as the cycle threshold, Ct, is extracted from each curve. A line is fitted to the feature vs concentration such that unknown sample concentrations can be extrapolated. Here, two terms, namely training and testing (as used in the field of Machine Learning), are used to describe the construction of a standard curve 1 10 and quantifying unknown samples respectively. Within the conventional approach for quantification, training using a first set of data relating to melting temperatures of samples having known characteristics is achieved through 4 stages: pre-processing 101 , curve fitting 102, single linear feature extraction 103 and line fitting 104, as illustrated in the upper branch of Figure 1.

Pre-processing 101 can be optionally performed to reduce factors such as background noise such that a more accurate comparison amongst samples can be achieved.

Curve fitting 102 (e.g. using a 5-parameter sigmoid, an exponential model, and/or linear interpolation) is optional, and beneficial given that amplification curves are discrete in time/temperature and most techniques require fluorescence readings that are not explicitly measured at a given time/temperature instance.

Feature extraction 103 involves selecting and determining a feature (or “characteristic”, e.g. C_t, C_y, -logi₀(F₀), FDM, SDM) of the target data.

Line (or curve) fitting 104 involves fitting a line (or curve) 1 10 to the determined feature data versus target concentration.

Examples of pre-processing 101 include baseline subtraction and normalisation. Examples of curve fitting 102 include using a 5-parameter sigmoid, an exponential model, and linear interpolation. Examples of features extracted in the feature extraction 103 step include C_t, C_y or -logi₀(F₀). Examples of line fitting 104 techniques include principal component analysis, and random sample consensus (RANSAC).

Testing of unknown samples (i.e. quantifying target concentration in unknown samples, based on second data relating to the melting temperature of a target comprised in the unknown sample) is accomplished by using the same first 3 blocks (pre-processing 101 , curve fitting 102, linear feature extraction 103) as training, and using the line 1 10 generated from the final line fitting 104 step during training in order to quantify the samples.

Proposed Method

The proposed method builds on the conventional techniques described in the above paragraph, by increasing the dimensionality of the standard curve (against which data is compared in the testing phase) in order to explore, research and take advantage of using multiple features together. This new framework is presented in the lower branch of Figure 1 .

For training, in this example embodiment there are 6 stages: pre-processing 101 , curve fitting 102, multi-feature extraction 1 13, high dimensional line fitting 1 14, multidimensional analysis 1 15, and dimensionality reduction 1 16. Testing follows a similar process: pre-processing 101 , curve fitting 102, multi-feature extraction 1 13, multidimensional analysis 1 15, and dimensionality reduction 1 16. As for the conventional approach, pre-processing 101 and curve fitting 102 are optional, and with suitable multidimensional analysis techniques an explicit step of dimensionality reduction may also be rendered optional.

Again, examples of pre-processing 101 include baseline subtraction and normalisation, and examples of curve fitting 102 include using a 5-parameter sigmoid, an exponential model, and linear interpolation. Examples of features extracted in the multi feature extraction 1 13 step include C_t, C_y, -logi₀(F₀), FDM, SDM. Examples of high dimensional line fitting 1 14 techniques include principal component analysis, and random sample consensus (RANSAC). Examples of multidimensional analysis 1 15 techniques include calculating a Euclidean distance, calculating confidence bounds, weighting features using scalars a,, as further described below. Examples of dimensionality reduction 1 16 techniques include principal component regression, calculating partial least-squares, and projecting onto original features, as further described below.

Figures 2a-2c illustrate the process of training and Figures 2d-2f show testing using the multidimensional approach. Starting with training, Figure 2a shows processed and curve-fitted real-time nucleic acid amplification curves obtained from a conventional qPCR instrument by serially diluting a known nucleic acid target to known concentrations. In contrast with the conventional training, instead of extracting a single linear feature, multiple features denoted using the dummy labels X, Y and Z are extracted from the processed amplification curves. Therefore, each amplification curve has been reduced to a number of sets of 3 values (e.g. Xi , Yi and Zi) and, consequently, can be viewed as a number of points plotted against each other in 3-dimensional space as shown in Figure 2b. It is important to stress that although this is a 3-D example (in order to visualise the process), optionally any number of features can be chosen. Given that all the features in this example have been chosen such that they are linearly related to initial concentration, the training data forms a 1 -D line in 3-D space, and this line is then approximated using high-dimensional line fitting 1 14 to generate what is termed the multidimensional standard curve 130. Although, the data forms a line, it is important to understand that data points do not necessarily lie exactly on the line. Consequently, there is considerable room for exploring this multidimensional space, referred to as the feature space, which will be discussed herein. Although in this example, only linear features (i.e. features linearly related to target concentration) are considered, the disclosed method can be applied to non-linear features by making appropriate changes.

For quantification purposes, the multidimensional standard curve is mapped into a single dimension, M₀, which function is linearly related to the initial concentration of the target. In order to distinguish the curve described by such a function from conventional standard curves, it is referred to here as the quantification curve 150. This is achieved using dimensionality reduction techniques (DRT) as illustrated in Figure 2c. Mathematically, this means that DRTs are multivariate functions of the form: M₀= f(C,U,Z) where cp(-):R3 R. In fact, given that scaling features does not affect linearity, M₀ can be mathematically expressed as M₀=f(a1C,a2U,a3Z) where a,, i e {1 ,2,3}, are scalar constants.

Once training is complete, at least one further (e.g. unknown) sample can then be analysed (e.g. quantified and/or classified) through testing as follows. Similar to training, processed amplification data (Figure 2d) and their respective corresponding point in the feature space (Figure 2e) is shown. Given that test points may lie anywhere in the feature space, it is necessary to project them onto the multidimensional standard curve 130 generated in training. Using the DRT function, f, which was produced in training, M₀ values for each test sample can be obtained. Subsequently, absolute quantification is achieved by extrapolating the initial concentration based on the quantification curve 150 in Figure 2f. It will be noted that data relating to these further samples can be used to refine the multidimensional standard curve 130 (e.g. by re-fitting a line to a plurality of points defined in N-dimensional space by the extracted features, including both the original set of training data, and the data relating to the further sample).

Given that this higher dimensional space has not previously been disclosed, it is effective to highlight the degrees of freedom within this new framework that were non existent when observing the quantification process through the conventional lens. The following advantages arise:

Advantage 1. The weight of each extracted feature can be controlled by the scalars, aΐ ,. .ah. There are two main observations of this degree of freedom. The first observation is that features that have poor quantification performance can be suppressed by setting the associated a to a small value. This introduces a very useful property of the framework which is referred to as the separation principle. The separation principle means that including features to enhance multidimensional analyses does not have a negative impact on quantification performance if the a’s are chosen appropriately. Optimisation algorithms can be used to set the a’s based on an objective function. Therefore, the performance of the quantification using the proposed framework is lower bounded by the performance of the best single feature for a given objective. The second observation is that no upper bound exists on the performance of using several scaled features. Thus, there is a potential to outperform single features as shown in this report.

Advantage 2. The versatility of this multidimensional way of thinking means that there are multiple methods for dimensionality reduction such as: principal component regression, partial-least squares regression, and even projecting onto a single feature (e.g. using the standard curve 1 10 used in conventional methods). Given that DRTs can be nonlinear and take advantage of multiple features, predictive performance may be improved.

Advantage 3. Training and testing data points do not necessarily lie perfectly on a straight line as they did in the conventional technique. This property is the backbone behind why there is more information in higher dimensions. For example, the closer two points are in the feature space, the more likely that their amplification curves are similar (resembling a Reproducing Kernel Hilbert Spaces). Therefore, a distance measure in the feature space can provide a means of computing a similarity measure between amplification curves. It is important to understand that the distance measure is not necessarily, and in reality unlikely to be, linearly related to the similarity measure. For example, it is not necessarily true that a point twice as far from the multidimensional standard curve is twice as unlikely to occur. This relationship can be approximated using the training data itself. In the case of training, a similarity measure is useful to identify and remove outliers that may skew quantification performance. As for testing, the similarity measure can give a probability that the unknown data is an outlier of the standard curve, i.e. non-specific or due to a qPCR artefact, without the need of post-PCR analyses such as melting curves or agarose gels.

Advantage 4. The effect of changes in reaction conditions, such as annealing temperature or primer mix concentration, can be captured by patterns in the feature space. Uncovering these trends and patterns can be very insightful in understanding the data. This is also possible in the conventional case, e.g. how C_t varies with temperature, however since reaction conditions affect different features differently, in the proposed multidimensional technique conclusions can be drawn with higher confidence e.g. if a pattern is observed in multidimensional space. For example, consider the following: a change in temperature, DT, causes a different change for different features, e.g. DC, DU and DZ. Therefore, if (as in the conventional technique) only a single feature, X, is used and a variation DC is observed then it is unlikely to capture the source of the variation, i.e. DT, with high confidence. Whereas, considering multiple features (as in the proposed multidimensional technique) and observing DC, DU and DZ simultaneously, can provide more confidence that the source is due to DT.

An extension of advantage 4 is related to the effect of variations in target concentration. Clearly, the pattern for varying target concentration is known: along the axis of the multidimensional standard curve 130. Therefore, the data itself is sufficient to suggest if a particular sample is at a different concentration than another. This is significant, since it allows variations amongst replicates (which are possible due to experimental errors such as dilution and mixing) to be identified and potentially compensated for. This is of particular importance for low concentrations wherein such errors are typically more significant. It is interesting to observe that if multiple features are used, and the DRT is chosen such that the multidimensional curve is projected onto a single feature, e.g. Ct, then the quantification performance is similar as for the conventional process (e.g. a special instance of the proposed framework, wherein only a single feature is used) yet the opportunities and insights obtained as a result of employing a multidimensional space still remain.

Example Method

It has been established that each step in the proposed method, as seen in the lower branch of Figurel , can be implemented using several different techniques, given as examples in the Figure. The specific techniques used for each block can be application dependent, however specific example methods are described herein to illustrate the power and versatility of this method. It will nevertheless be understood that the described method is not limited to those specific examples.

Pre-processing 101

The only pre-processing 101 performed in this example is background subtraction. This is accomplished using baseline subtraction: removing the mean of the first 5 fluorescence readings from every amplification curve. In other embodiments, however, pre processing can be omitted, or other or additional pre-processing steps such as normalisation can be carried out, and more advanced pre-processing steps can optionally be carried out so improve performance and/or accuracy.

Curve fitting 102

An example model for curve fitting is the 5-parameter sigmoid (Richards Curve) given by:

Where x is the cycle number, F(x) is the fluorescence at cycle x, F_b is the background fluorescence, F_max is the maximum fluorescence, c is the fractional cycle of the inflection point, b is related to the slope of the curve, and d allows for an asymmetric shape (Richard’s coefficient).

An example optimisation algorithm used to fit the curve to the data is the trust-region method and is based on the interior reflective Newton method. Here, the trust-region method is chosen over the Levenberg-Marquardt algorithm since bounds for the 5 parameters can be chosen in order to encourage a unique and realistic solution. Example lower and upper bounds for the 5 parameters, [F_b, F_max, c, b, d], are given as: [-0.5, -0.5, 0, 0, 0.7] and [0.5, 0.5, 50, 100, 10] respectively. Multi- feature extraction 113

The number of features, n, that can be extracted is arbitrary, however 3 features have been chosen in this example in order to enhance visualisation of each step of the framework: C_t, C_y and -logi₀(F₀), for ease of explanation. As a result, in this example, each point in the feature space is a vector in 3-dimensional space,

e.g. p=[Ct,Cy,-logio(F₀)]^T

where [·]^t denotes the transpose operator.

Note that by convention, vectors are columns and are bold lowercase letters. Matrices are bold uppercase. The details of these features are not the focus of this disclosure, and so will not be described further herein, it being assumed that the reader is familiar with said details.

High-Dimensional Line fitting 114

When constructing a multidimensional standard curve, a line must be fitted in n- dimensional space. This can be achieved in multiple ways such as using the first principal component in principal component analysis (PCA) or techniques robust to outliers such as random sample consensus (RANSAC) if there is sufficient data. This example uses the former (PCA) since a relatively small number of training points are used to construct the standard curve.

Distance and Similarity measure (Multi-dimensional analysis 1 15)

There are two distance measures given as examples in this disclosure: Euclidean and Mahalanobis distance, although it will be appreciated that other distance measures can be used.

The Euclidean distance between a point, p, and the multidimensional standard curve can be calculated by orthogonally projecting a point onto the multidimensional standard curve 130 and then using simple geometry to calculate the Euclidean distance, e:

(p -qi)^T(q₂ - qi)

R = F( P,q_{1 ?}q₂)

(q2 -qi)^r(q2 -qi)

e = | (p - qi ) - (qi + P - (q2 - qi)) |

where F computes the projection of the point p e Rⁿ onto the multidimensional standard curve, the points q1 ,q2 e Rⁿ are any two distinct points that lie on the standard curve, and |-| denotes the absolute value operator.

The Mahalanobis distance is defined as the distance between a point, p, and a distribution, D, in multidimensional space. Similar to the Euclidean distance, a point is first projected onto the multidimensional standard curve 130 and the following formula is applied to compute the Mahalanobis distance, d: d= \J (p p . (q₂ - qi)^T å ¹ (p -P - (q2 -qi)

(4)

where p, P, q1 and q2 are given in equation (2), and å is the co-variance matrix of the training data used to approximate the distribution D.

In order to convert the distance measure into a similarity measure, it can be shown that if the data is approximately normally distributed then the Mahalanobis distance squared, i.e. d², follows an x²-distribution. Therefore, an x²-distribution table can be used to translate a specific p-value into a distance threshold. For instance, for a x²-distribution with 2 degrees of freedom, a p-value of 0.05 and 0.01 correspond to a squared Mahalanobis distance of 5.991 and 9.210 respectively.

Feature weights.

As mentioned previously, different weights, a, can be assigned to each feature. In order to accomplish this, a simple optimisation algorithm can be implemented. Equivalently, an error measure can be minimised. Figure 3 is an illustration of how an optimisation algorithm can be used to find optimal parameters, a, for the disclosed method. In this example, the error measure to minimise is the figure of merit described in the following subsection. By way of example, a suitable optimisation algorithm is the Nelder-Mead simplex algorithm with weights initialised to unity, i.e. beginning with no assumption on how good features are for quantification. This is a basic algorithm and only 20 iterations are used to find the weights so that there is little computational overhead.

Dimensionality reduction 1 16

In this example, principal component regression is used, e.g. M₀=P from equation (2), and it is compared with projecting the standard curve onto all three dimensions, i.e. C_t, C_y and -logio(Fo).

Evaluating standard curves

In consistency with the existing literature on evaluating standard curves, relative error (RE) and average coefficient of variation (CV) can, by way of example, be used to measure accuracy and precision respectively. The CV for each concentration can be calculated after normalising the standard curves such that a fair comparison across standard curves is achieved. The formula for the two measures are given by:

where n is the number of training points, i is the index of a given training point, x, is the true concentration of the i^th training data, x^“ is the estimate of x, using the standard curve.

std( ^J )

mean (it? )

(6)

where m is the number of concentrations, j is the index of a given concentration and x^“j is a vector of estimated concentrations for a given concentration indexed by j. The functions std(·) and mean(-) perform the standard deviation and mean of their vector arguments respectively.

Referring to the field of Statistics, this example also uses the“leave one-out cross validation” (LOOCV) error as a measure for stability and overall predictive performance. Stability refers to the predictive performance when training points are removed. The equation for calculating the LOOCV is given as:

where n is the number of training points, i is the index of a given training point, z, is a vector of the true concentration for all training points except the i^th training point and z^“ is the estimate of z, generated by the standard curve without the i^th training point.

In order for the optimisation algorithm for computing a to simultaneously minimise the three aforementioned measures, it is convenient to introduce a figure of merit, Q, to capture all of the desired properties. Therefore, Q is defined as the product between all three errors and can be used to heuristically compare the performance across quantification methods.

Q = RE x CV x LOOCV (8) Example Fluorescence Datasets

Several DNA targets were used for qPCR amplification by way of example:

(i) Synthetic double-stranded DNA (gblocks Fragments Genes, Integrated DNA Technologies) containing phage lambda DNA sequence was used to construct and evaluate the standards curves (DNA concentration ranging from 10² to 10⁸ copies per reaction). See Appendix A.

(ii) Genomic DNA isolated from pure cultures of carbapenem-resistant (A) Klebsiella pneumoniae carrying b/aox_A-48, (B) Escherichia coli carrying and (C) Klebsiella pneumoniae carrying b/a_KPc were used for the outlier detection experiments. See Appendix B.

(iii) Phage lambda DNA (New England Biolabs, Catalog #N301 1 S) was used for primer variation experiment (final primer concentration ranging from 25 nM/each to 850 nM/each) and temperature variation experiments (annealing temperature ranging from 52°C to 72^°C.

All oligonucleotides used in this example were synthesised by IDT (Integrated DNA Technologies, Germany) and are shown in Table 1 . The specific PCR primers for lambda phage were designed in-house using Primer3

(http://biotools.umassmed.edu/bioapps/primer3_www.cgi), whereas the primer pairs used for the specific detection of carbapenem resistance genes were taken from Monteiro et al 2012. Real-time PCR amplifications were conducted using FastStart Essential DNA Green Master (Roche) according to the manufacturer’s instructions, with variable primer concentration and a variable amount of DNA in a 5mI_ final reaction volume. Thermocycling was performed using a LightCycler 96 (Roche) initiated by a 10 min incubation at 95^°C, followed by 40 cycles: 95^°C for 20 sec; 62^°C (for lambda) or 68^°C (for carbapenem resistance genes) for 45 sec; and 72°C for 30 sec, with a single fluorescent reading taken at the end of each cycle. Each reaction combination, starting DNA and specific PCR amplification mix, was conducted in octuplicate. All the runs were completed with a melting curve analysis to confirm the specificity of amplification and lack of primer dimer. The concentrations of all DNA solutions were determined using a Qubit 3.0 fluorometer (Life Technologies). Appropriate negative controls were included in each experiment.

Table 1. Specific PCR primers used in this example

Target Primer Sequence (5-3) Amplicon name size (bp)

Results

The following example results illustrate the aforementioned advantages of the proposed framework using an example instance of the method as described above. Given that there is a separation principle between quantification performance and insights in the feature space, this section is split into two parts: quantification performance and multidimensional analysis. The first part shows the results that arose from the two degrees of freedom introduced in advantage 1 & 2 and the latter explores advantage 3 & 4 regarding interesting observations in multidimensional space.

Figure 4 shows the multidimensional standard curve 130 and quantification using information from all features. In Figure 4a, a multidimensional standard curve 130 is constructed using Ct, Cy and -log10(F0) for lambda DNA with concentration values ranging from 10² to 10⁸ (top right to bottom left). Each concentration was repeated 8 times. The line fitting was achieved using principal component analysis. In Figure 4b, the quantification curves 150 were obtained by dimensionality reduction of the multidimensional standard curve using principal component regression.

Quantification Performance

In this example, synthetic double-stranded DNA was used to construct a multidimensional standard curve 130 and evaluate its quantification performance relative to single feature methods. The resulting multidimensional standard curve 130, constructed using the features C_t, C_y and -logi₀(F₀), is visualised in Figure 4a. The computed features and curve fitting parameters for each amplification curve grouped by concentration, ranging from 10² to 10⁸, is presented in Appendix C. Figure 4b shows the resulting uni-dimensional quantification curve 150 obtained after dimensionality reduction 1 16 through principal component regression. For comparison, the standard curves for the conventional examples are computed by projecting the multidimensional standard curve onto each feature, as listed in Appendix D.

In this example, the optimal feature weights, a, to control the contribution of each feature to quantification, after 20 iterations of the optimisation algorithm, converged to a = [1 .6807,1 .0474,0.0134] where the weights correspond to C_t, C_y and -logi₀(F₀) respectively. This result is readily interpretable and it suggests that -logio(F₀) exhibits the poorest quantification performance amongst the three features; as consistent with the existing knowledge. It is important to stress again that although the weight of -logio(F₀) is suppressed relative to the other features to improve quantification, there is still a lot of value in keeping it as it can uncover trends in multidimensional space: as will become apparent later.

The performance measures and figure of merit, Q, for this particular instance of the proposed framework against the conventional instance is given in Table 2. A breakdown of each calculated error grouped by concentration is provided in Appendix D. It can be observed that C_t offers the smallest RE, i.e. accuracy, whereas M₀ outperforms the other methods in CV and LOOCV, i.e. precision and overall prediction. In terms of the figure of merit, combining all of the errors, this arbitrary realisation of the framework enhanced quantification by 6.8%, 25.6% and 99.3% compared to C_t, C_y and -logio(F₀) respectively.

Table 2. Performance measures for quantification methods used in this example along with a heuristic figure of merit, Q.

RE (%) €¥ (%) LOOCV (%) Fig. of Merit, Q

Ct 7.70 ± 5.87 0.97 ± 0.77 9.52 ± 8.20 71.1 ± 37.22

C_y 8.01 ± 6.5 1.11 ± 1.28 9.47 ± 8.61 84.6 ± 71.46

FQ 21.86 ± 7.50 7.76 ± 12.78 26.3 ± 9.39 4460 ± 903.08

Mo 7.76 ± 6.06 0.90 ± 0.74 9.42 ± 8.34 65.8 ± 37.37

RE = relative error, CV = coefficient of variation, LOOCV = leave-one-out cross validation. Multidimensional Analysis

Given that the feature space is a new concept, there is room to explore what can be achieved. In this section the concept of distance in the feature space is explored and is demonstrated through an example of outlier detection. Furthermore, it is shown that in this example a pattern exists in the feature space when altering reaction conditions.

Figure 5 shows outliers in the feature space, specifically the multidimensional standard curve 130 for lambda DNA along with three carbapenemase outliers: blaOXA, blaNDM and blaKPC. On the right of Figure 5 is shown a zoomed view into the region of the feature space with the mean of the replicates and the projection of the outliers onto the standard curve.

In this example, genomic DNA carrying carbapenemase genes, namely b/aox_A, blamu and b/a_KPc, are used as deliberate outliers for the multidimensional standard curve 130. Figure 5 shows the mean of the outliers in the feature space. The computed features and curve-fitting parameters for outlier amplification curves in this example are shown in Appendix E, and specificity of the outliers is confirmed using a melting curve analysis as presented in Appendix F and Figures 15a-15d. Given that the outlier test points do not lie exactly on the multidimensional standard curve 130, Figure 5 also shows the orthogonal projection of the mean of the outliers onto the multidimensional standard curve 130; as described in the proposed framework.

In order to fully capture the position of the outliers in the feature space, it is convenient to view the feature space along the axis of the multidimensional standard curve 130. This is possible by projecting data points in the feature space onto the plane perpendicular to the multidimensional standard curve 130 as illustrated in Figure 6a. The resulting projected points are shown in Figure 6b.

Figure 6 shows a multidimensional analysis using the feature space for clustering and detecting outliers. In particular, Figure 6a shows a multidimensional standard curve 130 using Ct, C_y and -logi o(Fo) for lambda DNA with concentration values ranging from 10² to 10⁸ (top right to bottom left). An arbitrary hyperplane orthogonal to the standard curve is shown in grey. Figure 6b shows a view of the feature space when all the data points have been projected onto the aforementioned hyperplane. The data points consist of training standard points and outliers corresponding to blaOXA, blaNDM and blaKPC. Errors corresponding to the Euclidean distance, e, from the multidimensional standard curve to the mean of the outliers is given by eOXA = 1 .16, eNDM = 0.77 and eKP C = 1 .41 . The 99.9% confidence corresponding to a p-value of 0.001 is shown with a solid black line. Figure 6c shows a transformed space where the Euclidean distance, d, is equivalent to the Mahalanobis distance in the orthogonal view. The black circle corresponds to a p-value of 0.001 .

It can be observed that all three outliers 601 , 602, 603 can be clustered and clearly distinguished from the training data 610. Furthermore, in this example, the Euclidean distance, e, from the multidimensional standard curve 130 to the mean of the outliers is given by eox_A=1 .16, e_NDM =0.77 and bkro =1.41 . Given that in this example the furthest training point from the multidimensional standard curve 130 in terms of Euclidean distance is 0.22: the ratio between eox_A, Q_NOM, QKR_O and 0.22 is given by 5.27, 3.5, 6.41 respectively. Therefore, this ratio can be used as a similarity measure and the three clusters could be classified as outliers. However, this similarity measure has two implicit assumptions: (i) The data follows a uniform probability distribution. That is, a point twice as far is twice as likely to be an outlier. This assumption is typically made when there is not enough information to infer a distribution (ii) Distances in different directions (e.g. along difference axes) are equally likely. This is intuitively untrue in the feature space because a change along one direction, e.g. C_t, does not impact the amplification curve as much as a change in another direction, e.g. -logi₀(F₀). It is important to emphasise that directions in the feature space contain information regarding how much amplification kinetics change and therefore direct comparisons between amplification reactions should be made along the same direction. This information is not captured in the aforementioned previous (unidimensional) data analysis.

In order to tackle the two aforementioned assumptions, the Mahalanobis distance, d, can be used. Clearly, by observing Figure 6b, the data predominantly varies in a given direction. The Mahalanobis distance can be computed directly using equation (4). In order to visualise the Mahalanobis distance, the orthogonal view of the feature space (Figure 6b) can be transformed into a new space (“Transformed space” in Figure 6c) wherein the Euclidean distance, e, is equivalent to the Mahalanobis distance, d, in the original space (i.e. the space illustrated in Figure 6b). It can be seen from Figure 6c that data in all directions are equiprobable, i.e. the training data 610 forms a circular distribution. The Mahalanobis distance, d, from the multidimensional standard curve 130 to the mean of the outliers 601 , 602, 603 is given by doxA=12.65, d_NDM =18.87 and d_KPc =19.36. In comparison to the Euclidean distances, it is observed that when considering the distribution of the data, the position of the outliers significantly change. As an example, based on Euclidean distance, bla_N DM 601 is the closest outlier whereas using the Mahalanobis distance suggests bla OXA 603.

A useful property of the Mahalanobis distance is that its squared value follows a c²- distribution if the data is approximately normally distributed. Therefore, the distance can be converted into a probability in order to capture the non-uniform distribution. Figure 7 shows a histogram of Mahalanobis distance, d, squared, for the entire training set, superimposed with a x²-distribution with 2 degrees of freedom. In this example, based on the x²-distribution table, any point further than about 3.717 is 99.9% (p-value < 0.01 ) likely to be an outlier. Figure 7 thus shows the data distribution, in terms of a histogram of the Mahalanobis distance squared of all training data points used in constructing the multidimensional standard curve superimposed with a x2-distribution with 2 degrees of freedom. Since all the outliers have a Mahalanobis distance significantly greater than about 3.717, they can be detected as outliers. Other distances (greater or smaller) can be chosen as a criterion for testing against the Mahalanobis distance, depending on the level of confidence required as to whether points are inliers or outliers. A distance of 3.717 has been illustrated since that corresponds to a probability of 99%, but distances corresponding to other probabilities such as 80%, 95%, 99.9% can also be chosen.

A second example multidimensional analysis (as shown in Figure 8) is concerned with observing patterns with respect to reaction conditions. Figure 8 shows patterns associated with changing reaction conditions. The multidimensional standard curve in all plots are using C_t, C_y and -logi₀(F₀) for lambda DNA with concentration values ranging from 10² to 10⁸ copies/reaction (top right to bottom left). In Figure 8a, the magnified image shows the effect of changing the reaction temperature from 52°C to 72°C for lambda DNA at 5x 10⁶ copies/reaction. In Figure 8b, the magnified image shows the effect of changing the primer mix concentration from 25nM to 850nM for each primer for lambda DNA at 5x 10⁶ copies/reaction. In Figure 8c, the magnified image shows the individual training sample location in the feature space for a given low concentration: 10² copies/reaction

In the illustrated example, annealing temperature and primer mix concentration have been chosen to illustrate the idea. Specificity of the qPCR is not affected, as shown with melting curve analyses (see Appendix F and Figures 15a-15d). Figure 8a shows the effect of annealing temperature on the standard curve. Temperatures ranging from 52.0 C to 69.9 C only affect -logi₀(F₀) whereas changes from 69.9 C to 72.0 C affect mostly C_t and Cy (see Appendix G). Similarly, Figure 8b shows there is a pattern associated with primer mix concentration: the variation from 25 to 850 nM for each primer is observed predominantly along the -logio(F₀) direction (see Appendix H). Both experiments show that C_t and C_y are more robust to changes in annealing temperature and primer mix concentration, which is good for quantification performance. Furthermore, the patterns are observed in the feature space predominantly due to -logio(F₀).

Based on this finding, the previous (unidimensional) way of proceeding would indicate the use of C_t or C_y for subsequent experiments. However, it has been realised that this implies a loss of information contained in patterns generated by -logio(F₀). Therefore, the proposed multidimensional approach combines features that are beneficial for quantification performance and pattern recognition: preserving all information without compromising quantification performance.

Finally, a further interesting observation is that for low concentrations of nucleic acids, there is a variation of training data points along the axis of the multidimensional standard curve 130 as seen in Figure 8c. Thus, it can be hypothesised that the variation is due to fluctuations in concentration as opposed to changes in reaction kinetics. There are two implications of this assumption: (i) all the points are inliers and thus likely to be specific without the need of resource consuming post-PCR analyses. Specificity is confirmed using a melting curve analysis, as for example given in Appendix F; (ii) The outcome of absolute quantification is based on 3 features as opposed to a single feature which implies an increased confidence in the estimated target concentration.

Although the disclosed framework has been described as considering features that are linearly related to initial target concentration, that example design choice was chosen so as to reduce the complexity of the analysis, however other features such as non-linearly related features can optionally be used.

Additionally, it will be noted that if two unrelated PCR reactions exhibit a perfectly symmetric sigmoidal amplification curve, their respective standard curves may potentially overlap, and thus a question arises as to whether sufficient information might be captured between amplification curves in order to distinguish them in the feature space. However, such an effect can be mitigated from a molecular perspective by tuning the chemistry in order to sufficiently change amplification curves without compromising the performance of the reaction (e.g. speed, sensitivity, specificity etc).

Conclusion

In conclusion, this disclosure presents a versatile method, multidimensional standard curve and feature space, which enable techniques and advantages that were not previously realisable. It has been illustrated that an advantage of using multiple features is improved reliability of quantification. Furthermore, instead of trusting a single feature, e.g. C_t, other features such as C_y and -logi₀(F₀) can be used to check if a quantification result is similar. The previous unidimensional way of thinking failed to consider multiple degrees of freedom and the resulting advantages that the versatile framework disclosed herein enables. There are thus four main capabilities that are enabled by the disclosed method:

(i) the ability to select multiple features and weight them based on quantification performance.

(ii) the flexibility of choosing an optimal mathematical method that maps multiple features into a single value representing target concentration. The first two capabilities lead to a separation principle which lower bounds the quantification performance of the framework to the best single feature, however the insights and multidimensional analyses from the multiple features still remain. It is interesting to observe that, for the example dataset used in this proposed approach, the gold standard C_t method outperformed the other single features. This is an example of why there is a technical prejudice against using other features, since the outcome is data dependent. The disclosed framework offers a method of absolute quantification without the need to select a specific feature with a guaranteed quantification performance. This disclosure shows that by using multiple features it is in fact possible to increase the quantification performance compared with the use of only single features.

(iii) enablement of applications such as outlier detection through the information gain captured by the elements of the feature space (e.g. distance measure, direction, distribution of data) that are typically meaningless or not considered in the previous unidimensional approach.

(iv) the ability to observe specific perturbations in reaction conditions as characteristic patterns in the feature space. Example Application of the disclosed method

Absolute quantification of nucleic acids and multiplexing the detection of several targets in a single reaction both have, in their own right, significant and extensive use in biomedical related fields, especially in point-of-care applications. With previous approaches, the ability to detect several targets using qPCR scales linearly with the number of targets, and is thus an expensive and time-consuming feat. In the present disclosure, a method is presented based on multidimensional standard curves that extends the use of real-time PCR data obtained by common qPCR instruments. By applying the method disclosed herein, simultaneous single-channel multiplexing and robust quantification of multiple targets in a single well is achieved using only real-time amplification data (that is, using bacterial isolates from clinical samples in a single reaction without the need of post PCR operations such as fluorescent probes, agarose gels, melting curve analysis, or sequencing analysis). Given the importance and demand for tackling challenges in antimicrobial resistance, the proposed method is shown in this example to simultaneously quantify and multiplex four different carbapenemase genes: blaOXA-48, blaNDM, blaVIM and blaKPC, which account for 97% of the UK’s reported carbapenemase-producing Enterobacteriaceae.

Quantitative detection of nucleic acids (DNA and RNA) is used for many applications in the biomedical field, including gene expression analysis, genetic disease predisposition, mutation detection and clinical diagnostics. One such application is in the screening of antibiotic resistance genes in bacteria: the emergence and spread of carbapenemase- producing enterobacteria (CPE) represents one of the most imminent threats to public health worldwide. Invasive infections with carbapenemase-resistant strains are associated with high mortality rates (up to 40 - 50%) and represent a major public health concern worldwide. Rapid and accurate screening for carriage of carbapenemase-producing Enterobacteriaceae (CPE) is essential for successful infection prevention and control strategies as well as bed management. However, routine laboratory detection of CPE based on carbapenem susceptibility is challenging: i) culture-based methods are convenient due to their ready availability and low cost, but their limited sensitivity and long turnaround time may not always be optimal for infection control practices; (ii) nucleic acid amplification techniques (NAATs), such as qPCR, provide fast results and added sensitivity and specificity compared with culture-based methods. However, these methodologies are often too expensive and require sophisticated equipment to be used as a screening tool in healthcare systems; and (iii) multiplexed NAATs have significant sensitivity, cost and turnaround time advantages, increasing the throughput and reliability of results, but the biotechnology industry has been struggling to meet the increasing demand for high-level multiplexing using available technologies. There is thus an unmet clinical need for new molecular tools that can be successfully adopted within existing healthcare settings.

Currently, qPCR is the gold standard for rapid detection of CPE and other bacterial infection. This technique is based on fluorescence-based data detection allowing kinetics of PCR amplification to be monitored in real-time. Different methodologies are used to analyse qPCR data, being the cycle-threshold (C_t) method the preferred approach for determining the absolute concentration of a specific target sequence. The C_t method assumes that the compared samples have similar PCR efficiency and it is defined as the number of cycles in the log-linear region of the amplification where there is significant detectable increase in fluorescence. Alternative methods have been developed to quantify template nucleic acids, including the standard curve methods, linear regression and non linear regression models, but none of them allow simultaneous target discrimination. Multiplex analytical systems allow the detection of multiple nucleic acid targets in one assay and can provide the required speed for sample characterisation while still saving cost and resources. However, in a practical context, multiplex quantitative real-time PCR (qPCR) is limited by the number of detection channels of the real-time thermocycler and commonly rely on melting curve analysis, agarose gels or sequencing for target confirmation. These post-PCR processes increase diagnostic time, limit high throughput application and lead to amplicon contamination by laboratory environments. Therefore, there is an urgent need to develop simplified molecular tools which are sensitive, accurate and low-cost.

The disclosed method allows existing technologies to get as a return the benefits of multiplex PCR whilst reducing the complexity of CPE screening; resulting in cost reduction. This is due to the fact that the proposed method: (i) enables multi-parameter imaging with a single fluorescent channel; (ii) is compatible with unmodified oligonucleotides; and (iii) does not require post-PCR processing. This is enabled through the use of multidimensional standard curves, which in this example are constructed using C_t, C_y and -logi o(Fo) features extracted from amplification curves. In this example, we show that the described methodology can be successfully applied to CPE screening. This provides a proof-of- concept that several nucleic acid targets can be multiplexed in a single channel using only real-time amplification data. It will be appreciated nevertheless that the disclosed method can be applied to detection of any nucleic acid, and to detection of any pathogenic or non- pathogenic genomic material.

This example application of the disclosed method, as described with reference to Figures 9 to 12 and 16, describes the methodology disclosed herein, applied to generate multidimensional standard curves (MSC) for simultaneous DNA quantification, multiplex target discrimination and outlier detection using only amplification shapes. Herein, we propose the MSC for simultaneous nucleic acid quantification, outlier detection and single channel multiplexing, without requiring melting curve analysis or any other post-PCR manipulation. The methodology disclosed herein combines multiple features of the amplification curve that are linear to the target concentration, such as C_t, F₀, and C_y0, to generate a characteristic fingerprint for each amplification curve. Then, the fingerprint is plotted in a multidimensional space to generate multivariate standard curves which provide enough information gain for simultaneous quantification, multiplexing and outlier detection. This method has been validated for the rapid screening of the four most clinically relevant carbapenemase genes (blaKPC, blaVIM, blaNDM and blaOXA-48) and has been shown to enhance quantification compared to the current state-of-the methods. The proposed method thus has the potential to deliver more comprehensive and actionable diagnostics, leading to improved patient care and reduced healthcare costs.

Figure 9 is an Illustration of an example experimental workflow for single-channel multiplex quantitative PCR using unidimensional and multidimensional analysis approach. In this example, an unknown DNA sample is amplified by multiplex qPCR for targets 1 , 2 and 3. Features such as a, b and y are extracted from the amplification curve. It is important to stress that any number of targets and features could have been chosen.

In the example conventional uni-dimensional analysis shown at Figure 9 (A), three conventional standard curves are generated through serial dilution of the known targets using a single feature. Given it is not possible to identify the target based on these standard curves, postPCR analysis are required for target identification and quantification.. For example, threshold Ct is plotted against Iog10 concentration of reference targetl and a regression line fitting the data is generated to construct the Standardl (Std 1 ). Relative values for target abundance in the unknown sample are extrapolated from the unidimensional standard. However, in single-channel qPCR multiplexing assays, the presence of multiple standard curves prevents the identification and quantification of the target within the unknown sample, since it is not possible to extrapolate a single feature to a specific standard curve. Therefore, post-PCR analysis are required (such as agarose gels, melting curves or sequencing) for target identification and quantification.

In the multidimensional analysis (B) disclosed herein, multidimensional standard curves and the feature space are used to simultaneously quantify and discriminate a target of interest solely based on the amplification curve: eliminating the need for expensive and time consuming post-PCR manipulations. Similar to conventional standard curves, multidimensional standard curves are generated by using standard solutions with known concentrations under uniform experimental conditions. In this example, multiple features, a, b and g, are extracted from each amplification curve and plotted against each other. Because each amplification curve has been reduced to three values, it can be represented as a single point in a 3D space (a greater or lesser number of dimensions can be used in embodiments). In this example, amplification curves from each concentration for a given target will thus generate three-dimensional clusters, which can be connected by high dimensional line fitting to generate the target-specific multidimensional standard curves 130. The multidimensional space where all the data points are contained is referred to as the feature space, and those data points can be projected to an arbitrary hyperplane orthogonal to the standard curves for target classification and outlier detection. Unknown samples can be confidently classified through the use of clustering techniques and enhanced quantification can be achieved by combining all the features into a unified feature called M₀. It is important to stress that any number of targets and features could have been chosen, a three-plex assay and three features have been selected in this example to illustrate the concept in a comprehensive manner.

Example Primers and amplification reaction conditions

All oligonucleotides were synthesised by Integrated DNA Technologies (The Netherlands) with no additional purification. Primer names and sequences are shown in Table 3. Each amplification reaction was performed in 5 pl_ of final volume with 2.5 mI_ FastStart Essential DNA Green Master 2x concentrated (Roche Diagnostics, Germany), 1 mI_ PCR Grade water, 0.5 mI_ of 10x multiplex PCR primer mixture containing the four primer sets (5 mM each primer) and 1 mI_ of different concentrations of synthetic DNA or bacterial genomic DNA. PCR amplifications consisted of 10 min at 95°C followed by 45 cycles at 95°C for 20 sec, 68°C for 45 sec and 72°C for 30 sec. One melting cycle was performed at 95°C for 10 sec, 65°C for 60 sec and 97°C for 1 sec (continuous reading from 65°C to 97°C) for validation of the specificity of the products. Each experimental condition was run 5 to 8 times loading the reactions into LightCycler 480 Multiwell Plates 96 (Roche Diagnostics, Germany) utilising a LightCycler 96 Real-Time PCR System (Roche Diagnostics,

Germany).

Table 3. Primers used for the CPE multiplex qPCR assay.

Target Primer Sequence Size (bp)

W i„_{\ \ 4}s OXA-4X-F TGTTTI TGGTGGCATCGAT 1 7 '

OXA-4N-R GTAAMRATGGTTGGTTGGC

b/fi M M X DM-F TTGGCCTTGC I GTCGTTG 82

N DM-R AGACGAi P ( ACA ATATCACCG

Sequences are given in the 5’ to 3’ direction. Size denotes PCR amplification products. Synthetic and genomic DNA samples

Four gBIock® Gene fragments were purchased from Integrated DNA Technologies (The Netherlands) and resuspended in TE buffer to 10 ng/pL stock solutions (stored at - 20_°C). The synthetic templates contained the DNA sequence from blaOXA, blaNDM, blaVIM and blaKPC genes required for the multiplex qPCR assay. Eleven pure cultures from clinical isolates were obtained (Table 4). One loop of colonies from each pure culture was suspended in 50 mI_ digestion buffer (Tris-HCI 10 mmol/L, EDTA 1 mmol/L, pH 8.0 containing 5 U/pL lysozime) and incubated at 37_°C for 30min in a dry bath. 0.75 pL proteinase K at 20 pg/pL (Sigma) were subsequently added, and the solution was incubated at 56_°C for 30 min. After boiling for 10 min, the samples were centrifuged at 10,000 * g for 5 min and the supernatant was transferred in a new tube and stored at -80_°C before use. Bacterial isolates included non-CPE producer Klebsiella pneumoniae and Escherichia eoli as control strains.

Table 4. Samples used in this example.

Sample ID Bacterial Isolate Carbapenemase sc nes

10 Klebsiella pneumoniae non-producer

1 1 Escherichia eoli non-producer

Example of the disclosed method

The data analysis for simultaneous quantification and multiplexing is achieved using the method previously described herein. Therefore, there are the following stages in data analysis: pre-processing 101 , curve fitting 102, multi-feature extraction 1 13, high dimensional line fitting 1 14, similarity measure (multidimensional analysis) 1 15 and dimensionality reduction 1 16.

Pre-processing 10V. (optional) Background subtraction via baseline correction, in this example. This is accomplished by removing the mean of the first 5 fluorescent readings from each raw amplification curve.

Curve fitting 102\ (optional) The 5-parameter sigmoid (Richard’s curve) is fitted, in this example, to model the amplification curves:

where x is the cycle number, F(x) is the fluorescence at cycle x, F_b is the background fluorescence, F_max is the maximum fluorescence, c is the fractional cycle of the inflection point, b is related to the slope of the curve and d allows for an asymmetric shape (Richard’s coefficient). The optimisation algorithm used in this example to fit the curve to the data is the trust-region method and is based on the interior reflective Newton method. The lower and upper bounds for the 5 parameters, [F_b , F_max, c, b, d], are given in this example as: [- 0.5, -0.5, 0, 0, 0.7] and [0.5, 0.5, 50, 100, 10] respectively. Feature extraction 113\ Three features are chosen in this example to construct the multidimensional standard curve: C_t, C_y and -logi₀(F₀). The details of these features are not the focus of this disclosure. It will be appreciated that fewer, or a greater number of, features could be used in other examples.

Line fitting 114\ The method of least squares is used for line fitting in this example, i.e. the first principal component in principal component analysis (PCA).

Similarity measure (multidimensional analysis) 1 15\ The similarity measure used in this example is the Mahalanobis distance, d:

d= j(p- P- (q_2— q )

Feature weights : In order to maximize quantification performance, different weights, a, can be assigned to each feature. In order to accomplish this, a simple optimisation algorithm can be implemented. Equivalently, an error measure can be minimised. In this example, the error measure to minimise is the figure of merit described in the following subsection. The optimisation algorithm is the Nelder-Mead simplex algorithm (32,33) with weights initialised to unity, i.e. beginning with no assumption on how good features are for quantification. This is a basic algorithm and only 20 iterations are used to find the weights so that there is little computational overhead.

Dimensionality reduction 116\ Three dimensionality reduction techniques were used in order to compare their performance. The first 3 are simple projections onto each of the individual features, i.e. C_t, C_y and -logio(F₀). The final method uses principal component regression to compute a feature termed M₀ using a vector

p = [Ct,C_y,-logio(Fo)]^T

where [·]^t denotes the transpose operator.

The general form for calculating MO for an arbitrary number of features, as shown in equation (2) is given as:

(p - qi)^r(q₂— qi)

M₀ = ${p,qi,q₂)

(q2 - qi)^T(q2 - qi)

Where F computes the projection of the point peRⁿ onto the multidimensional standard curve 130. The points q1 ,q2 e Rⁿ are any two distinct points that lie on the standard curve.

Evaluation of the standard curves is performed as described in the general disclosure above.

Results

In this example, it is shown that simultaneous robust quantification and multiplexing detection of blaOXA-48, blaNDM, blaVIM and blaKPC -lactamase genes in bacterial isolates can be achieved through analysing the fluorescent amplification curves in qPCR by using multidimensional standard curves. This section is broken into two parts: multiplexing and robust quantification. First, it is proven that single-channel multiplexing can be achieved, which is non-trivial and highly advantageous.

Target Discrimination using Multidimensional Analysis

Figure 1 1 shows four amplification curves and their respective derived melting curves specific for blaOXA, blaNDM, blaVIM and blaKPC genes. The four curves have been chosen to have similar C_t (19.4 0.5) thus each reaction has a different target DNA concentration. Using only this information, i.e. in a conventional technique, post-PCR processing such as melting curve analysis would be needed to differentiate the targets. The same argument applies when solely observing C_y and F₀.

The multidimensional method disclosed herein shows that considering multiple features gives sufficient information gain in order to discriminate outliers from a specific target using a multidimensional standard curve 130. Taking advantage of this property, several multidimensional standard curves can be built in order to discriminate multiple specific targets. Figure 10 shows the multidimensional standard curves 130i , 130₂, 130₃, 130₄, constructed using a single primer fix for the four target genes using C_t, C_y and -logio(Fo). It is visually observed that the 4 standards are sufficiently distant in multidimensional space in order to distinguish training samples. That is, an unknown DNA sample can be potentially classified as one of a number of specific targets (or an outlier) solely using the extracted features from amplification curves in a single channel.

In order to prove this, 1 1 samples given in Table 4 were tested against the multidimensional standards 130i , 130₂, 130₃, 130₄. The similarity measure used to classify the unknown samples is the Mahalanobis distance, using a p-value of 0.01 as the threshold. In order to fully capture the position of the outliers in the feature space, it is convenient to view the feature space along the axis of the multidimensional standard curves 130i , 130₂, 130₃, 130₄. Melting curves are provided in Figure 1 1 to demonstrate that the real-time amplification curves belong to different qPCR products. Until the development of this methodology, it was not possible to associate amplification curve to a specific assay using a single-channel. Therefore, melting curves are used as a confirmation method.

Figure 12 shows the Mahalanobis space for the four standards in this example. This visualisation is constructed by projecting all data points onto an arbitrary hyperplane orthogonal to each standard curve, as described in the general method disclosed above. The first observation is that the training points (synthetic DNA) from each standard are clustered together in its respective Mahalanobis space with a p-value < 0.01 . This corroborates the fact that there is sufficient information in the 3 chosen features to distinguish the 4 standard curves capturing the amplification reaction kinetics.

Figure 12 uses the disclosed multidimensional analysis using the feature space for clustering and classification of unknown samples. As previously described, for this example arbitrary hyperplanes orthogonal to each multidimensional standard curve have been used to project all the data points, including the replicates for each concentration for the four multidimensional standards (training standard points) and eight unknown samples (test points). Circular callouts are magnified to visualise the location of the samples relative to each standard of interest. The dark circular points within each magnified circular callout represent a standard of interest (5 to 8 replicates per each concentration), which is placed by default (0,0) at the centre of the Mahalanobis Space; dark grey asterisks represent the other standards; light grey asterisks represent the test points (3 replicates per sample); and the diamonds show the mean value for each sample. Each black circle corresponds to a p- value of 0.01 .

The second observation is that the mean of the test samples (bacterial isolates) which have a single resistance fall (samples 1 -8) within the correct cluster (p-value < 0.01 ) of training points. Melting curve analysis was used to validate the results, as provided in the Appendices. The results from testing can be succinctly captured within a bar chart as shown in Figure 16. It is, however, important to visualise the data in order to confirm that the Mahalanobis distance is a suitable similarity measure. When the training data points in the feature space are approximately normally distributed, then the distribution of the training data points in the Mahalanobis space is approximately circular - as seen in Figure 6c. Figure 16, in this example, shows average Mahalanobis distance from standard points to sample tests. The average distance between sample test points and the distribution of standard test points have been used to identify the presence of carbapenemase genes within the unknown samples. When the data is approximately normally distributed, the Mahalanobis Distance can be converted into a probability. Sample test points with an average distance relative to the standard of interest smaller than about 3.717 can be classified within this cluster (p-value < about 0.01 ). Samples 1 , 2 and 5 were classified within blaOXA-48 cluster, samples 4 and 6 within blaNDM cluster, samples 3 and 7 within blaVIM cluster and sample 8 within blaKPC cluster. Sample 9 does not belong to any of the cluster (p-value >= about 0.01 ). After DNA amplification, melting curve analysis of the samples was also performed in order to determine the specificity of multiplex qPCR products. Melting curve analysis agrees well with sample classification based on the Mahalanobis distance.

It can be observed that using appropriate clustering techniques in each transformed space, it can be distinguished whether a point belongs to the target or not. Furthermore, if a probability is assigned to each data point then samples can be classified reliably to a given standard whilst simultaneously quantifying it. Given that the training data follow approximately a multivariate normal distribution, the Mahalanobis distance squared can provide a measure of probability.

Robust Quantification

Given that multiplexing has been established, quantification can be obtained using any conventional method such as the gold standard cycle threshold, C_t. However, as shown in the general method disclosed herein, enhanced quantification can be achieved using a feature, M₀, that combines all of the features for optimal absolute quantification. The measure of optimality in this study is a figure of merit that combines accuracy, precision, robustness and overall predictive power as shown in equation X. Table 5 shows the figure of merit for the 3 chosen features (C_t, C_y and -logi₀(F₀)) and M₀ used in this example. The percentage improvement is also shown. It can be observed that quantification is always improved compared to the best single feature. The improvement is 30.69%, 14.39%, 2.12% and 35.00% for blaOXA-48, blaNDM, blaVIM and blaKPC respectively. This is a result of the multidimensional framework. It is further interesting to observe that amongst the conventional methods, there is no single method that performs the best for all the targets. Thus, Mo is the most robust method in the sense that it will always be the best performing method.

Table 5. Figure of merit comparing conventional features with Mo for absolute quantification.

Ct 2.7le+09 l .2le+08 2.45e+07 2.43e+09

C_y 2.12e+09 8.88e+07 9.74e+07 1.31e+09

F₀* L05e+l0 l.98e+09 2.28e+09 2. l7e+l0

M₀ 1.47e+09 7.60e+07 2.40e+07 8.53e+08

% Imp. 30.69 14.39 2.12 35.00

% Imp. = Percentage improvement of MQ over the next best method (both in

bold)

* The figure of merit values is calculated using— logio (Fo )

Appendix A

Nucleotide sequence for synthetic double-stranded DNA ordered from Integrated DNA Technologies containing the lambda phage DNA target.

Forward lambda PCR primer in bold and reverse lambda primer in italics.

Appendix B

Template preparation from bacterial isolates for real-time PCR assays.

One loop of colonies from the pure culture was suspended in 50 mI_ digestion buffer (Tris- HC1 10 mmol/L, EDTA 1 mmol/L, pH 8.0 containing 5 U/pL lysozime) and incubated at 37°C for 30 min in a dry bath. 0.75 mI_ proteinase K at 20 pg/pL (Sigma) were subsequently added, and the solution was incubated at 56°C for 30 min. After boiling for 10 min, the samples were centrifuged at 10,000 c g for 5 min and the supernatant was transferred in a new tube and stored at -80C before use.

Appendix C

Experimental values for construction of lambda DNA standard.

242bp of double-stranded DNA lambda phage was used to build molecule (gBIock gene fragment, IDT) containing the desired target sequence from the standard curves. Each condition run in octuplicate.

Appendix D

Concentration

Replicat

e 1.00E+08 1.00E+07 I.00E+06 I.00E+05 I.00E+04 1.00E+03 I.00E+02

1 5.5555 0.5114 9.5157 10.7036 9.0197 5.7072 17.9332

2 3.7877 7.921 10.2285 8.2501 0.3972 3.8695 II.8746

3 4.214 3.192 II.3459 5.2215 0.931 3.0301 54.3126

O 4 2.5599 2.4175 10.8226 8.2693 0.7879 2.549 0.8628

Q.

5 4.4065 2.3706 8.6827 12.3621 0.4907 5.0994 14.147

O

6 3.3152 3.2146 9.9601 9.7313 4.1688 2.5319 25.6601

0

> 7 4.0194 3.5135 9.0245 4.9426 II.1341 0.0169 0.2702

0

DC 8 4.7132 4.0055 11.2184 II 0122 1.9708 12.0674 31.3394

Relative 3.393262 8.811587 19.54998 Error (RE) 4.071425 5 10.0998 5 3.612525 4.358925 8 Coefficien

t of

Variation

(CV) 2.0597 1.3814 0.2129 0.3398 0.5877 0.3721 1.8359

Average 7.699644

RE 6

Average 0.969928

CV 6

Concentration

Replicat

e 1.00E+08 1.00E+07 I.00E+06 I.00E+05 1.00E+04 1.00E+03 1.00E+02

1 6.0839 2.8016 10.8614 14.2799 7.0406 4.3254 17.8873

2 5.4233 1.0142 13.0343 8.0415 0.8037 5.8983 10.9427

3 3.8308 1.3031 12 6.7038 0.5954 3.3657 51.3925

O 4 5.3753 0.7691 9.7182 12.6143 2.2818 1.9027 3.2301

Q.

5 1.3151 3.0768 10.5987 II.661 3.4428 3.8667 17.8223

O

6 2.4575 2.4747 12.3504 9.4534 0.7933 4.979 28.0663

> 7 3.6943 3.3154 II.3119 10.7851 7.2838 2.5206 0.172 o

DC 8 4.5996 3.8594 12.7848 9.0781 2.6202 8.0757 32.7488

Relative 2.326787 11.58246 10.32713 4.366762

Error (RE) 4.097475 5 3 8 3.1077 5 20.28275 Coefficien

t of

Variation

(CV) 3.7033 0.8395 0.2516 0.3105 0.4419 0.3704 1.8874

Average 8.013010

RE 7

Average 1.114942

CV 9

Concentration

1.00E+08 1.00E+07 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02

1.4026 14.5468 31.622 5.0244 29.5711 22.9036 28.2305

31.744 19.0407 32.9947 28.5293 14.1826 32.6766 2.2095 3 12.8921 4.5039 23.6794 26.7005 15.6453 9.6682 64.7824

4 37.279 14.229 5.4332 8.4892 25.6179 8.0132 33.6652

5 20.3618 13.6581 10.0757 10.4786 50.1636 18.472 35.5455

6 18.6748 11.5559 22.3921 13.9454 68.4436 52.0679 41.9572

7 8.4809 0.0428 27.9459 25.1322 40.9121 33.6609 6.5612

8 29.6678 6.0835 24.376 1.6024 6.1358 13.9507 26.4939

Relative 20.06287 10.45758 22.31487 23.92663 29.93067 Error (RE) 5 8 5 14.98775 31.334 8 5

Coefficien

t of

Variation

(CV) 36.6827 4.7492 2.2954 2.6891 3.0691 2.4236 2.4413

Average

RE 21.8592

Average 7.764342

CV 9

Concentration

Replicat

e 1.00E+08 1.00E+07 I.00E+06 I.00E+05 1.00E+04 1.00E+03 1.00E+02

1 5.705 0.4168 9.9004 II.7059 8.4528 5.3121 17.9187

2 4.2501 5.9139 II.0345 8.1891 0.5133 4.4508 11.609

3 4.1055 2.6596 11.5324 5.6384 0.4998 3.1246 53.4789

<u 4 3.352 1.9521 10.5105 9.4846 1.2103 2.3648 1.5299

Q.

5 3.5206 2.5719 9.2304 12.1627 1.3211 4.7487 15.1786

O

6 3.0717 3.0047 10.6452 9.6513 2.7844 3.2218 26.3515

O

> 7 3.9273 3.4572 9.6801 6.5686 10.0568 0.7352 0.1447

2X

8 4.6818 3.9637 11.6661 10.4597 2.1552 10.9203 31.742

Relative 2.992487 9.232537 3.374212 4.359787 19.74416 Error (RE) 4.07675 5 10.52495 5 5 5 3

Coefficien

t of

Variation

(CV) 1.9088 1.1545 0.189 0.2922 0.5385 0.3651 1.8493

Average 7.757841

RE 1

Average 0.899628

CV 6

Appendix E

Experimental values for outlier detection experiment.

Genomic DNA extracted from pure bacterial cultures. All targets at 1.00E+05 gDNA copies per reaction. Each condition run in octuplicate.

Appendix F

Melting curve analysis for lambda DNA standard experiment as shown in Figure 15a: This figure shows average melting curves peaks for synthetic lambda DNA standard experiments using the 242bp double-stranded DNA molecule (gBIock gene fragment ordered from IDT) using in-house lambda primers. Ten-fold dilution from 10⁸ to 10¹ copies per reaction were used in this experiment, 8-reactions per tested concentration. Average melting curve peak was 80.49°C (SD = 0.08°C) for all positive reactions and no secondary melting event was observed at other annealing temperatures.

Melting curve analysis for outlier detection experiment, as shown In Figure 15b: This figure shows average melting curves peaks of 80.66°C (SD = 0.07°C) for blaOXA48, 83.97°C (SD = 0.10°C) for blaNDM and 90.76°C (SD = 0.10°C) for blaKPC. Octuplicate reactions per gDNA sample were performed, 10⁶ genomic copies per reaction. No secondary melting event was observed at other annealing temperatures. Specific primers sets were selected from Monteiro et al 2012.

Melting curve analysis for primer concentration variation experiment, as shown in Figure 15c: This figure shows average melting curves peaks for primer concentration experiments using phage lambda DNA and in-house lambda primers. Observed average melting curve peaks for tested primer concentration are: 80.18°C (SD = 0.09°C) for 25 nM; 80.10°C (SD = 0.09°C) for 100 nM; 80.18°C (SD = 0.04°C) for 175 nM; 80.13°C (SD = 0.1 1 °C) for 250 nM; 80.21 °C (SD = 0.21 °C) for 325 nM; 80.34°C (SD = 0.06°C) for 400 nM; 80.46°C (SD = 0.08°C) for 475 nM; 80.50°C (SD = 0.09°C) for 550 nM; 80.63°C (SD = 0.09°C) for 625 nM; 80.66°C (SD = 0.07°C) for 700 nM; 80.73°C (SD = 0.06°C) for 775 nM; and 80.87°C (SD = 0.07°C) for 850 nM. Octuplicate reactions per primer concentration were performed. No secondary melting event was observed at other annealing temperatures.

Melting curve analysis for temperature variation experiment, as shown in Figure 15d:

This figure shows average melting curves peaks for temperature variation experiments using phage lambda DNA and in-house primers. Observed average melting curve peaks for tested temperatures are: 80.53°C (SD = 0.10°C) for 52.0°C; 80.52°C (SD = 0.13°C) for 53.0°C; 80.48°C (SD = 0.03°C) for 54.9°C; 80.53°C (SD = 0.07°C) for 57.3°C; 80.53°C (SD = 0.06°C) for 59.9°C; 80.43°C (SD = 0.17°C) for 62.7°C; 80.51 (SD = 0.09°C) for 65.4°C; 80.51 °C (SD = 0.09°C) for 67.8°C; 80.47°C (SD = 0.13°C) for 69.9°C; 80.35°C (SD = 0.09°C) for 71 3°C; 80.35°C (SD = 0.08°C) for 71 .9°C; and 80.36°C (SD = 0.08°C) for 72.0°C. Octuplicate reactions per tested temperature were performed. No secondary melting event was observed at other annealing temperatures.

Appendix G

Experimental values for temperature variation experiment.

Lambda DNA as target (NEB, Catalog #N301 1 S), 10⁶ genomic copies per reaction. Temperature in Celsius. Each experimental condition run in octuplicate.

Appendix H

Experimental values for primer concentration variation experiment.

Lambda DNA as target (NEB, Catalog #N301 1 S), 10⁶ genomic copies per reaction. Primer concentration in nanomolar (nM), ranging from 25 to 850nM each primer. Each experimental condition run in octuplicate.

Advantages and technical effects of aspects and embodiments, including those mentioned above, will be apparent to a skilled person from the foregoing description and from the Figures. It will be appreciated that the described methods can be carried out by one or more computers under control of one or more computer programs arranged to carry out said methods, said computer programs being stored in one or more memories and/or other kinds of computer-readable media.

Figure 13 shows an example of a computer system 1300 which can be used to implement the methods described herein, said computer system 1300 comprising one or more servers 1310, one or more databases 1320, and one or more computing devices 1330, said servers 1310, databases 1320 and computing devices 1330 communicatively coupled with each other by a computer network 1340. The network 1340 may comprise one or more of any kinds of computer network suitable for transmitting or communicating data, for example a local area network, a wide area network, a metropolitan area network, the internet, a wireless communications network 1350, a cable network, a digital broadcast network, a satellite communication network, a telephone network, etc. The computing devices 1330 may be mobile devices, personal computers, or other server computers. Data may also be communicated via a physical computer-readable medium (such as a memory stick, CD, DVD, BluRay disc, etc.), in which case all or part of the network may be omitted.

Each of the one or more servers 1310 and/or computing devices 1330 may operate under control of one or more computer programs arranged to carry out all or a subset of method steps described with reference to any embodiment, thereby interacting with another of the one or more servers 1310 and/or computing devices 1330 so as to collectively carry out the described method steps in conjunction with the one or more databases 1320.

Referring to Figure 14, each of the one or more servers 1310 and/or computing devices 1330 in Figure 13 may comprise features as shown therein by way of example. The shown computer system 1400 comprises a processor 1410, memory 1420, computer- readable storage medium 1430, output interface 1440, input interface 1450 and network interface 1460, which can communicate with each other by virtue of one or more data buses 1470. It will be appreciated that one or more of these features may be omitted, depending on the required functionality of said system, and that other computer systems having fewer components or additional/alternative can be used instead, subject to the functionality required for implementing the described methods/systems.

The computer-readable storage medium may be any form of non-volatile and/or non-transitory data storage device such as a magnetic disk (such as a hard drive or a floppy disc) or optical disk (such as a CD-ROM, a DVD-ROM or a BluRay disc), or a memory device (e.g. a ROM, RAM, EEPROM, EPROM, Flash memory or portable/removable memory device) etc., and may store data, application program instructions according to one or more embodiments of the disclosure herein, and/or an operating system. The storage medium may be local to the processor, or may be accessed via a computer network or bus.

The processor may be any apparatus capable of carrying out method steps according to embodiments, and may for example comprise a single data processing unit or multiple data processing units operating in parallel or in cooperation with each other, or may be implemented as a programmable logic array, graphics processor, or digital signal processor, or a combination thereof.

The input interface is arranged to receive input from a user and provide it to the processor, and may comprise, for example, a mouse (or other pointing device), a keyboard and/or a touchscreen device.

The output interface optionally provides a visual, tactile and/or audible output to a user of the system, under control of the processor.

Finally, the network interface provides for the computer to send/receive data over one or more data communication networks.

Embodiments may be carried out on any suitable computing or data processing device, such as a server computer, personal computer, mobile smartphone, set top box, smart television, etc. Such a computing device may contain a suitable operating system such as UNIX, Windows (RTM) or Linux, for example. It will be appreciated that the above-described partitioning of functionality can be altered without affecting the functionality of the methods and systems, or their advantages/technical effects. The above-described functional partitioning is presented as an example in order that the invention can be understood, and is thus conceptual rather than limiting, the invention being defined by the appended claims. The skilled person will also appreciate that the described method steps may be combined or carried out in a different order without affecting the advantages and technical effects resulting from the invention as defined in the claims.

It will be further appreciated that the described functionality can be implemented as hardware (for example, using field programmable gate arrays, ASICs or other hardware logic), firmware and/or software modules, or as a mixture of those modules. It will also be appreciated that, a computer-readable storage medium and/or a transmission medium (such as a communications signal, data broadcast, communications link between two or more computers, etc.), carrying a computer program arranged to implement one or more aspects of the invention, may embody aspects of the invention. The term “computer program,” as used herein, refers to a sequence of instructions designed for execution on a computer system, and may include source or object code, one or more functions, modules, executable applications, applets, servlets, libraries, and/or other instructions that are executable by a computer processor.

It will be further appreciated that the set of first data (training data) and second data (unknown sample data) can be obtained via the above-mentioned networked computer system components, such as by being retrieved from storage, being inputted by a user via an input device. Results data such as inlier/outlier determinations, and determined sample concentrations can also be stored using the aforementioned storage elements, and/or outputted to a display or other output device. The multidimensional standard curve 130 and/or the standard curve defined by the unidimensional function can also be stored using such storage elements. The aforementioned processor can process such stored and inputted data, as described herein, and store/output the results accordingly.

As will be appreciated by the skilled person, details of the above embodiment may be varied without departing from the scope of the present invention as defined by the appended claims. Many combinations, modifications, or alterations to the features of the above embodiments will be readily apparent to the skilled person and are intended to form part of the disclosure. Any of the features described specifically relating to one embodiment or example may be used in any other embodiment by making appropriate changes as apparent to the skilled person in the light of the above disclosure.

Claims

1 . A method for use in quantifying a sample comprising a target nucleic acid, the method comprising:

obtaining a set of first real-time amplification data for each of a plurality of target concentrations;

extracting a plurality of N features from the set of first data, wherein each feature relates the set of first data to the concentration of the target; and

fitting a line to a plurality of points defined in an N-dimensional space by the features, each point relating to one of the plurality of target concentrations, wherein the line defines a multidimensional standard curve specific to the nucleic acid target which can be used for quantification target concentration.

2. The method of any preceding claim, further comprising:

obtaining second real-time amplification data relating to an unknown sample; extracting a corresponding plurality of N features from the second data; and calculating a distance measure between the line in N-dimensional space and a point defined in N-dimensional space by the corresponding plurality of N features.

3. The method of claim 2, further comprising computing a similarity measure between amplification curves from the distance measure, and optionally further comprising identifying outliers or classifying targets from the similarity measure.

4. The method of any preceding claim, wherein each feature is different to each of the other features, and optionally wherein each feature is linearly related to the concentration of the target, and optionally wherein one or more of the features comprises one of C_t, C_y and -logi o(Fo).

5. The method of any preceding claim, further comprising mapping the line in N- dimensional space to a unidimensional function, M₀, which is related to target concentration, and optionally wherein the unidimensional function is linearly related to target concentration, and/or optionally wherein the unidimensional function defines a standard curve for quantifying target concentration.

6. The method of claim 5 wherein the mapping is performed using a dimensionality reduction technique, and optionally wherein the dimensionality reduction technique comprises at least one of: principal component analysis; random sample consensus; partial- least squares regression; and projecting onto a single feature.

7. The method of claim 5 or 6 wherein the mapping comprises applying a respective scalar feature weight to each of the features, and optionally wherein the respective feature weights are determined by an optimisation algorithm which optimises an objective function, and optionally wherein the objective function is arranged for optimisation of quantisation performance.

8. The method of any of claims 2 to 7 wherein calculating the distance measure comprises projecting the point in N-dimensional space onto a plane which is normal to the line in N-dimensional space, and optionally wherein calculating the distance measure further comprises calculating, based on the projected point, a Euclidean distance and/or a Mahalanobis distance.

9. The method of claim 8, further comprising calculating a similarity measure based on the distance measure, and optionally wherein calculating a similarity measure comprises applying a threshold to the similarity measure.

10. The method of claim 9, further comprising determining whether the point in N- dimensional space is an inlier or an outlier based on the similarity measure.

1 1 . The method of claim 10, comprising: if the point in N-dimensional space is determined to be an outlier then excluding the point from training data upon which the step of fitting a line to a plurality of points defined in N-dimensional space is based, and if the point in N-dimensional space is not determined to be an outlier then re-fitting the line in N- dimensional space based additionally on the point in N-dimensional space.

12. The method of any of claims 2 to 1 1 further comprising determining a target concentration based on the multidimensional standard curve, and optionally further based on the distance measure, and optionally when dependent on claim 4 based on the unidimensional function which defines the standard curve.

13. The method of claim 12, further including displaying the target concentration on a display.

14. The method of any preceding claim wherein the method further comprises a step of fitting a curve to the set of first data, wherein the feature extraction is based on the curve- fitted first data, and optionally wherein the curve fitting is performed using one or more of a 5-parameter sigmoid, an exponential model, and linear interpolation, and optionally wherein the set of first data relating to the melting temperatures is pre-processed, and the curve fitting is carried out on the processed set of first data, and optionally wherein the pre processing comprises one or more of: subtracting a baseline; and normalisation.

15. The method of any preceding claim wherein the data relating to the melting temperature is derived from one or more physical measurements taken versus sample temperature, and optionally wherein the one or more physical measurements comprise fluorescence readings.

16. The method of any preceding claim, used for single-channel multiplexing without post-PCR manipulations.

17. The method of any preceding claim implemented using at least one processor and/or using at least one integrated circuit.

18. A system comprising at least one processor and/or at least one integrated circuit, the system arranged to carry out a method according to any preceding claim.

19. A computer program comprising instructions which, when executed by one or more processors, cause the one or more processors to perform a method according to any one of claims 1 to 16.

20. A computer-readable medium storing instructions which when executed by at least one processor, cause the at least one processor to carry out a method according to any one of claims 1 to 16.

21 . The method of any one of claims 1 to 16, used for detection of genomic material.

22. The method of claim 21 wherein the genomic material comprises one or more pathogens.

23. A method for diagnosis of an infection by detection of one or more pathogens according to the method of any of claims 1 to 16.

24. A method for point-of-care diagnosis of an infectious disease by detection of one or more pathogens according to the method of any of claims 1 to 16.

25. The method of any of claims 22 to 24 wherein the pathogens comprise one more carbapenemase-producing enterobacteria, and optionally wherein the pathogens comprise one or more carbapenemase genes from the set comprising blaOXA-48, blaVlM, blaNDM and blaKPC