CA2416611A1 - Methods and systems for evaluating a matching between measured data and standards - Google Patents

Methods and systems for evaluating a matching between measured data and standards Download PDF

Info

Publication number
CA2416611A1
CA2416611A1 CA002416611A CA2416611A CA2416611A1 CA 2416611 A1 CA2416611 A1 CA 2416611A1 CA 002416611 A CA002416611 A CA 002416611A CA 2416611 A CA2416611 A CA 2416611A CA 2416611 A1 CA2416611 A1 CA 2416611A1
Authority
CA
Canada
Prior art keywords
matching
standard
measured
sequence
edges
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002416611A
Other languages
French (fr)
Inventor
Heinz Breu
Hugh J. Pasika
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Applied Biosystems Inc
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CA2416611A1 publication Critical patent/CA2416611A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N27/00Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
    • G01N27/26Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating electrochemical variables; by using electrolysis or electrophoresis
    • G01N27/416Systems
    • G01N27/447Systems using electrophoresis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/86Signal analysis
    • G01N30/8675Evaluation, i.e. decoding of the signal into analytical information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Biotechnology (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Analytical Chemistry (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electrochemistry (AREA)
  • Algebra (AREA)
  • Chemical Kinetics & Catalysis (AREA)

Abstract

The present invention proposes an approach to matching measured data points representative of a physical property (for instance the measured peak location or positions of an in-lane size standard which represent DNA fragment length) to a set of standard values of that property (for instance a size standard definition). The matching is evaluated by comparing ratios of intervals between selected pairs of the measured data points with ratios of the intervals between adjacent pairs of the standard values, to find the measured data points that best match the standard values.

Description

Methods and Systems for Evaluating a Matching Between Measured Data and Standards Field of the Invention The present invention relates to methods and systems for evaluating a matching between data, in particular a sequence of measured data points, and a pre-defined standard. It has specific, although not necessarily exclusive application in the field of evaluating biological data.
Background When studying a substance or device for example, it is often the case that one or more physical properties of the sample or device are not directly measurable or at least cannot be measured easily.
In order to evaluate such a property an alternative, indirect measure representative of the property must be found.
This problem can be illustrated by taking the biological example of evaluating the size (i.e. length) of DNA fragments (measured in base pairs for instance). This is not a property that can be measured directly. The alternative, indirect measure of fragment size most commonly used is a measure of the rate at which the fragments are transported through a medium in which their mobility varies with size. Typically this is achieved by electrophoresis through Lanes in a polyacrylmide gel, the details of which will be well known to the skilled person and need not be repeated here.
With this approach, the location of the fragments along the lane, or the point in time at which a particular fragment passes a set point along the lane (typically identified by peaks on an electropherogram), provides an indirect measure of the fragment size.
However, since in any particular run the migration of the fragments through the gel will be affected by factors other than just the size of the fragment, on its own this provides only a measure of the relative sizes of the fragments to one another. What it does not provide is any absolute values for fragment size.

To enable the determination of absolute fragment size, it is possible to run an in-lane size standard along with the fragments being evaluated. This in-lane size standard is a kind of "ruler"
made up of a set of fragments of known size (i.e. length). If the location of these size standard fragments can be accurately determined, it is then possible to evaluate the size of an unknown _ fragment by comparing its location with those of the size standard and interpolating from the known sizes of the size standard fragments.
For this approach to work, it is important that the location of the size standard fragments can be accurately determined. That is to say, one has to match peak locations in the measured results to each of the known fragment sizes of the size standard definition. The presence of artefacts in the results, giving false peaks, often means that this is not a straightforward task. Embodiments of the present invention are concerned with providing efficient approaches to carrying out and evaluating such matching processes.
Although the background to the invention has been described above with reference to evaluating the size of DNA fragments, it will be appreciated that similar problems can be encountered when evaluating the properties of other biological or non-biological samples or devices. The invention has applicability more generally, therefore, in any case where there is a need to match specific points within a measured data set representing a set of standard values of a property to the standard values themselves. The skilled person will be well aware of many exemplary cases where the approach set out below can be usefully applied, including but not limited to applications in the filed of telecommunications, image processing, pattern recognition and the like.
Summary of the Invention The present invention proposes an approach to matching measured data points representative of a physical property (for instance the measured peak location or positions of an in-lane size standard which represent DNA fragment length) to a set of standard values of that property (for instance a size standard definition). The matching is evaluated by comparing ratios of intervals between selected.pairs of the measured data points with ratios of the intervals between adjacent pairs of the standard values, to find the measured data points that best match the standard values.
Other preferred features and aspects of the invention are set out in the appended claims.
Brief Description of the Drawing The invention is described below, by way of example, with reference to the accompanying drawings in which:
Fig. 1 is a flow diagram illustrating the basis matching process according to an embodiment of the present invention;
Fig. 2 Illustrates Example peaks, sizes and a matching;
Fig. 3 shows execution time for the matching process against number of values in the size standard and additional peaks;
Fig. 4 shows execution time versus the number of values in the size standard definition for different numbers of additional peaks;
Fig. 5 shows execution time versus the number of extra peaks for a series of size standard definitions having differing numbers of size values;
Fig. 6, 7, 8 and 9 show exemplary results from matchings;
Fig. 10 is a table of execution times versus the number of size value in the size standard definition and the number of additional peaks; and Fig. 11 is a block diagram illustrating a suitable system architecture for operating embodiments of the invention.
Description of Embodiments Embodiments of the invention are described below in the context of the exemplary application of the described approach to the problem of matching an in-line size standard to a size standard definition for the evaluation of nucleic acid information obtained from an electrophoresis procedure. It will be appreciated that the approach has wider applicability In overview, the method operates by matching data generated with a standard sample to actual sizes that should exist in the standard sample. For example, one may use a standard sample with nucleotide lengths 110, 114, 117, 120, and 125. One runs the standard sample and obtains several data peaks. The size standard matching component predicts the peaks that correspond to the five known nucleotide lengths. Thus, one can subsequently compare data in a sample to those peaks to determine the nucleotide lengths of fragments in a sample.
Referring to Fig. 1, an embodiment of the process operates by first selecting a sequence of peaks from those obtained by running the standard sample, the number of peaks selected being equal to the number of values in the size standard,definition. One peak is matched to each of the size standard sizes in sequence. The matching this produces is then evaluated to determine the total cost for the matching. This process is repeated for a number of different sequences of peaks to establish which sequence of the possible variations from the peaks obtained from running the standard sample provides the best matching.
In certain embodiments, a matching component uses an algorithm that has three parameters: ratio factor (the importance of peak height vs. the importance of the local linearity), min acceptable quality (used for ending dynamic programming iteration), and number of extra peaks (the number of peaks participated in size matching is the number of size standard definition fragments plus the number of extra peaks). In certain embodiments, the component fixes the ratio factor to 0.6 and min acceptable quality to 0.75. In certain embodiments, the component fixes the number of extra peaks to 10 for Applied Biosystems 310/377 instrument data and 25 for Applied Biosystems 3700 instrument data.
In certain embodiments, a statistically based quality value is generated for the matching result.
In certain embodiments, one skilled in the art will be able to adjust the number of extra peaks that may be used with a given instrument.
In certain embodiments, the component ignores the peaks located within the offseale regions in the sample.
Tn certain embodiments, the component fails the size matching process if the size standard definitions are not fully matched in the matching process.
In certain embodiments, the component implements two primer peak detection methods. The first is the primer-peak-height-suppression method. This method replaces the peak heights of the highest peaks with the peak height of the middle peak, assuming that the primer peaks are among the highest. The second is to find the primer peak location. The method assumes that the primer peak locates within the first half of the signal and the size standard fragments locate in the second half of the signal. For example, one takes the mean peak height of all the peaks in the second half and multiples that mean by five to get the potential primer peak height.
The method works backwards in the first half of the signal to find the last primer peak.
In certain embodiments, a size-standard matcher takes as input a list of peaks (e.g., from an electropherogram) and a list of fragment sizes (e.g., in nucleotides). It produces as output a matching, that is, a list of pairs of the form <peak,size>, where each peak and each fragment size appears at most once. In certain embodiments, a size-standard mateher evaluates a matching, and uses an algorithm for finding good matchings.
Certain embodiments employ an algorithm that evaluates a matching by treating its two constituent sequences as sequences of edges between points. A matching is also a correspondence between edges. Two edges, e1 and e2, that share an endpoint define a ratio of lengths r = ~e2~/~el~. Again, a matching is also a correspondence between ratios. Under the assumption that the relation between peak position and fragment size is "more or less" linear, corresponding ratios typically should be equal. In certain embodiments, the component derives a ratio cost to measure this property. In certain embodiments, the component also concentrates on big peaks by deriving a height cost. The total cost of a matching is a weighted sum of these constituent costs, In certain embodiments, the component formulates the size standard matching problem as finding a matching with maximum cost.
In such embodiments, the cost is separable. That is, with some additional mathematics, the component can maximise subsequences independently. In certain embodiments, the cost also enjoys the advantage of being local, thereby compensating for global deviations from linearity. This cost also leads to a quality value between 0 and 1.
A size standard is a set of DNA fragments, each of known size.
The definition of a size standard is simply a list of these sizes.
Note that a size-standard definition typically does not depend on the instrument on which the size standard is run, and therefore not on any particular set of run conditions either.
An in-lane size standard is a set of peaks resulting from running a size standard on an instrument. One determines the positions and the heights of the peaks.
A size-standard matcher takes as input an in-lane size standard and a size standard definition. It produces as output a matching, that is, a list of pairs of the form (peak,size), where each peak and each fragment size appears at most once. A peak has a position (e. g., in scan numbers) and a height (e. g., in fluorescent units).
Fragment sizes are given in nucleotides.
Assume that there are at least as many peaks as sizes.
Furthermore, assume that every size has a corresponding peak, except possibly for some small number from the end of this list. This exception is meant to model the situation where a user may have stopped electrophoresis early, before the larger fragments have had a chance to elute.
In certain embodiments, one employs the following components.
Let P=lpo,p"...,j7~r_~) be a list of nP peak locations, given by increasing scan number, for example. Let H=~j2o,12~...,ltn -~ J be a list v of the corresponding nP peak heights, given in fluorescent units, for example. The size standard definition S=LSO,s~,...,s,~-~, is a list of s ns fragment sizes in increasing nucleotides. By assumption, nP Z ns.
A size-standard matchin is a set of airs f g p M =~~io~0)~~ii~l)~..., (i»,~n,.) where the i~ are in increasing order, that is, where subscript j < k implies i~ < ik.
Example 1 Peaks, Sizes, and a Matching Consider the peaks, sizes, and matching displayed in Figure 2.
The list P contains nP = 11 peak positions:
P = [968, 1029, 1203, 1259, 1412, 1535, 1714, 1751, 1785, 1837, 1928] .
These np peaks have heights H:
H = [2722, 6219, 1060, 5380, 7726, 1082, 7424, 1263, 7335, 7937, 1562] .
The size standard definition has ns = 5 sizes S:
S = [75, 100, 139, 150, 160] .
Finally, M is the matching shown in the figure:
M = { (3, 0) , (4, 1) , (6, 2) , (8, 3) , (9, 4) } .
In certain embodiments, big-Oh notation is used to express algorithm complexity. This notation is ubiquitous in worst-case and average-case resource analysis. Briefly, a function f is said to be on the order of another function g (written f(x) = O(g(x)) if there exists positive constants c and N such that ~f(x)~ _< c~g(x)~ for all x >- N.
Evaluating a Matching Assume there is a matching. In certain embodiments,, the matcher component evaluates a matching by examining its two constituent sequences. It treats the sequence of peaks as a sequence of edges between peaks, and similarly for the sizes. For example, M
- ~(3, 0), (4, 1), (6, 2), (8, 3), (9, 4)} is the matching from Example 1. Its peak (index) sequence is [3, 4, 6, 8, 9], which has four edges, (3, 4), (4, 6), (6, 8), and (8, 9). Similarly, its fragment size definition (index) sequence is [0, 1, 2, 3, 4], which also has four edges: (0, 1), (1, 2), (2, 3), and (3, 4).
A matching is also a correspondence between edges. In this example, peak edge (6, 8) corresponds to definition edge (2, 3).
Assume that two edges are adjacent if they share an endpoint. In this example, (4, 6) and (6, 8) are adjacent since they share peak 6.
Two adjacent edges (i,j) and (j, k) define a ratio risk of lengths:
1'J~ _ ~k p~ . (1) h; - hr In certain embodiments, one can employ a more economical notation for size ratios for matching all sizes:
r = S,f+z -Sf+~ .
I
s f+~ - s.f ( ~ ) Again, a matching is also a correspondence between ratios. In this example, peak ratio r689 corresponds to size ratio rZ.
Under the assumption that the relation between peak position and fragment size is "more or less" lineax, corresponding ratios typically should be equal. More formally, assume that the fragment of size si occurs at position pi. ~If there are coefficients a and b such that pi = asi + b for all i, then p~ -pj - ~as~. +b)-~asj +b~ a~s~ -sj~- s~ -sj (3) pj-p ~ -~ r ~ ~- , as . + b as. + b a s . s. s . s.
To measure the similarity of a corresponding pair of ratios ri~x and rf, one may define their ratio cost cr(i, j, k, f) to be min rt.~ , r ~,.~i~J~k~.f~= ~ ~ jv_ (4) max ~;~~,rf Note that 0 <- cr(i,j,k,f) <_ 1 for all 0 <- i < j < k < np and 0 _<< f <
n5 - 2. Note also that cr(i,j,k,f) = 1 indicates the ideal of equal ratios. The ratio cost of a matching is the sum of its individual costs .
In certain embodiments, one has the matchings concentrate on the big peaks. To this end, one may define the height cost ch(i) of a matched peak i to be its height divided by the maximum peak height 1i. More formally, 3o h= max h. (5) 0_<j<rrr, and cn~i~= h . (6>
Again, 0 < ch(i) <- 1 for all peaks 0 _<< i < nP, and ch(i) = 1, in certain embodiments, corresponds to the ideal of a maximally tall peak.
In order to combine these two types of costs, one may weight and sum them. Since there are only two costs, a single weighting parameter a, where 0 <- a <_ l, will suffice. The total cost c(M) of a matching M is the weighted sum:
c~M~=a ~ c,.~i, j, k, f~+~1-a> ~ c,,~i~ . (7) r>.f )EM O,.f )EM
j,f+1 eM
~h,.f+Z~M
One may now formulate the size standard matching problem as finding a matching with maximum cost. Note that the cost is local in the sense that each element of the summation depends on at most three adjacent points. This property allows the matcher component in certain embodiments. to compensate for global deviations from linearity.
Quality Measures If one divides the cost of a matching by the maximum possible cost over all matchings, one would have a number between 0 and 1 that indicates its quality. What is this maximum possible cost? Every pair of ratios in such a matching would contribute its maximum value, namely axl. There are a total of n - 2 ratio pairs. Similarly, every matched peak would be at the maximum height, so that all n matched peaks (one for each definition size) contribute (1 - a) x 1.
The maximum possible cost c, therefore, is:
c=~n-2~a,+n~l-a~=n-2a ( The quality of a matching M is therefore given by c(M)/c.
Other possible quality measures include the sum of just the ratio costs, and the worst ratio cost in the matching.
An Efficient Algorithm In certain embodiments, an advantage of the above formulation is that the cost is separable. That is, with some additional mathematics, one can maximise subsequences independently. This property leads to an efficient dynamic programming algorithm. In certain embodiments, the algorithm is efficient (runs in low-order polynomial time and space) and guarantees an optimal solution.
Let c : ~~ -. ~. denote the maximum cost of a matching subproblem. In particular, let c(j, k, f) denote the maximum cost of matching the peaks from 0 to k with the definition fragments from 0 to f + 1 in such a way that peak j matches size f and peak k matches size f + 1. The cost of matching all sizes is therefore c~M*~= max c~j,k,ns. -2~ (9) i<h~n,~
where M* is the optimal matching. Note that every definition fragment matches some peak, but only n5 of the peaks need match in this embodiment.
One can now express a maximum cost recursively. For f = 0 there are no ratios to compute, so one need only be concerned with the height cost:
c~j,k,0~= ~1-cz~c,,~j~+ch~k)) . (lo) For f > 0, one can compute the cost recursively by adding the height cost for the newly matched peak k to a new ratio cost and a previous subproblem cost:
c~j,k, ~=(1-a~~c,,~k~ (11) +max~(i, j, f -1~+a~c,.(i, j,k, f -l~~
i<~
Converting these equations to algorithms is straightforward.
In certain embodiments, the matcher component computes the individual elements in a consistent order. Furthermore, one may exploit the fact that one can match every size in the definition by limiting the computation. In certain embodiments, the matcher component only needs to compute c(j, k, f) for k > j Z f since j peaks cannot be made to fit all f sizes if j < f. Similarly, in certain embodiments, the matcher component needs to examine only subproblems c(i,j,f-1) where i ? f - 1 since i peaks could not fit all f-1 sizes if i < f -1.
To this end, Algorithm 2 solves Equation 10 and Algorithm 3 solves Equation 11.
Algorithm 2 Basis of the Recursion (when f = 0) for j .- 0, 1,... , nP - 2 do for k .- j + 1, j + 2, ... , np - 1 do c(j~k~ 0) ~- (l - a) (ch(j) + ch(k) ) Algorithm 3 Compute Cost of Matching (when f > 0) 1. for f ,.- 1,2,... ,ns - 2 do 2. for k .- f + 1, f + 2,... , f + np - ns + l do 3. for j .- f, f + 1,... , k - 1 do 4 . c * .-5. for i .-- f - 1, f,... , j - 1 do 6. , ci .- e(i, j, f - 1) + a ~ cr(i, j,k, f - 1) 7. if ci > c* then 8 , c * .- ci 9. c(j,k,f) ~- (1 - a) ~ ch(k) + c*
As stated, these algorithms compute only the cost of an optimal matching. One will still retrieve a matching from this calculation.
This is often a standard part of dynamic programming algorithms.
When memory requirements are very high, it is often the practice to recompute the path to the optimal cost from the cost matrix. Since certain embodiments have relatively small sequences, one can trade time for memory by keeping an array of backpointers, or predecessors p. It is easy to maintain this array by adding the line p(j,k,f) .-- i after line 8 in Algorithm 3. This assignment indicates that the predecessor to cost c(j,k,f) is c(i,j,f-1). Then, the matcher component may reconstruct an optimal matching from Equation 9 by tracing backwards.
Computational Resources Theoretical Run-time Analysis In certain embodiments, the algorithm's run time complexity is dominated by the number of times it executes Lines 6 and 7. The lines themselves execute in constant time. The inner (the i) loop executes them ~i'r-~ 1 times. Therefore the j loop executes them H_~ i-~
~.-.r-1 1 times. The k loop terminates at k = f + nP - ns + 1 =
f + m + 1, where m = nP - ns is the number of extra peaks.
Continuing in this way, one sees that the lines are executed a total of T(m, ns) times, where ..,.-z n~+.r+i k-~ j T~nZ,hs.~= ~ ~ ~ 1 . (12) .r=~ x=.f+~ i=I r=I-1 to This expression is not as formidable as it looks since the inner three summations are independent of the value f. By a judicious substitution of variables, one sees that:
n,+_r+1 k-1 ;-1 m k 1 IS ~ ~ ~ 1=~ ~ ~ 1. (13) =.r+~ j=.r 1=.r-1 k=~ J=~ i=0 A straightforward calculation shows that:
~~ ~ ~ 1=6~nZ3+6naz+llm+6~=6~m+3~m+2~m+1~.
=o j=o r=o It follows that n -2 m+r+1 k-1 j-1 T ~m~ n~s~ ~ _ ' .r=1 ~=.r+1 j=f i=f-1 = 6(m+3~m+2~m+l~n,s -2) (14) = O~msns That is, the execution time increases only linearly with the number of definition fragments, but it increases as the cube of the number of extra peaks. Note that, when the number of peaks equals the number of definition fragments (that is, when m = 0), Lines 6 and 7 are executed only ns -2 times, which is exactly the number of ratios that need to be compared to evaluate any matching.

Empirical Measurements The theoretical analysis in the previous subsection allows one to understand the asymptotic behaviour of the algorithm. That is, it allows one to predict the trend in the run-time when the inputs are large. For smaller inputs, in certain embodiments, various overhead factors influence the run time.
One can construct several sets of synthetic data and time a C++
implementation of the algorithm. The data includes size standard definitions with ns = 5 to ns = 40 fragment sizes. In every case the ith fragment has size 20a., where i >- 1. The in-lane peaks have positions equal to the definition sizes, but they also have m = 0 to m = 20 additional peaks, where the ith additional peak has position 10+20i, for i >- 0. For each combination of ns and m, a test program executes the matcher component 20 times and divides the elapsed time by 20 also, to give the time for each execution in milliseconds.
Figures 3 to 5 show the results. The execution times themselves, rounded to the nearest millisecond, are provided in Fig.
10.
Memory Full arrays for holding the costs and predecessors may use ~n2 -1- t2s.~2 t2~. = m2ns. -I- 2mYts -1-Yts real values . Initialising these arrays therefore takes asymptotically more time, when YI23 =O(ns ), than the optimisation algorithm. If this is a problem, the arrays can be implemented as sparse arrays, so that they occupy O(m3ns) space as well as time. Another solution is to use full arrays, but to index them not with the peak indices, but rather with the substituted variables in Equation 13. A third possibility is to use and allocate full matrices, but to not initialise them.
Practical Considerations In certain embodiments, one may want to determine a set of candidates peaks that the matcher component should size. One may choose to allow a parameter m specifying the number of extra peaks to consider. The size standard matcher then extracts the np = ns + m tallest peaks from all peaks detected, for example by a previous sizecalling step. One may use m = 4.
One may use a weighting factor a between 1/2 and 3/4.

An analyst typically should choose a size standard definition that corresponds accurately to the in-lane size standard. However, it may be that an analyst terminated a run early, before the longer fragments have had a chance to elute. In this case, the definition is not accurate, strictly speaking. To provide some robustness in this situation, one may test if the optimal matching satisfies a minimum acceptable quality parameter. If not, one may remove the last definition size and try again, repeating this process until the quality is acceptable. Alternatively, if the quality is unacceptable, one may simply report this without returning a matching.
Results Figure 6 shows an example of matching a run of the Applied Biosystems GeneScan 500 size standard on a Manhattan breadboard. The top panel shows the matched electropherogram. Again, peak positions are in green, and peak bases are in red. The middle panel shows the "errors" obtained by fitting this matching to a straight line and subtracting the assigned size from the size predicted from the linear fit. Finally, the bottom panel shows the definition points(in green circles) and the predicted peak sizes (in red circles). The correspondence between the two is shown in blue line segments.
Prior to analysis, the primer peak is removed from the peaks used for the matching, although it is still indicated in the diagram.
The electropherogram contains an extra peak at its end (around scan 4400). In this example, the described approach correctly did not match this peak. However that peak was often assigned 500 nucleotides by prior art matchers.
The electropherogram also shows a spurious "spike" around scan 2200. A peak detector rejected this as a peak since it did not meet the minimum peak width criterion.
The 250-nucleotide fragment of the size standard is known to migrate anomalously on the 310, and seems to do so on Manhattan, also. The 340-nucleotide fragment is also known to migrate anomalously; these two fragments typically are not used fox sizing on the 310.

Figure 7 shows the run from Figure 6 matched with GeneScan 500 less sizes 250 and 340. Note that the matcher correctly ignored the peaks corresponding to these two fragments.
Figure 8 shows the result of matching GeneScan 500, less sizes 35 and 50 with a run that was terminated early. Tl~e matcher correctly ignores the corresponding peaks. The matcher does initially match all sizes, including the unrepresented large ones, to the available peaks, but the matching quality was not high enough. Consequently, The matcher then strips one size at a time off the end of the definition until the resulting match is of sufficient quality.
Figure 9 shows a matching with the GeneScan 400HD size standard. This electropherogram also contains a spike (between size 90 and 100). A peak detector ignored this peak due to a minimum peak width parameter. The primer peak is not visible since the analyst set an analysis range that excludes it. Note from the middle panel that the extremes(sizes 50, 60, 380, and 400) of the standard did not run "as linearly" as the rest. These extremes would cause previous matchers some problems, though their errors are quite small here.
Conclusion It can be seen that embodiments of the present invention provide a simple formulation of the size-standard matching problem which leads to a simple algorithm for its solution. In some embodiments the formulation emphasises corresponding ratios of edge lengths combined with the heights of the matched peaks. In preferred embodiments, the corresponding algorithm employs dynamic programming directly on the input sequences. It runs in O(m3 ns) time, where ns is the number of fragment sizes in the size-standard definition, and m =
nP - ns is the number of extra peaks in the in-lane size standard.
Preferred embodiments of the invention employ an algorithm that requires just a very few (three) parameters:
1. the weight for the ratio cost, 2. the minimum acceptable quality, and 3. the number m of extra peaks to consider.
l5 System Architecture Figure 11 is a block diagram that illustrates a computer system 500, according to certain embodiments, upon which embodiments of the invention may be implemented. Computer system 500 includes a bus 502 or other communication mechanism for communicating information, and a processor 504 coupled with bus 502 for processing information.
Computer system 500 also includes a memory 506, which can be a random access memory (RAM) or other dynamic storage device, coupled to bus 502 for determining size matches, and instructions to be executed by processor 504. Memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504. Computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor 504. A storage device 510, such as a magnetic disk or optical disk, is provided and coupled to bus 502 for storing information and instructions.
Computer system 500 may be coupled via bus 502 to a display 512, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. An input device 514, including alphanumeric and other keys, is coupled to bus 502 for communicating information and command selections to processor 504. Another type of user input device is cursor control 516, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512. This input device typically has two degrees of freedom in two axes, a first axis (e. g., x) and a second axis (e. g., y), that allows the device to specify positions in a plane.
A computer system 500 evaluates size matchings and provides a quality level fox each matching. Consistent with certain implementations of the invention, a matching and associated quality is provided by computer system 500 in response to processor 504 l6 executing one or more sequences of one or more instructions contained in memory 506. Such instructions may be read into memory 506 from another computer-readable medium, such as storage device 510.
Execution of the sequences of instructions contained in memory 506 causes processor 504 to perform the process states described herein.
Alternatively hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention.
Thus implementations of the invention are not limited to any specific combination of hardware circuitry and software.
The term "computer-readable medium" as used herein refers to any media that participates in providing instructions to processor 504 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510_ Volatile media includes dynamic memory, such as memory 506. Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 502. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.
Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution. For example, the instructions may initially be carried on magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector coupled to bus 502 can receive the data carried in the infra-red signal and place the data on bus 502. Bus 502 carries the data to memory 506, from which processor 504 retrieves and executes the instructions. The instructions received by memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
It will be appreciated that the above embodiments and results are exemplary only and various modifications to that which have been specifically described are possible within the scope of the invention. In particular it is to be noted that the applicability of the invention is not limited to the described biological data, but is also applicable to~other biological and non-biological data.

Claims (36)

Claims
1. A computer implemented method for evaluating a matching between a plurality of measured data points representative of a physical property and a standard comprising a sequence of pre-defined values of said property, the method comprising:
(a) selecting a sequence of the measured data points corresponding in number to the number of pre-defined values of said property comprised in said standard and matching them in sequence to the values of the standard;
(b) defining a sequence of measured edges as the set of intervals between sequential ones of the measured data points selected in step (a);
(c) calculating the ratio of each pair of adjacent measured edges defined in step (b);
(d) defining a sequence of standard edges as the sequence of intervals between sequential ones of the pre-defined values in the standard, each standard edge corresponding to a respective different measured edge;
(e) calculating the ratio of each pair of adjacent standard edges defined in step (d); and (f) evaluating the matching by comparing the ratios of the measured edges with the corresponding ratios of the standard edges.
2. A method according to claim 1, wherein the step of comparing ratios comprises calculating the ratios of corresponding ones of the measured edge ratios and the standard edge ratios to generate a ratio cost for each corresponding pair of ratios.
3. A method according to claim 2, comprising summing the ratio costs to generate a total cost for the matching.
4. A method according to claim 2, wherein each of said measured data points has an associated magnitude, the method further comprising calculating a magnitude cost for each of the selected points as a ratio of the magnitude of the point to the maximum magnitude across all of the measured data points.
5. A method according to claim 4, comprising generating a total cost for the matching as a weighted sum of the ratio costs and magnitude costs for the selected points.
6. A method according to claim 3 or claim 5, comprising determining a quality for the matching by comparing the total cost of the matching with a maximum possible total cost.
7. A method according to claim 6, wherein a matching is rejected if the quality is not sufficiently high.
8. A method according to any one of the preceding claims, wherein there are more measured data points than there are values in the standard, and the evaluation is repeated for one of more different sub-sequences of measured data points.
9. A method according to claim 8, further comprising calculating total costs for matching each of said sub-sequences of measured data points to said standard, and identifying the matching having the highest cost as the best matching.
10. A method according to any one of the preceding claims, wherein the plurality of measured data points from which the sequence of matched points are selected is itself a sub-set of a larger set of measured data points.
11. A method according to claim 10, wherein the number of data points selected from said larger set equals the number of values in said standard plus a pre-defined number of additional data points.
12. A method according to claim 11 or claim 12, wherein the measured data points of said larger set each have a magnitude, and the plurality of measured points selected from said larger set of data points is selected on the basis of said magnitude.
13. A method according to any one of the preceding claims, wherein the evaluation of the matching comprises employing an algorithm for which the execution time increases no more than substantially linearly with the number of pre-defined values in the standard.
14. A method according to any one of the preceding claims, wherein said plurality of measured data points exceeds the number of pre-determined values in the standard and the evaluation of the matching comprises employing an algorithm for which the execution time increase no more than as the cube of the number of measured data points in excess of the number of values in the standard.
15. A method according to any one of the preceding claims, wherein the matching is evaluated using a dynamic programming algorithm.
16. A computer implemented method for evaluating a matching between a plurality of measured peak locations, the locations representing DNA fragment size, and a size standard definition comprising a sequence of pre-defined DNA fragment sizes, the method comprising:
(a) selecting a sequence of the measured peak locations corresponding in number to the number of pre-defined sizes in the size standard definition;

(b) defining a sequence of measured edges as the set of intervals between sequential ones of the peak locations selected in step (a);
(c) calculating the ratio of each pair of adjacent measured edges defined in step (b);
(d) defining a sequence of standard edges as the sequence of intervals between sequential ones of the sizes of the size standard definition, each standard edge corresponding to a respective different measured edge;
(e) calculating the ratio of each pair of adjacent standard edges defined in step (d); and (f) evaluating the matching by comparing the ratios of the measured edges with the corresponding ratios of the standard edges.
17. A method according to claim 16, wherein the step of comparing ratios comprises calculating the ratios of corresponding ones of the measured edge ratios and the standard edge ratios to generate a ratio cost for each corresponding pair of ratios.
18. A method according to claim 17, comprising summing the ratio costs to generate a total cost for the matching.
19. A method according to claim 17, wherein each of said peak locations has an associated peak height, the method further comprising calculating a height cost for each of the selected peaks as a ratio of the height of the peak to the maximum height across all of the peaks.
20. A method according to claim 19, comprising generating a total cost for the matching as a weighted sum of the ratio costs and height costs for the selected points.
21. A method according to claim 18 or claim 20, comprising determining a quality for the matching by comparing the total cost of the matching with a maximum possible total cost.
22. A method according to claim 21, wherein a matching is rejected if the quality is not sufficiently high.
23. A method according to any one of claims 16 to 22, wherein there axe more measured peak locations than there are sizes in the size standard definition, and the evaluation is repeated for one of more different sub-sequences of measured peak locations.
24. A method according to claim 23, further comprising calculating total costs for matching each of said sub-sequences of measured peak locations to said size standard definition, and identifying the matching having the highest cost as the best matching.
25. A method according to any one of claims 16 to 24, wherein the plurality of peak locations from which the sequence of matched peaks are selected is itself a sub-set of a larger set of peak locations.
26. A method according to claim 25, wherein the number of peak locations selected from said larger set equals the number of sizes in said size standard definition plus a pre-defined number of additional peak locations.
27. A method according to claim 25 or claim 26, wherein the peak locations of said larger set each have an associated height, and the plurality of peak locations selected from said larger set of data points is selected on the basis of said height.
28. A method according to any one of claims 16 to 27, wherein the evaluation of the matching comprises employing an algorithm for which the execution time increases no more than substantially linearly with the number of sizes in the size standard definition.
29. A method according to any one of claims 16 to 27, wherein said plurality of peak locations exceeds the number of sizes in the size standard definition and the evaluation of the matching comprises employing an algorithm for which the execution time increase no more than as the cube of the number of peak locations in excess of the number of sizes in the size standard definition.
30. A method according to any one of claims 16 to 29, wherein the matching is evaluated using a dynamic programming algorithm.
31. A computer system for evaluating a matching between a plurality of measured data points representative of a physical property and a standard comprising a sequence of pre-defined values of said property, the system comprising:
(i) memory for storing values of said measured data points and said pre-defined values of said standard; and (ii) processing means for:
(a) selecting a sequence of the measured data points corresponding in number to the number of pre-defined values of said property comprised in said standard and matching them in sequence to the values of the standard;
(b) defining a sequence of measured edges as the set of intervals between sequential ones of the measured data points selected in step (a);
(c) calculating the ratio of each pair of adjacent measured edges defined in step (b);
(d) defining a sequence of standard edges as the sequence of intervals between sequential ones of the pre-defined values in the standard, each standard edge corresponding to a respective different measured edge;
(e) calculating the ratio of each pair of adjacent standard edges defined in step (d); and (f) evaluating the matching by comparing the ratios of the measured edges with the corresponding ratios of the standard edges; and (iii) output means for outputting the result of said evaluation.
32. A system according to claim 31, wherein said processing means is operable in accordance with any one of the methods of claims 1 to 15.
33. A.computer system for evaluating a matching between a plurality of measured peak locations, the locations representing DNA
fragment size, and a size standard definition comprising a sequence of pre-defined DNA fragment sizes, the system comprising:
(i) memory for storing values of said peak locations and said sizes of said size standard definition; and (ii) processing means for:

(a) selecting a sequence of the measured peak locations corresponding in number to the number of pre-defined sizes in the size standard definition;
(b) defining a sequence of measured edges as the set of intervals between sequential ones of the peak locations selected in step (a);
(c) calculating the ratio of each pair of adjacent measured edges defined in step (b);
(d) defining a sequence of standard edges as the sequence of intervals between sequential ones of the sizes of the size standard definition, each standard edge corresponding to a respective different measured edge;
(e) calculating the ratio of each pair of adjacent standard edges defined in step (d); and (f) evaluating the matching by comparing the ratios of the measured edges with the corresponding ratios of the standard edges; and (iii) output means for outputting the result of said evaluation.
34. A system according to claim 33, wherein said processing means is operable in accordance with any one of the methods of claims 16 to 30.
35. A computer program which is executable on a computer or a distributed network of computers to cause the computer or network of computers to operate in accordance with the method of any one of claims 1 to 30.
36. A computer program according to claim 35 stored on a computer-readable medium.
CA002416611A 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards Abandoned CA2416611A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US21969700P 2000-07-21 2000-07-21
US60/219,697 2000-07-21
PCT/US2001/023630 WO2002008949A2 (en) 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards

Publications (1)

Publication Number Publication Date
CA2416611A1 true CA2416611A1 (en) 2002-01-31

Family

ID=22820389

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002416611A Abandoned CA2416611A1 (en) 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards

Country Status (5)

Country Link
EP (1) EP1350179A2 (en)
JP (1) JP2004505254A (en)
AU (2) AU8300301A (en)
CA (1) CA2416611A1 (en)
WO (1) WO2002008949A2 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012002930A1 (en) * 2010-06-29 2012-01-05 Analogic Corporation Internal sizing/lane standard signal verification
GB201511508D0 (en) * 2015-07-01 2015-08-12 Ge Healthcare Bio Sciences Ab Method for determining a size of biomolecules

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1993014224A1 (en) * 1992-01-14 1993-07-22 Applied Biosystems, Inc. Size-calibration dna fragment mixture and method
US5876933A (en) * 1994-09-29 1999-03-02 Perlin; Mark W. Method and system for genotyping
US5580728A (en) * 1994-06-17 1996-12-03 Perlin; Mark W. Method and system for genotyping
US6821402B1 (en) * 1998-09-16 2004-11-23 Applera Corporation Spectral calibration of fluorescent polynucleotide separation apparatus
US6807490B1 (en) * 2000-02-15 2004-10-19 Mark W. Perlin Method for DNA mixture analysis

Also Published As

Publication number Publication date
WO2002008949A2 (en) 2002-01-31
JP2004505254A (en) 2004-02-19
WO2002008949A3 (en) 2003-07-31
AU2001283003B2 (en) 2004-04-08
AU8300301A (en) 2002-02-05
EP1350179A2 (en) 2003-10-08

Similar Documents

Publication Publication Date Title
Straume et al. Least-squares analysis of fluorescence data
US7904270B2 (en) System for estimating and improving test case generation
US10140336B1 (en) Accuracy testing of query optimizers
CN109697049A (en) Data processing method and device, electronic equipment and computer readable storage medium
Lin et al. An empirical method of grouping genotypes based on a linear function of the genotype-environment interaction
Kreutz et al. A blind and independent benchmark study for detecting differentially methylated regions in plants
US6842261B2 (en) Integrated circuit profile value determination
CN102906851B (en) Analyze mass spectrographic method and system
Yan et al. PhyloAcc-GT: A Bayesian method for inferring patterns of substitution rate shifts on targeted lineages accounting for gene tree discordance
US10002232B2 (en) Biological sample analysis system and method
Tian Better reliability assessment and prediction through data clustering
AU2001283003B2 (en) Methods and systems for evaluating a matching between measured data and standards
Wong et al. A comparison study for DNA motif modeling on protein binding microarray
CN106095669B (en) Parallel program testing method based on scheduling sequence reduction
US20080046187A1 (en) Method, system and software arrangement for detecting or determining similarity regions between datasets
Kousathanas et al. A guide to general-purpose ABC software
US20130289890A1 (en) Rank Normalization for Differential Expression Analysis of Transcriptome Sequencing Data
AU2001283003A1 (en) Methods and systems for evaluating a matching between measured data and standards
US11515995B2 (en) Efficient computation of univariate statistical moments for side channel vulnerability evaluation
Rabinowit et al. The approximate distribution of the maximum of a smoothed Poisson random field
Barton Testing strategies for simulation optimization
US8812430B1 (en) Determining a confidence of a measurement signature score
CA2452274A1 (en) System and method of testing and evaluating mathematical functions
CN113449746B (en) Tail gas telemetry data correction method and device
Kousathanas et al. A guide to general-purpose approximate bayesian computation software

Legal Events

Date Code Title Description
EEER Examination request
FZDE Discontinued