WO2002008949A2 - Methods and systems for evaluating a matching between measured data and standards - Google Patents

Methods and systems for evaluating a matching between measured data and standards Download PDF

Info

Publication number
WO2002008949A2
WO2002008949A2 PCT/US2001/023630 US0123630W WO0208949A2 WO 2002008949 A2 WO2002008949 A2 WO 2002008949A2 US 0123630 W US0123630 W US 0123630W WO 0208949 A2 WO0208949 A2 WO 0208949A2
Authority
WO
WIPO (PCT)
Prior art keywords
matching
standard
measured
sequence
edges
Prior art date
Application number
PCT/US2001/023630
Other languages
French (fr)
Other versions
WO2002008949A3 (en
Inventor
Heinz Breu
Hugh J. Pasika
Original Assignee
Applera Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Applera Corporation filed Critical Applera Corporation
Priority to AU2001283003A priority Critical patent/AU2001283003B2/en
Priority to AU8300301A priority patent/AU8300301A/en
Priority to JP2002514583A priority patent/JP2004505254A/en
Priority to CA002416611A priority patent/CA2416611A1/en
Priority to EP01961761A priority patent/EP1350179A2/en
Publication of WO2002008949A2 publication Critical patent/WO2002008949A2/en
Publication of WO2002008949A3 publication Critical patent/WO2002008949A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N27/00Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
    • G01N27/26Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating electrochemical variables; by using electrolysis or electrophoresis
    • G01N27/416Systems
    • G01N27/447Systems using electrophoresis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/86Signal analysis
    • G01N30/8675Evaluation, i.e. decoding of the signal into analytical information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Definitions

  • the present invention relates to methods and systems for evaluating a matching between data, in particular a sequence of measured data points, and a pre-defined standard. It has specific, although not necessarily exclusive application in the field of evaluating biological data.
  • This problem can be illustrated by taking the biological example of evaluating the size (i.e. length) of DNA fragments (measured in base pairs for instance) . This is not a property that can be measured directly.
  • the alternative, indirect measure of fragment size most commonly used is a measure of the rate at which the fragments are transported through a medium in which their mobility varies with size. Typically this is achieved by electrophoresis through lanes in a polyacrylmide gel, the details of which will be well known to the skilled person and need not be repeated here .
  • the location of the fragments along the lane, or the point in time at which a particular fragment passes a set point along the lane provides an indirect measure of the fragment size.
  • this provides only a measure of the relative sizes of the fragments to one another. What it does not provide is any absolute values for fragment size.
  • This in-lane size standard is a kind of "ruler" made up of a set of fragments of known size (i.e. length) . If the location of these size standard fragments can be accurately determined, it is then possible to evaluate the size of an unknown fragment by comparing its location with those of the size standard and interpolating from the known sizes of the size standard fragments .
  • Embodiments of the present invention are concerned with providing efficient approaches to carrying out and evaluating such matching processes .
  • the present invention proposes an approach to matching measured data points representative of a physical property (for instance the measured peak location or positions of an in-lane size standard which represent DNA fragment length) to a set of standard values of that property (for instance a size standard definition) .
  • the matching is evaluated by comparing ratios of intervals between selected pairs of the measured data points with ratios of the intervals between adjacent pairs of the standard values, to find the measured data points that best match the standard values.
  • Fig. 1 is a flow diagram illustrating the basis matching process according to an embodiment of the present invention
  • Fig. 2 Illustrates Example peaks, sizes and a matching
  • Fig. 3 shows execution time for the matching process against number of values in the size standard and additional peaks
  • Fig. 4 shows execution time versus the number of values in the size standard definition for different numbers of additional peaks
  • Fig. 5 shows execution time versus the number of extra peaks for a series of size standard definitions having differing numbers of size values
  • Fig. 6, 7, 8 and 9 show exemplary results from matchings
  • Fig. 10 is a table of execution times versus the number of size value in the size standard definition and the number of additional peaks ;
  • Fig. 11 is a block diagram illustrating a suitable system architecture for operating embodiments of the invention.
  • Embodiments of the invention are described below in the context of the exemplary application of the described approach to the problem of matching an in-line size standard to a size standard definition for the evaluation of nucleic acid information obtained from an electrophoresis procedure. It will be appreciated that the approach has wider applicability
  • the method operates by matching data generated with a standard sample to actual sizes that should exist in the standard sample. For example, one may use a standard sample with nucleotide lengths 110, 114, 117, 120, and 125. One runs the standard sample and obtains several data peaks. The size standard matching component predicts the peaks that correspond to the five known nucleotide lengths. Thus, one can subsequently compare data in a sample to those peaks to determine the nucleotide lengths of fragments in a sample.
  • an embodiment of the process operates by first selecting a sequence of peaks from those obtained by running the standard sample, the number of peaks selected being equal to the number of values in the size standard, definition. One peak is matched to each of the size standard sizes in sequence. The matching this produces is then evaluated to determine the total cost for the matching. This process is repeated for a number of different sequences of peaks to establish which sequence of the possible variations from the peaks obtained from running the standard sample provides the best matching.
  • a matching component uses an algorithm that has three parameters: ratio factor (the importance of peak height vs . the importance of the local linearity) , min acceptable quality (used for ending dynamic programming iteration) , and number of extra peaks (the number of peaks participated in size matching is the number of size standard definition fragments plus the number of extra peaks) .
  • the component fixes the ratio factor to 0.6 and min acceptable quality to 0.75.
  • the component fixes the number of extra peaks to 10 for Applied Biosystems 310/377 instrument data and 25 for Applied Biosystems 3700 instrument data.
  • a statistically based quality value is generated for the matching result.
  • one skilled in the art will be able to adjust the number of extra peaks that may be used with a given instrument .
  • the component ignores the peaks located within the offscale regions in the sample.
  • the component fails the size matching process if the size standard definitions are not fully matched in the matching process .
  • the component implements two primer peak detection methods.
  • the first is the primer-peak-height- suppression method. This method replaces the peak heights of the highest peaks with the peak height of the middle peak, assuming that the primer peaks are among the highest.
  • the second is to find the primer peak location. The method assumes that the primer peak locates within the first half of the signal and the size standard fragments locate in the second half of the signal. For example, one takes the mean peak height of all the peaks in the second half and multiples that mean by five to get the potential primer peak height. The method works backwards in the first half of the signal to find the last primer peak.
  • a size-standard matcher takes as input a list of peaks (e.g., from an electropherogram) and a list of fragment sizes (e.g., in nucleotides) . It produces as output a matching, that is, a list of pairs of the form ⁇ peak,size>, where each peak and each fragment size appears at most once.
  • a size-standard matcher evaluates a matching, and uses an algorithm for finding good matchings . Certain embodiments employ an algorithm that evaluates a matching by treating its two constituent sequences as sequences of edges between points. A matching is also a correspondence between edges.
  • a matching is also a correspondence between ratios. Under the assumption that the relation between peak position and fragment size is "more or less" linear, corresponding ratios typically should be equal.
  • the component derives a ratio cost to measure this property. In certain embodiments, the component also concentrates on big peaks by deriving a height cost . The total cost of a matching is a weighted sum of these constituent costs.
  • the component formulates the size standard matching problem as finding a matching with maximum cost.
  • the cost is separable. That is, with some additional mathematics, the component can maximise subsequences independently.
  • the cost also enjoys the advantage of being local, thereby compensating for global deviations from linearity. This cost also leads to a quality value between 0 and 1.
  • a size standard is a set of DNA fragments, each of known size. The defini tion of a size standard is simply a list of these sizes. Note that a size-standard definition typically does not depend on the instrument on which the size standard is run, and therefore not on any particular set of run conditions either.
  • An in-lane size standard is a set of peaks resulting from running a size standard on an instrument. One determines the positions and the heights of the peaks.
  • a size-standard matcher takes as input an in-lane size standard and a size standard definition. It produces as output a matching, that is, a list of pairs of the form (peak, size), where each peak and each fragment size appears at most once.
  • a peak has a position (e.g., in scan numbers) and a height (e.g., in fluorescent units). Fragment sizes are given in nucleotides. Assume that there are at least as many peaks as sizes.
  • one employs the following components.
  • M ⁇ (3, 0), (4, 1), (6, 2), (8, 3), (9, 4) ⁇ .
  • Example 1 Its peak (index) sequence is [3, 4, 6, 8, 9], which has four edges, (3, 4), (4, 6), (6, 8), and (8, 9). Similarly, its fragment size definition (index) sequence is [0, 1, 2, 3, 4] , which also has four edges: (0, 1), (1, 2), (2, 3), and (3, 4). A matching is also a correspondence between edges .
  • peak edge (6, 8) corresponds to definition edge (2, 3). Assume that two edges are adjacent if they share an endpoint. In this example, (4, 6) and (6, 8) are adjacent since they share peak 6.
  • Two adjacent edges [ i , j ) and (j , k) define a ratio r ⁇ jk of lengths:
  • peak ratio r 6 ⁇ 9 corresponds to size ratio r 2 .
  • one may define the height cost c h (i) of a matched peak i to be its height divided by the maximum peak height R. More formally, z max ⁇ , (5)
  • 0 ⁇ c h (i) ⁇ 1 for all peaks 0 ⁇ i ⁇ n p , and c h (i) 1, in certain embodiments, corresponds to the ideal of a maximally tall peak.
  • an advantage of the above formulation is that the cost is separable. That is, with some additional mathematics, one can maximise subsequences independently. This property leads to an efficient dynamic programming algorithm.
  • the algorithm is efficient (runs in low-order polynomial time and space) and guarantees an optimal solution.
  • the matcher component computes the individual elements in a consistent order. Furthermore, one may exploit the fact that one can match every size in the definition by limiting the computation. In certain embodiments, the matcher component only needs to compute c (j , k, f) for k > j ⁇ f since j peaks cannot be made to fit all f sizes if j ⁇ f. Similarly, in certain embodiments, the matcher component needs to examine only subproblems c(i, j',f-l) where i ⁇ f - 1 since i peaks could not fit all f-1 sizes if i ⁇ f 1.
  • Algorithm 2 solves Equation 10 and Algorithm 3 solves Equation 11.
  • n p - 1 do c(j,J, 0) - (1 - ⁇ ) (c h (j) + CaW)
  • Algorithm 3 Compute Cost of Matching (when f > 0)
  • the algorithm's run time complexity is dominated by the number of times it executes Lines 6 and 7. The lines themselves execute in constant time.
  • a test program executes the matcher component 20 times and divides the elapsed time by 20 also, to give the time for each execution in milliseconds.
  • Figures 3 to 5 show the results. The execution times themselves, rounded to the nearest millisecond, are provided in Fig. 10.
  • One may use m 4.
  • An analyst typically should choose a size standard definition that corresponds accurately to the in-lane size standard. However, it may be that an analyst terminated a run early, before the longer fragments have had a chance to elute. In this case, the definition is not accurate, strictly speaking. To provide some robustness in this situation, one may test if the optimal matching satisfies a minimum acceptable quality parameter. If not, one may remove the last definition size and try again, repeating this process until the quality is acceptable. Alternatively, if the quality is unacceptable, one may simply report this without returning a matching.
  • Figure 6 shows an example of matching a run of the Applied Biosystems GeneScan 500 size standard on a Manhattan breadboard.
  • the top panel shows the matched electropherogram. Again, peak positions are in green, and peak bases are in red.
  • the middle panel shows the "errors" obtained by fitting this matching to a straight line and subtracting the assigned size from the size predicted from the linear fit.
  • the bottom panel shows the definition points (in green circles) and the predicted peak sizes (in red circles) . The correspondence between the two is shown in blue line segments.
  • the primer peak Prior to analysis, the primer peak is removed from the peaks used for the matching, although it is still indicated in the diagram.
  • the electropherogram contains an extra peak at its end (around scan 4400) .
  • the described approach correctly did not match this peak.
  • peak was often assigned 500 nucleotides by prior art matchers .
  • the electropherogram also shows a spurious "spike” around scan 2200.
  • a peak detector rejected this as a peak since it did not meet the minimum peak width criterion.
  • the 250-nucleotide fragment of the size standard is known to migrate anomalously on the 310, and seems to do so on Manhattan, also.
  • the 340-nucleotide fragment is also known to migrate anomalously; these two fragments typically are not used for sizing on the 310.
  • Figure 7 shows the run from Figure 6 matched with GeneScan 500 less sizes 250 and 340. Note that the matcher correctly ignored the peaks corresponding to these two f agments .
  • Figure 8 shows the result of matching GeneScan 500, less sizes 35 and 50 with a run that was terminated early.
  • the matcher correctly ignores the corresponding peaks .
  • the matcher does initially match all sizes, including the unrepresented large ones, to the available peaks, but the matching quality was not high enough. Consequently, The matcher then strips one size at a time off the end of the definition until the resulting match is of sufficient quality.
  • Figure 9 shows a matching with the GeneScan 400HD size standard.
  • This electropherogram also contains a spike (between size 90 and 100) .
  • a peak detector ignored this peak due to a minimum peak width parameter.
  • the primer peak is not visible since the analyst set an analysis range that excludes it. Note from the middle panel that the extremes (sizes 50, 60, 380, and 400) of the standard did not run "as linearly" as the rest. These extremes would cause previous matchers some problems, though their errors are quite small here.
  • embodiments of the present invention provide a simple formulation of the size-standard matching problem which leads to a simple algorithm for its solution.
  • the formulation emphasises corresponding ratios of edge lengths combined with the heights of the matched peaks .
  • Preferred embodiments of the invention employ an algorithm that requires just a very few (three) parameters:
  • FIG 11 is a block diagram that illustrates a computer system 500, according to certain embodiments, upon which embodiments of the invention may be implemented.
  • Computer system 500 includes a bus 502 or other communication mechanism for communicating information, and a processor 504 coupled with bus 502 for processing information.
  • Computer system 500 also includes a memory 506, which can be a random access memory (RAM) or other dynamic storage device, coupled to bus 502 for determining size matches, and instructions to be executed by processor 50 .
  • Memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504.
  • Computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor 504.
  • ROM read only memory
  • a storage device 510 such as a magnetic disk or optical disk, is provided and coupled to bus 502 for storing information and instructions.
  • Computer system 500 may be coupled via bus 502 to a display 512, such as a cathode ray tube (CRT) or liquid crystal display (LCD) , for displaying information to a computer user.
  • a display 512 such as a cathode ray tube (CRT) or liquid crystal display (LCD)
  • An input device 514 is coupled to bus 502 for communicating information and command selections to processor 504.
  • cursor control 516 is Another type of user input device, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512.
  • This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y) , that allows the device to specify positions in a plane.
  • a computer system 500 evaluates size matchings and provides a quality level for each matching. Consistent with certain implementations of the invention, a matching and associated quality is provided by computer system 500 in response to processor 504 executing one or more sequences of one or more instructions contained in memory 506. Such instructions may be read into memory 506 from another computer-readable medium, such as storage device 510. Execution of the sequences of instructions contained in memory 506 causes processor 504 to perform the process states described herein. Alternatively hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus implementations of the invention are not limited to any specific combination of hardware circuitry and software.
  • Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510.
  • Volatile media includes dynamic memory, such as memory 506.
  • Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 502. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
  • Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.
  • Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution.
  • the instructions may initially be carried on magnetic disk of a remote computer.
  • the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
  • a modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
  • An infra-red detector coupled to bus 502 can receive the data carried in the infra-red signal and place the data on bus 502.
  • Bus 502 carries the data to memory 506, from which processor 504 retrieves and executes the instructions.
  • the instructions received by memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Medical Informatics (AREA)
  • Mathematical Analysis (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Epidemiology (AREA)
  • Algebra (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Electrochemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Immunology (AREA)

Abstract

The present invention proposes an approach to matching measured data points representative of a physical property (for instance the measured peak location or positions of an in-lane size standard which represent DNA fragment length) to a set of standard values of that property (for instance a size standard definition). The matching is evaluated by comparing ratios of intervals between selected pairs of the measured data points with ratios of the intervals between adjacent pairs of the standard values, to find the measured data points that best match the standard values.

Description

Methods and Systems for Evaluating a Matching Between Measured Data and Standards
Field of the Invention
The present invention relates to methods and systems for evaluating a matching between data, in particular a sequence of measured data points, and a pre-defined standard. It has specific, although not necessarily exclusive application in the field of evaluating biological data.
Background
When studying a substance or device for example, it is often the case that one or more physical properties of the sample or device are not directly measurable or at least cannot be measured easily. In order to evaluate such a property an alternative, indirect measure representative of the property must be found.
This problem can be illustrated by taking the biological example of evaluating the size (i.e. length) of DNA fragments (measured in base pairs for instance) . This is not a property that can be measured directly. The alternative, indirect measure of fragment size most commonly used is a measure of the rate at which the fragments are transported through a medium in which their mobility varies with size. Typically this is achieved by electrophoresis through lanes in a polyacrylmide gel, the details of which will be well known to the skilled person and need not be repeated here .
With this approach, the location of the fragments along the lane, or the point in time at which a particular fragment passes a set point along the lane (typically identified by peaks on an electropherogram) , provides an indirect measure of the fragment size. However, since in any particular run the migration of the fragments through the gel will be affected by factors other than just the size of the fragment, on its own this provides only a measure of the relative sizes of the fragments to one another. What it does not provide is any absolute values for fragment size. To enable the determination of absolute fragment size, it is possible to run an in-lane size standard along with the fragments being evaluated. This in-lane size standard is a kind of "ruler" made up of a set of fragments of known size (i.e. length) . If the location of these size standard fragments can be accurately determined, it is then possible to evaluate the size of an unknown fragment by comparing its location with those of the size standard and interpolating from the known sizes of the size standard fragments .
For this approach to work, it is important that the location of the size standard fragments can be accurately determined. That is to say, one has to match peak locations in the measured results to each of the known fragment sizes of the size standard definition. The presence of artefacts in the results, giving false peaks, often means that this is not a straightforward task. Embodiments of the present invention are concerned with providing efficient approaches to carrying out and evaluating such matching processes .
Although the background to the invention has been described above with reference to evaluating the size of DNA fragments, it will be appreciated that similar problems can be encountered when evaluating the properties of other biological or non-biological samples or devices. The invention has applicability more generally, therefore, in any case where there is a need to match specific points within a measured data set representing a set of standard values of a property to the standard values themselves . The skilled person will be well aware of many exemplary cases where the approach set out below can be usefully applied, including but not limited to applications in the filed of telecommunications, image processing, pattern recognition and the like.
Summary of the Invention
The present invention proposes an approach to matching measured data points representative of a physical property (for instance the measured peak location or positions of an in-lane size standard which represent DNA fragment length) to a set of standard values of that property (for instance a size standard definition) . The matching is evaluated by comparing ratios of intervals between selected pairs of the measured data points with ratios of the intervals between adjacent pairs of the standard values, to find the measured data points that best match the standard values.
Other preferred features and aspects of the invention are set out in the appended claims .
Brief Description of the Drawing
The invention is described below, by way of example, with reference to the accompanying drawings in which:
Fig. 1 is a flow diagram illustrating the basis matching process according to an embodiment of the present invention;
Fig. 2 Illustrates Example peaks, sizes and a matching;
Fig. 3 shows execution time for the matching process against number of values in the size standard and additional peaks;
Fig. 4 shows execution time versus the number of values in the size standard definition for different numbers of additional peaks;
Fig. 5 shows execution time versus the number of extra peaks for a series of size standard definitions having differing numbers of size values;
Fig. 6, 7, 8 and 9 show exemplary results from matchings;
Fig. 10 is a table of execution times versus the number of size value in the size standard definition and the number of additional peaks ; and
Fig. 11 is a block diagram illustrating a suitable system architecture for operating embodiments of the invention.
Description of Embodiments
Embodiments of the invention are described below in the context of the exemplary application of the described approach to the problem of matching an in-line size standard to a size standard definition for the evaluation of nucleic acid information obtained from an electrophoresis procedure. It will be appreciated that the approach has wider applicability
In overview, the method operates by matching data generated with a standard sample to actual sizes that should exist in the standard sample. For example, one may use a standard sample with nucleotide lengths 110, 114, 117, 120, and 125. One runs the standard sample and obtains several data peaks. The size standard matching component predicts the peaks that correspond to the five known nucleotide lengths. Thus, one can subsequently compare data in a sample to those peaks to determine the nucleotide lengths of fragments in a sample.
Referring to Fig. 1, an embodiment of the process operates by first selecting a sequence of peaks from those obtained by running the standard sample, the number of peaks selected being equal to the number of values in the size standard, definition. One peak is matched to each of the size standard sizes in sequence. The matching this produces is then evaluated to determine the total cost for the matching. This process is repeated for a number of different sequences of peaks to establish which sequence of the possible variations from the peaks obtained from running the standard sample provides the best matching.
In certain embodiments, a matching component uses an algorithm that has three parameters: ratio factor (the importance of peak height vs . the importance of the local linearity) , min acceptable quality (used for ending dynamic programming iteration) , and number of extra peaks (the number of peaks participated in size matching is the number of size standard definition fragments plus the number of extra peaks) . In certain embodiments, the component fixes the ratio factor to 0.6 and min acceptable quality to 0.75. In certain embodiments, the component fixes the number of extra peaks to 10 for Applied Biosystems 310/377 instrument data and 25 for Applied Biosystems 3700 instrument data.
In certain embodiments, a statistically based quality value is generated for the matching result. In certain embodiments, one skilled in the art will be able to adjust the number of extra peaks that may be used with a given instrument . In certain embodiments, the component ignores the peaks located within the offscale regions in the sample.
In certain embodiments, the component fails the size matching process if the size standard definitions are not fully matched in the matching process .
In certain embodiments, the component implements two primer peak detection methods. The first is the primer-peak-height- suppression method. This method replaces the peak heights of the highest peaks with the peak height of the middle peak, assuming that the primer peaks are among the highest. The second is to find the primer peak location. The method assumes that the primer peak locates within the first half of the signal and the size standard fragments locate in the second half of the signal. For example, one takes the mean peak height of all the peaks in the second half and multiples that mean by five to get the potential primer peak height. The method works backwards in the first half of the signal to find the last primer peak.
In certain embodiments, a size-standard matcher takes as input a list of peaks (e.g., from an electropherogram) and a list of fragment sizes (e.g., in nucleotides) . It produces as output a matching, that is, a list of pairs of the form <peak,size>, where each peak and each fragment size appears at most once. In certain embodiments, a size-standard matcher evaluates a matching, and uses an algorithm for finding good matchings . Certain embodiments employ an algorithm that evaluates a matching by treating its two constituent sequences as sequences of edges between points. A matching is also a correspondence between edges. Two edges, βi and e2, that share an endpoint define a ratio of lengths r = | e21 / | ex | . Again, a matching is also a correspondence between ratios. Under the assumption that the relation between peak position and fragment size is "more or less" linear, corresponding ratios typically should be equal. In certain embodiments, the component derives a ratio cost to measure this property. In certain embodiments, the component also concentrates on big peaks by deriving a height cost . The total cost of a matching is a weighted sum of these constituent costs.
In certain embodiments, the component formulates the size standard matching problem as finding a matching with maximum cost. In such embodiments, the cost is separable. That is, with some additional mathematics, the component can maximise subsequences independently. In certain embodiments, the cost also enjoys the advantage of being local, thereby compensating for global deviations from linearity. This cost also leads to a quality value between 0 and 1.
A size standard is a set of DNA fragments, each of known size. The defini tion of a size standard is simply a list of these sizes. Note that a size-standard definition typically does not depend on the instrument on which the size standard is run, and therefore not on any particular set of run conditions either.
An in-lane size standard is a set of peaks resulting from running a size standard on an instrument. One determines the positions and the heights of the peaks.
A size-standard matcher takes as input an in-lane size standard and a size standard definition. It produces as output a matching, that is, a list of pairs of the form (peak, size), where each peak and each fragment size appears at most once. A peak has a position (e.g., in scan numbers) and a height (e.g., in fluorescent units). Fragment sizes are given in nucleotides. Assume that there are at least as many peaks as sizes.
Furthermore, assume that every size has a corresponding peak, except possibly for some small number from the end of this list. This exception is meant to model the situation where a user may have stopped electrophoresis early, before the larger fragments have had a chance to elute.
In certain embodiments, one employs the following components.
Let P
Figure imgf000007_0001
be a list of np peak locations, given by increasing scan number, for example. Let H = ]hQ, hi ..., hn _Λ be a list of the corresponding np peak heights, given in fluorescent units, for example. The size standard definition S
Figure imgf000007_0002
,... ,Sπ _ is a list of nε fragment sizes in increasing nucleotides. By assumption, np ≥ ne . A size-standard matching is a set of pairs M = {(0,0), (,,1),..., (i„t , ns)} where the i-/ are in increasing order, that is, where subscript j < k implies ij < ik. Example 1 Peaks, Sizes, and a Matching
Consider the peaks, sizes, and matching displayed in Figure 2. The list P contains np = 11 peak positions:
P = [968, 1029, 1203, 1259, 1412, 1535, 1714, 1751, 1785, 1837, 1928] .
These np peaks have heights H:
H = [2722, 6219, 1060, 5380, 7726, 1082, 7424, 1263, 7335, 7937,
1562] . The size standard definition has nε = 5 sizes S: S = [75, 100, 139, 150, 160] .
Finally, M is the matching shown in the figure:
M = {(3, 0), (4, 1), (6, 2), (8, 3), (9, 4)}.
In certain embodiments, big-Oh notation is used to express algorithm complexity. This notation is ubiquitous in worst-case and average-case resource analysis. Briefly, a function f is said to be on the order of another function g (written f (x) = 0 {g (x) ) if there exists positive constants c and N such that |f(x) | ≤ c \ g{x) | for all x ≥ N. Evaluating a Matching
Assume there is a matching. In certain embodiments, the matcher component evaluates a matching by examining its two constituent sequences. It treats the sequence of peaks as a sequence of edges between peaks, and similarly for the sizes. For example, M = {(3, 0), (4, 1), (6, 2), (8, 3), (9, 4)} is the matching from
Example 1. Its peak (index) sequence is [3, 4, 6, 8, 9], which has four edges, (3, 4), (4, 6), (6, 8), and (8, 9). Similarly, its fragment size definition (index) sequence is [0, 1, 2, 3, 4] , which also has four edges: (0, 1), (1, 2), (2, 3), and (3, 4). A matching is also a correspondence between edges . In this example, peak edge (6, 8) corresponds to definition edge (2, 3). Assume that two edges are adjacent if they share an endpoint. In this example, (4, 6) and (6, 8) are adjacent since they share peak 6. Two adjacent edges [ i , j ) and (j , k) define a ratio r±jk of lengths:
Figure imgf000008_0001
In certain embodiments, one can employ a more economical notation for size ratios for matching all sizes:
Sf+2 Sf+1 rf = (2)
Sf÷l Sf
Again, a matching is also a correspondence between ratios. In this example, peak ratio r6β9 corresponds to size ratio r2.
Under the assumption that the relation between peak position and fragment size is "more or less" linear, corresponding ratios typically should be equal. More formally, assume that the fragment of size sα occurs at position p . If there are coefficients a and Jb such that i = as^ + b for all i, then
PL ~Pj _ (αsi +b)-(αsj +b)^ α(sk -s _ sk -s,
(3) Pj-P, (αsj+b iαsxb) α(sj-s,) Sj-s,
To measure the similarity of a corresponding pair of ratios r1 k and ^, one may define their ratio cost cr{i, j, k, f) to be
Figure imgf000009_0001
Note that 0 ≤ cr{i,j,k,f) ≤ 1 for all Q < i < j < k < np and 0 ≤ f < ns - 2. Note also that cr(i,j,k,f) = 1 indicates the ideal of equal ratios. The ratio cost of a matching is the sum of its individual costs.
In certain embodiments, one has the matchings concentrate on the big peaks. To this end, one may define the height cost ch(i) of a matched peak i to be its height divided by the maximum peak height R. More formally, z=maxΛ, (5)
0≤J«h, and «.» - (6)
Again, 0 ≤ ch (i) ≤ 1 for all peaks 0 ≤ i < np, and ch (i) = 1, in certain embodiments, corresponds to the ideal of a maximally tall peak.
In order to combine these two types of costs, one may weight and sum them. Since there are only two costs, a single weighting parameter α, where 0 ≤ a ≤ 1, will suffice. The total cost c {M) of a matching M is the weighted sum:
(M) = ∑ ,-(*» J> *>/")+ (l-«) ∑ cA(/) . (7) i,f)eM (i,f)≡M j,f+UεM k,f+2)eM
One may now formulate the size standard matching problem as finding a matching with maximum cost. Note that the cost is local in the sense that each element of the summation depends on at most three adjacent points. This property allows the matcher component in certain embodiments to compensate for global deviations from linearity. Quality Measures If one divides the cost of a matching by the maximum possible cost over all matchings, one would have a number between 0 and 1 that indicates its quality. What is this maximum possible cost? Every pair of ratios in such a matching would contribute its maximum value, namely αxl. There are a total of n - 2 ratio pairs. Similarly, every matched peak would be at the maximum height, so that all n matched peaks (one for each definition size) contribute (1 - α) x 1. The maximum possible cost c, therefore, is:
c = (n -2α +«(l-α)= n-2α (8)
The quality of a matching M is therefore given by c(M) /c.
Other possible quality measures include the sum of just the ratio costs, and the worst ratio cost in the matching. An Efficient Algorithm In certain embodiments, an advantage of the above formulation is that the cost is separable. That is, with some additional mathematics, one can maximise subsequences independently. This property leads to an efficient dynamic programming algorithm. In certain embodiments, the algorithm is efficient (runs in low-order polynomial time and space) and guarantees an optimal solution. Let c : Hs → E denote the maximum cost of a matching subproblem. In particular, let c {j , k, f) denote the maximum cost of matching the peaks from 0 to k with the definition fragments from 0 to f + 1 in such a way that peak j matches size f and peak k matches size f + 1. The cost of matching all sizes is therefore
c(M*)= max c(j, k, ns -2) (9)
]<k≤np
where M* is the optimal matching. Note that every definition fragment matches some peak, but only ns of the peaks need match in this embodiment.
One can now express a maximum cost recursively. For f = 0 there are no ratios to compute, so one need only be concerned with the height cost :
cQ,k,0)= (l-a)(ch(j)+ ch(k)) (10)
For f > 0, one can compute the cost recursively by adding the height cost for the newly matched peak k to a new ratio cost and a previous subproblem cost :
Figure imgf000011_0001
Converting these equations to algorithms is straightforward. In certain embodiments, the matcher component computes the individual elements in a consistent order. Furthermore, one may exploit the fact that one can match every size in the definition by limiting the computation. In certain embodiments, the matcher component only needs to compute c (j , k, f) for k > j ≥ f since j peaks cannot be made to fit all f sizes if j < f. Similarly, in certain embodiments, the matcher component needs to examine only subproblems c(i, j',f-l) where i ≥ f - 1 since i peaks could not fit all f-1 sizes if i < f 1.
To this end, Algorithm 2 solves Equation 10 and Algorithm 3 solves Equation 11.
Algorithm 2 Basis of the Recursion {when f = 0)
for j <- 0,1,... , np - 2 do for k <- j + 1, j + 2, ... , np - 1 do c(j,J, 0) - (1 - α) (ch(j) + CaW)
Algorithm 3 Compute Cost of Matching (when f > 0)
1 for f «- 1,2,... ,ns - 2 do 2 for k <- f + 1, f + 2,... , f + np - ns + 1 do 3 for j <- f, f + 1,... , £ - 1 do 4 c* <- - 5 for i <- f - 1, f,„ ,j - 1 do 6 d <- c(i,j,f - 1) + α cr(i,j ,k,f - 1) 7 if Ci > c* then
C* <- Ci c(j,k,f) <- (1 - α) cή(J) + c*
As stated, these algorithms compute only the cost of an optimal matching. One will still retrieve a matching from this calculation. This is often a standard part of dynamic programming algorithms. When memory requirements are very high, it is often the practice to recompute the path to the optimal cost from the cost matrix. Since certain embodiments have relatively small sequences, one can trade time for memory by keeping an array of backpointers, or predecessors p. It is easy to maintain this array by adding the line p{j,k,f) <- i after line 8 in Algorithm 3. This assignment indicates that the predecessor to cost c(j,k,f) is c(i,j,f-l). Then, the matcher component may reconstruct an optimal matching from Equation 9 by tracing backwards . Computational Resources
Theoretical Run-time Analysis In certain embodiments, the algorithm's run time complexity is dominated by the number of times it executes Lines 6 and 7. The lines themselves execute in constant time. The inner (the i) loop
∑ j-i 1 times. Therefore the j loop executes them
1 times. The k loop terminates at k = f + np - ns + 1 = f + m + 1, where m = np - nB is the number of extra peaks. Continuing in this way, one sees that the lines are executed a total of T{m, ns) times, where
Figure imgf000013_0001
This expression is not as formidable as it looks since the inner three summations are independent of the value f. By a judicious substitution of variables, one sees that:
Figure imgf000013_0002
A straightforward calculation shows that:
∑ ∑ ∑ \ = -{m3 +6m2 +l lm + 6)=-(m + 3Xm + 2)(m + l)
A=0 7=0 ι=0 6 "
It follows that
Figure imgf000013_0003
That is, the execution time increases only linearly with the number of definition fragments, but it increases as the cube of the number of extra peaks. Note that, when the number of peaks equals the number of definition fragments (that is, when = 0) , Lines 6 and 7 are executed only ns -2 times, which is exactly the number of ratios that need to be compared to evaluate any matching. Empirical Measurements The theoretical analysis in the previous subsection allows one to understand the asymptotic behaviour of the algorithm. That is, it allows one to predict the trend in the run-time when the inputs are large. For smaller inputs, in certain embodiments, various overhead factors influence the run time.
One can construct several sets of synthetic data and time a C++ implementation of the algorithm. The data includes size standard definitions with ns = 5 to ns = 40 fragment sizes. In every case the ith fragment has size 20i, where i ≥ 1. The in-lane peaks have positions equal to the definition sizes, but they also have m = 0 to m = 20 additional peaks, where the ith additional peak has position 10+20i, for i ≥ 0. For each combination of ns and m, a test program executes the matcher component 20 times and divides the elapsed time by 20 also, to give the time for each execution in milliseconds. Figures 3 to 5 show the results. The execution times themselves, rounded to the nearest millisecond, are provided in Fig. 10.
Memory
Full arrays for holding the costs and predecessors may m + ns ) tιi. = m ns + 2mns + ns real values. Initialising these arrays therefore takes asymptotically more time, when m = 0 ns ) , than the optimisation algorithm. If this is a problem, the arrays can be implemented as sparse arrays, so that they occupy Oψl II s ) space as well as time. Another solution is to use full arrays, but to index them not with the peak indices, but rather with the substituted variables in Equation 13. A third possibility is to use and allocate full matrices, but to not initialise them. Practical Considerations
In certain embodiments, one may want to determine a set of candidates peaks that the matcher component should size. One may choose to allow a parameter m specifying the number of extra peaks to consider. The size standard matcher then extracts the np = ns + m tallest peaks from all peaks detected, for example by a previous sizecalling step. One may use m = 4.
One may use a weighting factor α between 1/2 and 3/4. An analyst typically should choose a size standard definition that corresponds accurately to the in-lane size standard. However, it may be that an analyst terminated a run early, before the longer fragments have had a chance to elute. In this case, the definition is not accurate, strictly speaking. To provide some robustness in this situation, one may test if the optimal matching satisfies a minimum acceptable quality parameter. If not, one may remove the last definition size and try again, repeating this process until the quality is acceptable. Alternatively, if the quality is unacceptable, one may simply report this without returning a matching.
Results
Figure 6 shows an example of matching a run of the Applied Biosystems GeneScan 500 size standard on a Manhattan breadboard. The top panel shows the matched electropherogram. Again, peak positions are in green, and peak bases are in red. The middle panel shows the "errors" obtained by fitting this matching to a straight line and subtracting the assigned size from the size predicted from the linear fit. Finally, the bottom panel shows the definition points (in green circles) and the predicted peak sizes (in red circles) . The correspondence between the two is shown in blue line segments.
Prior to analysis, the primer peak is removed from the peaks used for the matching, although it is still indicated in the diagram. The electropherogram contains an extra peak at its end (around scan 4400) . In this example, the described approach correctly did not match this peak. However that peak was often assigned 500 nucleotides by prior art matchers .
The electropherogram also shows a spurious "spike" around scan 2200. A peak detector rejected this as a peak since it did not meet the minimum peak width criterion.
The 250-nucleotide fragment of the size standard is known to migrate anomalously on the 310, and seems to do so on Manhattan, also. The 340-nucleotide fragment is also known to migrate anomalously; these two fragments typically are not used for sizing on the 310. Figure 7 shows the run from Figure 6 matched with GeneScan 500 less sizes 250 and 340. Note that the matcher correctly ignored the peaks corresponding to these two f agments .
Figure 8 shows the result of matching GeneScan 500, less sizes 35 and 50 with a run that was terminated early. The matcher correctly ignores the corresponding peaks . The matcher does initially match all sizes, including the unrepresented large ones, to the available peaks, but the matching quality was not high enough. Consequently, The matcher then strips one size at a time off the end of the definition until the resulting match is of sufficient quality.
Figure 9 shows a matching with the GeneScan 400HD size standard. This electropherogram also contains a spike (between size 90 and 100) . A peak detector ignored this peak due to a minimum peak width parameter. The primer peak is not visible since the analyst set an analysis range that excludes it. Note from the middle panel that the extremes (sizes 50, 60, 380, and 400) of the standard did not run "as linearly" as the rest. These extremes would cause previous matchers some problems, though their errors are quite small here.
Conclusion It can be seen that embodiments of the present invention provide a simple formulation of the size-standard matching problem which leads to a simple algorithm for its solution. In some embodiments the formulation emphasises corresponding ratios of edge lengths combined with the heights of the matched peaks . In preferred embodiments, the corresponding algorithm employs dynamic programming directly on the input sequences. It runs in 0 (m3 nB) time, where ns is the number of fragment sizes in the size-standard definition, and m = np - nε is the number of extra peaks in the in-lane size standard.
Preferred embodiments of the invention employ an algorithm that requires just a very few (three) parameters:
1. the weight for the ratio cost,
. the minimum acceptable quality, and
3. the number m of extra peaks to consider. System Architecture
Figure 11 is a block diagram that illustrates a computer system 500, according to certain embodiments, upon which embodiments of the invention may be implemented. Computer system 500 includes a bus 502 or other communication mechanism for communicating information, and a processor 504 coupled with bus 502 for processing information. Computer system 500 also includes a memory 506, which can be a random access memory (RAM) or other dynamic storage device, coupled to bus 502 for determining size matches, and instructions to be executed by processor 50 . Memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504. Computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor 504. A storage device 510, such as a magnetic disk or optical disk, is provided and coupled to bus 502 for storing information and instructions.
Computer system 500 may be coupled via bus 502 to a display 512, such as a cathode ray tube (CRT) or liquid crystal display (LCD) , for displaying information to a computer user. An input device 514, including alphanumeric and other keys, is coupled to bus 502 for communicating information and command selections to processor 504. Another type of user input device is cursor control 516, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y) , that allows the device to specify positions in a plane.
A computer system 500 evaluates size matchings and provides a quality level for each matching. Consistent with certain implementations of the invention, a matching and associated quality is provided by computer system 500 in response to processor 504 executing one or more sequences of one or more instructions contained in memory 506. Such instructions may be read into memory 506 from another computer-readable medium, such as storage device 510. Execution of the sequences of instructions contained in memory 506 causes processor 504 to perform the process states described herein. Alternatively hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus implementations of the invention are not limited to any specific combination of hardware circuitry and software.
The term "computer-readable medium" as used herein refers to any media that participates in providing instructions to processor 504 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510. Volatile media includes dynamic memory, such as memory 506. Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 502. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.
Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution. For example, the instructions may initially be carried on magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector coupled to bus 502 can receive the data carried in the infra-red signal and place the data on bus 502. Bus 502 carries the data to memory 506, from which processor 504 retrieves and executes the instructions. The instructions received by memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
It will be appreciated that the above embodiments and results are exemplary only and various modifications to that which have been specifically described are possible within the scope of the invention. In particular it is to be noted that the applicability of the invention is not limited to the described biological data, but is also applicable to Other biological and non-biological data.

Claims

Claims
1. A computer implemented method for evaluating a matching between a plurality of measured data points representative of a physical property and a standard comprising a sequence of pre-defined values of said property, the method comprising:
(a) selecting a sequence of the measured data points corresponding in number to the number of pre-defined values of said property comprised in said standard and matching them in sequence to the values of the standard; (b) defining a sequence of measured edges as the set of intervals between sequential ones of the measured data points selected in step (a) ;
(c) calculating the ratio of each pair of adjacent measured edges defined in step (b) ; (d) defining a sequence of standard edges as the sequence of intervals between sequential ones of the pre-defined values in the standard, each standard edge corresponding to a respective different measured edge;
(e) calculating the ratio of each pair of adjacent standard edges defined in step (d) ; and
(f) evaluating the matching by comparing the ratios of the measured edges with the corresponding ratios of the standard edges .
2. A method according to claim 1, wherein the step of comparing ratios comprises calculating the ratios of corresponding ones of the measured edge ratios and the standard edge ratios to generate a ratio cost for each corresponding pair of ratios.
3. A method according to claim 2, comprising summing the ratio costs to generate a total cost for the matching.
4. A method according to claim 2 , wherein each of said measured data points has an associated magnitude, the method further comprising calculating a magnitude cost for each of the selected points as a ratio of the magnitude of the point to the maximum magnitude across all of the measured data points .
5. A method according to claim 4, comprising generating a total cost for the matching as a weighted sum of the ratio costs and magnitude costs for the selected points.
6. A method according to claim 3 or claim 5, comprising determining a quality for the matching by comparing the total cost of the matching with a maximum possible total cost.
7. A method according to claim 6, wherein a matching is rejected if the quality is not sufficiently high.
8. A method according to any one of the preceding claims, wherein there are more measured data points than there are values in the standard, and the evaluation is repeated for one of more different sub-sequences of measured data points.
9. A method according to claim 8, further comprising calculating total costs for matching each of said sub-sequences of measured data points to said standard, and identifying the matching having the highest cost as the best matching.
10. A method according to any one of the preceding claims, wherein the plurality of measured data points from which the sequence of matched points are selected is itself a sub-set of a larger set of measured data points .
11. A method according to claim 10, wherein the number of data points selected from said larger set equals the number of values in said standard plus a pre-defined number of additional data points.
12. A method according to claim 11 or claim 12, wherein the measured data points of said larger set each have a magnitude, and the plurality of measured points selected from said larger set of data points is selected on the basis of said magnitude.
13. A method according to any one of the preceding claims, wherein the evaluation of the matching comprises employing an algorithm for which the execution time increases no more than substantially linearly with the number of pre-defined values in the standard.
14. A method according to any one of the preceding claims, wherein said plurality of measured data points exceeds the number of predetermined values in the standard and the evaluation of the matching comprises employing an algorithm for which the execution time increase no more than as the cube of the number of measured data points in excess of the number of values in the standard.
15. A method according to any one of the preceding claims, wherein the matching is evaluated using a dynamic programming algorithm.
16. A computer implemented method for evaluating a matching between a plurality of measured peak locations, the locations representing
DNA fragment size, and a size standard definition comprising a sequence of pre-defined DNA fragment sizes, the method comprising:
(a) selecting a sequence of the measured peak locations corresponding in number to the number of pre-defined sizes in the size standard definition;
(b) defining a sequence of measured edges as the set of intervals between sequential ones of the peak locations selected in step (a) ;
(c) calculating the ratio of each pair of adjacent measured edges defined in step (b) ;
(d) defining a sequence of standard edges as the sequence of intervals between sequential ones of the sizes of the size standard definition, each standard edge corresponding to a respective different measured edge; (e) calculating the ratio of each pair of adjacent standard edges defined in step (d) ; and
(f) evaluating the matching by comparing the ratios of the measured edges with the corresponding ratios of the standard edges.
17. A method according to claim 16, wherein the step of comparing ratios comprises calculating the ratios of corresponding ones of the measured edge ratios and the standard edge ratios to generate a ratio cost for each corresponding pair of ratios.
18. A method according to claim 17, comprising summing the ratio costs to generate a total cost for the matching .
19. A method according to claim 17, wherein each of said peak locations has an associated peak height, the method further comprising calculating a height cost for each of the selected peaks as a ratio of the height of the peak to the maximum height across all of the peaks .
20. A method according to claim 19, comprising generating a total cost for the matching as a weighted sum of the ratio costs and height costs for the selected points.
21. A method according to claim 18 or claim 20, comprising determining a quality for the matching by comparing the total cost of the matching with a maximum possible total cost .
22. A method according to claim 21, wherein a matching is rejected if the quality is not sufficiently high.
23. A method according to any one of claims 16 to 22, wherein there are more measured peak locations than there are sizes in the size standard definition, and the evaluation is repeated for one of more different sub-sequences of measured peak locations.
24. A method according to claim 23, further comprising calculating total costs for matching each of said sub-sequences of measured peak locations to said size standard definition, and identifying the matching having the highest cost as the best matching.
25. A method according to any one of claims 16 to 24, wherein the plurality of peak locations from which the sequence of matched peaks are selected is itself a sub-set of a larger set of peak locations .
26. A method according to claim 25, wherein the number of peak locations selected from said larger set equals the number of sizes in said size standard definition plus a pre-defined number of additional peak locations .
27. A method according to claim 25 or claim 26, wherein the peak locations of said larger set each have an associated height, and the plurality of peak locations selected from said larger set of data points is selected on the basis of said height.
28. A method according to any one of claims 16 to 27, wherein the evaluation of the matching comprises employing an algorithm for which the execution time increases no more than substantially linearly with the number of sizes in the size standard definition.
29. A method according to any one of claims 16 to 27, wherein said plurality of peak locations exceeds the number of sizes in the size standard definition and the evaluation of the matching comprises employing an algorithm for which the execution time increase no more than as the cube of the number of peak locations in excess of the number of sizes in the size standard definition.
30. A method according to any one of claims 16 to 29, wherein the matching is evaluated using a dynamic programming algorithm.
31. A computer system for evaluating a matching between a plurality of measured data points representative of a physical property and a standard comprising a sequence of pre-defined values of said property, the system comprising:
(i) memory for storing values of said measured data points and said pre-defined values of said standard; and (ii) processing means for: (a) selecting a sequence of the measured data points corresponding in number to the number of pre-defined values of said property comprised in said standard and matching them in sequence to the values of the standard;
(b) defining a sequence of measured edges as the set of intervals between sequential ones of the measured data points selected in step (a) ;
(c) calculating the ratio of each pair of adjacent measured edges defined in step (b) ;
(d) defining a sequence of standard edges as the sequence of intervals between sequential ones of the pre-defined values in the standard, each standard edge corresponding to a respective different measured edge;
(e) calculating the ratio of each pair of adjacent standard edges defined in step (d) ; and (f) evaluating the matching by comparing the ratios of the measured edges with the corresponding ratios of the standard edges; and (iii) output means for outputting the result of said evaluation.
32. A system according to claim 31, wherein said processing means is operable in accordance with any one of the methods of claims 1 to 15.
33. A computer system for evaluating a matching between a plurality of measured peak locations, the locations representing DNA fragment size, and a size standard definition comprising a sequence of pre-defined DNA fragment sizes, the system comprising: (i) memory for storing values of said peak locations and said sizes of said size standard definition; and (ii) processing means for:
(a) selecting a sequence of the measured peak locations corresponding in number to the number of pre-defined sizes in the size standard definition;
(b) defining a sequence of measured edges as the set of intervals between sequential ones of the peak locations selected in step (a) ;
(c) calculating the ratio of each pair of adjacent measured edges defined in step (b) ;
(d) defining a sequence of standard edges as the sequence of intervals between sequential ones of the sizes of the size standard definition, each standard edge corresponding to a respective different measured edge; (e) calculating the ratio of each pair of adjacent standard edges defined in step (d) ; and
(f) evaluating the matching by comparing the ratios of the measured edges with the corresponding ratios of the standard edges; and (iii) output means for outputting the result of said evaluation.
34. A system according to claim 33, wherein said processing means is operable in accordance with any one of the methods of claims 16 to 30.
35. A computer program which is executable on a computer or a distributed network of computers to cause the computer or network of computers to operate in accordance with the method of any one of claims 1 to 30.
36. A computer program according to claim 35 stored on a computer- readable medium.
PCT/US2001/023630 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards WO2002008949A2 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
AU2001283003A AU2001283003B2 (en) 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards
AU8300301A AU8300301A (en) 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards
JP2002514583A JP2004505254A (en) 2000-07-21 2001-07-23 Method and system for evaluating the match between measured data and a standard
CA002416611A CA2416611A1 (en) 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards
EP01961761A EP1350179A2 (en) 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US21969700P 2000-07-21 2000-07-21
US60/219,697 2000-07-21

Publications (2)

Publication Number Publication Date
WO2002008949A2 true WO2002008949A2 (en) 2002-01-31
WO2002008949A3 WO2002008949A3 (en) 2003-07-31

Family

ID=22820389

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/023630 WO2002008949A2 (en) 2000-07-21 2001-07-23 Methods and systems for evaluating a matching between measured data and standards

Country Status (5)

Country Link
EP (1) EP1350179A2 (en)
JP (1) JP2004505254A (en)
AU (2) AU2001283003B2 (en)
CA (1) CA2416611A1 (en)
WO (1) WO2002008949A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012002930A1 (en) * 2010-06-29 2012-01-05 Analogic Corporation Internal sizing/lane standard signal verification
WO2017001597A1 (en) * 2015-07-01 2017-01-05 Ge Healthcare Bio-Sciences Ab Method for determining a size of biomolecules

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1993014224A1 (en) * 1992-01-14 1993-07-22 Applied Biosystems, Inc. Size-calibration dna fragment mixture and method
US5876933A (en) * 1994-09-29 1999-03-02 Perlin; Mark W. Method and system for genotyping
WO2000016087A1 (en) * 1998-09-16 2000-03-23 The Perkin-Elmer Corporation Spectral calibration of fluorescent polynucleotide separation apparatus
US6054268A (en) * 1994-06-17 2000-04-25 Perlin; Mark W. Method and system for genotyping
EP1128311A2 (en) * 2000-02-15 2001-08-29 Mark W. Perlin A method and system for DNA analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1993014224A1 (en) * 1992-01-14 1993-07-22 Applied Biosystems, Inc. Size-calibration dna fragment mixture and method
US6054268A (en) * 1994-06-17 2000-04-25 Perlin; Mark W. Method and system for genotyping
US5876933A (en) * 1994-09-29 1999-03-02 Perlin; Mark W. Method and system for genotyping
WO2000016087A1 (en) * 1998-09-16 2000-03-23 The Perkin-Elmer Corporation Spectral calibration of fluorescent polynucleotide separation apparatus
EP1128311A2 (en) * 2000-02-15 2001-08-29 Mark W. Perlin A method and system for DNA analysis

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012002930A1 (en) * 2010-06-29 2012-01-05 Analogic Corporation Internal sizing/lane standard signal verification
US20130090861A1 (en) * 2010-06-29 2013-04-11 Analogic Corporation Internal sizing/lane standard signal verification
US10503573B2 (en) * 2010-06-29 2019-12-10 Analogic Corporation Internal sizing/lane standard signal verification
WO2017001597A1 (en) * 2015-07-01 2017-01-05 Ge Healthcare Bio-Sciences Ab Method for determining a size of biomolecules
CN107787452A (en) * 2015-07-01 2018-03-09 通用电气健康护理生物科学股份公司 For the method for the size for determining biomolecule
CN113916964A (en) * 2015-07-01 2022-01-11 思拓凡瑞典有限公司 Method for determining the size of a biomolecule
US11237130B2 (en) 2015-07-01 2022-02-01 Cytiva Sweden Ab Method for determining a size of biomolecules

Also Published As

Publication number Publication date
AU2001283003B2 (en) 2004-04-08
AU8300301A (en) 2002-02-05
JP2004505254A (en) 2004-02-19
WO2002008949A3 (en) 2003-07-31
EP1350179A2 (en) 2003-10-08
CA2416611A1 (en) 2002-01-31

Similar Documents

Publication Publication Date Title
Kaur et al. A neural network method for prediction of β-turn types in proteins using evolutionary information
Panconesi et al. Fast hare: A fast heuristic for single individual SNP haplotype reconstruction
US6789020B2 (en) Expert system for analysis of DNA sequencing electropherograms
Macken et al. Evolutionary walks on rugged landscapes
US6898596B2 (en) Evolution of library data sets
KR102425673B1 (en) How to reorder sequencing data reads
US6842261B2 (en) Integrated circuit profile value determination
US20190213328A1 (en) Model for detection of anomalous discrete data sequences
US10002232B2 (en) Biological sample analysis system and method
JP7286872B2 (en) Gene alignment technology
US9390163B2 (en) Method, system and software arrangement for detecting or determining similarity regions between datasets
AU2001283003B2 (en) Methods and systems for evaluating a matching between measured data and standards
AU2001283003A1 (en) Methods and systems for evaluating a matching between measured data and standards
Nam et al. An efficient top-down search algorithm for learning boolean networks of gene expression
US6810357B2 (en) Systems and methods for mining model accuracy display for multiple state prediction
US20130289891A1 (en) Rank Normalization for Differential Expression Analysis of Transcriptome Sequencing Data
Kousathanas et al. A guide to general-purpose ABC software
McGeoch et al. Milestones on the quantum utility highway
Frith A simple theory for finding related sequences by adding probabilities of alternative alignments
CN106776598B (en) Information processing method and device
Mamrak et al. Statistical Methods for Comparing Computer Services
Suh et al. Parallel computing methods for analyzing gene expression relationships
US20070196851A1 (en) System and method for selecting differentially dependent genes from gene expression data
Karp et al. An algorithm for analysing probed partial digestion experiments
Fillon et al. Symbolic regression of discontinuous and multivariate functions by Hyper-Volume Error Separation (HVES)

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2416611

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2001961761

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2001283003

Country of ref document: AU

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWP Wipo information: published in national office

Ref document number: 2001961761

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 2001283003

Country of ref document: AU