IL309786A - Quality score calibration of basecalling systems - Google Patents

Quality score calibration of basecalling systems

Info

Publication number
IL309786A
IL309786A IL309786A IL30978623A IL309786A IL 309786 A IL309786 A IL 309786A IL 309786 A IL309786 A IL 309786A IL 30978623 A IL30978623 A IL 30978623A IL 309786 A IL309786 A IL 309786A
Authority
IL
Israel
Prior art keywords
sensor data
range
clusters
computer
subset
Prior art date
Application number
IL309786A
Other languages
Hebrew (he)
Original Assignee
Illumina Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/839,387 external-priority patent/US20230029970A1/en
Application filed by Illumina Inc filed Critical Illumina Inc
Publication of IL309786A publication Critical patent/IL309786A/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Analytical Chemistry (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Organic Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Signal Processing (AREA)
  • Biochemistry (AREA)
  • Genetics & Genomics (AREA)

Claims (20)

1.Claims 1. A computer-implemented method of generating base calls by a base caller, comprising: receiving, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identifying a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; mapping at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; processing the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remapping each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
2. The computer-implemented method of claim 1, wherein the second range is fully encompassed within the first range.
3. The computer-implemented method of claim 1 or 2, wherein one or more outlier sensor data within the first range are absent from the second range of sensor data.
4. The computer-implemented method of any of claims 1-3, wherein identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
5. The computer-implemented method of claim 4, wherein at least one of the lower threshold percentage or the upper threshold percentage is 0.5% or less.
6. The computer-implemented method of claim 4, wherein at least one of the lower threshold percentage or the upper threshold percentage is 1.0% or less.
7. The computer-implemented method of any of claims 4-6, wherein each of the lower threshold percentage and the upper threshold percentage is 0.5% or less.
8. The computer-implemented method of any of claims 4-6, wherein each of the lower threshold percentage and the upper threshold percentage is 1% or less.
9. The computer-implemented method of any of claims 4-8, further comprising: identifying (i) a first outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is lower than the low value and (ii) a second outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is higher than the high value; and prior to the mapping, assigning the low value to the first outlier sensor data, and assigning the high value to the second outlier sensor data, such that the first outlier sensor data and the second outlier sensor data are within the second range subsequent to the assignment.
10. The computer-implemented method of any of claims 4-9, further comprising: identifying (i) a first outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is lower than the low value and (ii) a second outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is higher than the high value; and excluding the first outlier sensor data and the second outlier sensor data from the subset of the plurality of sensor data representing the subset of the plurality of clusters incorporating different nucleotide bases with different labels during the mapping, for being outside the second range, such that the first outlier sensor data and the second outlier sensor data are not mapped to the third range.
11. The computer-implemented method of any of claims 1-10, wherein mapping at least a subset of the plurality of sensor data representing the subset of the plurality of clusters incorporating different nucleotide bases with different labels comprises: mapping a first sensor data within the subset from a first value that is within the second range to a second value that is within the third range; and mapping a second sensor data within the subset from a third value that is within the second range to a fourth value that is within the third range.
12. The computer-implemented method of any of claims 1-11, wherein individual sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels comprises corresponding intensity of a corresponding section of an image generated from the flow cell.
13. The computer-implemented method of any of claims 1-12, further comprising: processing the plurality of normalized sensor data in a base caller, to assign, the corresponding base called for the target cluster, a first quality score indicating a first probability of the corresponding base being an A, a second quality score indicating a second probability of the corresponding base being a C, a third quality score indicating a third probability of the corresponding base being a T, and a fourth quality score indicating a fourth probability of the corresponding base being a G.
14. The computer-implemented method of claim 13, wherein the plurality of quality scores corresponding to the base call comprise the first quality score, the second quality score, the third quality score, and the fourth quality score.
15. The computer-implemented method of claim 14, further comprising: quantizing each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
16. A non-transitory computer readable storage medium comprising computer program instructions that, when executed on a processor, cause a computing device to: receive, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identify a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; map at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; process the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remap each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
17. The non-transitory computer readable storage medium of claim 16, wherein identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
18. The non-transitory computer readable storage medium of claim 17, further comprising computer program instructions that, when executed on the processor, cause the computing device to quantize each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
19. A system comprising: at least one processor; and a non-transitory computer-readable medium comprising instructions thereon that, when executed by the at least one processor, cause the system to: receive, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identify a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; map at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; process the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remap each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
20. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to quantize each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
IL309786A 2021-07-28 2022-07-28 Quality score calibration of basecalling systems IL309786A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202163226707P 2021-07-28 2021-07-28
US17/839,387 US20230029970A1 (en) 2021-07-28 2022-06-13 Quality score calibration of basecalling systems
PCT/US2022/038729 WO2023009758A1 (en) 2021-07-28 2022-07-28 Quality score calibration of basecalling systems

Publications (1)

Publication Number Publication Date
IL309786A true IL309786A (en) 2024-02-01

Family

ID=83149575

Family Applications (1)

Application Number Title Priority Date Filing Date
IL309786A IL309786A (en) 2021-07-28 2022-07-28 Quality score calibration of basecalling systems

Country Status (7)

Country Link
EP (1) EP4377960A1 (en)
JP (1) JP2024532049A (en)
KR (1) KR20240037882A (en)
AU (1) AU2022319125A1 (en)
CA (1) CA3223746A1 (en)
IL (1) IL309786A (en)
WO (1) WO2023009758A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118053503B (en) * 2024-01-11 2024-12-06 中国农业科学院农业基因组研究所 A method and system for constructing a multi-omics database of invasive organisms

Family Cites Families (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0450060A1 (en) 1989-10-26 1991-10-09 Sri International Dna sequencing
US5641658A (en) 1994-08-03 1997-06-24 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid with two primers bound to a single solid support
US6090592A (en) 1994-08-03 2000-07-18 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid on supports
EP1498494A3 (en) 1997-04-01 2007-06-20 Solexa Ltd. Method of nucleic acid sequencing
EP2327797B1 (en) 1997-04-01 2015-11-25 Illumina Cambridge Limited Method of nucleic acid sequencing
AR021833A1 (en) 1998-09-30 2002-08-07 Applied Research Systems METHODS OF AMPLIFICATION AND SEQUENCING OF NUCLEIC ACID
US20020150909A1 (en) 1999-02-09 2002-10-17 Stuelpnagel John R. Automated information processing in randomly ordered arrays
US20030064366A1 (en) 2000-07-07 2003-04-03 Susan Hardin Real-time sequence determination
EP1354064A2 (en) 2000-12-01 2003-10-22 Visigen Biotechnologies, Inc. Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity
AR031640A1 (en) 2000-12-08 2003-09-24 Applied Research Systems ISOTHERMAL AMPLIFICATION OF NUCLEIC ACIDS IN A SOLID SUPPORT
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
US20040002090A1 (en) 2002-03-05 2004-01-01 Pascal Mayer Methods for detecting genome-wide sequence variations associated with a phenotype
AU2003259350A1 (en) 2002-08-23 2004-03-11 Solexa Limited Modified nucleotides for polynucleotide sequencing
ES2864086T3 (en) 2002-08-23 2021-10-13 Illumina Cambridge Ltd Labeled nucleotides
GB0321306D0 (en) 2003-09-11 2003-10-15 Solexa Ltd Modified polymerases for improved incorporation of nucleotide analogues
WO2005065814A1 (en) 2004-01-07 2005-07-21 Solexa Limited Modified molecular arrays
EP1790202A4 (en) 2004-09-17 2013-02-20 Pacific Biosciences California Apparatus and method for analysis of molecules
WO2006064199A1 (en) 2004-12-13 2006-06-22 Solexa Limited Improved method of nucleotide detection
EP1888743B1 (en) 2005-05-10 2011-08-03 Illumina Cambridge Limited Improved polymerases
US8045998B2 (en) 2005-06-08 2011-10-25 Cisco Technology, Inc. Method and system for communicating using position information
GB0514936D0 (en) 2005-07-20 2005-08-24 Solexa Ltd Preparation of templates for nucleic acid sequencing
GB0517097D0 (en) 2005-08-19 2005-09-28 Solexa Ltd Modified nucleosides and nucleotides and uses thereof
US7405281B2 (en) 2005-09-29 2008-07-29 Pacific Biosciences Of California, Inc. Fluorescent nucleotide analogs and uses therefor
GB0522310D0 (en) 2005-11-01 2005-12-07 Solexa Ltd Methods of preparing libraries of template polynucleotides
EP2021503A1 (en) 2006-03-17 2009-02-11 Solexa Ltd. Isothermal methods for creating clonal single molecule arrays
CA2648149A1 (en) 2006-03-31 2007-11-01 Solexa, Inc. Systems and devices for sequence by synthesis analysis
US20080242560A1 (en) 2006-11-21 2008-10-02 Gunderson Kevin L Methods for generating amplified nucleic acid arrays
US7595882B1 (en) 2008-04-14 2009-09-29 Geneal Electric Company Hollow-core waveguide-based raman systems and methods
CA3104322C (en) 2011-09-23 2023-06-13 Illumina, Inc. Methods and compositions for nucleic acid sequencing
WO2016064703A1 (en) * 2014-10-21 2016-04-28 Life Technologies Corporation Methods, systems, and computer-readable media for blind deconvolution dephasing of nucleic acid sequencing data
CA3088687A1 (en) * 2018-01-26 2019-08-01 Quantum-Si Incorporated Machine learning enabled pulse and base calling for sequencing devices
US11347965B2 (en) * 2019-03-21 2022-05-31 Illumina, Inc. Training data generation for artificial intelligence-based sequencing

Also Published As

Publication number Publication date
WO2023009758A1 (en) 2023-02-02
AU2022319125A1 (en) 2024-01-18
EP4377960A1 (en) 2024-06-05
CA3223746A1 (en) 2023-02-02
JP2024532049A (en) 2024-09-05
KR20240037882A (en) 2024-03-22

Similar Documents

Publication Publication Date Title
JP2023080096A5 (en) Deep learning based variant classifier
US20230064991A1 (en) Display method, display panel and display control device
CN109492561B (en) Optical remote sensing image ship detection method based on improved YOLO V2 model
IL309786A (en) Quality score calibration of basecalling systems
EP3620982A1 (en) Sample processing method and device
CN108875779A (en) Training method, device and the terminal device of neural network
CN113516939A (en) Brightness correction method, device, display device, computing device and storage medium
US20220147441A1 (en) Method and apparatus for allocating memory and electronic device
CN108074220B (en) Image processing method and device and television
AU2018287566A1 (en) Method for adjusting color temperature based on screen brightness, non-transitory computer-readable storage medium and terminal device
CN112070682A (en) Method and device for compensating image brightness
CN113360105A (en) Laser printer imaging system based on laser unit self-adaptive adjustment
WO2019232870A1 (en) Method for acquiring handwritten character training sample, apparatus, computer device, and storage medium
KR20240137030A (en) Method and system for automatically annotating sensor data
US20240062521A1 (en) Method and electronic device for object detection, and computer-readable storage medium
CN111144270B (en) Neural network-based handwritten text integrity evaluation method and evaluation device
MX2022011775A (en) Apparatus and method to facilitate identification of items.
CN113570507A (en) An image noise reduction method, device, equipment and storage medium
KR20240132283A (en) Detection method of OLED wet film defects based on lightweight semantic segmentation network
CN118279329B (en) Three-dimensional model singulation and semantically segmentation method, device, equipment and medium
CN107423307B (en) Internet information resource allocation method and device
CN114972721B (en) A method for identifying and locating insulator strings of transmission lines based on deep learning
KR102851105B1 (en) Real-time object detection method
CN111695550A (en) Character extraction method, image processing device and computer readable storage medium
GB2603366A (en) Method and system for performing de-identified location analytics