IL309786A - Quality score calibration of basecalling systems - Google Patents

Quality score calibration of basecalling systems

Info

Publication number
IL309786A
IL309786A IL309786A IL30978623A IL309786A IL 309786 A IL309786 A IL 309786A IL 309786 A IL309786 A IL 309786A IL 30978623 A IL30978623 A IL 30978623A IL 309786 A IL309786 A IL 309786A
Authority
IL
Israel
Prior art keywords
sensor data
range
clusters
computer
subset
Prior art date
Application number
IL309786A
Other languages
Hebrew (he)
Original Assignee
Illumina Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/839,387 external-priority patent/US20230029970A1/en
Application filed by Illumina Inc filed Critical Illumina Inc
Publication of IL309786A publication Critical patent/IL309786A/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Signal Processing (AREA)
  • Biochemistry (AREA)
  • Genetics & Genomics (AREA)

Claims (20)

1.Claims 1. A computer-implemented method of generating base calls by a base caller, comprising: receiving, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identifying a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; mapping at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; processing the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remapping each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
2. The computer-implemented method of claim 1, wherein the second range is fully encompassed within the first range.
3. The computer-implemented method of claim 1 or 2, wherein one or more outlier sensor data within the first range are absent from the second range of sensor data.
4. The computer-implemented method of any of claims 1-3, wherein identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
5. The computer-implemented method of claim 4, wherein at least one of the lower threshold percentage or the upper threshold percentage is 0.5% or less.
6. The computer-implemented method of claim 4, wherein at least one of the lower threshold percentage or the upper threshold percentage is 1.0% or less.
7. The computer-implemented method of any of claims 4-6, wherein each of the lower threshold percentage and the upper threshold percentage is 0.5% or less.
8. The computer-implemented method of any of claims 4-6, wherein each of the lower threshold percentage and the upper threshold percentage is 1% or less.
9. The computer-implemented method of any of claims 4-8, further comprising: identifying (i) a first outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is lower than the low value and (ii) a second outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is higher than the high value; and prior to the mapping, assigning the low value to the first outlier sensor data, and assigning the high value to the second outlier sensor data, such that the first outlier sensor data and the second outlier sensor data are within the second range subsequent to the assignment.
10. The computer-implemented method of any of claims 4-9, further comprising: identifying (i) a first outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is lower than the low value and (ii) a second outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is higher than the high value; and excluding the first outlier sensor data and the second outlier sensor data from the subset of the plurality of sensor data representing the subset of the plurality of clusters incorporating different nucleotide bases with different labels during the mapping, for being outside the second range, such that the first outlier sensor data and the second outlier sensor data are not mapped to the third range.
11. The computer-implemented method of any of claims 1-10, wherein mapping at least a subset of the plurality of sensor data representing the subset of the plurality of clusters incorporating different nucleotide bases with different labels comprises: mapping a first sensor data within the subset from a first value that is within the second range to a second value that is within the third range; and mapping a second sensor data within the subset from a third value that is within the second range to a fourth value that is within the third range.
12. The computer-implemented method of any of claims 1-11, wherein individual sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels comprises corresponding intensity of a corresponding section of an image generated from the flow cell.
13. The computer-implemented method of any of claims 1-12, further comprising: processing the plurality of normalized sensor data in a base caller, to assign, the corresponding base called for the target cluster, a first quality score indicating a first probability of the corresponding base being an A, a second quality score indicating a second probability of the corresponding base being a C, a third quality score indicating a third probability of the corresponding base being a T, and a fourth quality score indicating a fourth probability of the corresponding base being a G.
14. The computer-implemented method of claim 13, wherein the plurality of quality scores corresponding to the base call comprise the first quality score, the second quality score, the third quality score, and the fourth quality score.
15. The computer-implemented method of claim 14, further comprising: quantizing each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
16. A non-transitory computer readable storage medium comprising computer program instructions that, when executed on a processor, cause a computing device to: receive, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identify a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; map at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; process the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remap each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
17. The non-transitory computer readable storage medium of claim 16, wherein identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
18. The non-transitory computer readable storage medium of claim 17, further comprising computer program instructions that, when executed on the processor, cause the computing device to quantize each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
19. A system comprising: at least one processor; and a non-transitory computer-readable medium comprising instructions thereon that, when executed by the at least one processor, cause the system to: receive, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identify a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; map at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; process the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remap each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
20. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to quantize each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
IL309786A 2021-07-28 2022-07-28 Quality score calibration of basecalling systems IL309786A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202163226707P 2021-07-28 2021-07-28
US17/839,387 US20230029970A1 (en) 2021-07-28 2022-06-13 Quality score calibration of basecalling systems
PCT/US2022/038729 WO2023009758A1 (en) 2021-07-28 2022-07-28 Quality score calibration of basecalling systems

Publications (1)

Publication Number Publication Date
IL309786A true IL309786A (en) 2024-02-01

Family

ID=83149575

Family Applications (1)

Application Number Title Priority Date Filing Date
IL309786A IL309786A (en) 2021-07-28 2022-07-28 Quality score calibration of basecalling systems

Country Status (7)

Country Link
EP (1) EP4377960A1 (en)
JP (1) JP2024532049A (en)
KR (1) KR20240037882A (en)
AU (1) AU2022319125A1 (en)
CA (1) CA3223746A1 (en)
IL (1) IL309786A (en)
WO (1) WO2023009758A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118053503A (en) * 2024-01-11 2024-05-17 中国农业科学院农业基因组研究所 Method and system for constructing invasive biology multi-group database

Family Cites Families (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2044616A1 (en) 1989-10-26 1991-04-27 Roger Y. Tsien Dna sequencing
US5641658A (en) 1994-08-03 1997-06-24 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid with two primers bound to a single solid support
US6090592A (en) 1994-08-03 2000-07-18 Mosaic Technologies, Inc. Method for performing amplification of nucleic acid on supports
JP2001517948A (en) 1997-04-01 2001-10-09 グラクソ、グループ、リミテッド Nucleic acid sequencing
AU6846698A (en) 1997-04-01 1998-10-22 Glaxo Group Limited Method of nucleic acid amplification
AR021833A1 (en) 1998-09-30 2002-08-07 Applied Research Systems METHODS OF AMPLIFICATION AND SEQUENCING OF NUCLEIC ACID
US20020150909A1 (en) 1999-02-09 2002-10-17 Stuelpnagel John R. Automated information processing in randomly ordered arrays
US20030064366A1 (en) 2000-07-07 2003-04-03 Susan Hardin Real-time sequence determination
WO2002044425A2 (en) 2000-12-01 2002-06-06 Visigen Biotechnologies, Inc. Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity
AR031640A1 (en) 2000-12-08 2003-09-24 Applied Research Systems ISOTHERMAL AMPLIFICATION OF NUCLEIC ACIDS IN A SOLID SUPPORT
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
US20040002090A1 (en) 2002-03-05 2004-01-01 Pascal Mayer Methods for detecting genome-wide sequence variations associated with a phenotype
PT2119722T (en) 2002-08-23 2016-12-12 Illumina Cambridge Ltd Labelled nucleotides
SI3363809T1 (en) 2002-08-23 2020-08-31 Illumina Cambridge Limited Modified nucleotides for polynucleotide sequencing
GB0321306D0 (en) 2003-09-11 2003-10-15 Solexa Ltd Modified polymerases for improved incorporation of nucleotide analogues
EP3175914A1 (en) 2004-01-07 2017-06-07 Illumina Cambridge Limited Improvements in or relating to molecular arrays
WO2006044078A2 (en) 2004-09-17 2006-04-27 Pacific Biosciences Of California, Inc. Apparatus and method for analysis of molecules
WO2006064199A1 (en) 2004-12-13 2006-06-22 Solexa Limited Improved method of nucleotide detection
EP1888743B1 (en) 2005-05-10 2011-08-03 Illumina Cambridge Limited Improved polymerases
US8045998B2 (en) 2005-06-08 2011-10-25 Cisco Technology, Inc. Method and system for communicating using position information
GB0514936D0 (en) 2005-07-20 2005-08-24 Solexa Ltd Preparation of templates for nucleic acid sequencing
GB0517097D0 (en) 2005-08-19 2005-09-28 Solexa Ltd Modified nucleosides and nucleotides and uses thereof
US7405281B2 (en) 2005-09-29 2008-07-29 Pacific Biosciences Of California, Inc. Fluorescent nucleotide analogs and uses therefor
GB0522310D0 (en) 2005-11-01 2005-12-07 Solexa Ltd Methods of preparing libraries of template polynucleotides
WO2007107710A1 (en) 2006-03-17 2007-09-27 Solexa Limited Isothermal methods for creating clonal single molecule arrays
CA2648149A1 (en) 2006-03-31 2007-11-01 Solexa, Inc. Systems and devices for sequence by synthesis analysis
US20080242560A1 (en) 2006-11-21 2008-10-02 Gunderson Kevin L Methods for generating amplified nucleic acid arrays
US7595882B1 (en) 2008-04-14 2009-09-29 Geneal Electric Company Hollow-core waveguide-based raman systems and methods
PT3623481T (en) 2011-09-23 2021-10-15 Illumina Inc Methods and compositions for nucleic acid sequencing
US20160110499A1 (en) * 2014-10-21 2016-04-21 Life Technologies Corporation Methods, systems, and computer-readable media for blind deconvolution dephasing of nucleic acid sequencing data
BR112020014542A2 (en) * 2018-01-26 2020-12-08 Quantum-Si Incorporated MACHINE LEARNING ENABLED BY PULSE AND BASE APPLICATION FOR SEQUENCING DEVICES
US11347965B2 (en) * 2019-03-21 2022-05-31 Illumina, Inc. Training data generation for artificial intelligence-based sequencing

Also Published As

Publication number Publication date
AU2022319125A1 (en) 2024-01-18
WO2023009758A1 (en) 2023-02-02
EP4377960A1 (en) 2024-06-05
CA3223746A1 (en) 2023-02-02
JP2024532049A (en) 2024-09-05
KR20240037882A (en) 2024-03-22

Similar Documents

Publication Publication Date Title
JP2023080096A5 (en) Deep learning based variant classifier
US11636799B2 (en) Display method, display panel and display control device
CN112241699B (en) Object defect type identification method, object defect type identification device, computer equipment and storage medium
IL309786A (en) Quality score calibration of basecalling systems
CN113516939B (en) Brightness correction method and device, display equipment, computing equipment and storage medium
GB2577640A (en) Autonomic incident triage prioritization by performance modifier and temporal decay parameters
CN110648322A (en) Method and system for detecting abnormal cervical cells
JP6550723B2 (en) Image processing apparatus, character recognition apparatus, image processing method, and program
US11687467B2 (en) Data sharing system and data sharing method therefor
EP4020200B1 (en) Resource management platform-based task allocation method and system
US20220147441A1 (en) Method and apparatus for allocating memory and electronic device
CN112070682A (en) Method and device for compensating image brightness
US20210158137A1 (en) New learning dataset generation method, new learning dataset generation device and learning method using generated learning dataset
KR20210065901A (en) Method, device, electronic equipment and medium for identifying key point positions in images
CN111177811A (en) Automatic fire point location layout method applied to cloud platform
CN113360105A (en) Laser printer imaging system based on laser unit self-adaptive adjustment
CN111695550B (en) Text extraction method, image processing device and computer readable storage medium
CN115471439A (en) Method and device for identifying defects of display panel, electronic equipment and storage medium
CN113128565A (en) Automatic image annotation system and device oriented to agnostic pre-training annotation data
US11355083B2 (en) Correction device, display device, method of performing correction for display device, and method of manufacturing display device
WO2023151285A1 (en) Image recognition method and apparatus, electronic device, and storage medium
CN115994918A (en) Cell segmentation method and system
US11790509B2 (en) Method of building model of defect inspection for LED display
MX2023009128A (en) Product image classification.
CN112150394B (en) Image processing method and device, electronic equipment and storage medium