IL309786A - Quality score calibration of basecalling systems - Google Patents
Quality score calibration of basecalling systemsInfo
- Publication number
- IL309786A IL309786A IL309786A IL30978623A IL309786A IL 309786 A IL309786 A IL 309786A IL 309786 A IL309786 A IL 309786A IL 30978623 A IL30978623 A IL 30978623A IL 309786 A IL309786 A IL 309786A
- Authority
- IL
- Israel
- Prior art keywords
- sensor data
- range
- clusters
- computer
- subset
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Analytical Chemistry (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Organic Chemistry (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Signal Processing (AREA)
- Biochemistry (AREA)
- Genetics & Genomics (AREA)
Claims (20)
1.Claims 1. A computer-implemented method of generating base calls by a base caller, comprising: receiving, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identifying a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; mapping at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; processing the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remapping each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
2. The computer-implemented method of claim 1, wherein the second range is fully encompassed within the first range.
3. The computer-implemented method of claim 1 or 2, wherein one or more outlier sensor data within the first range are absent from the second range of sensor data.
4. The computer-implemented method of any of claims 1-3, wherein identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
5. The computer-implemented method of claim 4, wherein at least one of the lower threshold percentage or the upper threshold percentage is 0.5% or less.
6. The computer-implemented method of claim 4, wherein at least one of the lower threshold percentage or the upper threshold percentage is 1.0% or less.
7. The computer-implemented method of any of claims 4-6, wherein each of the lower threshold percentage and the upper threshold percentage is 0.5% or less.
8. The computer-implemented method of any of claims 4-6, wherein each of the lower threshold percentage and the upper threshold percentage is 1% or less.
9. The computer-implemented method of any of claims 4-8, further comprising: identifying (i) a first outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is lower than the low value and (ii) a second outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is higher than the high value; and prior to the mapping, assigning the low value to the first outlier sensor data, and assigning the high value to the second outlier sensor data, such that the first outlier sensor data and the second outlier sensor data are within the second range subsequent to the assignment.
10. The computer-implemented method of any of claims 4-9, further comprising: identifying (i) a first outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is lower than the low value and (ii) a second outlier sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels that is higher than the high value; and excluding the first outlier sensor data and the second outlier sensor data from the subset of the plurality of sensor data representing the subset of the plurality of clusters incorporating different nucleotide bases with different labels during the mapping, for being outside the second range, such that the first outlier sensor data and the second outlier sensor data are not mapped to the third range.
11. The computer-implemented method of any of claims 1-10, wherein mapping at least a subset of the plurality of sensor data representing the subset of the plurality of clusters incorporating different nucleotide bases with different labels comprises: mapping a first sensor data within the subset from a first value that is within the second range to a second value that is within the third range; and mapping a second sensor data within the subset from a third value that is within the second range to a fourth value that is within the third range.
12. The computer-implemented method of any of claims 1-11, wherein individual sensor data of the plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels comprises corresponding intensity of a corresponding section of an image generated from the flow cell.
13. The computer-implemented method of any of claims 1-12, further comprising: processing the plurality of normalized sensor data in a base caller, to assign, the corresponding base called for the target cluster, a first quality score indicating a first probability of the corresponding base being an A, a second quality score indicating a second probability of the corresponding base being a C, a third quality score indicating a third probability of the corresponding base being a T, and a fourth quality score indicating a fourth probability of the corresponding base being a G.
14. The computer-implemented method of claim 13, wherein the plurality of quality scores corresponding to the base call comprise the first quality score, the second quality score, the third quality score, and the fourth quality score.
15. The computer-implemented method of claim 14, further comprising: quantizing each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
16. A non-transitory computer readable storage medium comprising computer program instructions that, when executed on a processor, cause a computing device to: receive, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identify a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; map at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; process the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remap each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
17. The non-transitory computer readable storage medium of claim 16, wherein identifying the second range comprises: identifying, within the first range, a low value, such that a lower threshold percentage of the plurality of sensor data have a value that is lower than the low value; and identifying, within the first range, a high value, such that an upper threshold percentage of the plurality of sensor data have a value that is higher than the high value, wherein the second range is defined by the low value and the high value.
18. The non-transitory computer readable storage medium of claim 17, further comprising computer program instructions that, when executed on the processor, cause the computing device to quantize each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
19. A system comprising: at least one processor; and a non-transitory computer-readable medium comprising instructions thereon that, when executed by the at least one processor, cause the system to: receive, from a plurality of clusters within a region of a flow cell, a plurality of sensor data representing the plurality of clusters incorporating different nucleotide bases with different labels, wherein the plurality of sensor data is within a first range; identify a second range, such that at least a threshold percentage of the plurality of sensor data are within the second range; map at least a subset of the plurality of sensor data, that are within the second range and that represent, from the region of the flow cell, a subset of the plurality of clusters incorporating different nucleotide bases with different labels, to a third range, thereby generating a plurality of normalized sensor data for a target cluster of the plurality of clusters; process the plurality of normalized sensor data in a base caller, to determine, for the target cluster, a base call and a plurality of quality scores corresponding to the base call; and remap each of at least a subset of the plurality of quality scores to a remapped quality score for the base call.
20. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to quantize each of a plurality of remapped quality scores to a corresponding one of a plurality of quantized remapped quality score.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202163226707P | 2021-07-28 | 2021-07-28 | |
| US17/839,387 US20230029970A1 (en) | 2021-07-28 | 2022-06-13 | Quality score calibration of basecalling systems |
| PCT/US2022/038729 WO2023009758A1 (en) | 2021-07-28 | 2022-07-28 | Quality score calibration of basecalling systems |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| IL309786A true IL309786A (en) | 2024-02-01 |
Family
ID=83149575
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| IL309786A IL309786A (en) | 2021-07-28 | 2022-07-28 | Quality score calibration of basecalling systems |
Country Status (7)
| Country | Link |
|---|---|
| EP (1) | EP4377960A1 (en) |
| JP (1) | JP2024532049A (en) |
| KR (1) | KR20240037882A (en) |
| AU (1) | AU2022319125A1 (en) |
| CA (1) | CA3223746A1 (en) |
| IL (1) | IL309786A (en) |
| WO (1) | WO2023009758A1 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118053503B (en) * | 2024-01-11 | 2024-12-06 | 中国农业科学院农业基因组研究所 | A method and system for constructing a multi-omics database of invasive organisms |
Family Cites Families (32)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0450060A1 (en) | 1989-10-26 | 1991-10-09 | Sri International | Dna sequencing |
| US5641658A (en) | 1994-08-03 | 1997-06-24 | Mosaic Technologies, Inc. | Method for performing amplification of nucleic acid with two primers bound to a single solid support |
| US6090592A (en) | 1994-08-03 | 2000-07-18 | Mosaic Technologies, Inc. | Method for performing amplification of nucleic acid on supports |
| EP1498494A3 (en) | 1997-04-01 | 2007-06-20 | Solexa Ltd. | Method of nucleic acid sequencing |
| EP2327797B1 (en) | 1997-04-01 | 2015-11-25 | Illumina Cambridge Limited | Method of nucleic acid sequencing |
| AR021833A1 (en) | 1998-09-30 | 2002-08-07 | Applied Research Systems | METHODS OF AMPLIFICATION AND SEQUENCING OF NUCLEIC ACID |
| US20020150909A1 (en) | 1999-02-09 | 2002-10-17 | Stuelpnagel John R. | Automated information processing in randomly ordered arrays |
| US20030064366A1 (en) | 2000-07-07 | 2003-04-03 | Susan Hardin | Real-time sequence determination |
| EP1354064A2 (en) | 2000-12-01 | 2003-10-22 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
| AR031640A1 (en) | 2000-12-08 | 2003-09-24 | Applied Research Systems | ISOTHERMAL AMPLIFICATION OF NUCLEIC ACIDS IN A SOLID SUPPORT |
| US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
| US20040002090A1 (en) | 2002-03-05 | 2004-01-01 | Pascal Mayer | Methods for detecting genome-wide sequence variations associated with a phenotype |
| AU2003259350A1 (en) | 2002-08-23 | 2004-03-11 | Solexa Limited | Modified nucleotides for polynucleotide sequencing |
| ES2864086T3 (en) | 2002-08-23 | 2021-10-13 | Illumina Cambridge Ltd | Labeled nucleotides |
| GB0321306D0 (en) | 2003-09-11 | 2003-10-15 | Solexa Ltd | Modified polymerases for improved incorporation of nucleotide analogues |
| WO2005065814A1 (en) | 2004-01-07 | 2005-07-21 | Solexa Limited | Modified molecular arrays |
| EP1790202A4 (en) | 2004-09-17 | 2013-02-20 | Pacific Biosciences California | Apparatus and method for analysis of molecules |
| WO2006064199A1 (en) | 2004-12-13 | 2006-06-22 | Solexa Limited | Improved method of nucleotide detection |
| EP1888743B1 (en) | 2005-05-10 | 2011-08-03 | Illumina Cambridge Limited | Improved polymerases |
| US8045998B2 (en) | 2005-06-08 | 2011-10-25 | Cisco Technology, Inc. | Method and system for communicating using position information |
| GB0514936D0 (en) | 2005-07-20 | 2005-08-24 | Solexa Ltd | Preparation of templates for nucleic acid sequencing |
| GB0517097D0 (en) | 2005-08-19 | 2005-09-28 | Solexa Ltd | Modified nucleosides and nucleotides and uses thereof |
| US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
| GB0522310D0 (en) | 2005-11-01 | 2005-12-07 | Solexa Ltd | Methods of preparing libraries of template polynucleotides |
| EP2021503A1 (en) | 2006-03-17 | 2009-02-11 | Solexa Ltd. | Isothermal methods for creating clonal single molecule arrays |
| CA2648149A1 (en) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
| US20080242560A1 (en) | 2006-11-21 | 2008-10-02 | Gunderson Kevin L | Methods for generating amplified nucleic acid arrays |
| US7595882B1 (en) | 2008-04-14 | 2009-09-29 | Geneal Electric Company | Hollow-core waveguide-based raman systems and methods |
| CA3104322C (en) | 2011-09-23 | 2023-06-13 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
| WO2016064703A1 (en) * | 2014-10-21 | 2016-04-28 | Life Technologies Corporation | Methods, systems, and computer-readable media for blind deconvolution dephasing of nucleic acid sequencing data |
| CA3088687A1 (en) * | 2018-01-26 | 2019-08-01 | Quantum-Si Incorporated | Machine learning enabled pulse and base calling for sequencing devices |
| US11347965B2 (en) * | 2019-03-21 | 2022-05-31 | Illumina, Inc. | Training data generation for artificial intelligence-based sequencing |
-
2022
- 2022-07-28 WO PCT/US2022/038729 patent/WO2023009758A1/en not_active Ceased
- 2022-07-28 JP JP2023579782A patent/JP2024532049A/en active Pending
- 2022-07-28 IL IL309786A patent/IL309786A/en unknown
- 2022-07-28 KR KR1020237043770A patent/KR20240037882A/en active Pending
- 2022-07-28 CA CA3223746A patent/CA3223746A1/en active Pending
- 2022-07-28 EP EP22761681.0A patent/EP4377960A1/en active Pending
- 2022-07-28 AU AU2022319125A patent/AU2022319125A1/en active Pending
Also Published As
| Publication number | Publication date |
|---|---|
| WO2023009758A1 (en) | 2023-02-02 |
| AU2022319125A1 (en) | 2024-01-18 |
| EP4377960A1 (en) | 2024-06-05 |
| CA3223746A1 (en) | 2023-02-02 |
| JP2024532049A (en) | 2024-09-05 |
| KR20240037882A (en) | 2024-03-22 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP2023080096A5 (en) | Deep learning based variant classifier | |
| US20230064991A1 (en) | Display method, display panel and display control device | |
| CN109492561B (en) | Optical remote sensing image ship detection method based on improved YOLO V2 model | |
| IL309786A (en) | Quality score calibration of basecalling systems | |
| EP3620982A1 (en) | Sample processing method and device | |
| CN108875779A (en) | Training method, device and the terminal device of neural network | |
| CN113516939A (en) | Brightness correction method, device, display device, computing device and storage medium | |
| US20220147441A1 (en) | Method and apparatus for allocating memory and electronic device | |
| CN108074220B (en) | Image processing method and device and television | |
| AU2018287566A1 (en) | Method for adjusting color temperature based on screen brightness, non-transitory computer-readable storage medium and terminal device | |
| CN112070682A (en) | Method and device for compensating image brightness | |
| CN113360105A (en) | Laser printer imaging system based on laser unit self-adaptive adjustment | |
| WO2019232870A1 (en) | Method for acquiring handwritten character training sample, apparatus, computer device, and storage medium | |
| KR20240137030A (en) | Method and system for automatically annotating sensor data | |
| US20240062521A1 (en) | Method and electronic device for object detection, and computer-readable storage medium | |
| CN111144270B (en) | Neural network-based handwritten text integrity evaluation method and evaluation device | |
| MX2022011775A (en) | Apparatus and method to facilitate identification of items. | |
| CN113570507A (en) | An image noise reduction method, device, equipment and storage medium | |
| KR20240132283A (en) | Detection method of OLED wet film defects based on lightweight semantic segmentation network | |
| CN118279329B (en) | Three-dimensional model singulation and semantically segmentation method, device, equipment and medium | |
| CN107423307B (en) | Internet information resource allocation method and device | |
| CN114972721B (en) | A method for identifying and locating insulator strings of transmission lines based on deep learning | |
| KR102851105B1 (en) | Real-time object detection method | |
| CN111695550A (en) | Character extraction method, image processing device and computer readable storage medium | |
| GB2603366A (en) | Method and system for performing de-identified location analytics |