WO2021030952A1 - 碱基识别方法、系统、计算机程序产品和测序系统 - Google Patents

碱基识别方法、系统、计算机程序产品和测序系统 Download PDF

Info

Publication number
WO2021030952A1
WO2021030952A1 PCT/CN2019/101067 CN2019101067W WO2021030952A1 WO 2021030952 A1 WO2021030952 A1 WO 2021030952A1 CN 2019101067 W CN2019101067 W CN 2019101067W WO 2021030952 A1 WO2021030952 A1 WO 2021030952A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
bright spot
pixel
bright
bright spots
Prior art date
Application number
PCT/CN2019/101067
Other languages
English (en)
French (fr)
Inventor
李林森
金欢
徐伟彬
姜泽飞
孙雷
Original Assignee
深圳市真迈生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市真迈生物科技有限公司 filed Critical 深圳市真迈生物科技有限公司
Priority to EP19941798.1A priority Critical patent/EP4015645A4/en
Priority to CN201980058420.6A priority patent/CN112823352B/zh
Priority to PCT/CN2019/101067 priority patent/WO2021030952A1/zh
Publication of WO2021030952A1 publication Critical patent/WO2021030952A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/04Recognition of patterns in DNA microarrays

Definitions

  • the present invention relates to the field of information processing and identification, in particular to the processing and analysis of data related to nucleic acid sequence determination, and more specifically to a base identification method, a base identification system, a sequencing system and a computer Program product.
  • nucleic acid molecules templates
  • biochemical reactions to determine the nucleotide sequence of the nucleic acid molecules
  • the nucleotide composition and sequence of at least a part of the template including how to identify and process images collected at multiple different time points, including information on the images, are issues worthy of attention.
  • the embodiments of the present invention aim to at least solve one of the technical problems existing in the related art or at least provide an alternative practical solution.
  • the present invention provides a base recognition method, a base recognition system, a computer program product and a sequencing system.
  • a base identification method comprising: corresponding to the coordinate of the bright spot in the bright spot set of the sequencing template corresponding to the image to be checked, to determine the corresponding coordinate position of the image to be checked ; Determine the intensity of the corresponding coordinate position of the said image to be inspected; and compare the intensity of the corresponding coordinate position of the image to be inspected with the size of the preset threshold, and base based on the information of the position on the image to be inspected whose intensity is greater than the preset threshold Identification;
  • the so-called bright spot set corresponding to the sequencing template is constructed based on multiple images.
  • the said image and the image to be tested are collected from the base extension reaction.
  • the image and the image to be tested correspond to the same field of view.
  • a base identification system is provided to implement the base identification method in any one of the above embodiments of the present invention.
  • the system includes: a mapping module for collecting bright spots corresponding to sequencing templates The coordinates of the bright spots in correspond to the image to be inspected to determine the corresponding coordinate position of the image to be inspected; an intensity determination module for calculating the intensity of the corresponding coordinate position of the image to be inspected from the mapping module; and an identification module for Compare the intensity of the corresponding coordinate position of the image to be inspected from the intensity determination module with the size of the preset threshold, and perform the base identification based on the information of the position on the image to be inspected whose intensity is greater than the preset threshold; the so-called corresponding sequencing template
  • the bright spot set is constructed based on multiple images.
  • the so-called image and the image to be inspected are all collected from the base extension reaction.
  • the image and the image to be inspected correspond to the same field of view. There are multiple bands in the field of view during the base extension reaction.
  • For optically detectable labeled nucleic acid molecules at least part of the nucleic acid molecules appear as bright spots on the image and/or the image to be inspected.
  • a base recognition system the system includes: a memory for storing data, including a computer executable program; a processor, for executing the so-called computer executable program to implement The base recognition method in any embodiment of the present invention described above.
  • a sequencing system which includes the base recognition system of any one of the embodiments of the present invention described above.
  • a computer-readable storage medium for storing a program for computer execution, and the execution of the program includes completing the base identification method in any of the above embodiments.
  • Computer-readable storage media include but are not limited to read-only memory, random access memory, magnetic or optical disks, etc.
  • a computer program product including instructions, which when a computer executes the so-called program, cause the computer to execute the base identification method in any of the above embodiments of the present invention.
  • a sequencing system which includes the computer program product of any one of the foregoing embodiments of the present invention.
  • the base identification method, base identification system, computer readable storage medium, computer program product and/or sequencing system in any of the above embodiments directly correspond to the coordinates of the bright spot in the bright spot set corresponding to the sequencing template
  • the base identification is performed on the image to be inspected and based on the information of the corresponding coordinate position on the image to be inspected.
  • the type and sequence of bases that are bound to the template nucleic acid during the base extension reaction can be identified simply, efficiently and with high throughput. Simple, fast and accurate determination of template nucleic acid sequence.
  • Figure 1 is a schematic flow chart of a base recognition method in a specific embodiment of the present invention.
  • Fig. 2 is a schematic flow chart of a base identification method in a specific embodiment of the present invention.
  • FIG. 3 is a schematic diagram of the process and result of merging bright spots in images Repeat1-20 in a specific embodiment of the present invention to obtain a set of bright spots corresponding to the sequencing template.
  • FIG. 4 is a schematic diagram of the deviation correction process and the deviation correction result in the specific embodiment of the present invention.
  • FIG. 5 is a schematic diagram of the corresponding matrix and pixels of candidate bright spots in a specific embodiment of the present invention.
  • FIG. 6 is a schematic diagram of pixel values in the range of m1*m2 centered on the central pixel point of the pixel point matrix in the specific embodiment of the present invention.
  • FIG. 7 is a schematic diagram of the comparison of the bright spot detection results before and after the determination based on the second bright spot detection threshold in the specific embodiment of the present invention.
  • FIG. 8 is a distribution curve of the number of pixels in a 300*300 area arranged in ascending order of pixel values in a specific embodiment of the present invention.
  • FIG. 9 is a distribution curve of the number of pixels in a 300*300 area arranged in ascending order of pixel values in a specific embodiment of the present invention.
  • Fig. 10 is a schematic structural diagram of a base recognition system in a specific embodiment of the present invention.
  • spots or “bright spots” (spots or peaks) refer to luminous spots or luminous points on an image, and one luminous spot occupies at least one pixel.
  • sequencing also called sequence determination or gene sequencing
  • nucleic acid sequence determination refers to nucleic acid sequence determination, including DNA sequencing and/or RNA sequencing, including long fragment sequencing and/or short fragment sequencing, including nucleic acid sequence
  • SBS sequencing by synthesis
  • SBL sequencing by ligation
  • the process of binding to the template is the base extension reaction.
  • the detection of so-called "bright spots” is the detection of optical signals from extended bases or base clusters.
  • Sequencing can be performed through a sequencing platform.
  • the sequencing platform can be selected but not limited to Illumina's Hiseq/Miseq/Nextseq/Novaseq sequencing platform, Thermo Fisher/Life Technologies' Ion Torrent platform, BGISEQ and MGISEQ platforms of BGI and single molecule sequencing Platform; the sequencing method can choose single-end sequencing or paired-end sequencing; the obtained sequencing results/data are the fragments that are read out, called reads. The length of the read segment is called the read length.
  • the images collected from the sequencing reaction/base extension reaction or the images converted or constructed based on these images may be grayscale images or color images.
  • the pixel value is the same as the grayscale value; for a 16-bit grayscale image such as a tiff grayscale image, the pixel value ranges from 0 to 65535.
  • the pixel value is The value range is 0-255.
  • one pixel of a color image has three pixel values.
  • the provided method and/or system can directly use the pixel value array for image detection/target information recognition, or first convert the color image to grayscale Image, the converted gray image is processed and information recognized to reduce the calculation and complexity of the image detection signal recognition process; the method of converting non-gray image to gray image can be selected but not limited to floating point Algorithm, integer method, shift method and average method, etc.
  • the so-called "intensity" and the pixel (pixel value) can be replaced, and the so-called intensity or pixel size can be a true or objective absolute value, or
  • the relative value can include various deformations based on the real pixel value, such as the enlarged pixel value, the reduced pixel value, the ratio or relationship based on the pixel value, etc.; generally, it involves comparing the intensity of multiple images or bright spots or positions/ In terms of pixel size, the intensity/pixel size of these images or bright spots or locations is the intensity/pixel size after the same processing, such as objective pixel values or pixel values after the same deformation processing; involving one or more
  • the information of the specific positions of the images is compared and analyzed. When determining the specific positions, it is preferable to make the images aligned and located in the same coordinate system.
  • This commonly used base identification method determines whether the nucleic acid molecule represented on the sequencing template reacts in the current round by comparing the distance between the bright spot on the image to be tested and the bright spot on the sequencing template. Bright; base extension reaction). If a reaction occurs, the type of base added or the type of base at the corresponding position of the nucleic acid molecule is measured.
  • the inventor found through a large amount of data testing that the method is seriously affected by image quality, bright spot positioning algorithm, dot density distribution, etc., and it is prone to misidentify bases.
  • the inventor provides a base identification method in one embodiment, which includes: S2 corresponds to the coordinates of the bright spots in the bright spot set corresponding to the sequencing template to the image to be inspected to determine the The corresponding coordinate position of the image to be inspected; S4 determines the intensity of the corresponding coordinate position of the image to be inspected; and S6 compares the intensity of the corresponding coordinate position of the image to be inspected with the size of the preset threshold, based on the intensity of the image to be inspected greater than the preset The information of the position of the threshold is used for base recognition.
  • the so-called bright spot set corresponding to the sequencing template is constructed based on multiple images.
  • the so-called image and the image to be tested are all collected from the base extension reaction.
  • the image and the image to be tested are from the same field of view.
  • the so-called nucleic acid molecule is the template to be sequenced or the nucleic acid complex containing the template, and the so-called base extension reaction includes the process of nucleotides including nucleotide analogues bound to the template or nucleic acid complex; those skilled in the art It is known that the images obtained on the sequencing platform based on optical detection to achieve sequence determination, with optically detectable labels, such as nucleotides or nucleic acid molecules with fluorescent molecules, are excited by laser to emit light, which appear as bright spots in the image. One bright spot occupies a few pixels.
  • this base call method the coordinates of the bright spots in the bright spot set corresponding to the sequencing template are directly corresponded to the image to be inspected, and base identification is performed based on the information of the corresponding coordinate position on the image to be inspected.
  • this method can be simpler. , Efficient and high-throughput identification of the type and sequence of the bases bound to the template nucleic acid during the base extension reaction, which can realize the simple, rapid and accurate determination of the template nucleic acid sequence.
  • the comparison test further found that the reads obtained by this method include the number of reads that match the unique position of the reference sequence, which is 30% or more higher than that obtained by the general base recognition method.
  • the so-called bright spot set corresponding to the sequencing template can be constructed when the base identification method is performed, or can be constructed in advance.
  • a bright spot set corresponding to the sequencing template is constructed in advance based on the image, and saved for recall.
  • the bright spot set corresponding to the sequencing template is constructed based on the image.
  • the said image includes four types corresponding to base/nucleotide/nucleotide analogues A, T/U, G and C, respectively.
  • the first image includes image M1 and image M2, the second image includes image N1 and image N2, and the third image includes Image P1 and image P2, the fourth image includes image Q1 and image Q2, define the sequence and/or realize the four types of base extension reactions at the same time as one round of sequencing reaction, image M1 and image M2 are from two rounds of sequencing reaction, image N1 and image N2 are from two rounds of sequencing reactions respectively, image P1 and image P2 are from two rounds of sequencing reactions respectively, and image Q1 and image Q2 are from two rounds of sequencing reactions respectively.
  • the method includes: S8 merges the first image, the second image, and the first image.
  • the bright spots on the third image and the fourth image record the number of bright spots at the same position, and remove the bright spots with the number 1 to obtain the bright spot set corresponding to the sequencing template.
  • This method can quickly and easily construct a bright spot set corresponding to nucleic acid molecules (sequencing templates) by directly merging bright spots on the image.
  • the constructed bright spot set can effectively, accurately and comprehensively reflect the information of the sequencing template, which is conducive to the accurate identification of subsequent bases and obtaining accurate nucleic acid sequences.
  • the so-called round of sequencing reaction which realizes four types of base extension reactions sequentially and/or simultaneously, can be the four types of base reaction substrates (such as nucleotide analogs/base analogs) at the same time as one base
  • a round of sequencing reaction can be realized in the base extension reaction system, which can be two types of base analogues in one base extension reaction system, and the other two types of reaction substrates in the next base extension reaction system to achieve a round of sequencing reaction. It can also be that one type of base analogue is added to one base extension reaction system, and the four types of base analogues are sequentially added to four consecutive base extension reaction systems to realize a round of sequencing reaction.
  • the first image, the second image, the third image, and the fourth image can be collected from two base extension reactions or more base extension reactions.
  • a base extension reaction may include one image acquisition or multiple image acquisitions.
  • a round of sequencing reaction includes multiple base extension reactions, such as single-color sequencing.
  • the reaction substrates (nucleotide analogs) corresponding to the four types of bases used all carry the same fluorescent dye.
  • the round sequencing reaction includes four base extension reactions (4repeats).
  • one base extension reaction includes one image acquisition.
  • Image M1, image N1, image P1, and image Q1 are four times from a round of sequencing reaction. The same view of the subbase extension reaction.
  • a round of sequencing reaction includes two base extension reactions, such as two-color sequencing, using four types of bases corresponding to two of the reaction substrates (nucleotide analogs) with a fluorescent dye , The other two fluorescent dyes with another different excitation wavelength, a round of sequencing reaction includes two base extension reactions, two types of base reaction substrates with different dyes are nucleated in one base extension reaction The nucleotide binding/extension reaction.
  • a base extension reaction includes two image acquisitions at different excitation wavelengths. Image M1, image N1, image P1, and image Q1 are respectively derived from two bases in a round of sequencing reaction The same field of view under the two excitation wavelengths of the extension reaction.
  • a round of sequencing reaction includes a base extension reaction, such as the two-color sequencing reaction of the second-generation sequencing platform.
  • the four types of base reaction substrates carry dye a and bands respectively. With dye b, with dye a and dye b, and without any dye, the excitation wavelengths of dye a and dye b are different; the four types of reaction substrates realize one round of sequencing reaction in the same base extension reaction, one alkali
  • the base extension reaction includes two image acquisitions at different excitation wavelengths, the first image is the same as the third image, and the second image is the same as the fourth image.
  • Image M1 and image N1 are from different rounds of sequencing reactions or different in the same round of sequencing reactions. The same field of view at the excitation wavelength.
  • a round of sequencing reaction includes a base extension reaction, such as a four-color sequencing reaction.
  • the four types of base reaction substrates (such as nucleotide analogs) carry dye a, dye b, With dye c and dye d, the excitation wavelengths of dye a, dye b, dye c and dye d are all different; the four types of reaction substrates realize one round of sequencing reaction in the same base extension reaction, and one base extension
  • the reaction includes four image acquisitions at different excitation wavelengths. Image M1, image N1, image P1, and image Q1 are from different rounds of sequencing reactions or the same field of view at different excitation wavelengths in the same round of sequencing reactions.
  • S8 includes: (a) merging the bright spots on the image N1 into the image M1 to obtain a merged image M1. For the overlapping bright spots in the once-merged image M1 according to the bright spots contained in the overlapping bright spots Mark the number of non-coincident bright spots, and mark it as 1 for the non-coincident bright spots.
  • the multiple bright spots whose distance is less than the first predetermined pixel in the merged image M1 at a time are regarded as one superimposed bright spot; (b) Take image P1, image Q1, and image M2, image N2, image P2, or image Q2 replace image N1, replace image M1 with one-time merged image M1, and perform (a) until the bright spots on all images are merged to obtain the original bright spot set; (c) remove The bright spots labeled 1 in the original bright spot set are used to obtain the bright spot set corresponding to the sequencing template. In this way, the weights of images of different rounds of sequencing reactions can be balanced, more accurate template bright spots can be obtained, and the bright spot set corresponding to the sequencing template can be obtained quickly, simply and accurately, which is beneficial to accurately identify bases and obtain reads.
  • the size of the electronic sensor used in the imaging system is 6.5 ⁇ m, the microscope magnification is 60 times, and the smallest size that can be seen is 0.1 ⁇ m.
  • the size of the bright spot corresponding to the nucleic acid molecule is generally less than 10*10 pixels.
  • the so-called first predetermined pixel in one example, is set to 1.05 pixels. In this way, it is possible to accurately determine the overlapping bright spots, which is beneficial to the accurate construction of the sequencing template (the bright spot set).
  • a template vector (TemplateVec) is set to carry the merged result of the peaks on the image of the same field of view from 1-20 Repeats. Each merged counts the successfully merged bright spots. After all the merges are completed, Remove the point with a count of 1. Specifically, when the peaks of the image Repeat1 are merged into TemplateVec, since there are no bright spots on TemplateVec at the beginning, the total number of bright spots in TemplateVec is equal to the number of bright spots on the image Repeat1, and all bright spots are counted as 1. When merging the bright spots into TemplateVec, first determine whether each Repeat2 bright spot has a bright spot with a distance of less than 1.05 pixels in the TemplateVec.
  • FIG. 3 illustrates the above sequencing template construction process, and the circles in the figure indicate bright spots.
  • the position/coordinates of the merged bright spot can be determined by the coordinates of the center of gravity of the multiple bright spots before merging, for example, taking multiple bright spots Either one of the center of gravity coordinates or the average value is used as the coordinate of the combined bright spot, or weights are set according to the coordinates of multiple bright spots before the combination to determine the coordinate of the combined bright spot, for example, according to the coordinates of the combined bright spot.
  • the number of bright spots contained in the bright spots and/or the bright spots before merging are from the first round of sequencing reactions, and different contribution values are set, that is, different weights are set for the bright spot coordinates before the merging to determine the combined bright spots
  • the coordinates in this way, help to obtain more accurate information of bright spots reflecting the real situation (corresponding to the sequencing template).
  • a relatively high weight is set, and/or the bright spot before merging comes from the first round of sequencing reaction images, and a relatively high weight is set, so that
  • the information of the merged bright spots objectively and accurately reflects the information before the merge, which is conducive to constructing an accurate set of bright spots corresponding to the sequencing template, and is conducive to accurate base identification.
  • the image is a registered image. In this way, it is conducive to accurately obtain the bright spot set corresponding to the sequencing template, and is conducive to accurate base identification.
  • the following method is used to perform image registration, including: performing a first registration of the image to be registered based on the reference image, the reference image and the image to be registered correspond to the same object, and the reference image and the image to be registered both include multiple Bright spots, including determining the first offset between the predetermined area on the image to be registered and the corresponding predetermined area on the reference image, and moving all the bright spots on the image to be registered based on the first offset to obtain the first registration
  • This image registration method uses two associated registrations, which can be referred to as coarse registration and fine registration, including fine registration using bright spots on the image, which can quickly realize high-precision correction of the image based on a small amount of data information. It is especially suitable for scenes that require high-precision image correction.
  • single-molecule-level image detection such as images of sequencing reactions from third-generation sequencing platforms.
  • the so-called single-molecule level refers to the size of a single or a few molecules, for example, no more than 10, 8, 5, 4, or 3 molecules.
  • the so-called "bright spots” correspond to optical signals of extended bases or base clusters or interference signals of other bright substances.
  • the so-called predetermined area on the image may be the entire image or a part of the image.
  • the predetermined area on the image is a part of the image, for example, a 512*512 area in the center of the image.
  • the so-called image center is the center of the field of view. The intersection of the optical axis of the imaging system and the imaging plane can be called the image center point, and the area centered on this center point can be regarded as the image center area.
  • the image to be registered comes from a nucleic acid sequencing platform. Specifically, it comes from a sequencing platform that uses the principle of optical imaging to perform sequence determination.
  • the platform includes an imaging system and a nucleic acid sample carrying system.
  • the nucleic acid molecules to be tested with optical detection labels are fixed in a reactor, which is also called a chip,
  • the chip is loaded on a movable table, and the movable table drives the movement of the chip to realize image acquisition of nucleic acid molecules to be tested at different positions (different fields of view) of the chip.
  • the movement of the optical system and/or the movable stage has precision limitations.
  • the so-called reference image is obtained through construction, and the reference image can be constructed when the image to be registered is registered, or can be called when needed for pre-construction and preservation.
  • constructing the reference image includes: acquiring a fifth image and a sixth image, where the fifth image and the sixth image correspond to the same field of view/object as the image to be registered; performing coarse registration on the sixth image based on the fifth image, Including determining the offset of the sixth image and the fifth image, moving the sixth image based on the offset to obtain the sixth image after coarse registration; combining the fifth image and the sixth image after coarse registration to obtain
  • the reference image, the fifth image and the sixth image all contain multiple bright spots. In this way, the use of construction to obtain a reference image containing more or relatively more complete information, and using this image as a reference for correcting deviations is conducive to achieving more accurate image registration.
  • multiple images are used for reference image construction, which is conducive to obtaining the complete bright spot information of the corresponding nucleic acid molecule from the reference image, which is conducive to the bright spot-based image correction, and further conducive to the bright spot collection corresponding to the sequencing template Acquisition and base recognition.
  • the fifth image and the sixth image are respectively from the same field of view at different times of the nucleic acid sequencing reaction (sequencing reaction).
  • a round of sequencing reaction includes multiple base extension reactions, such as single-color sequencing.
  • the reaction substrates (nucleotide analogs) corresponding to the four types of bases used all carry the same fluorescent dye.
  • the round sequencing reaction includes four base extension reactions (4repeats).
  • one base extension reaction includes one image acquisition, and the fifth and sixth images are from the same field of view of different base extension reactions. In this way, the reference image obtained by processing and integrating the information of the fifth image and the sixth image is used as a reference for correction, which facilitates more accurate image correction.
  • a single-molecule two-color sequencing reaction is performed, and two of the reaction substrates (nucleotide analogs) corresponding to the four types of bases are used with one fluorescent dye, and the other two with the other
  • a round of sequencing reaction includes two base extension reactions. Two types of base reaction substrates with different dyes undergo binding reactions in one base extension reaction. For one field of view, One base extension reaction includes two image acquisitions at different excitation wavelengths. The fifth image and the sixth image are from different base extension reactions or the same field of view at different excitation wavelengths in the same base extension reaction. In this way, the reference image obtained by processing and integrating the information of the fifth image and the sixth image is used as a reference for correction, which facilitates more accurate image correction.
  • a round of sequencing reaction includes a base extension reaction, such as the two-color sequencing reaction of the second-generation sequencing platform.
  • the four types of base reaction substrates carry dye a and bands respectively. With dye b, with dye a and dye b, and without any dye, the emission wavelengths of dye a and dye b are different after being excited; or, for example, four-color sequencing, four types of base reaction substrates (such as nucleosides) Acid analogs) with dye a, dye b, dye c, and dye d, respectively, the emission wavelengths of dye a, b, c and d are different after being excited; four types of reaction substrates are in the same base extension reaction
  • One round of sequencing reaction is realized, and the fifth image and the sixth image are from different rounds of sequencing reactions or the same field of view under different excitation wavelengths in the same round of sequencing reactions. In this way, the reference image obtained by processing and integrating the information of the fifth image and the sixth image is used as a reference image obtained by processing and integrating the information of
  • the fifth image and/or the sixth image may be one image or multiple images.
  • the fifth image is the first image
  • the sixth image is the second image.
  • it further includes using the seventh image and the eighth image to construct the so-called reference image, and the image to be registered, the fifth image, the sixth image, the seventh image, and the eighth image come from the sequencing reaction.
  • the fifth image, sixth image, seventh image, and eighth image correspond to the four types of base extension reactions A, T/U, G, and C.
  • the reference image construction also includes: The image performs rough registration on the seventh image, including determining the offset of the seventh image relative to the fifth image, and moving the seventh image based on the offset to obtain the seventh image after the rough registration; based on the fifth image pair Coarse registration of the eighth image includes determining the offset of the eighth image relative to the fifth image, and moving the eighth image based on the offset to obtain the eighth image after the rough registration; combining the fifth image and the rough registration
  • the sixth image after alignment, the seventh image after coarse registration, and the eighth image after coarse registration are used to obtain a reference image.
  • the implementation of the first registration is not limited.
  • Fourier transform and frequency domain registration can be used to determine the first offset.
  • the two-dimensional discrete Fourier transform in the pure phase correlation function Phase-Only Correlation Function
  • the first registration/coarse registration can reach an accuracy of 1 pixel (1 pixel). In this way, the first offset can be determined quickly and accurately and/or a reference image that is conducive to accurate correction can be constructed.
  • the reference image and the image to be registered are binarized images. In this way, it is beneficial to reduce the amount of calculation and quickly correct the deviation.
  • the image to be corrected and the reference image are both binarized images, that is, each pixel in the image is not a or b, for example, a is 1, b is 0, and the pixel marked 1 is brighter than the pixel marked 0 , Or high intensity;
  • the reference image is constructed using the images repeat1, repeat2, repeat3 and repeat4 of the four base extension reactions of a round of sequencing reaction, the fifth image and the sixth image are selected from any of the images repeat1-4 , Two or three.
  • the fifth image is the image repeat1, and the images repeat2, repeat3, and repeat4 are the sixth images.
  • the images repeat2-4 are sequentially coarsely registered to obtain the coarsely registered images repeat2-4;
  • the image repeat1 and the coarsely registered image repeat2-4 are used to obtain the reference image.
  • the so-called merged image is the overlapping bright spots in the merged image.
  • two bright spots with a distance of not more than 1.5 pixels on the two images are set as overlapping bright spots.
  • the central area of the synthesized image of 4 repeats is used as the reference image.
  • repeat5 is the image after the binarization, take The center of the image, for example, a 512*512 area, and the center image synthesized with repeat1-4 (corresponding to the center 512*512 area of the reference image), perform a two-dimensional discrete Fourier transform, and use frequency domain registration to obtain the offset offset( x0, y0), that is, to achieve coarse image registration, x0 and y0 can reach the accuracy of 1 pixel; 2) The image after the coarse registration and the reference image are merged based on the bright spots on the image, including the calculation of repeat5 images
  • the offset (x1, y1) of the overlapping bright spots in the central area of the reference image and the corresponding area of the reference image offset(x1, y1) the coordinate position of the bright spot of the image to be corrected-the coordinate position of the corresponding bright spot on the reference image, which can be expressed
  • detecting and identifying bright spots on the image includes using a k1*k2 matrix to perform bright spot detection on the image, and determining the center of the matrix A matrix whose pixel value is not less than any pixel value in the non-center of the matrix corresponds to a candidate bright spot, and determines whether the candidate bright spot is a bright spot, k1 and k2 are both odd numbers greater than 1, and the k1*k2 matrix contains k1*k2 pixels .
  • the so-called image is, for example, the image to be registered, the image in the construction reference image, and the like.
  • This method to detect bright spots on the image can quickly and effectively realize the detection of bright spots (spots or peaks) on the image, especially for images collected from nucleic acid sequence determination reactions.
  • This method has no special restrictions on the image to be detected, that is, the original input data. It is suitable for the processing and analysis of images generated by any platform that uses optical detection principles for nucleic acid sequence determination, including but not limited to second- and third-generation sequencing, with high accuracy and The high-efficiency feature can obtain more representative sequence information from the image. Especially for random images and signal recognition with high accuracy requirements, it is especially advantageous.
  • the image comes from a nucleic acid sequence determination reaction, and the nucleic acid molecule has an optically detectable label, such as a fluorescent label.
  • the fluorescent molecule can be excited to emit fluorescence under the irradiation of a laser of a specific wavelength, and the image is collected by the imaging system.
  • the captured image includes light spots/bright spots that may correspond to the location of fluorescent molecules. Understandably, when in the focal plane position, the size of the bright spot corresponding to the position of the fluorescent molecules in the collected image is smaller and the brightness is higher; when in the non-focus plane position, the collected image The bright spot corresponding to the position of the fluorescent molecule in has a larger size and lower brightness.
  • the so-called single molecule is a small number of molecules, for example, the number of molecules is not more than 10, 8, 6, 5, or 3, for example, one, two, three, four, five, six, or eight.
  • the center pixel value of the matrix is greater than the first preset value
  • any pixel value other than the center of the matrix is greater than the second preset value
  • the first preset value and the second preset value are related to the average pixel value of the image.
  • a k1*k2 matrix may be used to perform traversal detection on the image, and the so-called first preset value and/or second preset value settings are related to the average pixel value of the image.
  • the pixel value is the same as the grayscale value.
  • k1*k2 matrix, k1 and k2 can be equal or unequal.
  • the relevant parameters of the imaging system are: the objective lens is 60 times, the size of the electronic sensor is 6.5 ⁇ m, the image formed by the microscope and then the electronic sensor, the smallest size that can be seen is 0.1 ⁇ m, the obtained image or the input image It can be a 16-bit grayscale or color image of 512*512, 1024*1024, or 2048*2048.
  • the inventors used a large amount of image processing statistics, and took the first preset value to be 1.4 times the average pixel value of the image, and the second preset value was 1.1 times the average pixel value of the image, which can eliminate interference, Obtain the bright spot detection result from the optical detection mark.
  • the size, degree of similarity and/or intensity of the ideal bright spot can be used to further screen and judge the candidate bright spot.
  • the size of the connected domain corresponding to the candidate bright spot is used to quantitatively reflect the size of the candidate bright spot on the compared image, so as to screen and determine whether the candidate bright spot is the desired bright spot.
  • the size defines a connected pixel larger than the average pixel value in a k1*k2 matrix as a connected domain corresponding to a so-called bright spot candidate. In this way, bright spots corresponding to the labeled molecules and consistent with subsequent sequence identification can be effectively obtained, and nucleic acid sequence information can be obtained.
  • two or more adjacent pixels that are not less than the average pixel value are called connected pixels/connected pixels, as shown in Figure 5.
  • the bold and enlarged indicate the center of the matrix corresponding to the candidate bright spot, and the thick line frame indicates the 3*3 matrix corresponding to the candidate bright spot.
  • the so-called third preset value may be determined according to the information of the size of the connected domain corresponding to all the candidate bright spots on the image. For example, by calculating the size of the connected domain corresponding to each candidate bright spot on the image, the average value of the connected domain size of the bright spot represents a characteristic of the image, as the third preset value; for another example, each candidate on the image
  • the connected component sizes corresponding to the bright spots are sorted from small to large, and the 50th, 60th, 70th, 80th, or 90th percentile connected component sizes are taken as the third preset value. In this way, bright spot information can be effectively obtained, which facilitates subsequent identification of nucleic acid sequences.
  • the so-called fourth preset value can be determined based on the information of the size of the scores of all candidate bright spots on the image. For example, when the number of candidate bright spots on the image is greater than a certain number, which meets the statistical requirement for quantity, for example, the number of candidate bright spots on the image is greater than 30, and the Score values of all candidate bright spots in the image can be calculated and pressed Sorting in ascending order, the fourth preset value can be set to the 50th, 60th, 70th, 80th or 90th quantile Score value, so that it can be excluded less than the 50th, 60th, 70th, 80th or 90th
  • the candidate bright spot with the quantile Score value is conducive to effectively obtaining the target bright spot, and is conducive to the accurate recognition of the subsequent base sequence.
  • the basis for performing this processing or the screening setting is that, generally, the bright spot with a large difference in intensity/pixel value between the center and the edge and converging is the bright spot corresponding to the location of the molecule to be detected.
  • the number of candidate bright spots on the image is greater than 50, greater than 100, or greater than 1,000.
  • candidate bright spots are screened in combination with morphology and intensity/brightness.
  • identifying and detecting bright spots includes: preprocessing an image to obtain a preprocessed image, and the so-called image is selected from the first image, the second image, the third image, the fourth image, and the fifth image. , At least one of the sixth image, the seventh image, and the eighth image; determining the critical value to simplify the preprocessed image, including assigning the pixel value of the pixel on the preprocessed image smaller than the critical value as the first A preset value, assigning a second preset value to the pixel value of a pixel on the preprocessed image that is not less than the critical value to obtain a simplified image; determining the first bright spot detection threshold c1 based on the preprocessed image; Identify candidate bright spots on the image based on the pre-processed image and the simplified image, including determining that the pixel matrix that meets at least two of the following conditions i)-ii) is a candidate bright spot, i) in the pre-processed image , The pixel
  • this method to detect bright spots on an image, including the use of the judgment conditions or a combination of judgment conditions determined by the inventor through a large amount of data training, can quickly and effectively realize the detection of bright spots on the image, especially for the reaction of the nucleic acid sequence Image.
  • This method has no special restrictions on the image to be detected, that is, the original input data. It is suitable for the processing and analysis of images generated by any platform that uses optical detection principles for nucleic acid sequence determination, including but not limited to second- and third-generation sequencing, with high accuracy and The high-efficiency feature can obtain more representative sequence information from the image. Especially for random images and signal recognition with high accuracy requirements, it is especially advantageous.
  • the pixel value is the same as the grayscale value. If the image is a color image, one pixel of the color image has three pixel values, the color image can be converted into a gray image, and then bright spot detection can be performed to reduce the calculation and complexity of the image detection process. Can choose but not limited to use floating point algorithm, integer method, shift method or average method to convert non-grayscale image into grayscale image.
  • preprocessing the image includes: using an open operation to determine the background of the image; based on the background, using a top hat operation to convert the image into a first image; performing Gaussian blur processing on the first image to obtain a second image; The second image is sharpened to obtain the so-called preprocessed image.
  • the image can be effectively denoised, or the signal-to-noise ratio of the image can be improved, which is conducive to the accurate detection of bright spots.
  • Preprocessing can be performed with reference to the image processing method of CN107945150A gene sequencing and the method disclosed in the system; specifically, the open operation is a morphological processing, that is, the process of first expansion and then corrosion.
  • the corrosion operation will change the foreground (the part of interest). Small, and expansion will make the foreground larger; the open operation can be used to eliminate small objects, separate objects at slender points, and smooth the boundaries of larger objects without changing their area significantly.
  • the size of the structure element p1*p2 (the basic template used to process the image) for the image opening operation is not particularly limited, and p1 and p2 are odd numbers.
  • the structure element p1*p2 may be 15*15, 31*31, etc., and finally a preprocessed image that is beneficial for subsequent processing and analysis can be obtained.
  • the top hat operation is often used to separate patches that are brighter than the neighboring points (bright spots/bright spots). When an image has a large background and small objects are more regular, the top hat operation can be used for background extraction.
  • performing the top hat transformation on the image includes first performing an open operation on the image, and then subtracting the result of the open operation from the original image to obtain the first image, that is, the image after the top hat transformation.
  • the image after the open operation is subtracted from the original image, and the resulting image highlights the brighter area than the area around the original image.
  • the operation is related to the size of the selected core, which can be considered to be related to the expected size of the bright spot/bright spot. If the bright spot is not the expected size, the processed effect will cause many small bumps in the entire image.
  • the virtual focus image namely The bright spots/spots are smudged into a ball.
  • the expected size of the bright spot that is, the size of the selected core, is 3*3, and the obtained top hat transformed image is beneficial to subsequent further denoising processing.
  • Gaussian Blur also known as Gaussian filtering, is a linear smoothing filter, suitable for eliminating Gaussian noise and widely used in the denoising process of image processing.
  • Gaussian filtering is the weighted average process of the entire image. The value of each pixel is obtained by weighted average of itself and other pixel values in the neighborhood.
  • the specific operation of Gaussian filtering is to scan each pixel in the image with a template (or convolution, mask), and use the weighted average gray value of the pixels in the neighborhood determined by the template to replace the value of the center pixel of the template.
  • Gaussian blur processing is performed on the first image
  • Gaussian filter GaussianBlur function is used in OpenCV
  • the Gaussian distribution parameter Sigma is 0.9
  • the two-dimensional filter matrix (convolution kernel) used is 3*3
  • the second image that is, the image after Gaussian filtering
  • is sharpened for example, two-dimensional Laplacian sharpening is performed. From an image perspective, after processing, the edges are sharpened, and the Gaussian blurred image is restored.
  • simplifying the pre-processed image includes: determining a critical value based on the background and the pre-processed image; comparing the pixel value of the pixel on the pre-processed image with the critical value, and predicting the value smaller than the critical value.
  • the pixel values of the pixels on the processed image are assigned to the first preset value, and the pixel values of the pixels on the preprocessed image that are not less than the critical value are assigned to the second preset value to obtain a simplified image.
  • the preprocessed image is simplified, such as binarization, which is beneficial to the subsequent accurate detection of bright spots and the subsequent accurate identification of bases. Obtain high-quality data, etc.
  • obtaining the simplified image includes: dividing the sharpened result obtained after preprocessing by the result of the open operation to obtain a set of values corresponding to the image pixels; using the set of values to determine the binarization The critical value of the preprocessed image.
  • the group of values can be arranged in ascending order of magnitude, and the value corresponding to the 20th, 30th, or 40th percentile in the group of values can be used as the binarization critical value/threshold value. In this way, the obtained binarized image facilitates the subsequent accurate detection and recognition of bright spots.
  • the structure element of the open operation during image preprocessing is p1*p2, so it is said that the preprocessed image (the result of sharpening) is divided by the result of the open operation to obtain a group of the same size as the structure element
  • the array/matrix p1*p2 in each array arrange the p1*p2 values contained in the array in ascending order of size, and take the value corresponding to the 30th percentile in the array as the area (numerical matrix)
  • the binarization critical value/threshold value of, in this way, the threshold value is determined separately to binarize each area on the image, and the final binarization result will denoise while highlighting the required information, which is conducive to the accurate detection of subsequent bright spots .
  • the Otsu method is used to determine the first bright spot detection threshold.
  • the Otsu method (OTSU algorithm) can also be called the maximum between-class variance method.
  • the Otsu method uses the largest between-class variance to segment the image, which means that the probability of error is small and the accuracy is high. Assuming that the segmentation threshold of the foreground and background of the preprocessed image is T(c1), the proportion of pixels belonging to the foreground to the whole image is w0, and its average gray scale is ⁇ 0; the number of pixels belonging to the background accounts for the whole image The ratio is w1, and its average gray scale is ⁇ 1.
  • the total average gray scale of the image to be processed is denoted as ⁇
  • the variance between classes is denoted as var
  • ⁇ 0 * ⁇ 0 + ⁇ 1 * ⁇ 1
  • var ⁇ 0 ( ⁇ 0 - ⁇ ) 2 + ⁇ 1 ( ⁇ 1 - ⁇ ) 2
  • the traversal method is used to obtain the segmentation threshold T that maximizes the variance between classes, which is the first bright spot detection threshold c1 sought.
  • the candidate bright spot on the image is identified based on the preprocessed image and the simplified image, including determining that the pixel matrix that meets the three conditions i)-iii) at the same time is a candidate bright spot. In this way, the accuracy of the subsequent determination of the nucleic acid sequence based on the bright spot information and the quality of the offline data can be effectively improved.
  • the conditions that need to be met for the determination of candidate bright spots include i), and r1 and r2 may be equal or unequal.
  • the relevant parameters of the imaging system are: the objective lens is 60 times, the size of the electronic sensor is 6.5 ⁇ m, the image formed by the microscope and then the electronic sensor, the smallest size that can be seen is 0.1 ⁇ m, the obtained image or the input image It can be a 16-bit grayscale or color image of 512*512, 1024*1024, or 2048*2048.
  • the conditions that need to be met for the determination of candidate bright spots include ii).
  • the pixel value of the center pixel of the pixel matrix is the second preset value, and the connected pixels of the pixel matrix are greater than ( 2/3)*r1*r2, that is, the pixel value of the central pixel is greater than the critical value and the connected pixels are greater than two-thirds of the matrix.
  • the pixel matrix does not meet the condition b) and is not a candidate bright spot.
  • the conditions that need to be met for the determination of candidate bright spots include iii).
  • g2 is the corrected pixels in the range of m1*m2, that is, the sum of pixels in the corrected range of m1*m2.
  • the so-called determining whether the candidate bright spot is a bright spot further includes: determining a second bright spot detection threshold based on the preprocessed image, and determining that the candidate bright spot whose pixel value is not less than the second bright spot detection threshold is Bright spots.
  • the pixel value of the position where the coordinates of the candidate bright spot is located is used as the pixel value of the candidate bright spot.
  • the second bright spot detection threshold determined based on the pre-processed image By using the second bright spot detection threshold determined based on the pre-processed image to further filter the candidate bright spots, at least part of the bright spots that are more likely to be the image background or interference but the brightness (intensity) and/or shape can be eliminated
  • the bright spot is conducive to the accurate recognition of subsequent bright spot-based sequences and improves the quality of off-camera data.
  • the center of gravity method may be used to obtain the coordinates of the candidate bright spot, including sub-pixel coordinates.
  • the pixel value/gray value of the coordinate position of the candidate bright spot is calculated by bilinear interpolation.
  • determining whether the candidate bright spot is a bright spot includes: dividing the preprocessed image into a set of blocks of a predetermined size, and sorting the pixel values of the pixels in the region to determine The second bright spot detection threshold corresponding to the region; for the candidate bright spot located in the region, it is determined that the candidate bright spot whose pixel value is not less than the second bright spot detection threshold corresponding to the region is a bright spot. In this way, distinguishing the differences in different areas of the image, such as the overall drop in light intensity, separately performing further detection and recognition of bright spots, is conducive to accurately identifying bright spots and obtaining more bright spots.
  • the pre-processed image is divided into a set of blocks of predetermined size, and the blocks may or may not overlap. In one example, there is no overlap between blocks.
  • the size of the pre-processed image is not less than 512*512, for example, 512*512, 1024*1024, 1800*1800, or 2056*2056, etc.
  • the so-called predetermined size area can be set to 200*200 . In this way, it is beneficial to quickly calculate and identify bright spots.
  • the pixel values of the pixels in each block are arranged in ascending order of size, and p10+(p10-p1)*4.1 is taken as the corresponding block
  • the second bright spot detection threshold that is, the background of the block
  • p1 represents the pixel value of the hundredth percentile
  • p10 represents the pixel value of the percentile.
  • the threshold is a relatively stable threshold obtained by the inventor through a large amount of data training and testing. It can adapt to the detection of images with different laser powers and/or various bright spot densities during various optical environments during image collection. Bright spot screening can eliminate a large number of non-target bright spots, which is conducive to subsequent rapid analysis and accurate results.
  • FIG. 7 is a schematic diagram of the comparison of the bright spot detection results before and after the processing, that is, the schematic diagram of the bright spot detection results before and after excluding the area background.
  • the upper part of Figure 7 is the bright spot detection result after the processing, and the lower half For the bright spot detection result without this processing, the bright spots or bright spots marked by the crosses are candidates.
  • the coordinates of the bright spots in the bright spot set corresponding to the sequencing template are mapped to the image to be inspected, for example, the bright spot set corresponding to the sequencing template and the waiting
  • the inspected images are all placed in the coordinate system of the image obtained in the first round of sequencing reaction, and the coordinates of each bright spot in the bright spot set corresponding to the sequencing template are marked on the image to be inspected.
  • the coordinates of each bright spot can be measured by the center of gravity method, etc. Confirm, in this way, the coordinates of the corresponding position on the image to be inspected can be quickly and accurately determined.
  • the method of determining the intensity of a certain position on the image is not limited.
  • bilinear interpolation, quadratic function interpolation, and quadratic spline interpolation can be used to calculate the sub-pixel value/gray value of the position as the The strength of the location.
  • the intensity of the corresponding coordinate position of the image to be inspected in S4 may be absolute intensity, for example, the pixel value of the position, or relative intensity, for example, the correlation value based on the pixel value of the position , For example, noise reduction, background removal, and/or correlation values of pixels of neighboring pixels are used for the image to be inspected.
  • the preset threshold corresponds to the image to be inspected, and a so-called preset threshold corresponds to one or more images to be inspected, that is, the base recognition method is used to perform the detection on one or more images to be inspected.
  • a preset threshold can be shared.
  • the background intensity of the corresponding coordinate position of the image to be inspected is the relative intensity, for example, the relative intensity is determined by the absolute intensity of the corresponding coordinate position and the background intensity of the area where the corresponding coordinate position is located. determine.
  • the area where the corresponding coordinate position is referred to is the area containing the position, preferably, it is the area containing x1*y1 pixels that does not need to be strictly centered on the position.
  • Both x1 and y1 are natural numbers, x1*y1 Not less than 100, preferably, x1*y1 is not less than 1000. In this way, it is conducive to fast and accurate base recognition.
  • the background intensity of the area where the said corresponding coordinate position is located is determined by the following: sort the pixels in the x1*y1 area where the said corresponding coordinate position is located according to the pixel value to obtain the x1*y1 area
  • the distribution curve of the number of pixels based on the distribution curve, determine the background intensity of the area where the corresponding coordinate position is called.
  • sorting can be ascending sorting or descending sorting. The following takes the ascending sorting result as an example. Those skilled in the art can obtain the curve sorted in descending order through this example and the related parameters or conditions determined according to the curve. Calculate the intensity of the called area.
  • Figures 8 and 9 show the 300*300 area arranged in ascending order according to the pixel value, that is, the distribution curve of 90,000 pixels.
  • the abscissa is the pixel (pixel value)
  • the ordinate is the number of pixels.
  • Figures 9 and 10 are taken from the edge of the field of view and the center of the field of view on the same image. It can be seen that the number of pixels of the background on the image is symmetrically distributed with the change of the pixel value, which obeys the normal distribution or approximately obeys the normal distribution, as shown in Figure 8.
  • the curve have a distribution It widens, the wave crest moves relatively to the left, and the downward trend on the right side becomes slow, that is, the right side shows a tail-like trend, as shown in Figure 9.
  • there are other interfering bright spots in this area such as abnormal intensity or large differences in the distribution of bright spots, etc., which also means that the curve tends to be asymmetrically distributed, including abnormal bumps at the peak or right trough or near the peak or trough. Or depression, etc. (do not follow the original trend of change).
  • the highest frequency pixel value of the distribution curve that is, the peak is the background intensity of the region where the corresponding coordinate position is called, and the relatively stable highest frequency pixel value represents the background intensity of the corresponding region. Conducive to the subsequent quick, simple and accurate identification of bases.
  • the method of estimating or determining the background intensity of the region in this embodiment is not limited, and for example, the open operation of OpenCV can be used.
  • the inventor conducted induction, testing, and verification based on a large amount of image data, and developed a formula that can be used to determine the peak pixel value of this type of distribution curve, that is, the background intensity I block of the region where the corresponding coordinate position is located.
  • I block I j1 + (I j2 -I j3 ) ⁇ t1, where I j1 , I j2 , and I j3 are the j1 percentile, j2 percentile, and j3 percentile on the distribution curve, respectively
  • the pixel value corresponding to the quantile, j1, j2, and j3 are all integers less than 50 and not less than 1, j2>8+j3, t1 is the first correction coefficient, and the value of t1 is determined by j1, j2, and j3.
  • the peak pixel value estimated by this formula is relatively reliable, and is suitable for images generated by various sequencing platforms, especially for images generated by single-molecule sequencing platforms.
  • j1 is selected from [1,40]
  • j2 is selected from [6,40]
  • j3 is selected from [1,30], 40 ⁇ j1+(j2-j3) ⁇ t1 ⁇ 50; so , Can more accurately estimate the peak pixel value that is conducive to accurate base identification.
  • the so-called background intensity of the area where the corresponding coordinate position is located is determined by the above formula, and the intensity of the corresponding coordinate position of the image to be inspected is the ratio of the absolute intensity of the corresponding coordinate position to the background intensity of the area where the corresponding coordinate position is located.
  • the preset threshold is selected from [0.85, 0.95] to perform S6 to compare the intensity of the corresponding coordinate position of the image to be inspected with the size of the preset threshold, and then perform base recognition based on the information of the position on the image to be inspected whose intensity is greater than the preset threshold In this way, it is possible to accurately identify bases with less loss of effective information.
  • the preset threshold value is 0.9, which is suitable for base identification based on images generated by a single molecule sequencing platform.
  • the preset threshold corresponds to the corresponding coordinate position of the image to be inspected, that is, the preset threshold generally changes with different positions, and a preset threshold corresponds to a so-called image to be inspected.
  • the corresponding coordinate position In these cases, the intensity of the corresponding coordinate position of the image to be inspected is the absolute intensity, for example, the pixel value of the position.
  • the preset threshold corresponding to each position can be determined when S6 is performed, or can be determined and saved in advance.
  • the determination of the preset threshold is related to the background intensity of the area where the corresponding coordinate position is said to be.
  • the area where the corresponding coordinate position is said to be the area containing the position on the image to be inspected is not strictly based on
  • the area at the center containing x2*y2 pixels, x2 and y2 are both natural numbers, x2*y2 is not less than 100, preferably, x2*y2 is not less than 1,000. In this way, it is beneficial to quickly and effectively determine the preset threshold, and it is beneficial to accurately identify the base.
  • determining the preset threshold includes: sorting the pixels in the x2*y2 area where the corresponding coordinate position is called according to the pixel value, to obtain the distribution of the number of pixels in the x2*y2 area Curve; and determine the preset threshold based on the so-called distribution curve.
  • the so-called sorting can be ascending sorting or descending sorting. The following takes the ascending sorting result as an example. Those skilled in the art can obtain the curve sorted in descending order through this example and the related parameters or conditions determined according to the curve.
  • the preset threshold is determined by calculation.
  • the valley pixel value on the right side of the distribution curve is used as the preset threshold.
  • the preset threshold is expected to be determined so that the reliability of the accuracy of the identified base can reach 90%, 95%, or More than 99%, which is conducive to the subsequent accurate base identification.
  • the right valley pixel value estimated by this formula is more reliable and is suitable for images generated by various sequencing platforms, including images with uniform or uneven distribution
  • j4 is selected from [1,40]
  • j5 is selected from [6,40]
  • j6 is selected from [1,30], 85 ⁇ j4+(j5-j6) ⁇ t2 ⁇ 100; in this way, it will be more accurate
  • the estimation of the right trough pixel value is conducive to accurate base identification.
  • base identification is performed based on a position determined to be greater than the preset threshold, which is better suitable for images generated based on a single molecule sequencing platform.
  • the corresponding position (the bright spot) of the image to be inspected can be further combined with the shape and so on to screen and determine the position, so that the information of the determined position can objectively reflect Find out whether the base extension reaction actually occurs at this position to facilitate accurate base recognition.
  • a computer-readable storage medium for storing a program for computer execution. Executing the so-called program includes completing the base identification method in any of the above embodiments.
  • Computer-readable storage media include but are not limited to read-only memory, random access memory, magnetic or optical disks, etc.
  • the computer-readable storage medium may be any device that can contain, store, communicate, propagate, or transmit a program for use by the instruction execution system, apparatus, or device or in combination with these instruction execution systems, devices, or equipment.
  • computer-readable storage media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cases (magnetic devices), random access memory (RAM) , Read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable compact disk read-only memory (CDROM).
  • the computer-readable storage medium can even be paper or other suitable media on which the program can be printed, because it can be done, for example, by optically scanning the paper or other media, and then editing, interpreting, or other suitable methods when necessary. Processing is performed to obtain the program electronically and then store it in the computer memory.
  • the above description of the technical features and advantages of the base identification method in any embodiment is also applicable to the computer-readable storage medium, and will not be repeated here.
  • a base identification system 100 for implementing the base identification method in any of the above embodiments of the present invention.
  • the system includes: a mapping module 10 for sequencing the corresponding The coordinates of the bright spots in the bright spot set of the template correspond to the image to be inspected to determine the corresponding coordinate position of the image to be inspected; the intensity determination module 20 is used to calculate the intensity of the corresponding coordinate position of the image to be inspected from the mapping module 10 And the identification module 30 for comparing the intensity of the corresponding coordinate position of the image to be inspected from the intensity determining module 20 with the size of the preset threshold, and performing the base based on the information of the position on the image to be inspected with an intensity greater than the preset threshold Identification;
  • the so-called bright spot set corresponding to the sequencing template is constructed based on multiple images.
  • the said image and the image to be tested are collected from the base extension reaction.
  • the image and the image to be tested correspond to the same field of view.
  • the base recognition method in any embodiment also applies to the base recognition system in this embodiment of the present invention, and will not be repeated here.
  • it further includes a template construction module 12, which is connected to the mapping module 10, and is used to construct a set of bright spots corresponding to the sequencing template based on multiple images.
  • the so-called images include A, T/U, G and The first image, the second image, the third image, and the fourth image of the same field of view during the four types of base extension reactions.
  • the first image includes image M1 and image M2, and the second image includes image N1 and image N2.
  • the third image includes image P1 and image P2, and the fourth image includes image Q1 and image Q2.
  • Images M1 and M2 are from two rounds respectively Sequencing reaction
  • image N1 and image N2 are from two rounds of sequencing reaction respectively
  • image P1 and image P2 are from two rounds of sequencing reaction respectively
  • image Q1 and image Q2 are respectively from two rounds of sequencing reaction, perform the following in the template building module 12: merge the first The bright spots on the first image, the second image, the third image, and the fourth image, record the number of bright spots at the same position, and remove the bright spots at the position of 1 to obtain the so-called bright spot corresponding to the sequencing template. Spot collection.
  • merging the bright spots on the first image, the second image, the third image, and the fourth image includes: (a) merging the bright spots on the image N1 into the image M1 to obtain one In the merged image M1, mark the coincident bright spot in the once-merged image M1 as A, mark the non-coincident bright spot as B, and the multiple bright spots in the once-merged image M1 whose distance is less than the first predetermined pixel are one said coincident bright spot (B) Replace image N1 with image P1, image Q1, image M2, image N2, image P2, or image Q2, replace image M1 with a merged image M1, and perform (a) multiple times until all the bright spots on the image are completed Combine to obtain the original bright spot set; and (c) remove the bright spot labeled B in the original bright spot set to obtain the bright spot set corresponding to the sequencing template.
  • the so-called image is a registered image.
  • the system 100 also includes a registration module 14, which is connected to the template construction module 12, and is used to implement the following to achieve image registration: a first registration is performed on the image to be registered based on the reference image, and the reference image and the image to be registered correspond to the same
  • the field of view includes: determining a first offset between a predetermined area on the image to be registered and a corresponding predetermined area on the reference image, and moving all bright spots on the image to be registered based on the first offset to obtain the first registration
  • the second registration of the image to be registered after the first registration includes: merging the image to be registered after the first registration and the reference image to obtain the merged image, and calculate the merge The offset of all the second coincident bright spots in the predetermined area on the image to determine the second offset.
  • the multiple bright spots on the combined image whose distance is less than the second predetermined pixel are one second coincident bright spot, based on The second
  • the registration module 14 includes a reference image construction unit 142, configured to perform the following to implement the construction of a reference image: acquiring a fifth image and a sixth image, and the fifth image and the sixth image are correspondingly the same as the image to be registered Field of view; rough registration of the sixth image based on the fifth image, including determining the offset of the sixth image relative to the fifth image, and moving the sixth image based on the offset to obtain the sixth image after coarse registration Image; merge the fifth image and the sixth image after coarse registration to obtain the so-called reference image.
  • a reference image construction unit 142 configured to perform the following to implement the construction of a reference image: acquiring a fifth image and a sixth image, and the fifth image and the sixth image are correspondingly the same as the image to be registered Field of view; rough registration of the sixth image based on the fifth image, including determining the offset of the sixth image relative to the fifth image, and moving the sixth image based on the offset to obtain the sixth image after coarse registration Image; merge the fifth image and the sixth image after coarse registration to
  • constructing the reference image further includes using the seventh image and the eighth image.
  • the fifth image, the sixth image, the seventh image, and the eighth image correspond to the same field of view.
  • the image, the seventh image, and the eighth image correspond to the field of view during the four types of base extension reactions A, T/U, G, and C.
  • the construction of the reference image also includes: coarse registration of the seventh image based on the fifth image, It includes determining the offset of the seventh image relative to the fifth image, moving the seventh image based on the offset to obtain the seventh image after coarse registration; performing coarse registration on the eighth image based on the fifth image, including determining The offset of the eighth image relative to the fifth image, based on the offset, move the eighth image to obtain the eighth image after coarse registration; merge the fifth image with the sixth image after coarse registration, and coarse registration
  • the seventh image after the subsequent rough registration and the eighth image after coarse registration are used to obtain the so-called reference image.
  • the so-called reference image and the image to be registered are both binarized images.
  • a two-dimensional discrete Fourier transform is used to determine the so-called first offset, the offset of the sixth image relative to the fifth image, the offset of the seventh image relative to the fifth image, and/ Or the offset of the eighth image relative to the fifth image.
  • the system 100 further includes a bright spot detection module 16, which is connected to the mapping module 10, the template construction module 12, and/or the registration module 14, for performing the following to realize bright spot detection on the image: preprocessing Image, obtain the preprocessed image; determine the critical value to simplify the preprocessed image, including assigning the pixel value of the pixel on the preprocessed image that is less than the critical value to the first preset value, and for not less than the critical value
  • the pixel value of the pixel on the preprocessed image is assigned to the second preset value to obtain a binary image;
  • the first bright spot detection threshold c1 is determined based on the preprocessed image; based on the preprocessed image Identify the bright spots on the image with the binarized image, including determining that the pixel matrix that meets at least two of the following conditions i)-iii) is a candidate bright spot, i) the preprocessed image Among them, the pixel value of the center pixel of the pixel matrix is the largest, and
  • Both k1 and k2 are odd numbers greater than 1.
  • the k1*k2 pixel matrix contains k1*k2 pixels, ii )
  • the pixel value of the central pixel of the pixel matrix is the second preset value and the connected pixels of the pixel matrix are greater than (2/3)*k1*k2, and iii) in the
  • the pixel value of the central pixel of the pixel matrix in the preprocessed image is greater than the third preset value and satisfies g1*g2>c1, where g1 is the range of m1*m2 centered on the central pixel of the pixel matrix
  • the correlation coefficient of the two-dimensional Gaussian distribution, g2 is the pixel in the m1*m2 range, m1 and m2 are both odd numbers greater than 1, and the m1*m2 range includes m1*m2 pixels.
  • the bright spot detection module 16 further includes performing the following to determine whether the candidate bright spot is a bright spot: determining a second bright spot detection threshold based on the preprocessed image, and comparing the pixel value of the candidate bright spot with the second bright spot For the size of the spot detection threshold, it is determined that the candidate bright spot whose pixel value is not less than the second bright spot detection threshold is a bright spot, and the pixel value of the position where the coordinate of the candidate bright spot is located is used as the pixel value of the candidate bright spot.
  • determining whether the candidate bright spot is a bright spot includes: dividing the pre-processed image into a set of regions of a predetermined size, and the pixels in the region Sort the values to determine the second bright spot detection threshold corresponding to the area, compare the pixel value of the candidate bright spot in the area with the second bright spot detection threshold, and determine that the pixel value is not less than the second bright spot corresponding to the area
  • the bright spot candidate for the detection threshold is a bright spot.
  • the so-called preprocessed image includes: using the open operation to determine the background of the image, based on the background, using the top hat operation to convert the image, Gaussian blurring the converted image, and Gaussian blurring the image after the Gaussian blurring. Sharpen to obtain the so-called preprocessed image.
  • determining the critical value to simplify the preprocessed image to obtain a binarized image includes: determining the critical value based on the background and the preprocessed image, and comparing the pixels on the preprocessed image The pixel value and the critical value are used to obtain a binary image.
  • g2 is a pixel in the range of m1*m2 after correction, and correction is made according to the proportion of pixels in the corresponding m1*m2 range of the binarized image with the pixel value of the second preset value to obtain the corrected Pixels in the range of m1*m2.
  • a so-called preset threshold corresponds to one or more images to be inspected.
  • the intensity of the corresponding coordinate position of the image to be inspected is the relative intensity, for example, determined by the absolute intensity of the corresponding coordinate position and the background intensity of the region where the corresponding coordinate position is located.
  • determining the background intensity of the area where the corresponding coordinate position is located includes: sorting the pixels in the x1*y1 area where the corresponding coordinate position is located according to the pixel value to obtain the x1*y1 area
  • the distribution curve of the number of pixels, x1 and y1 are both natural numbers, and x1*y1 is not less than 100; and the background intensity of the region where the corresponding coordinate position is determined based on the so-called distribution curve.
  • the so-called sorting is ascending sorting
  • the peak pixel value of the so-called distribution curve is taken as the background intensity of the region where the corresponding coordinate position is located
  • the formula I block I j1 + (I j2 -I j3 ) ⁇ t1
  • I j1 , I j2 , I j3 are the pixel values corresponding to the j1 percentile, j2 percentile, and j3 percentile, respectively, j1, j2, and j3
  • t1 is the first correction coefficient
  • the value of t1 is determined by j1, j2, and j3.
  • j1 is selected from [1,40]
  • j2 is selected from [6,40]
  • j3 is selected from [1,30]
  • 40 ⁇ j1+(j2-j3) ⁇ t1 ⁇ 50 is selected from any value in [0.85, 0.95].
  • a so-called preset threshold corresponds to a corresponding coordinate position of an image to be inspected.
  • the intensity of the corresponding coordinate position of the image to be inspected is the absolute intensity.
  • the system 100 further includes a threshold determination module 40, which is connected to the recognition module 30, for determining the preset threshold: According to the pixel value, the pixel in the x2*y2 area where the corresponding coordinate position is located is performed. Sorting to obtain a distribution curve of the number of pixels in the x2*y2 area, x2 and y2 are both natural numbers, and x2*y2 is not less than 100; the preset threshold is determined based on the distribution curve.
  • a threshold determination module 40 which is connected to the recognition module 30, for determining the preset threshold: According to the pixel value, the pixel in the x2*y2 area where the corresponding coordinate position is located is performed. Sorting to obtain a distribution curve of the number of pixels in the x2*y2 area, x2 and y2 are both natural numbers, and x2*y2 is not less than 100; the preset threshold is determined based on the distribution curve.
  • the so-called sorting is ascending sorting
  • the right trough pixel value of the distribution curve is used as the preset threshold
  • I j4 , I j5 , and I j6 are the pixel values corresponding to the j4th percentile, j5th percentile and j6th percentile respectively
  • j4, j5 and j6 are all less than 50 and not less than
  • An integer of 1 j5>8+j6, t2 is the second correction coefficient, and the value of t2 is determined by j4, j5, and j6.
  • j4 is selected from [1,40]
  • j5 is selected from [6,40]
  • j6 is selected from [1,30], 85 ⁇ j4+(j5-j6) ⁇ t2 ⁇ 100.
  • a base recognition system in another embodiment, includes: a memory for storing data, including a computer-executable program; and a processor, for executing the so-called computer-executable program to implement the aforementioned The base recognition method in any embodiment of the present invention.
  • This system is used to implement the base recognition method in any of the above specific embodiments.
  • the above description of the technical features and advantages of the base recognition method in any embodiment is also applicable to the base recognition system. Repeat it again.
  • a sequencing system is provided, and the sequencing system includes the base recognition system in any of the above embodiments.
  • a computer program product including instructions, which when a computer executes the so-called program, cause the computer to execute the base identification method in any of the above embodiments of the present invention.
  • the above description of the technical features and advantages of the base recognition method in any embodiment is also applicable to the computer program product, and will not be repeated here.
  • a sequencing system including the computer program product of any one of the above-mentioned embodiments of the present invention.
  • the above description of the technical features and advantages of the base recognition method and/or computer program product in any embodiment is also applicable to the sequencing system, and will not be repeated here.
  • controller/processor in addition to implementing the controller/processor in a purely computer-readable program code manner, it is entirely possible to make the controller control with logic gates, switches, application specific integrated circuits, and editable logic by changing the method steps into logic.
  • the same function can be realized in the form of a controller and embedded microcontroller. Therefore, such a controller/processor can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module for realizing the method and a structure within a hardware component.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Analytical Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种碱基识别方法、系统、计算机程序产品和测序系统。碱基识别方法包括将对应测序模板的亮斑集合中的亮斑的坐标对应到待检图像上,确定待检图像的相应坐标位置;确定待检图像的相应坐标位置的强度;以及,比较待检图像的相应坐标位置的强度与预设阈值的大小,基于待检图像上强度大于预设阈值的位置的信息进行该碱基识别。该方法和/或系统通过将对应测序模板的亮斑集合中的亮斑的坐标映射到待检图像上,基于待检图像上相应坐标位置的信息进行碱基识别,能够快速、简便且通量高地识别出碱基,确定核酸序列。

Description

碱基识别方法、系统、计算机程序产品和测序系统 技术领域
本发明涉及信息处理和识别领域,具体地,涉及核酸序列测定相关数据的处理和分析,更具体地,涉及一种碱基识别方法、一种碱基识别系统、一种测序系统和一种计算机程序产品。
背景技术
在相关技术中,包括在基于成像系统多次对生化反应中的核酸分子(模板)进行图像采集以测定该核酸分子的核苷酸顺序的测序平台中,如何有效、准确以及通量高地获得核酸模板的至少一部分的核苷酸组成和顺序,包括如何对多次不同时间点所采集的图像包括图像上的信息进行识别和处理,是值得关注的问题。
发明内容
本发明实施方式旨在至少解决相关技术中存在的技术问题之一或者至少提供一种可选择的实用方案。为此,本发明提供了一种碱基识别方法、一种碱基识别系统、一种计算机程序产品和一种测序系统。
依据本发明的一个实施方式,提供一种碱基识别方法,该方法包括:将对应测序模板的亮斑集合中的亮斑的坐标对应到待检图像上,以确定待检图像的相应坐标位置;确定所称的待检图像的相应坐标位置的强度;以及比较待检图像的相应坐标位置的强度与预设阈值的大小,基于待检图像上强度大于预设阈值的位置的信息进行碱基识别;所称的对应测序模板的亮斑集合基于多个图像构建获得,所称的图像和待检图像均采集自碱基延伸反应,图像和待检图像对应相同的视野,碱基延伸反应时的该视野中存在多个带有光学可检测标记的核酸分子,至少一部分所称的核酸分子在图像和/或待检图像上表现为亮斑。
依据本发明的一个实施方式,提供一种碱基识别系统,用以实施上述本发明任一实施方式中的碱基识别方法,该系统包括:映射模块,用于将对应测序模板的亮斑集合中的亮斑的坐标对应到待检图像上,以确定待检图像的相应坐标位置;强度确定模块,用于计算来自映射模块的待检图像的相应坐标位置的强度;以及识别模块,用于比较来自强度确定模块的待检图像的相应坐标位置的强度与预设阈值的大小,以及基于待检图像上强度大于预设阈值的位置的信息进行该碱基识别;所称的对应测序模板的亮斑集合基于多个图像构建获得,所称的图像和待检图像均采集自碱基延伸反应,图像和待检图像对应相同的视野,碱基延伸反应时的该视野中存在多个带有光学可检测标记的核酸分子,至少一部分核酸分子在图像和/或待检图像上表现为亮斑。
依据本发明的再一个实施方式,提供一种碱基识别系统,该系统包括:存储器,用于存储数据,包括计算机可执行程序;处理器,用于执行所称的计算机可执行程序,以实施上述本发明任一实施方式中的碱基识别方法。
依据本发明的一个实施方式,提供一种测序系统,该系统包括上述本发明任一实施方式的碱基识别系统。
依据本发明的一个实施方式,提供一种计算机可读存储介质,用于存储供计算机执行的程序,执行程序包括完成上述任一实施方式中的碱基识别方法。计算机可读存储介质包括但不限于只读存储器、随机存储器、磁盘或光盘等。
依据本发明的一个实施方式,提供一种计算机程序产品,包括指令,该指令在计算机执行所称的程序时,使该计算机执行上述本发明任一实施方式中的碱基识别方法。
依据本发明的一个实施方式,提供一种测序系统,包括上述本发明任一实施方式的计算机程序产品。
上述任一实施方式中的碱基识别方法、碱基识别系统、计算机可读存储介质、计算机程序产品和/或测序系统,通过将对应测序模板的亮斑集合中的亮斑的坐标直接对应至待检图像上、依据待检图像上的相应坐标位置的信息进行碱基识别,能够简单、高效且通量高地识别出碱基延伸反应时与模板核酸结合的碱基的类型和顺序,能够实现模板核酸序列的简单、快速和准确的测定。
本发明实施方式的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明实施方式的实践了解到。
附图说明
图1是本发明具体实施方式中的碱基识别方法的流程示意图。
图2是本发明具体实施方式中的碱基识别方法的流程示意图。
图3是本发明具体实施方式中的对图像Repeat1-20中的亮斑进行合并的过程及结果以获得对应测序模板的亮斑集合的示意图。
图4是本发明的具体实施方式中的纠偏过程和纠偏结果的示意图。
图5是本发明具体实施方式中的候选亮斑的对应的矩阵以及连同像素示意图。
图6是本发明具体实施方式中的以像素点矩阵的中心像素点为中心的m1*m2范围的像素值示意图。
图7是本发明具体实施方式中的依据第二亮斑检测阈值进行判定之前和之后的亮斑检测结果对比示意图。
图8是本发明具体实施方式中的依据像素值升序排列的一个300*300区域的像素点的数目的分布曲线。
图9是本发明具体实施方式中的依据像素值升序排列的一个300*300区域的像素点的数目的分布曲线。
图10是本发明具体实施方式中的碱基识别系统的结构示意图。
具体实施方式
下面详细描述本发明的实施方式,实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。
在本文的描述中,术语“第一”、“第二”、“第三”、“第四”等仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者顺序。在本发明的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
在本文的描述中,所称的“亮斑”或“亮点”(spots或者peaks),指图像上的发光斑或发光点,一个发光斑占有至少一个像素点。
在本文的描述中,所称的测序,也称为序列测定或者基因测序,指核酸序列测定,包括DNA测序和/或RNA测序,包括长片段测序和/或短片段测序,包括对核酸序列的多个连续或非连续的特定位置的碱基的类型的识别和排列顺序的测定;可以利用边合成边测序(SBS)或者边连接边测序(SBL),包括核苷酸或者核苷酸类似物结合至模板上的过程,即碱基延伸反应。对所称的“亮斑”的检测为对来自延伸碱基或碱基簇的光学信号的检测。
测序可以通过测序平台进行,测序平台可选择但不限于Illumina公司的Hiseq/Miseq/Nextseq/Novaseq测序平台、Thermo Fisher/Life Technologies公司的Ion Torrent平台、华大基因的BGISEQ和MGISEQ平台以及单分子测序平台;测序方式可以选择单端测序,也可以选择双末端测序;获得的测序结果/数据即测读出来的片段,称为读段(reads)。读段的长度称为读长。
在本文的描述中,采集自测序反应/碱基延伸反应的图像或者基于这些图像转化或构建得的图像,可以是灰度图像,也可以是彩色图像。对于灰度图像,所称的像素值同灰度值;对于16位的灰度图像例如tiff灰度图像,像素值的取值范围为0-65535,对于8位的灰度图像,像素值的取值范围为0-255。对于彩色图像,彩色图像的一个像素点具有三个像素值,可以利用提供的方法和/或系统直接利用该像素值数组进行图像检测/目标信息识别,也可以先将该彩色图像转化为灰度图像,再对转化后的灰度图像进行处理和信息识别,以降低图像检测信号识别过程的计算量和复杂度;将非灰度图像转换成灰度图像的方法,可选择但不限于浮点算法、整数方法、移位方法和平均值法等。
在本文的描述中,除非有其它明确的限定,基于图像信息,所称的“强度”与像素(像素值)可以替换,所称的强度或者像素的大小可以是真实或客观的绝对值,也可以是相对值包括基于真实像素值的各种变形,例如放大的像素值、缩小的像素值、基于像素值的比例或者关系等;一般地,涉及比较多个图像或者亮斑或者位置的强度/像素大小的,该些图像或者亮斑或者位置的强度/像素大小为经过相同处理后的强度/像素大小,比如均为客观像素值或者均为相同变形处理后的像素值;涉及基于一个或者多个图像的特定位置的信息进行比较分析的,确定该些特定位置时,较佳地,使该些图像对齐以及位于同一坐标系中。
在基于光学成像测定核酸序列的平台中,特别是单分子测序平台,由于来自目标分子的信号微弱且信号强度短时间内会变化,采集得图像之后,对于碱基识别,一般地,会先将采集得 的图像上的亮斑检测识别出,即检测识别出来自碱基/碱基簇延伸的真实信号,接着将每个该些亮斑与测序模板上的亮斑进行匹配,例如遍历待检图像上的每个亮斑,若待检图像上的某个亮斑与测序模板上的亮斑的距离足够小(与分辨率等有关),则认为该两个亮斑重合,认为待检图像的该亮斑对应的位置上存在待测核酸分子并且该待测核酸分子发生核苷酸结合反应(碱基延伸反应),由此识别得该待测核酸分子结合上的核苷酸/碱基的类型。该通常使用的碱基识别方法,通过比较待检图像上的亮斑与测序模板上的亮斑的坐标之间的距离,来确定测序模板上所代表的核酸分子是否在本轮中发生反应(发亮;碱基延伸反应),若发生反应,则测读得加入的碱基类型或者测读得核酸分子的相应位置的碱基的类型。发明人经过大量数据测试发现,该方法受图像质量、亮点定位算法、点密度分布等的影响比较严重,容易发生错误识别碱基的情况。
由此,请参阅图1,发明人在一个实施方式中提供一种碱基识别方法,包括:S2将对应测序模板的亮斑集合中的亮斑的坐标对应到待检图像上,以确定待检图像的相应坐标位置;S4确定所称的待检图像的相应坐标位置的强度;以及S6比较待检图像的相应坐标位置的强度与预设阈值的大小,基于待检图像上强度大于预设阈值的位置的信息进行碱基识别。所称的对应测序模板的亮斑集合基于多个图像构建获得,所称的图像和待检图像均采集自碱基延伸反应,图像和待检图像来自相同的视野,碱基延伸反应时的该视野中存在多个带有光学可检测标记的核酸分子,至少一部分核酸分子在图像和/或待检图像上表现为亮斑。所称的核酸分子为待测序模板或者为包含模板的核酸复合物,所称的碱基延伸反应包含核苷酸包括核苷酸类似物结合至模板或者核酸复合物上的过程;本领域技术人员知道,在基于光学检测实现序列测定的的测序平台上获得的图像,带有光学可检测标记例如带有荧光分子的核苷酸或核酸分子被激光激发发光,在图像中表现为亮斑,每个亮斑占据几个像素。
该碱基识别(base call)方法,通过将对应测序模板的亮斑集合中的亮斑的坐标直接对应至待检图像上、依据待检图像上的相应坐标位置的信息进行碱基识别,相较于一般的将待检图像上的亮斑检测出、通过判断检测出的亮斑是否能匹配上对应测序模板的亮斑集合中的特定亮斑来进行碱基识别的方法,该方法能够简单、高效且通量高地识别出碱基延伸反应时与模板核酸结合的碱基的类型和顺序,能够实现模板核酸序列的简单、快速和准确的测定。对比测试进一步发现,通过该方法获得的读段包括匹配到参考序列唯一位置的读段的量,较一般的碱基识别方法获得的,高30%或以上。
所称的对应测序模板的亮斑集合可以在进行该碱基识别方法时构建,也可以预先构建。在一个示例中,基于图像预先构建对应测序模板的亮斑集合,保存以备调用。具体地,请参阅图2,基于图像构建对应测序模板的亮斑集合,所称的图像包括分别对应碱基/核苷酸/核苷酸类似物A、T/U、G和C四种类型碱基延伸反应时的一个相同视野的第一图像、第二图像、第三图像和第四图像,第一图像包括图像M1和图像M2,第二图像包括图像N1和图像N2,第三图像包括图像P1和图像P2,第四图像包括图像Q1和图像Q2,定义顺序和/或同时实现一次四种类型碱基延伸反应为一轮测序反应,图像M1和图像M2分别来自两轮测序反应,图像N1和图像N2分别来自两轮测序反应,图像P1和图像P2分别来自两轮测序反应,图像Q1和图像Q2分别来自两轮测序反应,该方法包括:S8合并第一图像、第二图像、第三图像和第四图像上的亮斑,记录相同位置上的亮斑的数目,去除数目为1的亮斑,以获得对应测序模板的亮斑集合。该方法通过对图像上的亮斑直接进行合并处理,能够快速简便地构建出对应核酸分子(测序模板)的亮斑集合。构建得的亮斑集合能有效、准确且全面的反映测序模板的信息,利于后续的碱基的准确识别,获得准确的核酸序列。
所称的一轮测序反应,顺序和/或同时实现一次四种类型碱基延伸反应,可以是四种类型碱基反应底物(例如核苷酸类似物/碱基类似物)同时于一个碱基延伸反应体系中实现一轮测序反应,可以是两种类型碱基类似物于一个碱基延伸反应体系中、另外两种类型反应底物于下一个碱基延伸反应体系以实现一轮测序反应,也可以是一种类型碱基类似物于一个碱基延伸反应体系中、依次在四个连续的碱基延伸反应体系中加入该四种类型碱基类似物以实现一轮测序反应。可知,第一图像、第二图像、第三图像和第四图像可以采集自两次碱基延伸反应或者更多次的碱基延伸反应。另外,一个碱基延伸反应可能包含一次图像采集,也可能包含多次图像采集。
在一个示例中,一轮测序反应包括多次碱基延伸反应,例如单色测序,利用的四种类型碱基对应的反应底物(核苷酸类似物)均带有同一种荧光染料,一轮测序反应包括四次碱基延伸反应(4repeats),对于一个视野来说,一次碱基延伸反应包含一次图像采集,图像M1、图像N1、图像P1和图像Q1分别为来自一轮测序反应的四次碱基延伸反应的同一视野。
在另一个示例中,一轮测序反应包括两次碱基延伸反应,例如双色测序,利用的四种类型碱基对应的反应底物(核苷酸类似物)中的两种带有一种荧光染料、另两种带有另一种不同激发波长的荧光染料,一轮测序反应包括两次碱基延伸反应,带有不同染料的两种类型碱基反应底物于一次碱基延伸反应中进行核苷酸结合/延伸反应,对于一个视野,一次碱基延伸反应包括两次于不同激发波长下的图像采集,图像M1、图像N1、图像P1和图像Q1分别来自一轮测序反应的两次碱基延伸反应的两种激发波长下的同一视野。
在再一个示例中,一轮测序反应包括一次碱基延伸反应,例如二代测序平台的双色测序反应,四种类型碱基反应底物(例如核苷酸类似物)分别带有染料a、带有染料b、带有染料a和染料b以及不带任何染料,染料a和染料b的激发波长不一样;四种类型反应底物于同一次碱基延伸反应中实现一轮测序反应,一次碱基延伸反应包括两次于不同激发波长下的图像采集,第一图像同第三图像、第二图像同第四图像,图像M1和图像N1分别来自不同轮测序反应或者同一轮测序反应中的不同激发波长下的同一视野。
在又一个示例中,一轮测序反应包括一次碱基延伸反应,例如四色测序反应,四种类型碱基反应底物(例如核苷酸类似物)分别带有染料a、带有染料b、带有染料c和染料d,染料a、染料b、染料c和染料d的激发波长均不一样;四种类型反应底物于同一次碱基延伸反应中实现一轮测序反应,一次碱基延伸反应包括四次于不同激发波长下的图像采集,图像M1、图像N1、图像P1和图像Q1分别来自不同轮测序反应或者同一轮测序反应中的不同激发波长下的同一视野。
发明人设计测序模板(template)的构建算法发现,在部分模板构建算法中,由于部分亮斑在构建模板过程中被丢弃,且一般地,采集自第一轮测序反应的图像信息对template构建的影响远大于采集自后续反应该视野的图像,容易损失对应测序模板的亮斑。在某些具体实施方式中,S8包括:(a)合并图像N1上的亮斑至图像M1中,获得一次合并图像M1,对于一次合并图像M1中的重合亮斑依据重合亮斑包含的亮斑的数目进行计数标记,对于非重合亮斑记为1,在一次合并图像M1中的距离小于第一预定像素的多个亮斑为一个重合亮斑;(b)以图像P1、图像Q1、图像M2、图像N2、图像P2或图像Q2替代图像N1,以一次合并图像M1替代图像M1,多次进行(a)直至完成所有图像上的亮斑的合并,获得原始亮斑集合;(c)去除原始亮斑集合中的标记为1的亮斑,以获得对应测序模板的亮斑集合。如此,能够平衡不同轮测序反应的图像的权重,获得更多更准确的模板亮斑,能够快速简便且准确地获取对应测序模板的亮斑集,利于准确识别碱基获得读段。
在一个示例中,所使用的成像系统,电子传感器的尺寸为6.5μm,显微镜放大倍率60倍,能看到的最小尺寸就是0.1μm。对应核酸分子的亮斑的大小一般为小于10*10像素。所称的第一预定像素,在一个示例中,设置为1.05像素。如此,能够准确的进行重合亮斑的判断,利于测序模板(亮斑集合)的准确构建。
在一个示例中,设置一个模板向量(TemplateVec)承载第1-20Repeats相同视野的图像上的亮斑(Peaks)的合并结果,每次合并都对合并成功的亮斑进行计数,所有合并完成后,去掉计数为1的点。具体地,将图像Repeat1的peaks合并至TemplateVec时,由于初始时TemplateVec上没有任何亮斑,所以TemplateVec中亮斑总数等于图像Repeat1上的亮斑数,且所有亮斑计数为1;将图像Repeat2上的亮斑合并至TemplateVec时,先判断每个Repeat2的亮斑在TemplateVec中是否有距离小于1.05像素的亮斑,若有则合并在一起得重合亮斑,取这两个亮斑中的任一个的位置作为该重合亮斑的位置或者取这两个亮斑的平均位置作为该重合亮斑的位置,并对该重合亮斑计数加1;若不存在距离小于1.05像素的亮斑,则把该亮斑追加至TemplateVec中,计数为1;重复上述步骤直至完成图像Repeat20上的亮斑合并至TemplateVec;最后,对TemplateVec中的亮斑进行筛选,去除计数为1的亮斑。
由于不同的加入顺序的原因,某些早期距离较远的亮斑,在合并后,会成为近距离亮斑,在一个示例中,进一步对TemplateVec内距离小于1.05像素的亮斑进行再一次合并。如此,利于获得更准确的测序模板。图3示意上述测序模板构建过程,图中的圆圈示意亮斑。
对于涉及多个亮斑合并成一个亮斑的情形,合并后的亮斑的位置/坐标,在一些示例中,可以通过合并前的多个亮斑的重心坐标来确定,例如取多个亮斑的重心坐标的任一个或者平均值作为合并后的亮斑的坐标,或者依据合并前的多个亮斑的坐标,设置权重,来确定合并后的亮斑的坐标,例如,依据合并前的各亮斑包含的亮斑数目和/或合并前的各亮斑来自第几轮测序反应,设置不同的贡献值,即对合并前的各亮斑坐标设置不同的权重来确定合并后的亮斑的坐标,如此,有利于获得较准确的反映真实情况(对应测序模板)的亮斑的信息。在一个示例中,合并前的一个亮斑包含多个亮斑的,设置相对高的权重,和/或合并前的亮斑来自第一轮 测序反应图像,设置相对高的权重,如此,能够使合并后的亮斑的信息客观、准确地反映出合并前的信息,利于构建出准确的对应测序模板的亮斑集合,利于碱基的准确识别。在某些具体实施方式中,图像为经过配准的图像。如此,利于准确地获取对应与测序模板的亮斑集合,利于准确的碱基识别。
对实现图像配准即进行图像纠偏的方式不作限制。在一些示例中,利用如下方法进行图像配准,包括:基于参考图像对待配准图像进行第一配准,参考图像和待配准图像对应相同对象,参考图像和待配准图像均包含多个亮斑,包括确定待配准图像上的预定区域和参考图像上的相应预定区域的第一偏移量,基于第一偏移量移动待配准图像上的所有亮斑,获得第一配准后的待配准图像;基于参考图像对第一配准后的待配准图像进行第二配准,包括合并第一配准后的待配准图像和参考图像,获得合并图像,计算合并图像上的预定区域的所有重合亮斑的偏移量,以确定第二偏移量,距离小于预定像素的两个或多个亮斑为一个重合亮斑,基于该第二偏移量移动第一配准后的待配准图像上的所有亮斑,以实现对待配准图像的配准。该图像配准方法通过两次关联配准,可相对称为粗配准和细配准,包括利用图像上的亮斑进行细配准,能够基于少量数据信息快速地实现图像的高精度纠偏,特别适于高精度图像纠偏要求的场景。例如,单分子级别的图像检测,比如来自第三代测序平台的测序反应的图像。所称单分子级别指分辨率为单个或少数几个分子的大小,例如不多于10个、8个、5个、4个或3个分子。
在某些具体实施方式中,所称的“亮斑”对应延伸碱基或碱基簇的光学信号或者其它发亮物质的干扰信号。所称的图像上的预定区域,可以是整个图像,也可以是图像的一部分。在一个示例中,图像上的预定区域为图像的一部分,例如为图像中心的512*512区域。所称的图像中心,为该视野的中心,成像系统的光轴与成像平面的交点可称为图像中心点,以该中心点为中心的区域可视为图像中心区域。
在某些具体实施方式中,待配准图像来自核酸测序平台。具体地,来自利用光学成像原理进行序列测定的测序平台,该平台包括成像系统和核酸样本承载系统,带有光学检测标记的待测核酸分子固定于反应器中,该反应器也称为芯片,芯片装载在一个可移动台子上,通过该可移动台子带动芯片运动来实现对位于芯片不同位置(不同视野)的待测核酸分子进行图像采集。一般地,光学系统和/或可移动台子的运动存在精度限制,例如,指令指定运动至某个位置和该机械结构实际运动达到的位置存在偏差,特别是在对精度高要求的应用情景,由此,在依据指令移动硬件以对不同时间点的同一位置(视野)进行多次图像采集的过程中,不同时间点采集的同一视野的多个图像难以完全对齐,对该些图像进行纠偏对齐,有利于基于该多个时间点采集的多个图像中的信息的变化来准确确定核酸分子核苷酸顺序。
在某些具体实施方式中,所称的参考图像是通过构建获得的,参考图像可以在对待配准图像进行配准时构建,也可以预先构建保存需要时调用。
在一些示例中,构建参考图像包括:获取第五图像和第六图像,第五图像和第六图像与待配准图像对应相同视野/对象;基于第五图像对第六图像进行粗配准,包括确定第六图像和第五图像的偏移量,基于该偏移量移动第六图像,获得粗配准后的第六图像;合并第五图像和粗配准后的第六图像,以获得参考图像,第五图像和第六图像均包含多个亮斑。如此,利用构建获得包含更多或相对更完整的信息的参考图像,利用该图像作为纠偏的基准,利于实现更准确的图像配准。对于核酸序列测定得到的图像,利用多个图像进行参考图像构建,利于使得该参考图像获得完整的对应核酸分子的亮斑信息,利于基于亮斑的图像纠偏,进而利于对应测序模板的亮斑集合的获取以及碱基识别。
在一些实施例中,第五图像、第六图像分别来自核酸序列测定反应(测序反应)的不同时刻的同一个视野。在一个示例中,一轮测序反应包括多次碱基延伸反应,例如单色测序,利用的四种类型碱基对应的反应底物(核苷酸类似物)均带有同一种荧光染料,一轮测序反应包括四次碱基延伸反应(4repeats),对于一个视野来说,一次碱基延伸反应包含一次图像采集,第五图像和第六图像分别来自不同次的碱基延伸反应的同一视野。如此,通过处理以及集合第五图像和第六图像的信息获得的参考图像作为纠偏的基准,利于进行更准确的图像纠偏。
在另一个示例中,进行单分子双色测序反应,利用的四种类型碱基对应的反应底物(核苷酸类似物)中的两种带有一种荧光染料、另两种带有另一种不同激发波长和发射波长的荧光染料,一轮测序反应包括两次碱基延伸反应,带有不同染料的两种类型碱基反应底物于一次碱基延伸反应中进行结合反应,对于一个视野,一次碱基延伸反应包括两次于不同激发波长下的图像采集,第五图像和第六图像分别来自不同次的碱基延伸反应或者同一次碱基延伸反应中的不同激发波长下的同一视野。如此,通过处理以及集合第五图像和第六图像的信息获得的参考图像作为纠偏的基准,利于进行更准确的图像纠偏。
在又一个示例中,一轮测序反应包括一次碱基延伸反应,例如二代测序平台的双色测序反应,四种类型碱基反应底物(例如核苷酸类似物)分别带有染料a、带有染料b、带有染料a和染料b以及不带任何染料,染料a和染料b被激发后的发射波长不一样;或者,例如四色测序,四种类型碱基反应底物(例如核苷酸类似物)分别带有染料a、染料b、染料c和染料d,染料a、b、c和d被激发后的发射波长不一样;四种类型反应底物于同一次碱基延伸反应中实现一轮测序反应,第五图像和第六图像分别来自不同轮测序反应或者同一轮测序反应中的不同激发波长下的同一视野。如此,通过处理以及集合第五图像和第六图像的信息获得的参考图像作为纠偏的基准,利于进行更准确的图像纠偏。
第五图像和/或第六图像,可以是一个图像也可以是多个图像。在一个示例中,第五图像为第一图像,第六图像为第二图像。进一步地,在一些具体实施方式中,还包括利用第七图像和第八图像构建所称的参考图像,待配准图像、第五图像、第六图像、第七图像和第八图像来自测序反应的相同视野,第五图像、第六图像、第七图像和第八图像分别对应A、T/U、G和C四种类型碱基延伸反应时的视野,构建参考图像还包括:基于第五图像对第七图像进行粗配准,包括确定第七图像相对于第五图像的偏移量,基于该偏移量移动第七图像,获得粗配准后的第七图像;基于第五图像对第八图像进行粗配准,包括确定第八图像相对于第五图像的偏移量,基于该偏移量移动第八图像,获得粗配准后的第八图像;合并第五图像和粗配准后的第六图像、粗配准后的第七图像以及粗配准后的第八图像,以获得参考图像。
对第一配准的实现方式不作限制,例如可利用傅里叶变换,使用频域配准,确定第一偏移量。具体地,例如可参考Kenji TAKITA et al,IEICE TRANS.FUNDAMENTALS,VOL.E86-A,NO.8AUGUST 2003.中的纯相位相关函数(Phase-Only Correlation Function)中的二维离散傅里叶变换确定第一偏移量、第六图像和第五图像的偏移量、第七图像和第五图像的偏移量和/或第八图像和第五图像的偏移量。第一配准/粗配准可达到1像素(1pixel)的精度。如此,可快速准确地确定第一偏移量和/或构建利于精确纠偏的参考图像。
在某些具体实施方式中,参考图像和待配准图像为二值化图像。如此,利于减少运算量快速纠偏。
在一个示例中,待纠偏图像和参考图像均为二值化图像,即图像中的各个像素非a即b,例如a为1,b为0,像素标记为1的较像素标记为0的亮,或者说强度大;参考图像是利用一轮测序反应的四次碱基延伸反应的图像repeat1、repeat2、repeat3和repeat4构建的,第五图像、第六图像选自图像repeat1-4中的任一个、两个或三个。
在一个示例中,第五图像为图像repeat1,图像repeat2、repeat3和repeat4为第六图像,基于图像repeat1依次对图像repeat2-4进行粗配准,分别获得粗配准后的图像repeat2-4;合并图像repeat1和粗配准后的图像repeat2-4,获得参考图像。所称的合并图像为合并图像中的重合亮斑。基于对应核酸分子的亮斑的大小和成像系统分辨率,在一个示例中,设定两个图像上距离不大于1.5个像素的两个亮斑为重合亮斑。这里,采用4个repeat的合成的图像中心区域作为参考图像,一来利于使得参考图像具有足够量的亮斑,利于后续配准,二来检测及定位出的图像中心区域中的亮斑,亮斑信息是相对更准确的,利于准确配准。
在一个示例中,进行如下步骤对图像进行纠偏:1)对采集自另一轮测序反应的一次碱基延伸反应的某个视野的图像repeat5进行粗纠偏,repeat5为二值化后的图像,取该图像中心例如512*512区域,与repeat1-4合成的中心图像(相应参考图像的中心512*512区域),进行二维离散傅里叶变换,使用频域配准,得到偏移量offset(x0,y0),即实现图像粗配准,x0、y0能达到1pixel的精度;2)将上述粗配准后的图像和参考图像基于图像上的亮斑进行合并(merge),包括计算repeat5图像的中心区域内与参考图像相应区域内的重合亮斑的偏移量offset(x1,y1)=待纠偏图像的该亮斑的坐标位置-参考图像上的相应亮斑的坐标位置,可表示为offset(x1,y1)=curRepeatPoints-basePoints;求取所有重合亮斑的平均偏移量,从而得到[0,0]到[1,1]范围内的细偏移量。在一个示例中,设定两个图像上距离不大于1.5个像素的两个亮斑为重合亮斑;3)综上,得到一个视野图像(fov)不同cycle的偏移量(x0,y0)-(x1,y1),对于一个亮斑(peak)可表示为:curRepeatPoints+(x0,y0)-(x1,y1),curRepeatPoints表示该亮斑原始坐标,即在纠偏前的图像中的坐标。上述图像纠偏获得的纠偏结果具有较高的准确性,且纠偏精度小于或等于0.1像素。图4示意纠偏过程及结果,图4中,基于图像A对图像C进行纠偏,图像A和图像C中的圆圈表示亮斑、相同数字标记的亮斑为重合亮斑,图像C->A表示纠偏结果,即图像C对齐至图像A的结果。
对图像上亮斑的识别检测方式不作限定。在某些具体实施方式中,检测识别图像上的亮斑,即检测出图像上来自于延伸碱基/碱基簇的信号,包括利用k1*k2矩阵对图像进行亮斑检测, 判定矩阵的中心像素值不小于矩阵非中心任一像素值的矩阵对应一个候选亮斑,以及确定候选亮斑是否为亮斑,k1和k2均为大于1的奇数,k1*k2矩阵包含k1*k2个像素点。所称的图像例如为待配准图像、构建参考图像中的图像等。利用该方式检测图像上的亮斑,能够快速有效地实现图像上的亮斑(spots或peaks)的检测,特别是对采集自核酸序列测定反应的图像。该方法对待检测图像即原始输入数据没有特别的限制,适用于任何利用光学检测原理进行核酸序列测定的平台所产生的图像的处理分析,包括但不限于二代和三代测序,具有高准确性和高效的特点,能从图像中获取更多的代表序列的信息。特别是对于随机图像及高准确度要求的信号识别,尤其具有优势。
在一些实施例中,图像来自核酸序列测定反应,核酸分子上带有光学可检测标记,利如荧光标记,荧光分子在特定波长激光照射下能够被激发发出荧光,通过成像系统采集图像。采集到的图像包括可能与荧光分子所在位置相对应的光斑/亮斑。可以理解地,当处于焦面位置时,所采集到的图像中的与荧光分子所在位置相对应的亮斑的尺寸较小且亮度较高;当位于非焦面位置时,所采集到的图像中的与荧光分子所在位置相对应的亮斑的尺寸较大且亮度较低。另外,视野中的可能存在其它非目标或者后续难以利用的物质/信息,比如杂质等;进一步地,在对单分子视野进行拍照中,大量分子聚集(簇)等也会干扰目标单分子信息采集。所称的单分子为一个少数几个分子,例如分子数目不大于10、8、6、5或者3,例如为一个、两个、三个、四个、五个、六个或者八个。
在一些示例中,矩阵的中心像素值大于第一预设值,矩阵非中心任一像素值大于第二预设值,第一预设值和第二预设值与图像的平均像素值相关。
在一些实施例中,可以利用k1*k2矩阵对图像进行遍历检测,所称的第一预设值和/或第二预设值的设置与该图像的平均像素值相关。对于灰度图像,像素值同灰度值。k1*k2矩阵,k1、k2可以相等也可以不相等。在一个示例中,成像系统相关参数为:物镜60倍,电子传感器的尺寸为6.5μm,经过显微镜成的像再经过电子传感器,能看到的最小尺寸为0.1μm,获得的图像或者输入的图像可为512*512、1024*1024或2048*2048的16位的灰度或彩色图像,k1和k2的取值范围均为大于1且小于10。在一个示例中,k1=k2=3;在另一个示例中,k1=k2=5。
在一个示例中,发明人经过大量图像处理统计,取第一预设值为该图像的平均像素的1.4倍,取第二预设值为该图像的平均像素值的1.1倍,能够排除干扰、获得来自于光学检测标记的亮斑检测结果。
可利用大小、与理想亮斑的相似程度和/或强度来对候选亮斑进一步进行筛选判断。在某些具体实施方式中,利用候选亮斑对应的连通域的大小来定量反映比较图像上候选亮斑的大小,以此来筛选判断候选亮斑是否为要的亮斑。
在一个示例中,确定候选亮斑是否为亮斑包括:计算一个候选亮斑对应的连通域的大小Area=A*B,判定对应的连通域的大小大于第三预设值的候选亮斑为一个亮斑,A表示以该候选亮斑对应的矩阵的中心的所在行的相连像素/连通像素的大小,B表示以该候选亮斑对应的矩阵的中心的所在列的相连像素/连通像素的大小,定义一个k1*k2矩阵中大于平均像素值的相连像素为一个所称的候选亮斑对应的连通域。如此,能够有效获得对应标记分子且符合后续序列识别的亮斑,获得核酸序列信息。
在一个例子中,以该图像的平均像素值作为基准,相邻的不小于平均像素值的两个或多个像素为所称的相连像素/连通像素(pixel connectivity),如图5所示,加粗加大的表示候选亮斑对应的矩阵的中心,粗线框表示候选亮斑对应的3*3矩阵,标记为1的像素为不小于该图像的平均像素值的像素点,标记为0的像素为小于平均像素值的像素点,可看出A=3,B=6,该候选亮斑对应的连通域的大小为A*B=3*6。
所称的第三预设值可依据该图像上所有候选亮斑对应的连通域的大小这一信息来确定。例如通过计算该图上各候选亮斑对应的连通域的大小,取亮斑的连通域大小的平均值代表该图像一个特性,作为第三预设值;又例如,可将该图像上各个候选亮斑对应的连通域大小按从小到大排序,取第50、第60、第70、第80或第90百分位数连通域大小作为该第三预设值。如此,可有效获得亮斑信息,利于后续识别核酸序列。
在某些示例中,通过统计设置参数来定量反映比较候选亮斑的强度特征,以此来筛选候选亮斑。在一个示例中,确定候选亮斑是否为亮斑包括:计算一个候选亮斑的分值Score=((k1*k2-1)CV-EV)/((CV+EV)/(k1*k2)),判定分值大于第四预设值的候选亮斑为一个亮斑,CV表示候选亮斑对应的矩阵的中心像素值,EV表示亮斑对应的矩阵的非中心像素值的总和。如此,能够有效获得对应标记分子且符合后续序列识别的亮斑,获得核酸序列信息。
所称的第四预设值可依据该图像上所有候选亮斑的分值的大小这一信息来确定。例如,当 该图像上的候选亮斑的数量大于一定数目符合统计上对量的要求,例如该图像上候选亮斑的数目大于30,可计算且将该图像的所有候选亮斑的Score值按升序排序,第四预设值可设置为第50、第60、第70、第80或90分位数Score值,如此,可排除掉小于第50、第60、第70、第80或第90分位数Score值的候选亮斑,利于有效获得目标亮斑,利于后续碱基序列准确识别。进行该处理或者说该筛选设置的依据是,一般地,认为中心与边缘强度/像素值差异大且汇聚的亮斑为与待检分子所在位置相对应的亮斑。一般情况下,图像上的候选亮斑的数量大于50、大于100或大于1000。
在某些示例中,结合形态和强度/亮度对候选亮斑进行筛选。在一个示例中,确定候选亮斑是否为亮斑包括:计算一个候选亮斑对应的连通域的大小Area=A*B,以及计算一个候选亮斑的分值Score=((k1*k2-1)CV-EV)/((CV+EV)/(k1*k2)),A表示以该候选亮斑对应的矩阵的中心的所在行的相连像素/连通像素的大小,B表示以该候选亮斑对应的矩阵的中心的所在列的相连像素/连通像素的大小,定义一个k1*k2矩阵中大于平均像素值的相连像素为一个所称的候选亮斑对应的连通域,CV表示候选亮斑对应的矩阵的中心像素值,EV表示亮斑对应的矩阵的非中心像素值的总和;判定对应的连通域的大小大于第三预设值且分值大于第四预设值的候选亮斑为一个亮斑。如此,能够有效地获得对应核酸分子且利于后续序列识别的亮斑信息。对于所称的第三预设值和/或第四预设值,可以参照前面具体实施方式进行考虑和设置。
在某些具体实施方式中,识别检测亮斑包括:预处理图像,获得预处理后的图像,所称的图像选自第一图像、第二图像、第三图像、第四图像、第五图像、第六图像、第七图像和第八图像中的至少一个;确定临界值以简化预处理后的图像,包括对小于临界值的预处理后的图像上的像素点的像素值赋值为第一预设值,对不小于临界值的预处理后的图像上的像素点的像素值赋值为第二预设值,以获得简化图像;基于预处理后的图像确定第一亮斑检测阈值c1;基于预处理后的图像和简化图像识别图像上的候选亮斑,包括判定满足以下i)-ii)中至少两个条件的像素点矩阵为一个候选亮斑,i)在预处理后的图像中,像素点矩阵的中心像素点的像素值为最大,像素点矩阵可表示为r1*r2,r1和r2均为大于1的奇数,r1*r2像素点矩阵包含r1*r2个像素点,ii)在简化图像中,像素点矩阵的中心像素点的像素值为第二预设值并且像素点矩阵的连通像素大于*r1*r2,以及iii)在预处理后的图像中的像素点矩阵的中心像素点的像素值大于第三预设值,并且满足g1*g2>c1,g1为以像素点矩阵的中心像素点为中心的m1*m2范围的二维高斯分布的相关系数,g2为该m1*m2范围的像素,m1和m2均为大于1的奇数,m1*m2范围包含m1*m2个像素点;以及确定候选亮斑是否为亮斑。利用该方式检测图像上的亮斑,包括利用发明人通过大量数据训练确定的判断条件或判断条件的组合,能够快速有效地实现图像上的亮斑的检测,特别是对采集自核酸序列测定反应的图像。该方法对待检测图像即原始输入数据没有特别的限制,适用于任何利用光学检测原理进行核酸序列测定的平台所产生的图像的处理分析,包括但不限于二代和三代测序,具有高准确性和高效的特点,能从图像中获取更多的代表序列的信息。特别是对于随机图像及高准确度要求的信号识别,尤其具有优势。
对于灰度图像,像素值同灰度值。若图像是彩色图像,彩色图像的一个像素点具有三个像素值,可以将彩色图像转化为灰度图像,再进行亮斑检测,以降低图像检测过程的计算量和复杂度。可选择但不限于利用浮点算法、整数方法、移位方法或平均值法等将非灰度图像转换成灰度图像。
在一些实施例中,预处理图像包括:利用开运算确定图像的背景;基于背景,利用顶帽运算将图像转化为第一图像;对第一图像进行高斯模糊处理,获得第二图像;对第二图像进行锐化,以获得所称的预处理后的图像。如此,能对图像进行有效的降噪或者说提高图像的信噪比,利于亮斑的准确检测。
可参照CN107945150A基因测序的图像处理方法及系统披露的方法进行预处理;具体地,开运算是一种形态学处理,即先膨胀后腐蚀的过程,腐蚀操作会使得前景(感兴趣的部分)变小,而膨胀会使得前景变大;开运算可以用来消除小物体,在纤细点处分离物体,并且在平滑较大物体的边界的同时不明显改变其面积。该实施方式对图像做开运算的结构元p1*p2(用来处理图像的基本模板)的大小不作特别限制,p1和p2为奇数。在一个示例中,结构元p1*p2可以为15*15、31*31等,最终都能够获得利于后续处理分析的预处理后的图像。
顶帽运算往往用来分离比临近点(亮点/亮斑)亮一些的斑块,在一幅图像具有大幅的背景,而微小物品比较有规律的情况下,可以使用顶帽运算进行背景提取。在一个示例中,对图像进行顶帽变换包括先对图像做开运算,进而利用原图像减去开运算结果,获得第一图像即顶帽变换后的图像。顶帽变换的数学表达式为dst=tophat(src,element)=src-open(src,element)。发明人认为,开运算的结果放大了裂缝或者局部低亮度的区域,因此从原图中减去开运算后的 图,得到的图像突出了比原图轮廓周围的区域更明亮的区域,这一操作与选择的核的大小相关,可以认为与亮点/亮斑的预期大小相关,若亮点不是预期大小,处理后的效果会使得整张图产生许多小凸起,具体可以参考虚焦图片,即亮点/亮斑晕染成一团。在一个示例中,亮点的预期大小即选择的核的大小为3*3,得到的顶帽变换后的图像利于后续进一步去噪处理。
高斯模糊(Gaussian Blur)也称为高斯滤波,是一种线性平滑滤波,适用于消除高斯噪声,广泛应用于图像处理的减噪过程。通俗的讲,高斯滤波就是对整幅图像进行加权平均的过程,每一个像素点的值,都由其本身和邻域内的其他像素值经过加权平均后得到。高斯滤波的具体操作是:用一个模板(或称卷积、掩模)扫描图像中的每一个像素,用模板确定的邻域内像素的加权平均灰度值去替代模板中心像素点的值。在一个示例中,对第一图像进行高斯模糊处理,在OpenCV中使用高斯滤波GaussianBlur函数进行,高斯分布参数Sigma取0.9,所使用的二维滤波器矩阵(卷积核)是3*3,从图像角度看经过该高斯模糊处理后,第一图像上的小突起被抹平,图像边缘光滑。进一步地,对第二图像即高斯过滤后的图像进行锐化,例如进行二维拉普拉斯锐化,从图像角度看经过处理后,边缘被锐化,高斯模糊后的图像得以恢复。
在一些实施例中,简化预处理后的图像包括:基于背景和预处理后的图像,确定临界值;比较预处理后的图像上的像素点的像素值与临界值,对小于临界值的预处理后的图像上的像素点的像素值赋值为第一预设值,对不小于临界值的预处理后的图像上的像素点的像素值赋值为第二预设值,获得简化图像。如此,根据发明人大量测试数据总结的确定临界值的方式以及确定的临界值,据此将预处理后的图像简化,例如二值化,利于后续亮斑准确检测,利于后续碱基准确识别、获得高质量数据等。
具体地,在一些示例中,获得简化图像包括:将预处理后获得的锐化后的结果除以开运算结果,获得和图像像素点对应的一组数值;通过该组数值,确定二值化预处理后的图像的临界值。例如,可将该组数值按大小升序排列,取该组数值中第20、30或40百分位数对应的数值作为二值化临界值/阈值。如此,获得的二值化图像利于后续亮斑的准确检测识别。
在一个示例中,图像预处理时的开运算的结构元为p1*p2,所称的将预处理后的图像(锐化后的结果)除以开运算结果,获得一组和结构元一样大小的数组/矩阵p1*p2,在每个数组中,将该数组包含的p1*p2个数值按大小升序排列,取该数组中第三十百分位数对应的数值作为该区域(数值矩阵)的二值化临界值/阈值,如此,分别确定阈值对图像上的各个区域进行二值化,最终获得的二值化结果在去噪的同时更加突出所需信息,利于后续亮斑的准确检测。
在一些示例中,利用大津法进行第一亮斑检测阈值的确定。大津法(OTSU算法)也可称为最大类间方差法,大津法利用类间方差最大来分割图像,意味着错分概率小,准确性高。假设预处理后的图像的前景和背景的分割阈值为T(c1),属于前景的像素点数占整幅图像的比例为w0,其平均灰度为μ0;属于背景的像素点数占整幅图像的比例为w1,其平均灰度为μ1。待处理图像的总平均灰度记为μ,类间方差记为var,则有:μ=ω 0011;var=ω 00-μ) 211-μ) 2,将后者代入前者,得到等价公式:var=ω 0ω 110) 2。采用遍历的方法得到使类间方差最大的分割阈值T,即为所求的第一亮斑检测阈值c1。
在一些实施例中,基于预处理后的图像和简化图像识别图像上的候选亮斑,包括判断同时满足i)-iii)三个条件的像素点矩阵为一个候选亮斑。如此,能有效地提高后续基于亮斑信息确定核酸序列的准确性和下机数据的质量。
具体地,在一个示例中,候选亮斑的判定需要满足的条件包括i),r1和r2可以相等也可以不相等。在一个示例中,成像系统相关参数为:物镜60倍,电子传感器的尺寸为6.5μm,经过显微镜成的像再经过电子传感器,能看到的最小尺寸为0.1μm,获得的图像或者输入的图像可为512*512、1024*1024或2048*2048的16位的灰度或彩色图像,r1和r2的取值范围均为大于1且小于10。在一个示例中,在一个预处理后的图像中,依据亮斑的预期大小设置r1=r2=3;在另一个示例中,设置r1=r2=5。
在一个示例中,候选亮斑的判定需要满足的条件包括ii),在简化图像中,像素点矩阵的中心像素点的像素值为第二预设值,并且该像素点矩阵的连通像素大于(2/3)*r1*r2,即中心像素点的像素值大于临界值且连通像素大于矩阵的三分之二。这里,相邻的像素值都为第二预设值的两个或多个像素为所称的相连像素/连通像素(pixel connectivity),例如,简化图像为二值化图像,第一预设值为0,第二预设值为1,如图5所示,加粗加大的表示所称的像素点矩阵的中心,粗线框表示像素点矩阵3*3,即r1=r2=3,该矩阵的中心像素点的像素值为1,连通像素为4,小于(2/3)*r1*r2=6,该像素点矩阵不满足条件b),非候选亮斑。
在一个示例中,候选亮斑的判定需要满足的条件包括iii),在预处理图像中,g2为修正 后的m1*m2范围的像素,即为修正后的m1*m2范围像素总和。在一个例子中,依据简化图像相应m1*m2范围中像素值为第二预设值的像素点所占的比例进行修正,例如,如图6所示,m1=m2=5,所称的简化图像相应m1*m2范围中像素值为第二预设值的像素点所占的比例为13/25(13个“1”),修正后的g2为原来的13/25。如此,利于更准确的检测识别亮斑,利于后续亮斑信息的分析读取。
在一些示例中,所称的判定候选亮斑是否为亮斑还包括:基于预处理后的图像确定第二亮斑检测阈值,以及判定像素值不小于第二亮斑检测阈值的候选亮斑为亮斑。在具体示例中,以候选亮斑的坐标所在的位置的像素值作为该候选亮斑的像素值。通过利用基于预处理后的图像确定的第二亮斑检测阈值对候选亮斑的进一步筛选,能够排除掉至少一部分更可能是图像背景或者干扰但亮度(强度)和/或形状表现为“斑”的亮斑,利于后续基于亮斑的序列的准确识别,提高下机数据的质量。
在一个示例中,可利用重心法获取候选亮斑的坐标,包括亚像素级坐标。利用双线性插值法计算候选亮斑的坐标位置的像素值/灰度值。
在某些具体示例中,判定候选亮斑是否为亮斑包括:将预处理后的图像划分为预定大小的一组区域(block),对该区域中的像素点的像素值进行排序,以确定该区域对应的第二亮斑检测阈值;对于位于区域的候选亮斑,判定像素值不小于该区域对应的第二亮斑检测阈值的候选亮斑为亮斑。如此,区分图像的不同区域的差异比如光强的整体落差,分开进行亮斑的进一步检测识别,利于准确识别亮斑并且获得更多的亮斑。
所称的将预处理后的图像划分为预定大小的一组区域(block),block之间可以有重叠也可以没有重叠。在一个示例中,block之间没有重叠。在一些实施例中,预处理后的图像的大小不小于512*512,例如为512*512、1024*1024、1800*1800或者2056*2056等,所称预定大小的区域可以设为200*200。如此,利于快速计算判断识别亮斑。
在一些实施例中,确定该区域对应的第二亮斑检测阈值时,对每个block中的像素点的像素值按大小进行升序排列,取p10+(p10-p1)*4.1作为该block对应的第二亮斑检测阈值,即该block的背景,p1表示第百分之一分位的像素值,p10表示第百分之十分位的像素值。该阈值是发明人通过大量数据训练测试得出的较为稳定的阈值,能够适应多种图像采集时的光学环境包括不同的激光功率和/或各种亮点密度的图像的检测,通过该阈值对候选亮斑进行筛选能够消除掉大量非目标亮斑,利于后续快速分析和获得准确的结果。可以理解地,当由于系统设置包括光学系统的较大调整,图像整体像素分布发生明显改变时,此阈值可能需要适当调整。图7为进行该处理之前和之后的亮斑检测结果对比示意图,即排除掉区域背景前后的亮斑检测结果示意图,图7的上半部分为作该处理后的亮斑检测结果、下半部分为不作该处理的亮斑检测结果,十字标记的为候选亮斑或亮斑。
在某些具体实施方式中,在S2中,基于相同的坐标体系,将对应测序模板的亮斑集合中的亮斑的坐标对应到待检图像上,比如将对应测序模板的亮斑集合和待检图像均置于第一轮测序反应获得的图像的坐标系中,将对应测序模板的亮斑集合中的各个亮斑的坐标标记到待检图像上,各亮斑的坐标可以通过重心法等确定,如此,可快速且准确确定待检图像上的相应位置的坐标。
对图像上的某个位置的强度的确定方式不作限定,例如可以利用双线性插值法、二次函数插值法和二次样条插值法等计算该位置的亚像素值/灰度值作为该位置的强度。在一些具体实施方式中,S4中的待检图像的相应坐标位置的强度可以为绝对强度,例如为该位置的像素值,也可以为相对强度,例如为基于该位置的像素值的相关关系值,例如为对待检图像进行降噪、去背景和/或利用相邻像素点的像素做差之后的相关值。
在某些具体实施方式中,预设阈值与待检图像相对应,一个所称的预设阈值对应一个或多个待检图像,即利用该碱基识别方法对一个或多个待检图像进行检测时,可共用一个预设阈值。在该种情形下,所称的待检图像的相应坐标位置的背景强度为相对强度,例如,所称的相对强度为通过该相应坐标位置的绝对强度和该相应坐标位置所在区域的背景强度来确定。所称的该相应坐标位置所在区域为包含该位置的区域,较佳地,为不需严格地以该位置为中心的包含x1*y1像素点的区域,x1和y1均为自然数,x1*y1不小于100,较佳地,x1*y1不小于1000。如此,利于快速且准确地进行碱基识别。
在一些示例中,通过以下确定所称的相应坐标位置所在区域的背景强度:依据像素值对所称的相应坐标位置所在的x1*y1区域的像素点进行排序,以获得该x1*y1区域中的像素点的数目的分布曲线;基于分布曲线确定所称的相应坐标位置所在区域的背景强度。所称的排序可以是升序排序也可以是降序排序,以下以升序排序结果作为示例,本领域技术人员通过该示例能 够获得通过降序排序的曲线以及根据该曲线确定得相关参数或条件,同样也可以计算出所称的区域的强度。
在一些示例中,待检图像的大小为1800*1800,取x1=y1=300,图8和图9显示依据像素值大小升序排列的300*300区域即9万个像素点的分布曲线(直方图),横坐标为像素(像素值),纵坐标为像素点的数目,图9和图10分别取自同一图像上的视野边缘和视野中心的300*300区域(方框显示)。可以看出,图像上的背景的像素点数随像素值的变化呈对称分布,服从正态分布或者近似服从正态分布,如图8所示;而背景与亮斑的叠加,使得该曲线有分布拉宽、波峰相对左移且右侧的下降趋势变缓即右边呈拖尾状的趋势,如图9所示。另外,该区域存在其它干扰亮斑比如强度异常或者亮斑分布疏密差异较大等,也是得该曲线趋于不对称分布,包括在波峰或右侧波谷或者靠近波峰或波谷处存在异常凸起或凹陷等(不遵循原变化趋势)。
在一个示例中,以该分布曲线的最高频像素值即波峰作为所称的相应坐标位置所在区域的背景强度,以相对稳定的所称的最高频像素值代表相应区域的背景强度,有利于后续快速简便且准确地的识别出碱基。本实施方式区域的背景强度的估算或确定方法不作限定,例如可以用OpenCV的开运算等。
在另一些具体实施方式中,发明人基于大量图像数据进行归纳、测试和验证,拟出可用于确定该类分布曲线的波峰像素值即所称的相应坐标位置所在区域的背景强度I block的公式,I block=I j1+(I j2-I j3)×t1,其中,I j1、I j2、I j3分别为该分布曲线上第j1百分位数、第j2百分位数和第j3百分位数对应的像素值,j1、j2和j3均为小于50且不小于1的整数,j2>8+j3,t1为第一修正系数,t1取值通过j1、j2和j3确定。该公式估算得的波峰像素值较可靠,适用于各种测序平台产生的图像,特别适用于单分子测序平台产生的图像。
在一些示例中,较佳地,j1选自[1,40],j2选自[6,40],j3选自[1,30],40<j1+(j2-j3)×t1<50;如此,能较准确地估得利于准确碱基识别的波峰像素值。
进一步地,所称的相应坐标位置所在区域的背景强度通过上述公式确定,而待检图像的相应坐标位置的强度为该相应坐标位置的绝对强度和该相应坐标位置所在区域的背景强度的比值,预设阈值选自[0.85,0.95],以进行S6比较待检图像的相应坐标位置的强度与预设阈值的大小,进而基于待检图像上强度大于预设阈值的位置的信息进行碱基识别,如此,能够较少丢失有效信息且准确地识别出碱基。在一个具体示例中,j1=j2=10,j3=1,t1=4.1,预设阈值为0.9,能较好的适用于基于单分子测序平台产生的图像的碱基识别。
在另一些具体实施方式中,预设阈值和待检图像的相应坐标位置是对应的,即预设阈值随着位置的不同一般会随着变化,一个预设阈值对应一个所称的待检图像的相应坐标位置。在该些情形下,所称的待检图像的相应坐标位置的强度为绝对强度,例如为该位置的像素值。
各位置对应的预设阈值可以在进行S6时确定,也可以预先确定保存。在一些示例中,预设阈值的确定与所称的该相应坐标位置所在区域的背景强度有关,所称的该相应坐标位置所在区域为待检图像上包含该位置的区域,为不严格地以该位置为中心的包含x2*y2像素点的区域,x2和y2均为自然数,x2*y2不小于100,较佳地,x2*y2不小于1000。如此,利于快速有效地确定预设阈值,利于准确地识别出碱基。
具体地,在一些示例中,确定预设阈值包括:依据像素值对所称的相应坐标位置所在的x2*y2区域的像素点进行排序,以获得该x2*y2区域的像素点的数目的分布曲线;以及基于所称的分布曲线确定预设阈值。所称的排序可以是升序排序也可以是降序排序,以下以升序排序结果作为示例,本领域技术人员通过该示例能够获得通过降序排序的曲线以及根据该曲线确定得相关参数或条件,同样也可以计算确定得预设阈值。
参见图8和图9,以分布曲线的右侧波谷像素值作为预设阈值,以期望确定得的预设阈值能使得识别出的碱基的准确性的可信度达90%、95%或者99%以上,有利于后续准确地的识别碱基。
对曲线的波谷像素值的确定方法不作限制。在另一些具体实施方式中,发明人经过大量图像数据归纳测试和验证,拟出可用于确定该类分布曲线的右侧波谷像素值即预设阈值Threshold的公式,Threshold=I j4+(I j5-I j6)×t2,其中,I j4、I j5、I j6分别为第j4百分位数、第j5百分位数和第j6百分位数对应的像素值,j4、j5和j6均为小于50且不小于1的整数,j5>8+j6,t2为第二修正系数,t2的取值通过j4、j5和j6确定。利用该公式估算得的右侧波谷像素值较可靠,适用于各种测序平台产生的图像,包括亮斑分布均匀或者不均匀的图像,特别适用于单分子测序平台产生的图像。
较佳地,j4选自[1,40],j5选自[6,40],j6选自[1,30],85<j4+(j5-j6)×t2<100;如此,利 于获得较准确的估算得右侧波谷像素值,利于准确的碱基识别。在一个具体示例中,j4=j5=10,j6=1,t2=3.7,基于大于该预设阈值确定出的位置进行碱基识别,能较好的适用于基于单分子测序平台产生的图像。
在一些示例中,除了与预设阈值进行比较,还可进一步地结合待检图像的相应位置(的亮斑)的形态等对该位置进行筛选判定,以使确定出的位置的信息能够客观反映出该位置是否真实地发生碱基延伸反应,以利于准确地碱基识别。
上述在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的序列表,可以具体实现在任何计算机可读存储介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。
在一个实施方式中,提供一种计算机可读存储介质,用于存储供计算机执行的程序,执行所称的程序包括完成上述任一实施方式中的碱基识别方法。计算机可读存储介质包括但不限于只读存储器、随机存储器、磁盘或光盘等。计算机可读存储介质可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读存储介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读存储介质甚至可以是可在其上打印程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得程序,然后将其存储在计算机存储器中。上述对任一实施方式中的碱基识别方法的技术特征和优点的描述,同样适用于该计算机可读存储介质,在此不再赘述。
在一个实施方式中,参见图10,还提供一种碱基识别系统100,用以实施上述本发明任一实施方式中的碱基识别方法,该系统包括:映射模块10,用于将对应测序模板的亮斑集合中的亮斑的坐标对应到待检图像上,以确定待检图像的相应坐标位置;强度确定模块20,用于计算来自映射模块10的待检图像的相应坐标位置的强度;以及识别模块30,用于比较来自强度确定模块20的待检图像的相应坐标位置的强度与预设阈值的大小,以及基于待检图像上强度大于预设阈值的位置的信息进行该碱基识别;所称的对应测序模板的亮斑集合基于多个图像构建获得,所称的图像和待检图像均采集自碱基延伸反应,图像和待检图像对应相同的视野,碱基延伸反应时的该视野中存在多个带有光学可检测标记的核酸分子,至少一部分核酸分子在图像和/或待检图像上表现为亮斑。
上述对任一实施方式中的碱基识别方法的技术特征和优点的描述,同样适用本发明这一实施方式中的碱基识别系统,在此不再赘述。例如,在一些示例中,还包括模板构建模块12,与映射模块10连接,用于基于多个图像构建对应测序模板的亮斑集合,所称的图像包括分别对应A、T/U、G和C四种类型碱基延伸反应时的一个相同视野的第一图像、第二图像、第三图像和第四图像,第一图像包括图像M1和图像M2,第二图像包括图像N1和图像N2,第三图像包括图像P1和图像P2,第四图像包括图像Q1和图像Q2,定义顺序和/或同时实现一次四种类型碱基延伸反应为一轮测序反应,图像M1和图像M2分别来自两轮测序反应,图像N1和图像N2分别来自两轮测序反应,图像P1和图像P2分别来自两轮测序反应,图像Q1和图像Q2分别来自两轮测序反应,在模板构建模块12中进行以下:合并第一图像、第二图像、第三图像和第四图像上的亮斑,记录相同位置上的亮斑的数目,去除数目为1的位置上的亮斑,以获得所称的对应测序模板的亮斑集合。
具体地,在模板构建模块12中,合并第一图像、第二图像、第三图像和第四图像上的亮斑,包括:(a)合并图像N1上的亮斑至图像M1中,获得一次合并图像M1,标记一次合并图像M1中的重合亮斑为A,标记非重合亮斑为B,在一次合并图像M1中的距离小于第一预定像素的多个亮斑为一个所述重合亮斑;(b)以图像P1、图像Q1、图像M2、图像N2、图像P2或图像Q2替代图像N1,以一次合并图像M1替代图像M1,多次进行(a)直至完成所有图像上的亮斑的合并,获得原始亮斑集合;以及(c)去除原始亮斑集合中的标记为B的亮斑,以获得对应测序模板的亮斑集合。
在一些示例中,所称的图像为经过配准的图像。该系统100还包括配准模块14,与模板构建模块12相连,用于实施以下以实现图像的配准:基于参考图像对待配准图像进行第一配准,参考图像和待配准图像对应相同视野,包括:确定待配准图像上的预定区域和参考图像上的相应预定区域的第一偏移量,基于第一偏移量移动待配准图像上的所有亮斑,获得第一配准后的 待配准图像;基于参考图像对第一配准后的待配准图像进行第二配准,包括:合并第一配准后的待配准图像和参考图像,获得合并图像,计算合并图像上的预定区域的所有第二重合亮斑的偏移量,以确定第二偏移量,在合并图像上的距离小于第二预定像素的多个亮斑为一个第二重合亮斑,基于该第二偏移量移动第一配准后的待配准图像上的所有亮斑,以实现对待配准图像的配准。
在一些示例中,配准模块14包括参考图像构建单元142,用于进行以下以实现参考图像的构建:获取第五图像和第六图像,第五图像和第六图像与待配准图像对应相同视野;基于第五图像对第六图像进行粗配准,包括确定第六图像相对于第五图像的偏移量,基于该偏移量移动所述第六图像,获得粗配准后的第六图像;合并第五图像和粗配准后的第六图像,以获得所称的参考图像。
进一步地,在参考图像构建单元142中,构建参考图像还包括利用第七图像和第八图像,第五图像、第六图像、第七图像和第八图像对应相同视野,第五图像、第六图像、第七图像和第八图像分别对应A、T/U、G和C四种类型碱基延伸反应时的视野,构建参考图像还包括:基于第五图像对第七图像进行粗配准,包括确定第七图像相对于第五图像的偏移量,基于该偏移量移动第七图像,获得粗配准后的第七图像;基于第五图像对第八图像进行粗配准,包括确定第八图像相对于第五图像的偏移量,基于该偏移量移动第八图像,获得粗配准后的第八图像;合并第五图像和粗配准后的第六图像、粗配准后的第七图像以及粗配准后的第八图像,以获得所称的参考图像。
在一些示例中,所称的参考图像和待配准图像均为二值化图像。
在一些示例中,利用二维离散傅里叶变换确定所称的第一偏移量、第六图像相对于第五图像的偏移量、第七图像相对于第五图像的偏移量和/或第八图像相对于第五图像的偏移量。
在一些示例中,该系统100还包括亮斑检测模块16,与映射模块10、模板构建模块12和/或配准模块14相连,用于进行以下以实现图像上的亮斑的检测:预处理图像,获得预处理后的图像;确定临界值以简化预处理后的图像,包括对小于临界值的预处理后的图像上的像素点的像素值赋值为第一预设值,对不小于临界值的预处理后的图像上的像素点的像素值赋值为第二预设值,以获得二值化图像;基于预处理后的图像确定第一亮斑检测阈值c1;基于预处理后的图像和二值化图像进行所述图像上的亮斑的识别,包括判定满足以下i)-iii)中至少两个条件的像素点矩阵为一个候选亮斑,i)在所述预处理后的图像中,像素点矩阵的中心像素点的像素值为最大,像素点矩阵可表示为k1*k2,k1和k2均为大于1的奇数,k1*k2像素点矩阵包含k1*k2个像素点,ii)在所述二值化图像中,像素点矩阵的中心像素点的像素值为第二预设值并且像素点矩阵的连通像素大于(2/3)*k1*k2,以及iii)在所述预处理后的图像中的像素点矩阵的中心像素点的像素值大于第三预设值,并且满足g1*g2>c1,g1为以像素点矩阵的中心像素点为中心的m1*m2范围的二维高斯分布的相关系数,g2为该m1*m2范围的像素,m1和m2均为大于1的奇数,m1*m2范围包含m1*m2个像素点。
进一步地,亮斑检测模块16中,还包括进行以下以判定候选亮斑是否为亮斑:基于预处理后的图像确定第二亮斑检测阈值,以及比较候选亮斑的像素值和第二亮斑检测阈值的大小,判定像素值不小于第二亮斑检测阈值的候选亮斑为亮斑,以该候选亮斑的坐标所在的位置的像素值作为该候选亮斑的像素值。
具体地,在一些示例中,在亮斑检测模块16中,判定候选亮斑是否为亮斑包括:将预处理后的图像划分为预定大小的一组区域,对该区域中的像素点的像素值进行排序,以确定该区域对应的第二亮斑检测阈值,比较该区域的候选亮斑的像素值和第二亮斑检测阈值的大小,判定像素值不小于该区域对应的第二亮斑检测阈值的候选亮斑为亮斑。
在一些示例中,所称的预处理图像,包括:利用开运算确定图像的背景,基于背景,利用顶帽运算转化图像,对转化后的图像进行高斯模糊处理,对高斯模糊处理后的图像进行锐化,获得所称的预处理后的图像。
在一些示例中,确定临界值以简化预处理后的图像,获得二值化图像,包括:基于背景和预处理后的图像,确定所述临界值,比较预处理后的图像上的像素点的像素值与该临界值,以获得二值化图像。
在一些示例中,g2为修正后的m1*m2范围的像素,依据二值化图像相应m1*m2范围中像素值为第二预设值的像素点所占的比例进行修正以获得修正后的m1*m2范围的像素。
在一些示例中,一个所称的预设阈值对应一个或多个待检图像。在该种情况下,待检图像的相应坐标位置的强度为相对强度,例如通过该相应坐标位置的绝对强度和该相应坐标位置所在区域的背景强度来确定。
在一些示例中,强度确定模块20中,确定相应坐标位置所在区域的背景强度,包括:依据像素值对相应坐标位置所在的x1*y1区域的像素点进行排序,以获得该x1*y1区域的像素点的数目的分布曲线,x1和y1均为自然数,x1*y1不小于100;以及基于所称的分布曲线确定该相应坐标位置所在区域的背景强度。
具体地,所称的排序为升序排序,以所称的分布曲线的波峰像素值作为相应坐标位置所在区域的背景强度,利用公式I block=I j1+(I j2-I j3)×t1估算所述分布曲线的波峰像素值,其中,I j1、I j2、I j3分别为第j1百分位数、第j2百分位数和第j3百分位数对应的像素值,j1、j2和j3均为小于50且不小于1的整数,j2>8+j3,t1为第一修正系数,t1取值通过j1、j2和j3确定。
进一步地,j1选自[1,40],j2选自[6,40],j3选自[1,30],40<j1+(j2-j3)×t1<50。相应地,预设阈值选自[0.85,0.95]中的任意数值。
在另一些示例中,一个所称的预设阈值对应一个待检图像的相应坐标位置。在该种情形下,待检图像的相应坐标位置的强度为绝对强度。
在一些示例中,系统100还包括阈值确定模块40,与识别模块30相连,用以进行以下以确定预设阈值:依据像素值对所称的相应坐标位置所在的x2*y2区域的像素点进行排序,以获得该x2*y2区域的像素点的数目的分布曲线,x2和y2均为自然数,x2*y2不小于100;基于该分布曲线确定预设阈值。
具体地,所称的排序为升序排序,以该分布曲线的右侧波谷像素值作为预设阈值,利用公式Threshold=I j4+(I j5-I j6)×t2估算所述右侧波谷像素值,其中,I j4、I j5、I j6分别为第j4百分位数、第j5百分位数和第j6百分位数对应的像素值,j4、j5和j6均为小于50且不小于1的整数,j5>8+j6,t2为第二修正系数,t2的取值通过j4、j5和j6确定。
进一步地,j4选自[1,40],j5选自[6,40],j6选自[1,30],85<j4+(j5-j6)×t2<100。
在另一个实施方式中,提供一种碱基识别系统,该系统包括:存储器,用于存储数据,包括计算机可执行程序;以及处理器,用于执行所称的计算机可执行程序,以实施上述本发明任一实施方式中的碱基识别方法。该系统用于实施上述任一具体实施方式中的碱基识别方法,上述对任一实施方式中的碱基识别方法的技术特征和优点的描述,同样适用于该碱基识别系统,在此不再赘述。
在一个实施方式中,提供一种测序系统,该测序系统包括上述任一实施方式中的碱基识别系统。
在一个实施方式中,提供一种计算机程序产品,包括指令,该指令在计算机执行所称的程序时,使该计算机执行上述本发明任一实施方式中的碱基识别方法。上述对任一实施方式中的碱基识别方法的技术特征和优点的描述,同样适用于该计算机程序产品,在此不再赘述。
在一个实施方式中,提供一种测序系统,包括上述本发明任一实施方式的计算机程序产品。上述对任一实施方式中的碱基识别方法和/或计算机程序产品的技术特征和优点的描述,同样适用于该测序系统,在此不再赘述。
本领域技术人员知晓,除了以纯计算机可读程序代码方式实现控制器/处理器外,完全可以通过将方法步骤进行逻辑变成来使得控制器以逻辑门、开关、专用集成电路、可编辑逻辑控制器和嵌入微控制器等的形式来实现相同的功能。因此,这种控制器/处理器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的的软件模块又可以是硬件部件内的结构。
在本说明书的描述中,一个实施方式、一些实施方式、一个或一些具体实施方式、一个或一些实施例、示例等的描述意指结合该实施方式或示例描述的具体特征、结构或者特点包含于本发明的至少一个实施例或示例中。
在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构等特点可以在任何的一个或多个实施例或示例中以合适的方式结合。尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同限定。

Claims (56)

  1. 一种碱基识别方法,其特征在于,包括:
    将对应测序模板的亮斑集合中的亮斑的坐标对应到待检图像上,以确定所述待检图像的相应坐标位置;
    确定所述待检图像的相应坐标位置的强度;
    比较所述待检图像的相应坐标位置的强度与预设阈值的大小,基于所述待检图像上强度大于预设阈值的位置的信息进行所述碱基识别;
    所述对应测序模板的亮斑集合基于多个图像构建获得,所述图像和所述待检图像均采集自碱基延伸反应,所述图像和所述待检图像对应相同的视野,碱基延伸反应时的该视野中存在多个带有光学可检测标记的核酸分子,至少一部分所述核酸分子在所述图像和/或所述待检图像上表现为亮斑。
  2. 权利要求1的方法,其特征在于,一个所述预设阈值对应一个或多个所述待检图像。
  3. 权利要求2的方法,其特征在于,所述待检图像的相应坐标位置的强度为相对强度,依据该相应坐标位置的绝对强度和该相应坐标位置所在区域的背景强度确定所述相对强度。
  4. 权利要求3的方法,其特征在于,确定所述相应坐标位置所在区域的背景强度,包括:
    依据像素值对所述相应坐标位置所在的x1*y1区域的像素点进行排序,以获得该x1*y1区域的像素点的数目的分布曲线,x1和y1均为自然数,x1*y1不小于100;
    基于所述分布曲线确定所述相应坐标位置所在区域的背景强度。
  5. 权利要求4的方法,其特征在于,所述排序为升序排序,以所述分布曲线的波峰像素值作为所述相应坐标位置所在区域的背景强度,利用公式I block=I j1+(I j2-I j3)×t1估算所述分布曲线的波峰像素值,其中,
    I j1、I j2、I j3分别为第j1百分位数、第j2百分位数和第j3百分位数对应的像素值,j1、j2和j3均为小于50且不小于1的整数,j2>8+j3,t1为第一修正系数,t1取值通过j1、j2和j3确定。
  6. 权利要求5的方法,其特征在于,j1选自[1,40],j2选自[6,40],j3选自[1,30],40<j1+(j2-j3)×t1<50。
  7. 权利要求6的方法,其特征在于,所述预设阈值选自[0.85,0.95]。
  8. 权利要求1的方法,其特征在于,一个所述预设阈值对应一个所述待检图像的相应坐标位置。
  9. 权利要求8的方法,其特征在于,所述待检图像的相应坐标位置的强度为绝对强度。
  10. 权利要求9的方法,其特征在于,确定所述预设阈值,包括:
    依据像素值对所述相应坐标位置所在的x2*y2区域的像素点进行排序,以获得该x2*y2区域的像素点的数目的分布曲线,x2和y2均为自然数,x2*y2不小于100;
    基于所述分布曲线确定所述预设阈值。
  11. 权利要求10的方法,其特征在于,所述排序为升序排序,以所述分布曲线的右侧波谷像素值作为所述预设阈值,利用公式Threshold=I j4+(I j5-I j6)×t2估算所述右侧波谷像素值,其中,
    I j4、I j5、I j6分别为第j4百分位数、第j5百分位数和第j6百分位数对应的像素值,j4、j5和j6均为小于50且不小于1的整数,j5>8+j6,t2为第二修正系数,t2的取值通过j4、j5和j6确定。
  12. 权利要求11的方法,其特征在于,j4选自[1,40],j5选自[6,40],j6选自[1,30],85<j4+(j5-j6)×t2<100。
  13. 权利要求1-12任一方法,其特征在于,所述图像包括分别对应A、T/U、G和C四种类型碱基延伸反应时的一个相同视野的第一图像、第二图像、第三图像和第四图像,所述第一图像包括图像M1和图像M2,所述第二图像包括图像N1和图像N2,所述第三图像包括图像P1和图像P2,所述第四图像包括图像Q1和图像Q2,定义顺序和/或同时实现一次四种类型碱基延伸反应为一轮测序反应,
    图像M1和图像M2分别来自两轮测序反应,图像N1和图像N2分别来自两轮测序反应,图像P1和图像P2分别来自两轮测序反应,图像Q1和图像Q2分别来自两轮测序反应,
    所述基于多个图像构建所述对应测序模板的亮斑集合,包括:
    合并所述第一图像、第二图像、第三图像和第四图像上的亮斑,记录相同位置上的亮斑的 数目,去除数目为1的位置上的亮斑,以获得所述对应所述测序模板的亮斑集合。
  14. 权利要求13的方法,其特征在于,合并所述第一图像、第二图像、第三图像和第四图像上的亮斑,包括:
    (a)合并图像N1上的亮斑至图像M1中,获得一次合并图像M1,标记一次合并图像M1中的重合亮斑为A,标记非重合亮斑为B,在所述一次合并图像M1中的距离小于第一预定像素的多个亮斑为一个所述重合亮斑;
    (b)以图像P1、图像Q1、图像M2、图像N2、图像P2或图像Q2替代所述图像N1,以一次合并图像M1替代所述图像M1,多次进行(a)直至完成所有图像上的亮斑的合并,获得原始亮斑集合;
    (c)去除所述原始亮斑集合中的标记为B的亮斑,以获得所述对应测序模板的亮斑集合。
  15. 权利要求13的方法,其特征在于,所述图像为经过配准的图像。
  16. 权利要求15的方法,其特征在于,配准所述图像,包括:
    基于参考图像对待配准图像进行第一配准,所述参考图像和所述待配准图像对应相同视野,包括,
    确定所述待配准图像上的预定区域和所述参考图像上的相应预定区域的第一偏移量,基于所述第一偏移量移动所述待配准图像上的所有亮斑,获得第一配准后的待配准图像;
    基于所述参考图像对第一配准后的待配准图像进行第二配准,包括,
    合并所述第一配准后的待配准图像和所述参考图像,获得合并图像,
    计算所述合并图像上的预定区域的所有第二重合亮斑的偏移量,以确定第二偏移量,在所述合并图像上的距离小于第二预定像素的多个亮斑为一个所述第二重合亮斑,
    基于该第二偏移量移动所述第一配准后的待配准图像上的所有亮斑,以实现对所述待配准图像的配准。
  17. 权利要求16的方法,其特征在于,所述参考图像通过构建获得,构建所述参考图像包括:
    获取第五图像和第六图像,所述第五图像和所述第六图像与所述待配准图像对应相同视野;
    基于第五图像对第六图像进行粗配准,包括确定所述第六图像相对于所述第五图像的偏移量,基于该偏移量移动所述第六图像,获得粗配准后的第六图像;
    合并所述第五图像和粗配准后的第六图像,以获得所述参考图像。
  18. 权利要求17的方法,其特征在于,构建所述参考图像还包括利用第七图像和第八图像,所述第五图像、第六图像、第七图像和第八图像对应相同视野,所述第五图像、第六图像、第七图像和第八图像分别对应A、T/U、G和C四种类型碱基延伸反应时的所述视野,构建所述参考图像还包括:
    基于第五图像对第七图像进行粗配准,包括确定所述第七图像相对于所述第五图像的偏移量,基于该偏移量移动所述第七图像,获得粗配准后的第七图像;
    基于第五图像对第八图像进行粗配准,包括确定所述第八图像相对于所述第五图像的偏移量,基于该偏移量移动所述第八图像,获得粗配准后的第八图像;
    合并所述第五图像和粗配准后的第六图像、粗配准后的第七图像以及粗配准后的第八图像,以获得所述参考图像。
  19. 权利要求16-18任一方法,其特征在于,所述参考图像和所述待配准图像为二值化图像。
  20. 权利要求16-19任一方法,其特征在于,利用二维离散傅里叶变换确定所述第一偏移量、所述第六图像相对于所述第五图像的偏移量、所述第七图像相对于所述第五图像的偏移量和/或所述第八图像相对于所述第五图像的偏移量。
  21. 权利要求13-20任一方法,其特征在于,检测所述图像上的亮斑,包括:
    预处理图像,获得预处理后的图像;
    确定临界值以简化预处理后的图像,包括对小于临界值的预处理后的图像上的像素点的像素值赋值为第一预设值,对不小于临界值的预处理后的图像上的像素点的像素值赋值为第二预设值,以获得二值化图像;
    基于预处理后的图像确定第一亮斑检测阈值c1;
    基于预处理后的图像和二值化图像进行所述图像上的亮斑的识别,包括判定满足以下i)-iii)中至少两个条件的像素点矩阵为一个候选亮斑,
    i)在所述预处理后的图像中,像素点矩阵的中心像素点的像素值为最大,像素点矩阵可表示为r1*r2,r1和r2均为大于1的奇数,r1*r2像素点矩阵包含r1*r2个像素点,
    ii)在所述二值化图像中,像素点矩阵的中心像素点的像素值为第二预设值并且像素点矩阵的连通像素大于(2/3)*r1*r2,以及
    iii)在所述预处理后的图像中的像素点矩阵的中心像素点的像素值大于第三预设值,并且满足g1*g2>c1,g1为以像素点矩阵的中心像素点为中心的m1*m2范围的二维高斯分布的相关系数,g2为该m1*m2范围的像素,m1和m2均为大于1的奇数,m1*m2范围包含m1*m2个像素点。
  22. 权利要求21的方法,其特征在于,还包括判定候选亮斑是否为亮斑,包括:
    基于所述预处理后的图像确定第二亮斑检测阈值,以及
    比较所述候选亮斑的像素值和所述第二亮斑检测阈值的大小,判定像素值不小于所述第二亮斑检测阈值的候选亮斑为亮斑,以该候选亮斑的坐标所在的像素点的像素值作为所述候选亮斑的像素值。
  23. 权利要求22的方法,其特征在于,判定候选亮斑是否为亮斑包括:
    将所述预处理后的图像划分为预定大小的一组区域,
    对该区域中的像素点的像素值进行排序,以确定该区域对应的第二亮斑检测阈值,
    比较该区域的候选亮斑的像素值和所述第二亮斑检测阈值的大小,判定像素值不小于该区域对应的第二亮斑检测阈值的候选亮斑为亮斑。
  24. 权利要求21-23任一方法,其特征在于,所述预处理图像,包括:
    利用开运算确定所述图像的背景,
    基于背景,利用顶帽运算转化所述图像,
    对转化后的图像进行高斯模糊处理,
    对高斯模糊处理后的图像进行锐化,获得所述预处理后的图像。
  25. 权利要求21-24任一方法,其特征在于,确定临界值以简化预处理后的图像,获得二值化图像,包括:
    基于背景和所述预处理后的图像,确定所述临界值,
    比较所述预处理后的图像上的像素点的像素值与所述临界值,以获得所述二值化图像。
  26. 权利要求21-25任一方法,其特征在于,g2为修正后的m1*m2范围的像素,依据二值化图像相应m1*m2范围中像素值为第二预设值的像素点所占的比例进行修正以获得所述修正后的m1*m2范围的像素。
  27. 一种碱基识别系统,其特征在于,包括:
    存储器,用于存储数据,包括计算机可执行程序;
    处理器,用于执行所述计算机可执行程序,以实施权利要求1-26任一方法。
  28. 一种测序系统,其特征在于,包括权利要求27的碱基识别系统。
  29. 一种计算机程序产品,其特征在于,包括指令,当计算机执行所述程序时,权利要求1-26任一方法中的步骤得以实施。
  30. 一种测序系统,其特征在于,包括权利要求29的计算机程序产品。
  31. 一种碱基识别系统,其特征在于,包括:
    映射模块,用于将对应测序模板的亮斑集合中的亮斑的坐标对应到待检图像上,以确定所述待检图像的相应坐标位置;
    强度确定模块,用于计算所述待检图像的相应坐标位置的强度;以及
    识别模块,用于比较来自所述强度确定模块的待检图像的相应坐标位置的强度与预设阈值的大小,以及基于所述待检图像上强度大于所述预设阈值的位置的信息进行所述碱基识别;
    所述对应测序模板的亮斑集合基于多个图像构建获得,所述图像和所述待检图像均采集自碱基延伸反应,所述图像和所述待检图像对应相同的视野,碱基延伸反应时的该视野中存在多个带有光学可检测标记的核酸分子,至少一部分所述核酸分子在所述图像和/或所述待检图像上表现为亮斑。
  32. 权利要求31的系统,其特征在于,一个所述预设阈值对应一个或多个所述待检图像。
  33. 权利要求32的系统,其特征在于,所述待检图像的相应坐标位置的强度为相对强度,依据该相应坐标位置的绝对强度和该相应坐标位置所在区域的背景强度确定所述相对强度。
  34. 权利要求33的系统,其特征在于,所述强度确定模块中,确定所述相应坐标位置所在区域的背景强度,包括:
    依据像素值对所述相应坐标位置所在的x1*y1区域的像素点进行排序,以获得该x1*y1区域的像素点的数目的分布曲线,x1和y1均为自然数,x1*y1不小于100;
    基于所述分布曲线确定所述相应坐标位置所在区域的背景强度。
  35. 权利要求34的系统,其特征在于,所述排序为升序排序,以所述分布曲线的波峰像素值作为所述相应坐标位置所在区域的背景强度,利用公式I block=I j1+(I j2-I j3)×t1估算所述分布曲线的波峰像素值,其中,
    I j1、I j2、I j3分别为第j1百分位数、第j2百分位数和第j3百分位数对应的像素值,j1、j2和j3均为小于50且不小于1的整数,j2>8+j3,t1为第一修正系数,t1取值通过j1、j2和j3确定。
  36. 权利要求35的系统,其特征在于,j1选自[1,40],j2选自[6,40],j3选自[1,30],40<j1+(j2-j3)×t1<50。
  37. 权利要求36的系统,其特征在于,所述预设阈值选自[0.85,0.95]。
  38. 权利要求31的系统,其特征在于,一个所述预设阈值对应一个所述待检图像的相应坐标位置。
  39. 权利要求38的系统,其特征在于,所述待检图像的相应坐标位置的强度为绝对强度。
  40. 权利要求39的系统,其特征在于,还包括阈值确定模块,与所述识别模块相连,用以确定所述预设阈值,包括:
    依据像素值对所述相应坐标位置所在的x2*y2区域的像素点进行排序,以获得该x2*y2区域的像素点的数目的分布曲线,x2和y2均为自然数,x2*y2不小于100;
    基于所述分布曲线确定所述预设阈值。
  41. 权利要求40的系统,其特征在于,所述排序为升序排序,以所述分布曲线的右侧波谷像素值作为所述预设阈值,利用公式Threshold=I j4+(I j5-I j6)×t2估算所述右侧波谷像素值,其中,
    I j4、I j5、I j6分别为第j4百分位数、第j5百分位数和第j6百分位数对应的像素值,j4、j5和j6均为小于50且不小于1的整数,j5>8+j6,t2为第二修正系数,t2的取值通过j4、j5和j6确定。
  42. 权利要求41的系统,其特征在于,j4选自[1,40],j5选自[6,40],j6选自[1,30],85<j4+(j5-j6)×t2<100。
  43. 权利要求31-42任一系统,其特征在于,所述图像包括分别对应A、T/U、G和C四种类型碱基延伸反应时的一个相同视野的第一图像、第二图像、第三图像和第四图像,所述第一图像包括图像M1和图像M2,所述第二图像包括图像N1和图像N2,所述第三图像包括图像P1和图像P2,所述第四图像包括图像Q1和图像Q2,定义顺序和/或同时实现一次四种类型碱基延伸反应为一轮测序反应,
    图像M1和图像M2分别来自两轮测序反应,图像N1和图像N2分别来自两轮测序反应,图像P1和图像P2分别来自两轮测序反应,图像Q1和图像Q2分别来自两轮测序反应,
    还包括模板构建模块,与所述映射模块连接,用于实施以下以实现基于多个图像构建所述对应测序模板的亮斑集合:
    合并所述第一图像、第二图像、第三图像和第四图像上的亮斑,记录相同位置上的亮斑的数目,去除数目为1的位置上的亮斑,以获得所述对应测序模板的亮斑集合。
  44. 权利要求43的系统,其特征在于,在所述模板构建模块中,合并所述第一图像、第二图像、第三图像和第四图像上的亮斑,包括:
    (a)合并图像N1上的亮斑至图像M1中,获得一次合并图像M1,标记一次合并图像M1中的重合亮斑为A,标记非重合亮斑为B,在所述一次合并图像M1中的距离小于第一预定像素的多个亮斑为一个所述重合亮斑;
    (b)以图像P1、图像Q1、图像M2、图像N2、图像P2或图像Q2替代所述图像N1,以一次合并图像M1替代所述图像M1,多次进行(a)直至完成所有图像上的亮斑的合并,获得原始亮斑集合;
    (c)去除所述原始亮斑集合中的标记为B的亮斑,以获得所述对应测序模板的亮斑集合。
  45. 权利要求43的系统,其特征在于,所述图像为经过配准的图像。
  46. 权利要求45的系统,其特征在于,还包括配准模块,与所述模板构建模块相连,用于实施以下以实现所述图像的配准:
    基于参考图像对待配准图像进行第一配准,所述参考图像和所述待配准图像对应相同视野,包括,
    确定所述待配准图像上的预定区域和所述参考图像上的相应预定区域的第一偏移量,基于所述第一偏移量移动所述待配准图像上的所有亮斑,获得第一配准后的待配准图像;
    基于所述参考图像对第一配准后的待配准图像进行第二配准,包括,
    合并所述第一配准后的待配准图像和所述参考图像,获得合并图像,
    计算所述合并图像上的预定区域的所有第二重合亮斑的偏移量,以确定第二偏移量,在所述合并图像上的距离小于第二预定像素的多个亮斑为一个所述第二重合亮斑,
    基于该第二偏移量移动所述第一配准后的待配准图像上的所有亮斑,以实现对所述待配准图像的配准。
  47. 权利要求46的系统,其特征在于,所述配准模块包括参考图像构建单元,用于实施以下以实现所述参考图像的构建:
    获取第五图像和第六图像,所述第五图像和所述第六图像与所述待配准图像对应相同视野;
    基于第五图像对第六图像进行粗配准,包括确定所述第六图像相对于所述第五图像的偏移量,基于该偏移量移动所述第六图像,获得粗配准后的第六图像;
    合并所述第五图像和粗配准后的第六图像,以获得所述参考图像。
  48. 权利要求47的系统,其特征在于,在所述参考图像构建单元中,构建所述参考图像还包括利用第七图像和第八图像,所述第五图像、第六图像、第七图像和第八图像对应相同视野,所述第五图像、第六图像、第七图像和第八图像分别对应A、T/U、G和C四种类型碱基延伸反应时的所述视野,构建所述参考图像还包括:
    基于第五图像对第七图像进行粗配准,包括确定所述第七图像相对于所述第五图像的偏移量,基于该偏移量移动所述第七图像,获得粗配准后的第七图像;
    基于第五图像对第八图像进行粗配准,包括确定所述第八图像相对于所述第五图像的偏移量,基于该偏移量移动所述第八图像,获得粗配准后的第八图像;
    合并所述第五图像和粗配准后的第六图像、粗配准后的第七图像以及粗配准后的第八图像,以获得所述参考图像。
  49. 权利要求46-48任一系统,其特征在于,所述参考图像和所述待配准图像为二值化图像。
  50. 权利要求46-49任一系统,其特征在于,利用二维离散傅里叶变换确定所述第一偏移量、所述第六图像相对于所述第五图像的偏移量、所述第七图像相对于所述第五图像的偏移量和/或所述第八图像相对于所述第五图像的偏移量。
  51. 权利要求43-50任一系统,其特征在于,还包括亮斑检测模块,与所述映射模块、模板构建模块和/或配准模块相连,用于进行以下以实现所述图像上的亮斑的检测:
    预处理图像,获得预处理后的图像;
    确定临界值以简化预处理后的图像,包括对小于临界值的预处理后的图像上的像素点的像素值赋值为第一预设值,对不小于临界值的预处理后的图像上的像素点的像素值赋值为第二预设值,以获得二值化图像;
    基于预处理后的图像确定第一亮斑检测阈值c1;
    基于预处理后的图像和二值化图像进行所述图像上的亮斑的识别,包括判定满足以下i)-iii)中至少两个条件的像素点矩阵为一个候选亮斑,
    i)在所述预处理后的图像中,像素点矩阵的中心像素点的像素值为最大,像素点矩阵可表示为r1*r2,r1和r2均为大于1的奇数,r1*r2像素点矩阵包含r1*r2个像素点,
    ii)在所述二值化图像中,像素点矩阵的中心像素点的像素值为第二预设值并且像素点矩阵的连通像素大于(2/3)*r1*r2,以及
    iii)在所述预处理后的图像中的像素点矩阵的中心像素点的像素值大于第三预设值,并且满足g1*g2>c1,g1为以像素点矩阵的中心像素点为中心的m1*m2范围的二维高斯分布的相关系数,g2为该m1*m2范围的像素,m1和m2均为大于1的奇数,m1*m2范围包含m1*m2个像素点。
  52. 权利要求51的系统,其特征在于,所述亮斑检测模块中,还包括进行以下以判定候选亮斑是否为亮斑:
    基于所述预处理后的图像确定第二亮斑检测阈值,以及
    比较所述候选亮斑的像素值和所述第二亮斑检测阈值的大小,判定像素值不小于所述第二亮斑检测阈值的候选亮斑为亮斑,以该候选亮斑的坐标所在的像素点的像素值作为所述候选亮斑的像素值。
  53. 权利要求52的系统,其特征在于,在所述亮斑检测模块中,判定候选亮斑是否为亮斑包括:
    将所述预处理后的图像划分为预定大小的一组区域,
    对该区域中的像素点的像素值进行排序,以确定该区域对应的第二亮斑检测阈值,
    比较该区域的候选亮斑的像素值和所述第二亮斑检测阈值的大小,判定像素值不小于该区域对应的第二亮斑检测阈值的候选亮斑为亮斑。
  54. 权利要求51-53任一系统,其特征在于,所述预处理图像,包括:
    利用开运算确定所述图像的背景,
    基于背景,利用顶帽运算转化所述图像,
    对转化后的图像进行高斯模糊处理,
    对高斯模糊处理后的图像进行锐化,获得所述预处理后的图像。
  55. 权利要求51-54的系统,其特征在于,确定临界值以简化预处理后的图像,获得二值化图像,包括:
    基于背景和所述预处理后的图像,确定所述临界值,
    比较所述预处理后的图像上的像素点的像素值与所述临界值,以获得所述二值化图像。
  56. 权利要求51-55任一系统,其特征在于,g2为修正后的m1*m2范围的像素,依据二值化图像相应m1*m2范围中像素值为第二预设值的像素点所占的比例进行修正以获得所述修正后的m1*m2范围的像素。
PCT/CN2019/101067 2019-08-16 2019-08-16 碱基识别方法、系统、计算机程序产品和测序系统 WO2021030952A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP19941798.1A EP4015645A4 (en) 2019-08-16 2019-08-16 METHODS AND SYSTEM FOR BASE RECOGNITION, COMPUTER PROGRAM PRODUCT AND SYSTEM FOR SEQUENCING
CN201980058420.6A CN112823352B (zh) 2019-08-16 2019-08-16 碱基识别方法、系统和测序系统
PCT/CN2019/101067 WO2021030952A1 (zh) 2019-08-16 2019-08-16 碱基识别方法、系统、计算机程序产品和测序系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/101067 WO2021030952A1 (zh) 2019-08-16 2019-08-16 碱基识别方法、系统、计算机程序产品和测序系统

Publications (1)

Publication Number Publication Date
WO2021030952A1 true WO2021030952A1 (zh) 2021-02-25

Family

ID=74659821

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/101067 WO2021030952A1 (zh) 2019-08-16 2019-08-16 碱基识别方法、系统、计算机程序产品和测序系统

Country Status (3)

Country Link
EP (1) EP4015645A4 (zh)
CN (1) CN112823352B (zh)
WO (1) WO2021030952A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294035A (zh) * 2022-07-22 2022-11-04 深圳赛陆医疗科技有限公司 亮点定位方法、亮点定位装置、电子设备及存储介质
CN116342984A (zh) * 2023-05-31 2023-06-27 之江实验室 一种模型训练的方法以及图像处理的方法及装置
CN116703958A (zh) * 2023-08-03 2023-09-05 山东仕达思医疗科技有限公司 显微图像的边缘轮廓检测方法、系统、设备和存储介质
CN117392155A (zh) * 2023-12-11 2024-01-12 吉林大学 基于图像处理的高通量基因测序数据处理方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187643B (zh) * 2022-06-20 2023-05-09 深圳赛陆医疗科技有限公司 图像配准和模板构建方法、装置、电子设备及存储介质
WO2024000288A1 (zh) * 2022-06-29 2024-01-04 深圳华大生命科学研究院 图像拼接方法、基因测序系统及相应的基因测序仪

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1771336A (zh) * 2003-02-12 2006-05-10 金尼松斯文斯卡股份公司 用于核酸测序的方法和工具
CN102676657A (zh) * 2012-04-18 2012-09-19 盛司潼 一种测序图像的识别系统及方法
CN107945150A (zh) 2016-10-10 2018-04-20 深圳市瀚海基因生物科技有限公司 基因测序的图像处理方法及系统
CN108192953A (zh) * 2017-11-22 2018-06-22 深圳市瀚海基因生物科技有限公司 检测核酸特异性和/或非特异性吸附的方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7200254B2 (en) * 2002-02-14 2007-04-03 Ngk Insulators, Ltd. Probe reactive chip, sample analysis apparatus, and method thereof
US8965076B2 (en) * 2010-01-13 2015-02-24 Illumina, Inc. Data processing system and methods
CN105205788B (zh) * 2015-07-22 2018-06-01 哈尔滨工业大学深圳研究生院 一种针对高通量基因测序图像的去噪方法
CN112322713B (zh) * 2017-12-15 2022-06-03 深圳市真迈生物科技有限公司 成像方法、装置及系统及存储介质
CN109117796B (zh) * 2018-08-17 2021-01-08 广州市锐博生物科技有限公司 碱基识别方法及装置、生成彩色图像的方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1771336A (zh) * 2003-02-12 2006-05-10 金尼松斯文斯卡股份公司 用于核酸测序的方法和工具
CN102676657A (zh) * 2012-04-18 2012-09-19 盛司潼 一种测序图像的识别系统及方法
CN107945150A (zh) 2016-10-10 2018-04-20 深圳市瀚海基因生物科技有限公司 基因测序的图像处理方法及系统
CN108192953A (zh) * 2017-11-22 2018-06-22 深圳市瀚海基因生物科技有限公司 检测核酸特异性和/或非特异性吸附的方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KENJI TAKITA, vol. E86-A, no. 8, August 2003 (2003-08-01), Retrieved from the Internet <URL:IEICETRANS.FUNDAMENTALS>
See also references of EP4015645A4

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294035A (zh) * 2022-07-22 2022-11-04 深圳赛陆医疗科技有限公司 亮点定位方法、亮点定位装置、电子设备及存储介质
CN115294035B (zh) * 2022-07-22 2023-11-10 深圳赛陆医疗科技有限公司 亮点定位方法、亮点定位装置、电子设备及存储介质
CN116342984A (zh) * 2023-05-31 2023-06-27 之江实验室 一种模型训练的方法以及图像处理的方法及装置
CN116342984B (zh) * 2023-05-31 2023-08-08 之江实验室 一种模型训练的方法以及图像处理的方法及装置
CN116703958A (zh) * 2023-08-03 2023-09-05 山东仕达思医疗科技有限公司 显微图像的边缘轮廓检测方法、系统、设备和存储介质
CN116703958B (zh) * 2023-08-03 2023-11-17 山东仕达思医疗科技有限公司 显微图像的边缘轮廓检测方法、系统、设备和存储介质
CN117392155A (zh) * 2023-12-11 2024-01-12 吉林大学 基于图像处理的高通量基因测序数据处理方法
CN117392155B (zh) * 2023-12-11 2024-02-09 吉林大学 基于图像处理的高通量基因测序数据处理方法

Also Published As

Publication number Publication date
CN112823352B (zh) 2023-03-10
EP4015645A1 (en) 2022-06-22
EP4015645A4 (en) 2023-05-10
CN112823352A (zh) 2021-05-18

Similar Documents

Publication Publication Date Title
WO2021030952A1 (zh) 碱基识别方法、系统、计算机程序产品和测序系统
CN107918931B (zh) 图像处理方法及系统及计算机可读存储介质
WO2020037573A1 (zh) 检测图像上的亮斑的方法、装置和计算机程序产品
EP3306566B1 (en) Method and system for processing image
WO2020037572A1 (zh) 检测图像上的亮斑的方法和装置、图像配准方法和装置
CN108520514B (zh) 基于计算机视觉的印刷电路板电子元器一致性检测方法
CN111444964B (zh) 一种基于自适应roi划分的多目标快速图像匹配方法
WO2019206968A1 (en) Systems and methods for segmentation and analysis of 3d images
CN112289377B (zh) 检测图像上的亮斑的方法、装置和计算机程序产品
WO2010017206A1 (en) Image analysis
CN113012757B (zh) 识别核酸中的碱基的方法和系统
CN112289381B (zh) 基于图像构建测序模板的方法、装置和计算机产品
CN107274349B (zh) 生物芯片荧光图像倾斜角度的确定方法及装置
CN116563298B (zh) 基于高斯拟合的十字线中心亚像素检测方法
WO2020037570A1 (zh) 图像配准方法、装置和计算机程序产品
US11170506B2 (en) Method for constructing sequencing template based on image, and base recognition method and device
WO2020037571A1 (zh) 基于图像构建测序模板的方法、装置和计算机程序产品
US20190333212A1 (en) Visual cardiomyocyte analysis
CN112285070B (zh) 检测图像上的亮斑的方法和装置、图像配准方法和装置
CN112288783B (zh) 基于图像构建测序模板的方法、碱基识别方法和装置
CN115546139A (zh) 一种基于机器视觉的缺陷检测方法、装置及电子设备
CN112288781A (zh) 图像配准方法、装置和计算机程序产品
CN117152251A (zh) 一种小尺寸背光源工件位置识别与缺陷检测方法
CN115661098A (zh) 一种海底管线二维冲刷剖面图像识别与数据提取方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19941798

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019941798

Country of ref document: EP

Effective date: 20220316