CN117854594B - Space histology sequencing positioning matching method and device, space histology sequencing equipment and medium - Google Patents
Space histology sequencing positioning matching method and device, space histology sequencing equipment and medium Download PDFInfo
- Publication number
- CN117854594B CN117854594B CN202410076175.4A CN202410076175A CN117854594B CN 117854594 B CN117854594 B CN 117854594B CN 202410076175 A CN202410076175 A CN 202410076175A CN 117854594 B CN117854594 B CN 117854594B
- Authority
- CN
- China
- Prior art keywords
- position bar
- bar code
- candidate
- matching
- barcode
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 96
- 238000000034 method Methods 0.000 title claims abstract description 83
- 230000008569 process Effects 0.000 claims description 29
- 230000004807 localization Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 12
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 16
- 230000014509 gene expression Effects 0.000 description 13
- 239000000523 sample Substances 0.000 description 9
- 108020004414 DNA Proteins 0.000 description 8
- 108020004999 messenger RNA Proteins 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 5
- 108090000623 proteins and genes Proteins 0.000 description 5
- 238000013518 transcription Methods 0.000 description 5
- 230000035897 transcription Effects 0.000 description 5
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 4
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 4
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000003559 RNA-seq method Methods 0.000 description 3
- 108091028664 Ribonucleotide Proteins 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000002336 ribonucleotide Substances 0.000 description 3
- 125000002652 ribonucleotide group Chemical group 0.000 description 3
- 102000053602 DNA Human genes 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 238000010804 cDNA synthesis Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013079 data visualisation Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010195 expression analysis Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 238000007481 next generation sequencing Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 229940035893 uracil Drugs 0.000 description 2
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 108091033380 Coding strand Proteins 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 238000005842 biochemical reaction Methods 0.000 description 1
- 239000012472 biological sample Substances 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003054 catalyst Substances 0.000 description 1
- 238000006555 catalytic reaction Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000002299 complementary DNA Substances 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 229920002521 macromolecule Polymers 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000002493 microarray Methods 0.000 description 1
- 238000012775 microarray technology Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- QJGQUHMNIGDVPM-UHFFFAOYSA-N nitrogen group Chemical group [N] QJGQUHMNIGDVPM-UHFFFAOYSA-N 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 150000007523 nucleic acids Chemical class 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000008823 permeabilization Effects 0.000 description 1
- 230000000379 polymerizing effect Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 238000010839 reverse transcription Methods 0.000 description 1
- 229920002477 rna polymer Polymers 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000010186 staining Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06K—GRAPHICAL DATA READING; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K7/00—Methods or arrangements for sensing record carriers, e.g. for reading patterns
- G06K7/10—Methods or arrangements for sensing record carriers, e.g. for reading patterns by electromagnetic radiation, e.g. optical sensing; by corpuscular radiation
- G06K7/10544—Methods or arrangements for sensing record carriers, e.g. for reading patterns by electromagnetic radiation, e.g. optical sensing; by corpuscular radiation by scanning of the records by radiation in the optical part of the electromagnetic spectrum
- G06K7/10821—Methods or arrangements for sensing record carriers, e.g. for reading patterns by electromagnetic radiation, e.g. optical sensing; by corpuscular radiation by scanning of the records by radiation in the optical part of the electromagnetic spectrum further details of bar or optical code scanning devices
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Electromagnetism (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Toxicology (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Image Analysis (AREA)
Abstract
The application provides a sequencing positioning matching method and device, space group science sequencing equipment and medium, wherein the sequencing positioning matching method comprises the following steps: the method comprises the steps of obtaining first position bar codes with space position information, constructing an index information base of the first position bar codes, obtaining second position bar codes obtained through secondary sequencing, obtaining a plurality of short sequences to be selected according to each second position bar code, comparing the short sequences to be selected with the index information base to determine matching short sequences, constructing a matching information base of the matching short sequences and the first position bar codes, determining the comparison priority of the first position bar codes according to the matching information base, comparing each second position bar code with the first position bar codes according to the comparison priority, determining candidate first position bar codes with editing distances meeting target requirements of the second position bar codes, calculating space distances according to the space position information of the candidate first position bar codes, and determining successfully-compared second position bar codes based on the space distances meeting target conditions.
Description
Technical Field
The application relates to the technical field of space group communication, in particular to a sequencing positioning matching method and device of space group science, sequencing equipment of space group science and a computer readable storage medium.
Background
Transcription (Transcription) is the process of genetic information flowing from DNA to RNA, i.e., the process of synthesizing RNA under the catalysis of RNA polymerase using one strand of double-stranded DNA (template strand for Transcription, coding strand not for Transcription) as a template and A, U, C, G ribonucleotides as raw materials. As a first step in protein biosynthesis, a gene is read and copied into mRNA, i.e., a specific DNA fragment is used as a template for genetic information, and a DNA-dependent RNA polymerase is used as a catalyst to synthesize pre-mRNA by the principle of base complementation. When mRNA is transcribed, the double strand of the DNA molecule is opened, and under the action of RNA polymerase, the free four ribonucleotides are combined on the single strand of DNA according to the base complementary pairing principle, and under the action of RNA polymerase, a single strand mRNA molecule is formed, so that transcription is completed.
Traditional gene expression analysis, such as RNA-Seq, can provide rich information on gene expression, but it generally requires grinding sample tissue into single cells or a mixture of RNA molecules, so that the gene expression obtained by subsequent sequencing loses spatial information. Spatial transcriptomics (Spatial Transcriptomics), herein collectively referred to as space histology, is an intersecting discipline that combines histology and gene expression analysis, measures mRNA from whole tissue sections, combines spatial information of mRNA with morphological content, and maps all locations where gene expression occurs to obtain a map of biological whole gene expression, thereby focusing on understanding the spatial heterogeneity of gene expression at the cellular level.
Spatial histology is the preservation and resolution of spatial information by applying microarray technology with position barcodes (barcode) on fixed tissue sections, allowing researchers to map sequencing-derived gene expression data back to their original spatial locations based on these spatial information, providing new perspectives for the resolution of tissue structure and function, disease occurrence and progression. In the prior art, two-step sequencing is typically used to determine the spatial location of gene expression. First, a tissue section fixed on a slide is contacted with a microarray having a positional barcode, and the positional barcode is allowed to label RNA molecules in the tissue. Then, the sequence (barcode) of each position bar code and the corresponding coordinate information (space coordinate X, Y) of the sequence are obtained through the first sequencing, and then RNA is subjected to reverse transcription and amplification, and meanwhile, the position bar code information is reserved; finally, the library is built again, the amplification and other processes are performed for the second sequencing to obtain the RNA sequence and the corresponding position bar code (barcode). By aligning the two sequenced positional barcodes, the RNA sequence was mapped back to its original spatial position.
However, errors may occur during both sequencing processes, resulting in a low match rate of the positional barcodes obtained from the secondary sequencing. Errors may result from fundamental errors in the sequencing process, sequence duplication, or barcode design defects, etc. The technology of reserving space information by using position bar codes in space histology is relatively popularized in the industry at present, for example, a white list (whitelist) of the position bar codes is predefined by 10X Genomics company, then the space position is determined by using the correspondence between the position bar codes and the white list, the white list does not need to be re-sequenced in each experiment, a certain sequencing error can be avoided, and the effective comparison rate of the corresponding position bar codes is also improved. The conventional quantitative gene comparison software for comparing the position bar codes with the white names, such as STARsolo, is a tool for comparing the RNA-Seq data based on STAR (SPLICED TRANSCRIPTS ALIGNMENT to A REFERENCE), and although the performance of the tool is good, when the data of space histology is processed, the position bar codes are often directly compared and the matching standard is strict so as to reduce the possibility of mismatching, and even if the white names are introduced for comparison, a certain proportion of the actual effective position bar codes can be wrongly removed; especially, the comparison scene of the position bar code with larger length has low comparison rate, and the effective rate of the position bar code is greatly reduced.
Disclosure of Invention
In order to solve the existing technical problems, the application provides a space group science barcode positioning method and device, computer equipment and a computer readable storage medium, wherein the space group science barcode positioning method and device is higher in comparison rate and can improve the effective rate of a position bar code.
In a first aspect of the embodiment of the present application, a method for locating and matching sequencing in space histology is provided, including:
Acquiring a first position bar code with space position information;
For each first position bar code, acquiring a plurality of reference short sequences, establishing index information according to the serial numbers and positions of the first position bar codes where the reference short sequences are located, and constructing an index information base of the first position bar codes;
Obtaining a second position bar code obtained by secondary sequencing;
For each second position bar code, a plurality of short sequences to be selected are obtained, the short sequences to be selected are compared with the index information base to determine a matching short sequence, and a matching information base of the matching short sequence and the first position bar code is constructed;
determining the comparison priority of the first position bar codes according to the matching information base;
And comparing each second position bar code with the first position bar code according to the comparison priority, determining a candidate first position bar code with the editing distance of the second position bar code meeting the target requirement, calculating a space distance based on the space position information of the candidate first position bar code, and determining a successfully-compared second position bar code based on the space distance meeting the target condition.
In a second aspect, there is also provided a sequencing localization matching device of space histology, comprising:
The acquisition module is used for acquiring a first position bar code with space position information;
The index establishing module is used for acquiring a plurality of reference short sequences for each first position bar code, establishing index information according to the serial numbers and the positions of the first position bar codes where the reference short sequences are positioned, and constructing an index information base of the first position bar codes;
The acquisition module is also used for acquiring a second position bar code obtained by secondary sequencing;
The matching module is used for acquiring a plurality of short sequences to be selected according to each second position bar code, comparing the short sequences to be selected with the index information base to determine a matching short sequence, and constructing a matching information base of the matching short sequence and the first position bar code;
The priority module is used for determining the comparison priority of the first position bar codes according to the matching information base;
The comparison module is used for comparing each second position bar code with the first position bar code according to the comparison priority, determining candidate first position bar codes with editing distances meeting target requirements with the second position bar codes, calculating the space distance based on the space position information of the candidate first position bar codes, and determining successfully-compared second position bar codes based on the space distance meeting target conditions.
In a third aspect, a space histology sequencing device is provided, including a processor and a memory connected to the processor, where the memory stores a computer program executable by the processor, and the computer program when executed by the processor implements a space histology sequencing localization matching method according to any embodiment of the present application.
In a fourth aspect, a computer readable storage medium is provided, where a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the method for sequencing and positioning matching in space group according to any embodiment of the present application is implemented.
In the above embodiment, the index information base is constructed by splitting the first position bar code into a plurality of reference short sequences, splitting the second position bar code into a plurality of candidate short sequences, comparing the candidate short sequences with the reference short sequences to screen and determine the matching short sequences, and constructing the matching information base of the matching short sequences and the first position bar code, so as to obtain the information of the potential first position bar code which can have a matching relationship with the second position bar code with a larger probability, and determine the comparison priority of the first position bar code. In the process of comparing the second position bar code with the first position bar code, the comparison can be performed according to the comparison priority of the first position bar code, and the comparison efficiency can be improved; and secondly, the first position bar codes with the editing distance meeting the target requirement can be more accurately and rapidly locked as candidates, the relative spatial distance is calculated based on the spatial position information of the candidate first position bar codes, and the second position bar codes with successful comparison are determined based on the spatial distance meeting the target condition, so that the comparison priority, the editing distance and the spatial distances among the plurality of candidate first position bar codes are combined to assist in judging whether the second position bar codes can be reserved or not, the matching effective rate of the position bar codes can be improved on the basis of reducing mismatch, and compared with the mode of directly comparing the position bar codes in the prior art, a large number of position bar codes discarded by the existing matching standard can be recovered, the effective rate of the position bar codes is greatly improved, and more data is provided for downstream analysis.
In the above embodiments, the spatial histology sequencing positioning matching device, the spatial histology sequencing equipment and the computer readable storage medium belong to the same concept as the corresponding spatial histology sequencing positioning matching method embodiments, so that the same technical effects as the corresponding spatial histology sequencing positioning matching method embodiments are achieved, and are not described herein.
Drawings
FIG. 1 is a schematic diagram of an application scenario of a space-histology sequencing-localization matching method in an embodiment;
FIG. 2 is a flow chart of a method of sequencing localization matching of space histology in an embodiment;
FIG. 3 is a flow chart of a method of spatially-organized sequencing-localization matching in an alternative embodiment;
FIG. 4 is a schematic diagram of a space-based sequencing alignment matching device according to an embodiment;
FIG. 5 is a schematic diagram showing the structure of a space-group chemical sequencing apparatus according to an embodiment.
Detailed Description
The technical scheme of the invention is further elaborated below by referring to the drawings in the specification and the specific embodiments.
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to the expression "some embodiments" which describe a subset of all possible embodiments, it being noted that "some embodiments" may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.
In the following description, the terms "first, second, third" and the like are used merely to distinguish between similar objects and do not represent a specific ordering of the objects, it being understood that the "first, second, third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.
1. A position Barcode (Barcode), a unique nucleotide sequence (DNA or RNA) for marking and identifying individual molecules or cells in a biological sample. Positional barcodes are typically character string barcodes of a length that play a critical role in spatial transcriptomics, enabling the mapping of RNA molecules to original locations in tissue samples.
2. A positional barcode Whitelist (WHITELIST) refers to a list of all known barcode sequences contained in a detection kit (designed in a space group sequencing platform) and is available during library preparation.
3. RNA and ribonucleic acid are linear macromolecules formed by polymerizing ribonucleotides through 3',5' -phosphodiester bonds. RNA in nature is usually single-stranded, and the four most basic bases in RNA are adenine (a), uracil (U), guanine (G), cytosine (C, opposite), DNA, which is a nucleic acid with RNA, is usually a double-stranded molecule, and contains a nitrogenous base that replaces uracil in RNA with thymine (T).
4. The edit distance is a quantitative measure of the degree of difference between two strings (e.g., the first and second position barcodes in the present application) by looking at how many times it takes to change one string to another, where how many times it is usually the edit distance corresponds to.
Space histology is a technique that spatially resolves RNA-Seq data, resolving all mRNA in a single tissue slice, to be able to locate and distinguish active expression of functional genes within a specific tissue region. Referring to fig. 1, an alternative application scenario of the spatial histology sequencing localization matching method is shown. The sequencing process of space group mainly comprises the following steps: 1. sample preparation and quality inspection; 2. tissue section, staining, imaging, etc.; 3. tissue permeabilization and cDNA synthesis, mainly comprising placing a tissue section on a glass slide containing a capture probe combined with RNA (in the embodiment of the application, the capture probe is called a space transcriptome chip, and the capture probe is provided with a unique position bar code so as to preserve the spatial position information of the RNA), and fixing and permeabilizing (so that mRNA in cells is released and combined with the corresponding capture probe so as to obtain gene expression information), wherein the process is one-time sequencing; 4. constructing a library, namely constructing a sequencing library on a cDNA sample after cDNA synthesis by taking the captured RNA as a template; 5. sequencing the prepared library, such as sequencing by adopting a high throughput sequencing platform NGS (Next Generation Sequencing, second generation sequencing technology), wherein the process is secondary sequencing; 6. data analysis and visualization (quality control and analysis of data, and acquisition of gene information expressed in spatial positions, etc.).
The sequencing positioning matching method of space histology provided by the embodiment of the application is generally used in data analysis and visualization, and the comparison of the first position bar code and the second position bar code is a key link for finally obtaining gene expression at a space position in a tissue.
Referring to fig. 2, a spatial histology sequencing localization matching method according to an embodiment of the present application includes the following steps:
s101, acquiring a first position bar code with spatial position information.
The first position barcode (first barcode) refers to barcode information obtained by one-time sequencing in space histology, wherein the barcode information records the spatial position information of RNA on a space transcriptome chip. In an alternative example, the processed first barcode file includes three columns of data, one column of the position barcode sequence consisting of A, T, C, G characters, one column of the spatial position coordinate X, and one column of the spatial position coordinate Y.
Optionally, step S101 includes:
Acquiring a position bar code white list, and acquiring a first position bar code with spatial position information from the position bar code white list; or alternatively, the first and second heat exchangers may be,
And acquiring a first position bar code with spatial position information by performing one-time sequencing based on the space group chip.
Here, the method of acquiring the first barcode with spatial location information is not limited to the white list or the form obtained by real-time sequencing, and may refer to reading the first barcode file obtained by performing one-time sequencing in advance, for example, reading a location barcode white list to acquire first barcode data; it may also mean that the first barcode data is read in real time during the execution of one sequencing.
S102, acquiring a plurality of reference short sequences for each first position bar code, establishing index information according to the serial numbers and the positions of the first position bar codes where the reference short sequences are located, and constructing an index information base of the first position bar codes.
Each reference short sequence refers to a segment of a string in a first barcode. The position of the reference short sequence in the first barcode refers to which segment of the reference short sequence is, which is usually expressed by the relative distance of the reference short sequence with respect to a certain reference bit (such as the starting point of the first barcode) in the first barcode, for example, the position of the certain reference short sequence is 10, which indicates that the 1 st bit of the reference short sequence is located at the 10 th bit corresponding to the first barcode. Each first barcode is split into a plurality of reference short sequences, and for each reference short sequence, index information of the reference short sequence is established according to the serial number of the first barcode from which the reference short sequence is derived and the position of the first barcode in the first barcode. And constructing an index information base of the first barcode according to the index information of the reference short sequence obtained by all the first barcode.
It should be noted that the index information base expresses a data set containing index information of all reference short sequences, and forms thereof include, but are not limited to, tables, documents, key values, and the like. In the embodiment of the application, the index information of the reference short sequence is a hash value, and the index information base refers to a hash table containing index information corresponding to all the reference short sequences.
S103, obtaining a second position bar code obtained by secondary sequencing.
The second position barcode refers to barcode information containing a position barcode sequence obtained by space group secondary sequencing. According to the sequencing principle of space histology, the second barcode sequencing library is separated from a space transcriptome chip in the construction process, and the space position information is lost.
S104, for each second position bar code, a plurality of short sequences to be selected are obtained, the short sequences to be selected are compared with the index information base to determine a matching short sequence, and a matching information base of the matching short sequence and the first position bar code is constructed.
Each candidate short sequence refers to a segment of a string in a second candidate. The method for obtaining the candidate short sequence from the second candidate is generally the same as the method for obtaining the reference short sequence from the first candidate, and the length of the candidate short sequence is the same as the length of the reference short sequence. And comparing the short sequence to be selected with the index information base to determine a matching short sequence, wherein for the short sequence to be selected which is determined to be the matching short sequence, the matching relation between the matching short sequence and the first barcode can be obtained according to the serial number of the first barcode where the reference short sequence matched with the short sequence to be selected is located, so that the matching information base of the matching short sequence and the first barcode can be constructed accordingly. It can be understood that the matching information base includes a candidate short sequence that can be successfully compared with at least one reference short sequence in the index information base, and the candidate short sequence that is not successfully compared is not stored.
S105, determining the comparison priority of the first position bar codes according to the matching information base.
The comparison priority of the first barcode refers to the comparison order priority of comparison with the second barcode when the first barcode is used as an object to be matched, and the higher the comparison priority of the first barcode is, the higher the comparison order priority of the corresponding first barcode is used as the object to be matched, and the higher the comparison order priority of the first barcode is, the comparison with the second barcode is performed in the comparison process. According to the matching relation between the short sequence and the first barcode in the matching information base, the matching rate of the short sequence to be selected in the second barcode and each first barcode can be counted, and the comparison priority of each first barcode is correspondingly associated with the matching rate of each first barcode which is counted and determined according to the matching information base. In an alternative example, the comparison priority of the first candidate with the highest number of matches (highest matching rate) to the candidate short sequences in the second candidate is correspondingly highest.
S106, comparing each second position bar code with the first position bar code according to the comparison priority, determining a candidate first position bar code with the editing distance meeting the target requirement, calculating the space distance based on the space position information of the candidate first position bar code, and determining a successfully-compared second position bar code based on the space distance meeting the target condition.
In the implementation flow of matching the second and first bolts, each second bolt is compared with the current second bolt in sequence according to the comparison priority of the first bolt. In an alternative example, the matching information for matching the short sequence to the first location barcode is: {8: 5, 117:3, 3890: 1}. The left side of the colon is the first barcode sequence number, and the right side is the occurrence frequency of the matching short sequence in the first barcode. According to the matching information of the matching short sequence and the first barcode, three first barcodes (with the serial number of 8, 117 and 3890) have short sequences consistent with the second barcode, but since the first barcode with the serial number of 8 has the most short sequence matching frequency, the first barcode with the serial number of 8 has the highest matching priority, and complete sequence comparison is performed with the second barcode from the first barcode.
The edit distance meeting the target requirement may mean that the edit distance is smaller than a preset value E. The spatial distance satisfying the target condition may mean that the spatial distance is smaller than a preset value D. D. The specific value of E can be adjusted in practical application, and the application is not limited to this. In an alternative example, E is 2 and D is 1. For each second barcode, determining the first barcode with the editing distance meeting the target requirement as a candidate first barcode, calculating the relative spatial distance according to the spatial position information of each candidate first barcode, and if the spatial distances among a plurality of candidate first barcodes meet the target condition, judging that the second barcode is successfully compared.
In the above embodiment, an index information base is constructed by splitting the first barcode into a plurality of reference short sequences, splitting the second barcode into a plurality of candidate short sequences, comparing the candidate short sequences with the reference short sequences to screen and determine the matching short sequences, and constructing a matching information base of the matching short sequences and the first barcode, so as to obtain the information of the potential first barcode which can have a matching relation with the second barcode with a larger probability, and determine the comparison priority of the first barcode. In the process of comparing the second and first barcode, comparison can be performed according to the comparison priority of the first barcode, so that the comparison efficiency can be improved; and secondly, the first barcode with the editing distance meeting the target requirement can be more accurately and rapidly locked as a candidate, the relative spatial distance is calculated based on the spatial position information of the first barcode, and the second barcode which is successfully compared is determined based on the spatial distance meeting the target condition, so that the comparison priority, the editing distance and the spatial distances among a plurality of first barcodes are combined to jointly assist in judging whether the second barcode can be reserved, the matching effective rate of the position barcodes can be improved on the basis of mismatch, and compared with the mode of directly comparing the position barcodes in the prior art, a large number of second barcodes discarded by the existing matching standard can be recovered, the effective rate of the position barcodes is greatly improved, and more data are provided for downstream analysis.
In some embodiments, step S106 includes:
determining candidate first position bar codes with editing distances meeting target requirements from the second position bar codes;
And calculating the space distance between the space position information of the candidate first position bar codes corresponding to other editing distances and the standard point by taking the space position information of the candidate first position bar codes corresponding to the minimum editing distance as the standard point, and determining a second position bar code successfully compared based on the space distance meeting the target condition.
In this embodiment, the spatial distance is calculated based on the spatial position information of the candidate first barcode, and the spatial distance between the spatial position information of the other candidate first barcode and the standard point is calculated by using the corresponding candidate first barcode having the smallest editing distance with the second barcode as the standard point.
As an example, the following table one:
For a certain second barcode, the candidate first barcode with the editing distance meeting the target requirement comprises a candidate first barcode1, a candidate first barcode2 and a candidate first barcode3, wherein the candidate first barcode1, the candidate first barcode2 and the candidate first barcode3 correspond to the editing distance 1, the editing distance 2 and the editing distance 3 respectively. And respectively calculating the spatial distance D1 between the spatial position information of the candidate first barcode2 and the standard point and the spatial distance D2 between the spatial position information of the candidate first barcode3 and the standard point by taking the candidate first barcode1 corresponding to the editing distance 1 with the smallest editing distance as the standard point, and determining the second barcode as a successfully-compared second position bar code when the spatial distances D1 and D2 respectively meet the target conditions.
In the above embodiment, the second barcode further screens whether the second barcode can be reserved according to the spatial distance between the second barcode and the plurality of candidate first barcodes in the range of which the editing distance meets the target requirement, so that the matching efficiency of the second barcode position bar code can be improved on the basis of reducing mismatch.
In some embodiments, the calculating the spatial distance between the spatial position information of the candidate first position bar code corresponding to the other editing distance and the standard point by using the spatial position information of the candidate first position bar code corresponding to the least editing distance as the standard point, and determining the successfully-compared second position bar code based on the spatial distance meeting the target condition includes:
Judging whether the candidate first position bar code corresponding to the minimum editing distance is unique or not;
If so, calculating the space distance between the space position information of the candidate first position bar codes corresponding to other editing distances and the standard point by taking the space position information of the candidate first position bar codes corresponding to the minimum editing distance as the standard point, and determining the second position bar code as a successfully-compared second position bar code when the space distance is smaller than a target threshold value;
if not, respectively carrying out iterative computation by taking the spatial position information of the plurality of candidate first position bar codes corresponding to the minimum editing distance as standard points, and in any iterative computation, calculating the spatial distance between the spatial position information of the candidate first position bar codes corresponding to other editing distances and the standard points, and determining the second position bar code as a successfully-compared second position bar code when the spatial distance is smaller than a target threshold value.
In this embodiment, when the spatial distance is calculated by using the corresponding candidate first candidate having the smallest editing distance with the second candidate as the standard point, an implementation scheme that may include multiple candidate first candidates at the same editing distance is further provided.
The candidate first candidate corresponding to the smallest editing distance is not unique, for example, the following table two is taken as an example:
For a certain second candidate, the candidate first candidate with the smallest editing distance comprises a candidate first candidate 1 and a candidate first candidate 4 corresponding to the editing distance 1, iterative computation is carried out by taking the candidate first candidate 1 and the candidate first candidate 4 as standard points, in one iterative computation by taking the candidate first candidate 1 as standard point, the spatial distance D1 between the spatial position information of the candidate first candidate 2 and the standard point (the candidate first candidate 1) and the spatial distance D2 between the spatial position information of the candidate first candidate 3 and the standard point (the candidate first candidate 1) are respectively calculated, and when the spatial distances D1 and D2 meet the target conditions respectively, the second candidate is determined as a successfully-compared second position bar code; in one iterative calculation with the candidate first barcode4 as a standard point, a spatial distance D3 between the spatial position information of the candidate first barcode2 and the standard point (candidate first barcode 4) and a spatial distance D4 between the spatial position information of the candidate first barcode3 and the standard point (candidate first barcode 4) are calculated, and when the spatial distances D3 and D4 satisfy target conditions, the second barcode is determined as a successfully aligned second position barcode.
The second candidate is considered to be successfully aligned when one of the target conditions is satisfied by calculating the spatial distances D1 and D2 using the candidate first candidate 1 as a standard point and the target conditions is satisfied by calculating the spatial distances D3 and D4 using the candidate first candidate 4 as a standard point.
In the above embodiment, when the candidate first barcode whose editing distance from the second barcode satisfies the target requirement range may be initially located at a different position in the organization, the iteration is performed by selecting the different candidate first barcode as the standard point, so as to reduce the error caused by the initial positioning mismatch, and improve the matching efficiency of the second barcode position barcode on the basis of reducing the mismatch.
In some embodiments, the determining whether the candidate first location barcode corresponding to the smallest editing distance is unique includes:
According to the comparison condition of the second position bar code and the first position bar code, keeping the candidate first position bar codes with editing distances meeting the target requirement in the same vector, and recording whether the corresponding candidate first position bar codes with the same editing distances have mark bits in a sequence repeatedly passing through a preset length or not;
and judging whether the candidate first position bar code corresponding to the minimum editing distance is unique or not according to the marking bit.
When the second and first barcode are completely matched (the editing distance is 0), the matching process can be stopped immediately, and the second barcode is considered to be successfully compared. When the second barcode is incompletely compared with the first barcode, all candidate first barcodes, the editing distance of which meets the target condition with the second barcode, can be calculated and remain in the same vector. Such as:
If the threshold value of the editing distance is set to 3, the editing distance between the candidate first candidate 1 and the matching short sequence is set to 2, and the editing distance meets the target requirement.
Alternatively, for the comparison result of each second candidate, for the candidate first candidate whose editing distance meets the target requirement, the following may be shown:
Edit distance 1 (b 2, b1_1) =1b1_1 (X, Y) =100.5, 234.2
Edit distance 2 (b 2, b1_2) =2b1_2 (X, Y) =100.1, 234.9
Edit distance 2 (b 2, b1_3) =2b1_3 (X, Y) =32, 1198
Wherein b2 is a second barcode currently participating in the comparison, the left side is a value of the edit distance determined by the comparison result, and b1_1, b1_2 and b1_3 respectively represent serial numbers of first barcodes corresponding to the value of the corresponding edit distance.
Alternatively, an integer may be used to represent the addition of the candidate first barcode of the same edit distance, and a flag bit in the integer, such as the last bit, is used to flag whether multiple candidates of the same edit distance are added, for example 00001010, the 2 nd bit and the 4 th bit are marked as 1 from right to left, so that it is known that the comparison of edit distances of 2 and 4 occurs in the comparison result, and the last bit from right to left is 0, which means that no edit distance of two or more results is 2 or 4, and if the comparison result is continuously added, the last bit is marked as 1 when the other comparison result is 2 or 4. Thus, according to the value of the marker bit in the integer, whether the candidate first candidate corresponding to the minimum editing distance is unique can be determined.
In the above embodiment, the vector pair candidates are recorded for the comparison of the second and first barcode, and the repeated candidate first barcode is recorded in the same editing distance by combining the form of setting the flag bit in the integer, which is beneficial to improving the comparison efficiency.
In some embodiments, the calculating the spatial distance between the spatial position information of the candidate first position bar code corresponding to the other editing distance and the standard point, and determining the successfully-compared second position bar code based on the spatial distance meeting the target condition includes:
And calculating the spatial distances between each candidate first position bar code and the standard point in different coordinate axis directions according to the spatial position information of the candidate first position bar codes corresponding to other editing distances and the spatial position information of the standard point, and determining the second position bar code as a successfully-compared second position bar code when the spatial distances in different coordinate axis directions are smaller than a target threshold value.
The step of meeting the target condition for the space distance is to calculate the relative space distance between the coordinates of the candidate first barcode in different coordinate axis directions (such as X and Y axis directions) and the standard point in the corresponding coordinate axis directions respectively, and the comparison is considered to be successful when the space distances in the different coordinate axis directions are smaller than the target threshold.
In the above embodiment, the spatial distance between any candidate first candidate and the standard point is calculated, and the spatial distance in any axial direction is smaller than the target threshold value to perform judgment, so that the accuracy of the ratio is improved on the basis of better ensuring that errors caused by mismatch are reduced.
In some embodiments, step S104 includes:
for each second position bar code, a sliding window is adopted to sequentially slide and interval extract a short sequence to be selected, the length of which is K, on the second position bar code; wherein the interval between two adjacent short sequences to be selected is W;
Comparing each short sequence to be selected with the index information base, and judging whether hit reference short sequences which are the same as the current short sequence to be selected exist or not;
If so, determining a hit first position bar code corresponding to the current short sequence to be selected and position deviations of the current short sequence to be selected and the hit reference short sequence in the respective position bar codes according to index information of the hit reference short sequence, determining that the current short sequence to be selected is a matching short sequence based on the position deviations meeting a preset range, converting the sequence number of the hit first position bar code corresponding to the matching short sequence and the repeated occurrence frequency of the hit reference short sequence into a hash value, establishing directory information corresponding to the matching short sequence, and forming a matching information base of the matching short sequence and the first position bar code based on the directory information.
The second barcode is split into short sequences, so that the splitting efficiency of the short sequences can be improved by means of a sliding window. In this embodiment, a sliding window with a width W is adopted to sequentially slide on a second barcode, and after a short sequence to be selected with a length K is extracted each time, a short sequence to be selected with a length K is extracted again at a distance of a width W of the sliding window. For each second candidate, after extracting a short sequence with length K, comparing the short sequence with a reference short sequence in an index information base, if the same reference short sequence is found, determining the position deviation according to the current position of the short sequence in the second candidate and the position of the hit reference short sequence in the corresponding first candidate, if the position deviation meets a preset range (such as M), taking the short sequence to be selected as a matching short sequence, and accordingly establishing directory information corresponding to the matching short sequence. And finally, forming a matching information base according to all directory information determined to be the matching short sequence.
It should be noted that the matching information base expresses a data set containing directory information of all matching short sequences, and forms thereof include, but are not limited to, tables, documents, key values, and the like. In the embodiment of the application, the directory information of the matched short sequences is a hash value, and the matched information base refers to a hash table containing the directory information corresponding to all the matched short sequences.
In the above embodiment, the manner of comparing the second barcode with the index information base constructed based on the first barcode by splitting the second barcode into the short sequence is beneficial to improving the comparison efficiency, and the first barcode with the largest matching probability can be screened out as the candidate on the basis of being compatible with a certain comparison error, so that the comparison efficiency and accuracy can be improved on the basis of reducing the error caused by mismatch.
Optionally, the sequencing localization matching method further includes:
if not, discarding the current short sequence to be selected; or alternatively, the first and second heat exchangers may be,
If so, determining that the current short sequence to be selected is false matching based on the position deviation exceeding the preset range, and discarding the current short sequence to be selected.
For each second candidate, after extracting a short sequence with length K, comparing the short sequence with a reference short sequence in an index information base, if the same reference short sequence cannot be found, the current short sequence is not saved, and the next short sequence is extracted and compared.
Or if the same reference short sequence is found, taking the same reference short sequence as a hit reference short sequence, determining position deviation according to the position of the current short sequence to be selected in the second candidate and the position of the hit reference short sequence in the corresponding first candidate, if the position deviation exceeds a preset range M, judging that the current matching is false matching, and likewise, not storing the current short sequence to be selected, and extracting and comparing the next short sequence to be selected. For example, the number of the cells to be processed,
Second barcode: GCGGTCTGGATGTGCGAAACTACA, position: 13
First barcodeX: CGCGAAACTCGGGGCGAAACTCTA, position: 1
Setting M as 2, comparing the candidate short sequence 'GCGAAACT' in the second candidate with an index information base, and referencing the short sequence 'GCGAAACT' in the hit first candidate X, wherein the position deviation is 12 according to the position '13' of the current candidate short sequence in the second candidate and the position '1' of the hit reference short sequence in the corresponding first candidate, and the position deviation exceeds a preset range M, so that false matching is not saved between the current candidate short sequence and the hit reference short sequence.
In the above embodiment, in the process of comparing the second barcode split into the short sequences with the index information base constructed based on the first barcode, the matched short sequences successfully compared are screened by combining the same comparison and the position deviation, and only the information of the matched short sequences is saved, so that the first barcode with the largest matching probability can be screened out as a candidate on the basis of being compatible with a certain comparison error, and the method is beneficial to quickly and accurately screening out the effective second barcode based on the candidate first barcode.
In some embodiments, step S102 includes:
For each first position bar code, a sliding window is adopted to sequentially slide and interval extract a reference short sequence with the length of K on the first position bar code; wherein the number of interval bits between two adjacent reference short sequences is W;
and converting the serial numbers and the positions of the first position bar codes where the reference short sequences are located into hash values, establishing index information corresponding to the reference short sequences, and forming an index information base of the first position bar codes based on the index information.
The length of the reference short sequence is the same as the length of the candidate short sequence. In this embodiment, the manner of splitting the first code into the reference short sequences is the same as the manner of splitting the second code into the candidate short sequences, a sliding window with a width W is adopted to sequentially slide on the first code, and after each time a reference short sequence with a length K is extracted, a reference short sequence with a length K is extracted again at a distance of a width W of the sliding window. And according to the serial number of the first barcode corresponding to each reference short sequence and the position of the serial number in the corresponding first barcode as index information of the reference short sequence, the index information is expressed by adopting a hash value to construct an index information base of the first barcode.
In an alternative example, reference is made to the sequence number (e.g. 1,2, 3) of the short sequence on the first barcode and its position on the first barcode (e.g. sequence of length K starting from bit 10), converted by a bit operation to a 64-bit integer representation, an example of a binary representation being:
0010110001011010101010000110000000000000000000000000000000000010. Wherein the first 32 bits represent the first barcode sequence number, the last 32 bits represent the starting position at fisrt barcode, and the hash value converted and stored in the hash table as the hash value of the corresponding reference short sequence.
In the above embodiment, the index is established for the first barcode by referring to the form of the short sequence, and the short sequence is extracted from the corresponding first barcode at intervals, so that on one hand, the overall data size of the subsequent comparison can be reduced, on the other hand, the subsequent comparison based on the index information base is convenient, all first barcodes with potential matching rate can be screened out on the basis of compatibility with a certain comparison error, and the efficiency is improved, and meanwhile, error leakage and erroneous judgment are avoided.
In some embodiments, the extracting the reference short sequence with the length K for each first position barcode using a sliding window sequentially sliding at intervals on the first position barcode includes:
And for each first position bar code, sequentially and slidingly extracting a reference short sequence with the length of K on the first position bar code at intervals by adopting a sliding window, wherein the sliding window skips the endpoint bit number with the length of O according to the endpoint position of the first position bar code in the sliding process of the sliding window on the first position bar code.
Based on the principle of sequencing and library construction, the junction (i.e., end point) of the position bar code and other sequences (such as the connector adpter) is usually relatively low in sequencing quality value, for example, the probe on the space transcriptome chip can cause error in the length of the synthesized probe due to inconsistent biochemical reaction in the synthesis process; errors in the boundaries between barcode and adpter, etc., may result from sequencing errors during the sequencing process. In the sliding process of the sliding window on the first barcode, the number of the endpoint bits with the length of O is skipped according to the endpoint position of the first barcode, so that the sequence extracted from the endpoint position is prevented from being used as a reference short sequence. The value of the length O may be adjusted in practical applications, which is not limited in the present application. The end point position of the first barcode generally comprises a left end point and a right end point, and in practical application, a certain position and a certain interval in sequencing can be preset as end points, or sequencing quality judgment is carried out in the sequencing process, and the position with obviously reduced sequencing quality is regarded as an end point position.
In the above embodiment, by setting the endpoint skipped when the reference short sequence is extracted from the first barcode, the accuracy of the extracted reference short sequence to correspondingly characterize the corresponding first barcode can be improved.
In some embodiments, the converting the serial number and the position of the barcode at the first position of each reference short sequence into the hash value to create the index information corresponding to each reference short sequence further includes:
And storing hash values respectively corresponding to a plurality of identical reference short sequences in the same vector to form common index information of the reference short sequences.
In this embodiment, in the process of establishing the index information base of the first barcode, when a repeated reference short sequence is encountered, the corresponding index information is stored in a vector, for example: [117, 235, 449] to form common index information having correspondence with a plurality of first barcode for the same reference short sequence. According to the design of the vector of the shared index information, the index information of the same reference short sequence can be combined, so that an index information base is simplified, and the subsequent comparison efficiency is improved.
In order to provide a more general understanding of the spatial histology sequencing and location matching method according to the embodiment of the present application, please refer to fig. 3, a specific example is described below, where the spatial histology sequencing and location matching method includes:
S11, reading in a first barcode, extracting a reference short sequence and establishing an index information base.
And extracting a reference short sequence with the length of K on the first barcode at intervals of W, converting the sequence number plus position of the first barcode corresponding to the reference short sequence into an integer through bit operation, and storing the integer in a hash table. In the process of extracting the reference short sequence K, the endpoint position O of the first barcode is adjusted.
S12, reading in second code, extracting a short sequence to be selected, comparing the short sequence with an index information base, and establishing a matching information base according to a comparison result.
And determining a short sequence to be selected, which is successfully compared with the reference short sequence and has a position deviation smaller than M on the bar code at the corresponding position, as a matching short sequence, correspondingly forming directory information (the serial number of the corresponding first barcode+the occurrence number in the corresponding first barcode) of the matching short sequence according to the matching condition between the matching short sequence and the reference short sequence, and constructing a matching information base.
S13, counting the first barcode containing the most short sequence matching, and determining the comparison priority of each first barcode according to the statistics.
S14, comparing the second barcode according to the comparison priority of the first barcode.
S15, if the current second barcode is completely matched with the first barcode, stopping the comparison of the current second barcode, and storing the second barcode successfully compared.
S16, if the current second barcode is not completely matched with the first barcode, counting the first barcode within the editing distance E as the candidate first barcode.
S161, using the candidate first barcode with the smallest editing distance as a standard point, and calculating whether the space distance between the candidate first barcode corresponding to other editing distances and the standard point is in the range D.
And S162, if yes, stopping comparison, and storing the second battery which is successfully compared.
S163, if not, stopping comparison, and discarding the current second barcode.
It should be noted that the values of the parameters K, W, O, M, E and D can be adjusted in practical application, and dynamic planning is realized to complete the comparison of the position bar codes.
The sequencing positioning matching method of space histology provided by the embodiment of the application has the following characteristics:
First, the first and second barcode are split into short sequences and are positioned and matched in a mode of supporting dynamic programming, so that the comparison efficiency is improved.
Secondly, determining the first barcode with larger matching probability based on short sequence positioning to determine the comparison priority, introducing the editing distance to screen candidate first barcodes, and then introducing the space distance to obtain the space position relation among a plurality of candidate first barcodes, so that the comparison priority, the editing distance and the space distance among the plurality of candidate first barcodes are combined to assist in judging whether the second barcodes are successfully matched and timely stopped matching, the comparison efficiency can be improved, the matching effective rate of the position barcodes can be improved on the basis of reducing the mismatch, and compared with the mode of directly comparing the position barcodes in the prior art, a large number of second barcodes discarded by the existing matching standard can be found to be recovered, the effective rate of the position barcodes is greatly improved, and more data are provided for downstream analysis.
Thirdly, in the prior art, in the way of directly comparing the position bar codes, the fact that the white name list comparison is free of mismatch is defaulted, whether the position bar codes are identical or not introduces strict matching standards is not defaulted, however, the assumption is not true in the actual application, and a large number of position bar codes are wasted; the sequencing positioning matching method of space histology provided by the embodiment of the application combines the comparison priority, the editing distance and the space distance among a plurality of candidate first barcode to assist in judging whether the second barcode is successfully matched, and discovers that a large number of position bar codes abandoned by the existing algorithm can be recovered under the principles of increasing mismatch and utilizing space information, thereby greatly improving the effective rate of the position bar codes and providing more data for downstream analysis.
In another aspect of the present application, referring to fig. 4, a space-based sequencing and positioning matching device is provided, which includes: an acquisition module 11, configured to acquire a first location barcode with spatial location information; the index establishing module 12 is configured to obtain a plurality of reference short sequences for each first position barcode, establish index information according to the serial number and the position of the first position barcode where each reference short sequence is located, and construct an index information base of the first position barcode; the acquisition module 11 is further used for acquiring a second position bar code obtained by secondary sequencing; the matching module 13 is configured to obtain a plurality of short sequences to be selected for each second position barcode, compare the short sequences to be selected with the index information base to determine a matching short sequence, and construct a matching information base of the matching short sequence and the first position barcode; a priority module 14, configured to determine a comparison priority of the first location bar code according to the matching information base; the comparison module 15 is configured to compare each second position barcode with the first position barcode according to the comparison priority, determine a candidate first position barcode whose editing distance with the second position barcode meets a target requirement, calculate a spatial distance based on the spatial position information of the candidate first position barcode, and determine a second position barcode whose comparison is successful based on the spatial distance meeting a target condition.
Optionally, the comparison module 15 is configured to determine whether the candidate first position barcode corresponding to the minimum editing distance is unique; if so, calculating the space distance between the space position information of the candidate first position bar codes corresponding to other editing distances and the standard point by taking the space position information of the candidate first position bar codes corresponding to the minimum editing distance as the standard point, and determining the second position bar code as a successfully-compared second position bar code when the space distance is smaller than a target threshold value; if not, respectively carrying out iterative computation by taking the spatial position information of the plurality of candidate first position bar codes corresponding to the minimum editing distance as standard points, and in any iterative computation, calculating the spatial distance between the spatial position information of the candidate first position bar codes corresponding to other editing distances and the standard points, and determining the second position bar code as a successfully-compared second position bar code when the spatial distance is smaller than a target threshold value.
Optionally, the comparison module 15 is further configured to keep the candidate first position barcodes whose editing distance meets the target requirement in the same vector according to the comparison situation of the second position barcode and the first position barcode, and record whether the corresponding candidate first position barcodes have the mark bits in the sequence repeatedly passing through the preset length at the same editing distance; and judging whether the candidate first position bar code corresponding to the minimum editing distance is unique or not according to the marking bit.
Optionally, the comparing module 15 is further configured to calculate, according to the spatial position information of the candidate first position bar codes corresponding to other editing distances and the spatial position information of the standard point, the spatial distances between each candidate first position bar code and the standard point in different coordinate axis directions, and determine that the comparison is successful when the spatial distances in different coordinate axis directions are all smaller than the target threshold, and determine the second position bar code as the successfully-compared second position bar code.
Optionally, the matching module 13 is further configured to extract, for each of the second position barcodes, a short sequence to be selected with a length K by sequentially sliding on the second position barcode with a sliding window at intervals; wherein the interval between two adjacent short sequences to be selected is W; comparing each short sequence to be selected with the index information base, and judging whether hit reference short sequences which are the same as the current short sequence to be selected exist or not; if so, determining a hit first position bar code corresponding to the current short sequence to be selected and position deviations of the current short sequence to be selected and the hit reference short sequence in the respective position bar codes according to index information of the hit reference short sequence, determining that the current short sequence to be selected is a matching short sequence based on the position deviations meeting a preset range, converting the sequence number of the hit first position bar code corresponding to the matching short sequence and the repeated occurrence frequency of the hit reference short sequence into a hash value, establishing directory information corresponding to the matching short sequence, and forming a matching information base of the matching short sequence and the first position bar code based on the directory information.
Optionally, the matching module 13 is further configured to discard the current short sequence to be selected if not; or if so, determining that the current short sequence to be selected is false matching based on the position deviation exceeding the preset range, and discarding the current short sequence to be selected.
Optionally, the index creating module 12 is further configured to extract, for each of the first position barcodes, a reference short sequence with a length K by sequentially sliding over the first position barcode with a sliding window at intervals; wherein the number of interval bits between two adjacent reference short sequences is W; and converting the serial numbers and the positions of the first position bar codes where the reference short sequences are located into hash values, establishing index information corresponding to the reference short sequences, and forming an index information base of the first position bar codes based on the index information.
Optionally, the index creating module 12 is further configured to sequentially slide, at intervals, a sliding window on each first position barcode to extract a reference short sequence with a length of K, where the sliding window skips an endpoint bit number with a length of O according to an endpoint position of the first position barcode in a sliding process on the first position barcode.
Optionally, the index establishing module 12 is further configured to store hash values corresponding to a plurality of identical reference short sequences in a same vector, so as to form common index information of the reference short sequences.
Optionally, the acquiring module 11 is specifically configured to acquire a location barcode whitelist, and acquire a first location barcode with spatial location information from the location barcode whitelist; or, acquiring a first position bar code with spatial position information obtained by performing one-time sequencing based on the spatial group chip.
It should be noted that: in the space group sequencing positioning matching device provided in the above embodiment, in the process of performing position barcode comparison, only the division of each program module is used for illustration, in practical application, the processing allocation can be completed by different program modules according to needs, that is, the internal structure of the device can be divided into different program modules, so as to complete all or part of the method steps described above. In addition, the spatial histology sequencing and positioning matching device provided in the above embodiment and the spatial histology sequencing and positioning matching method embodiment belong to the same concept, and detailed implementation processes of the spatial histology sequencing and positioning matching device are detailed in the method embodiment and are not described herein.
In another aspect of the application, a space histology sequencing apparatus is also provided. Referring to fig. 5, an optional hardware structure diagram of a space histology sequencing device is shown, where the space histology sequencing device includes a processor 212 and a memory 211 connected to the processor 212, and the memory 211 stores a computer program for implementing the space histology sequencing location matching method provided by any embodiment of the present application, so that when the corresponding computer program is executed by the processor, the steps of the space histology sequencing location matching method provided by any embodiment of the present application are implemented. The spatial histology sequencing apparatus loaded with the corresponding computer program has the same technical effects as the corresponding method embodiments, and is not described here again for avoiding repetition.
In another aspect of the embodiments of the present application, a computer readable storage medium is further provided, where a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the processes of the foregoing embodiment of the sequencing positioning matching method based on space histology are implemented, and the same technical effects can be achieved, so that repetition is avoided, and details are not repeated here. Wherein, the computer readable storage medium is Read-OnlyMemor (ROM), random Access Memory (RAM), magnetic disk or optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, a space group sequencing platform, a gene sequencer, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (14)
1. A method for spatially-organized sequencing-localization matching, comprising:
Acquiring a first position bar code with space position information;
For each first position bar code, acquiring a plurality of reference short sequences, establishing index information according to the serial numbers and positions of the first position bar codes where the reference short sequences are located, and constructing an index information base of the first position bar codes;
Obtaining a second position bar code obtained by secondary sequencing;
For each second position bar code, a plurality of short sequences to be selected are obtained, the short sequences to be selected are compared with the index information base to determine a matching short sequence, and a matching information base of the matching short sequence and the first position bar code is constructed; wherein the length of the short sequence to be selected is equal to the length of the reference short sequence;
determining the comparison priority of the first position bar codes according to the matching information base;
And comparing each second position bar code with the first position bar code according to the comparison priority, determining a candidate first position bar code with the editing distance of the second position bar code meeting the target requirement, calculating a space distance based on the space position information of the candidate first position bar code, and determining a successfully-compared second position bar code based on the space distance meeting the target condition.
2. The sequencing localization matching method of claim 1, wherein the determining a candidate first location barcode whose edit distance from the second location barcode meets a target requirement, calculating a spatial distance based on the spatial location information of the candidate first location barcode, determining a successfully aligned second location barcode based on the spatial distance meeting a target condition, comprises:
determining candidate first position bar codes with editing distances meeting target requirements from the second position bar codes;
And calculating the space distance between the space position information of the candidate first position bar codes corresponding to other editing distances and the standard point by taking the space position information of the candidate first position bar codes corresponding to the minimum editing distance as the standard point, and determining a second position bar code successfully compared based on the space distance meeting the target condition.
3. The sequencing localization matching method of claim 2, wherein the calculating the spatial distance between the spatial position information of the candidate first position barcode corresponding to the other edit distance and the standard point by using the spatial position information of the candidate first position barcode corresponding to the minimum edit distance as the standard point, and determining the successfully aligned second position barcode based on the spatial distance satisfying the target condition comprises:
Judging whether the candidate first position bar code corresponding to the minimum editing distance is unique or not;
If so, calculating the space distance between the space position information of the candidate first position bar codes corresponding to other editing distances and the standard point by taking the space position information of the candidate first position bar codes corresponding to the minimum editing distance as the standard point, and determining the second position bar code as a successfully-compared second position bar code when the space distance is smaller than a target threshold value;
if not, respectively carrying out iterative computation by taking the spatial position information of the plurality of candidate first position bar codes corresponding to the minimum editing distance as standard points, and in any iterative computation, calculating the spatial distance between the spatial position information of the candidate first position bar codes corresponding to other editing distances and the standard points, and determining the second position bar code as a successfully-compared second position bar code when the spatial distance is smaller than a target threshold value.
4. The sequencing localization matching method of claim 3, wherein the determining whether the candidate first location barcode corresponding to the smallest edit distance is unique comprises:
According to the comparison condition of the second position bar code and the first position bar code, keeping the candidate first position bar codes with editing distances meeting the target requirement in the same vector, and recording whether the corresponding candidate first position bar codes with the same editing distances have mark bits in a sequence repeatedly passing through a preset length or not;
and judging whether the candidate first position bar code corresponding to the minimum editing distance is unique or not according to the marking bit.
5. The sequencing localization matching method of claim 2, wherein calculating the spatial distance between the spatial position information of the candidate first position barcode corresponding to the other edit distance and the standard point, and determining the successfully aligned second position barcode based on the spatial distance satisfying the target condition comprises:
And calculating the spatial distances between each candidate first position bar code and the standard point in different coordinate axis directions according to the spatial position information of the candidate first position bar codes corresponding to other editing distances and the spatial position information of the standard point, and determining the second position bar code as a successfully-compared second position bar code when the spatial distances in different coordinate axis directions are smaller than a target threshold value.
6. The sequencing localization matching method of claim 1, wherein the obtaining a plurality of short sequences to be selected for each of the second location barcodes, comparing the short sequences to be selected with the index information library to determine a matching short sequence, and constructing a matching information library of the matching short sequence and the first location barcode comprises:
for each second position bar code, a sliding window is adopted to sequentially slide and interval extract a short sequence to be selected, the length of which is K, on the second position bar code; wherein the interval between two adjacent short sequences to be selected is W;
Comparing each short sequence to be selected with the index information base, and judging whether hit reference short sequences which are the same as the current short sequence to be selected exist or not;
If so, determining a hit first position bar code corresponding to the current short sequence to be selected and position deviations of the current short sequence to be selected and the hit reference short sequence in the respective position bar codes according to index information of the hit reference short sequence, determining that the current short sequence to be selected is a matching short sequence based on the position deviations meeting a preset range, converting the sequence number of the hit first position bar code corresponding to the matching short sequence and the repeated occurrence frequency of the hit reference short sequence into a hash value, establishing directory information corresponding to the matching short sequence, and forming a matching information base of the matching short sequence and the first position bar code based on the directory information.
7. The sequencing-location matching method of claim 6, further comprising:
if not, discarding the current short sequence to be selected; or alternatively, the first and second heat exchangers may be,
If so, determining that the current short sequence to be selected is false matching based on the position deviation exceeding the preset range, and discarding the current short sequence to be selected.
8. The sequencing localization matching method of claim 1, wherein the obtaining a plurality of reference short sequences for each of the first location barcodes, creating index information according to the serial numbers and the locations of the first location barcodes where the reference short sequences are located, and constructing an index information base of the first location barcodes comprises:
For each first position bar code, a sliding window is adopted to sequentially slide and interval extract a reference short sequence with the length of K on the first position bar code; wherein the number of interval bits between two adjacent reference short sequences is W;
and converting the serial numbers and the positions of the first position bar codes where the reference short sequences are located into hash values, establishing index information corresponding to the reference short sequences, and forming an index information base of the first position bar codes based on the index information.
9. The sequencing localization matching method of claim 8, wherein for each of the first location barcodes, a sliding window is used to sequentially slide over the first location barcode at intervals to extract a reference short sequence of length K, comprising:
And for each first position bar code, sequentially and slidingly extracting a reference short sequence with the length of K on the first position bar code at intervals by adopting a sliding window, wherein the sliding window skips the endpoint bit number with the length of O according to the endpoint position of the first position bar code in the sliding process of the sliding window on the first position bar code.
10. The sequencing localization matching method of claim 8, wherein the converting the serial number and the position of the barcode at the first position of each reference short sequence into the hash value to create the index information corresponding to each reference short sequence further comprises:
And storing hash values respectively corresponding to a plurality of identical reference short sequences in the same vector to form common index information of the reference short sequences.
11. The sequencing localization matching method of claim 1, wherein the obtaining the first location barcode with spatial location information comprises:
Acquiring a position bar code white list, and acquiring a first position bar code with spatial position information from the position bar code white list; or alternatively, the first and second heat exchangers may be,
And acquiring a first position bar code with spatial position information by performing one-time sequencing based on the space group chip.
12. A spatially-organized sequencing-localization-matching device, comprising:
The acquisition module is used for acquiring a first position bar code with space position information;
The index establishing module is used for acquiring a plurality of reference short sequences for each first position bar code, establishing index information according to the serial numbers and the positions of the first position bar codes where the reference short sequences are positioned, and constructing an index information base of the first position bar codes;
The acquisition module is also used for acquiring a second position bar code obtained by secondary sequencing;
The matching module is used for acquiring a plurality of short sequences to be selected according to each second position bar code, comparing the short sequences to be selected with the index information base to determine a matching short sequence, and constructing a matching information base of the matching short sequence and the first position bar code; wherein the length of the short sequence to be selected is equal to the length of the reference short sequence;
The priority module is used for determining the comparison priority of the first position bar codes according to the matching information base;
The comparison module is used for comparing each second position bar code with the first position bar code according to the comparison priority, determining candidate first position bar codes with editing distances meeting target requirements with the second position bar codes, calculating the space distance based on the space position information of the candidate first position bar codes, and determining successfully-compared second position bar codes based on the space distance meeting target conditions.
13. A space histology sequencing apparatus comprising a processor and a memory coupled to the processor, the memory having stored thereon a computer program executable by the processor, the computer program when executed by the processor implementing the space histology sequencing localization matching method of any of claims 1 to 11.
14. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the sequencing localization matching method of the space group of any of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410076175.4A CN117854594B (en) | 2024-01-18 | 2024-01-18 | Space histology sequencing positioning matching method and device, space histology sequencing equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410076175.4A CN117854594B (en) | 2024-01-18 | 2024-01-18 | Space histology sequencing positioning matching method and device, space histology sequencing equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117854594A CN117854594A (en) | 2024-04-09 |
CN117854594B true CN117854594B (en) | 2024-06-04 |
Family
ID=90539880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410076175.4A Active CN117854594B (en) | 2024-01-18 | 2024-01-18 | Space histology sequencing positioning matching method and device, space histology sequencing equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117854594B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118571316A (en) * | 2024-07-29 | 2024-08-30 | 墨卓生物科技(浙江)有限公司 | Sequencing data splitting method of fastq file |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106295250A (en) * | 2016-07-28 | 2017-01-04 | 北京百迈客医学检验所有限公司 | Method and device is analyzed in the quick comparison of the short sequence of secondary order-checking |
CN108182346A (en) * | 2016-12-08 | 2018-06-19 | 杭州康万达医药科技有限公司 | Predict method for building up and its application of the siRNA for the machine learning model of the toxicity of certain class cell |
CN112182140A (en) * | 2020-08-17 | 2021-01-05 | 北京来也网络科技有限公司 | Information input method and device combining RPA and AI, computer equipment and medium |
CN113094559A (en) * | 2021-04-25 | 2021-07-09 | 百度在线网络技术(北京)有限公司 | Information matching method and device, electronic equipment and storage medium |
CN113486993A (en) * | 2021-07-07 | 2021-10-08 | 杭州海康机器人技术有限公司 | Information matching method and information matching device |
-
2024
- 2024-01-18 CN CN202410076175.4A patent/CN117854594B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106295250A (en) * | 2016-07-28 | 2017-01-04 | 北京百迈客医学检验所有限公司 | Method and device is analyzed in the quick comparison of the short sequence of secondary order-checking |
CN108182346A (en) * | 2016-12-08 | 2018-06-19 | 杭州康万达医药科技有限公司 | Predict method for building up and its application of the siRNA for the machine learning model of the toxicity of certain class cell |
CN112182140A (en) * | 2020-08-17 | 2021-01-05 | 北京来也网络科技有限公司 | Information input method and device combining RPA and AI, computer equipment and medium |
CN113094559A (en) * | 2021-04-25 | 2021-07-09 | 百度在线网络技术(北京)有限公司 | Information matching method and device, electronic equipment and storage medium |
CN113486993A (en) * | 2021-07-07 | 2021-10-08 | 杭州海康机器人技术有限公司 | Information matching method and information matching device |
Non-Patent Citations (2)
Title |
---|
宏基因组大数据分析的质量控制流程规范;郑广勇;杨桢;曹瑞芳;刘婉;李亦学;张国庆;;大数据;20180515(第03期);全文 * |
郑广勇 ; 杨桢 ; 曹瑞芳 ; 刘婉 ; 李亦学 ; 张国庆 ; .宏基因组大数据分析的质量控制流程规范.大数据.2018,(第03期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN117854594A (en) | 2024-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11560598B2 (en) | Systems and methods for analyzing circulating tumor DNA | |
CN117854594B (en) | Space histology sequencing positioning matching method and device, space histology sequencing equipment and medium | |
Krueger et al. | Large scale loss of data in low-diversity illumina sequencing libraries can be recovered by deferred cluster calling | |
Rougemont et al. | Probabilistic base calling of Solexa sequencing data | |
Dündar et al. | Introduction to differential gene expression analysis using RNA-seq | |
JP2017500004A (en) | Methods and systems for genotyping gene samples | |
CN113470743A (en) | Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data | |
CN107403075A (en) | Comparison method, apparatus and system | |
CN110692101A (en) | Method for aligning targeted nucleic acid sequencing data | |
CN112270953A (en) | Analysis method, device and equipment based on BD single cell transcriptome sequencing data | |
CN108182348B (en) | DNA methylation data detection method and device based on seed sequence information | |
US20190139628A1 (en) | Machine learning techniques for analysis of structural variants | |
EP2972309A1 (en) | Characterization of biological material using unassembled sequence information, probabilistic methods and trait-specific database catalogs | |
CN115083521A (en) | Method and system for identifying tumor cell group in single cell transcriptome sequencing data | |
CN109658981B (en) | Data classification method for single cell sequencing | |
CN116596933B (en) | Base cluster detection method and device, gene sequencer and storage medium | |
Jing et al. | ScSmOP: a universal computational pipeline for single-cell single-molecule multiomics data analysis | |
CN107403076B (en) | Method and apparatus for treating DNA sequence | |
CN110684830A (en) | RNA analysis method for paraffin section tissue | |
CN117672343B (en) | Sequencing saturation evaluation method and device, equipment and storage medium | |
Yu et al. | Generating barcodes for nanopore sequencing data with PRO | |
CN116343923B (en) | Genome structural variation homology identification method | |
US20210304844A1 (en) | Method, apparatus, and computer-readable medium for optimal pooling of nucleic acid samples for next generation sequencing | |
Ebrahimi et al. | scTagger: fast and accurate matching of cellular barcodes across short-and long-reads of single-cell RNA-seq experiments | |
Copeland | Computational Analysis of High-replicate RNA-seq Data in Saccharomyces Cerevisiae: Searching for New Genomic Features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |