CN115641911B - Method for detecting overlapping between sequences - Google Patents

Method for detecting overlapping between sequences Download PDF

Info

Publication number
CN115641911B
CN115641911B CN202211280417.9A CN202211280417A CN115641911B CN 115641911 B CN115641911 B CN 115641911B CN 202211280417 A CN202211280417 A CN 202211280417A CN 115641911 B CN115641911 B CN 115641911B
Authority
CN
China
Prior art keywords
sequence
seed
mapping
overlapping
seed sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211280417.9A
Other languages
Chinese (zh)
Other versions
CN115641911A (en
Inventor
孔唐潮
刘博�
王亚东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202211280417.9A priority Critical patent/CN115641911B/en
Publication of CN115641911A publication Critical patent/CN115641911A/en
Application granted granted Critical
Publication of CN115641911B publication Critical patent/CN115641911B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for detecting overlapping among sequences, in particular to a method for detecting overlapping among sequences based on ONT data of a third generation sequencing technology, which aims to solve the problems that the comparison speed is slow and the memory consumption is high when the third generation sequencing technology is used for detecting overlapping, the third generation sequencing technology has higher error rate, the accuracy of the generated overlapping detection result among sequences is low and the reliability is low, and firstly, the acquired sequencing sequences are divided into seed sequences and mapping sequences; constructing a secondary hash index table of the seed sequence; calculating a filtering threshold value of the seed sequence; finally, the seed sequence and the mapping sequence are compared by utilizing a secondary hash index table of the seed sequence and a filtering threshold value to obtain overlapping information of the seed sequence and the mapping sequence, then the seed sequence and the mapping sequence are selected again in an iterative mode at the beginning of each iteration, and the operation is repeated to obtain the overlapping information of the sequencing sequence. Belonging to the field of sequence overlap detection.

Description

Method for detecting overlapping between sequences
Technical Field
The invention relates to a sequence overlap detection method, in particular to a method for detecting sequence overlap of ONT data based on a third generation sequencing technology, and belongs to the field of sequence overlap detection.
Background
De novo genome splicing refers to the process of reconstructing the genomic sequence of a species from the sequenced sequence. By detecting the overlapping relationship between the sequencing sequences, the invention of a graph model is utilized to determine the connection sequence between the sequencing sequences, and the process of reconstructing the continuous sequences on the genome is called contig splicing. Contig splicing is the first step in completing de novo genome splicing. The invention relates to a detection method for overlapping sequences among sequencing sequences, which is a process of finding the overlapping position among the sequences by mutual alignment among the sequencing sequences given a set of the sequencing sequences. The invention is the upstream work of methods such as genome sequence splicing, genome sequence error correction and the like, and has important roles in bioinformatics research.
Third generation sequencing Techniques (TGS) can produce longer but higher error rate sequencing sequences (up to 10kbp or even more than 100kbp in length with an error rate of about 12% -25%) compared to the sequencing sequences produced by high throughput sequencing techniques (NGS) (average length of 100-250bp with an error rate of less than 1%). Represented by PacBio single molecule real-time sequencing technology (SMRT) and oxford nanopore sequencing technology (ONT). The third generation sequencing sequence can cover most large structural variation and repeated sequence areas on the genome by virtue of the advantage of the read length, so that the capability of detecting the structural variation, the accuracy of comparison, the continuity of genome splicing and the like can be greatly improved. However, the higher error rate also provides more challenges for the comparison method, the accuracy of the generated overlapping detection results between sequences is low, the reliability is low, and the third generation sequencing technology is used for comparing all sequences during overlapping detection, so that the comparison speed is low and the memory consumption is high.
Disclosure of Invention
The invention aims to solve the problems that the comparison speed is low, the memory consumption is high, the error rate is high, the accuracy of the generated sequence-to-sequence overlap detection result is low, and the reliability is low in the third generation sequencing technology when all sequences are compared in the overlap detection, and further provides a method for detecting the sequence-to-sequence overlap.
The technical scheme adopted by the invention is as follows:
it comprises the following steps:
s1, acquiring a sequencing sequence;
s2, dividing the sequencing sequence into a seed sequence and a mapping sequence, and comparing the seed sequence with the mapping sequence to obtain overlapping information of the sequencing sequence, wherein the specific process is as follows:
s21, sequencing the sequencing sequences as mapping sequences, sequencing the sequencing sequences from long to short according to the sequence length, and selecting one or more sequencing sequences with the longest sequencing length as seed sequences;
s22, constructing a secondary hash index table of the seed sequence, and obtaining a constructed secondary hash index table;
s23, calculating a filtering threshold value of the seed sequence;
s24, comparing the seed sequence with the mapping sequence by utilizing a secondary hash index table of the seed sequence and a filtering threshold value to obtain an overlapped set of the seed sequence and the mapping sequence;
S25, symmetrically recording overlapping sets of the seed sequence and the mapping sequence to obtain corresponding information of each overlapping set on the seed sequence and the mapping sequence, and respectively calculating coverage of overlapping areas on the seed sequence and the mapping sequence according to the obtained information;
s26, taking all sequencing sequences with insufficient coverage as mapping sequences, selecting a certain number of sequencing sequences with insufficient coverage as seed sequences, and repeatedly executing S21-S25 until the reduced number of the sequencing sequences with insufficient coverage is lower than a threshold value, so as to obtain overlapping information of the sequencing sequences.
Further, the step S22 of constructing a secondary hash index table of the seed sequence, and obtaining a constructed secondary hash index table comprises the following specific steps:
s221, sequentially reading each seed sequence from the sequencing sequence to obtain information of each seed sequence, and combining the obtained information of each seed sequence to obtain an information set of the seed sequence;
the number of bases of all seed sequences is:
Figure BDA0003897758520000021
m represents a self-defined upper limit of a memory, and the unit is GB;
# reads represents the total number of sequenced sequences;
top n representing the amount of overlapping information that is at most retained per round of each sequencing sequence;
t represents the thread number;
B 1 Representing the number of bytes occupied by storing each seed sequence, mapping sequence marking information and overlapping information;
B 2 representing the number of bytes occupied to store each specific piece of overlapping information;
B 3 representing the number of bytes occupied by an element in the stored index table;
S 1 under the condition of simultaneous reading of multiple threads in the process of representation and comparison, S 1 The number of bytes occupied by a base;
S 2 representing the condition of simultaneous reading of multiple threads in the indexing process, S 2 The number of bytes occupied by a base;
s222, constructing a linear list and a pointer list of the seed sequence according to the information set of the seed sequence, and obtaining a constructed linear list and a pointer list;
s223, constructing a secondary hash index table of the seed sequence according to the constructed linear list and the pointer list.
Further, the information of each seed sequence in S221 is represented by a tetrad (id, length, name, seq), where id represents the order of each seed sequence in the sequence, and is used as an identifier corresponding to each seed sequence, length represents the length of each seed sequence, name represents the name of each seed sequence in the sequence, and seq represents a specific base sequence of each seed sequence.
Further, in S222, a linear list and a pointer list of the seed sequence are constructed according to the information set of the seed sequence, so as to obtain a constructed linear list and pointer list, which specifically include the following steps:
S2221, defining a window with the length of w, wherein the window starts to move rightwards from the first base position of each seed sequence until the last base is finished, repeatedly traversing each seed sequence by using the window, and each window is internally provided with w short character strings;
s2222, respectively processing the short character strings in each window by using a hash function to obtain hash values of each short character string in each window, selecting the short character string with the smallest hash value in each window as a minimizer, enabling the minimizer to represent the current window, when the hash values of two adjacent minimzers are consistent, keeping the same minimizer for the two adjacent windows when the hash values of the two adjacent minimzers are located at the same position on the same seed sequence, obtaining minimzers of all seed sequences, and sequencing all minimzers according to the obtained sequence to generate a linear list mi;
s2223, each element of the pointer list mi_num records the number of kmers having the same top l bits.
Further, each minimizer includes a quadruple minimizer (value, rid, pos, strand), where value represents a hash value corresponding to the minimizer, rid represents a seed sequence identifier, pos represents a position of the minimizer on a seed sequence, and strand represents a positive and negative chain corresponding to the minimizer.
Further, in S223, a secondary hash index table of the seed sequence is constructed according to the constructed linear list and the pointer list, which specifically includes:
s2231, calculating to obtain a pointer list mi_count according to the pointer list mi_num;
s2232, sorting the linear list mi according to the hash value corresponding to each minimizer to generate an ordered linear list mi';
s2233, merging the pointer list mi_count and the ordered linear list mi' to form a secondary hash index table of the seed sequence.
Further, the step S23 of calculating the filtering threshold of the seed sequence specifically includes:
calculating a filtering threshold of a minimizer of the seed sequence:
rep_n=rep_kmer[#minimizer*r n ] (2)
wherein rep_n represents a filtering threshold;
# minimizer represents the number of minimzers for different hash values;
r n representing the percentage of minimizer with the highest number of filtering repetition in a self-defined way;
rep_kmer [ ] represents a decrementally ordered set storing each different minimizer number.
Further, in S24, the seed sequence and the mapping sequence are compared by using a secondary hash index table of the seed sequence and a filtering threshold value, so as to obtain an overlapping set of the seed sequence and the mapping sequence, which comprises the following specific processes:
s241, sequentially reading each mapping sequence from the sequencing sequence to obtain information of each mapping sequence, and combining the obtained information of each mapping sequence to obtain an information set of the mapping sequence;
The information of each mapping sequence is represented by a tetrad read '(id', name ', seq'), wherein id 'represents the cis position of each mapping sequence in the file, and is taken as an identifier corresponding to each mapping sequence, length' represents the length of each mapping sequence, name 'represents the name of each mapping sequence in the file, and seq' represents a specific base sequence of each mapping sequence;
the number of bases of all mapped sequences is:
Figure BDA0003897758520000041
m represents a self-defined upper limit of a memory, and the unit is GB;
# reads represents the total number of sequenced sequences;
top n representing the amount of overlapping information that is at most retained per round of each sequencing sequence;
t represents the thread number;
B 1 representing the number of bytes occupied by storing each seed sequence, mapping sequence marking information and overlapping information;
B 2 representing the number of bytes occupied to store each specific piece of overlapping information;
B 3 representing the number of bytes occupied by an element in the stored index table;
S 1 under the condition of simultaneous reading of multiple threads in the process of representation and comparison, S 1 The number of bytes occupied by a base;
S 2 representing the condition of simultaneous reading of multiple threads in the indexing process, S 2 The number of bytes occupied by a base;
s242, respectively constructing a second linear list of each mapping sequence according to the obtained information of each mapping sequence to obtain a second linear list of each mapping sequence, wherein the specific process is as follows:
S2421, defining a window with the length w, wherein the window starts to move rightwards from the first base position of each mapping sequence until the last base is finished; repeatedly traversing each mapping sequence by utilizing the windows, wherein w short character strings are arranged in each window;
s2422, processing the short character strings in each window on each mapping sequence by using a hash function to obtain the hash value of each short character string in each window, selecting the short character string with the smallest hash value in each window as a minimizer ', enabling the minimizer' to represent the current window, and when the hash values of the minimizer 'of two adjacent windows are consistent, keeping the same minimizer' for the two adjacent windows at the same position on the same mapping sequence, so as to obtain all minimizer 'of the mapping sequence, and combining all minimzers' to generate a linear list two mi_q;
each minimizer 'includes a quadruple minimizer' (value ', rid', pos ', strand'), where value 'represents a hash value corresponding to the minimizer', rid 'represents a mapping sequence identifier corresponding to the minimizer', pos 'represents a position of the minimizer' on the mapping sequence, and strand 'represents a positive and negative chain corresponding to the minimizer';
S243, searching the matching blocks of each mapping sequence and the target seed sequence by using a secondary hash index table of the seed sequence to obtain a matching block set of each mapping sequence and the target seed sequence, wherein the specific process is as follows:
s2431, positioning the starting position and the ending position of the lmer of the minimizer 'in the ordered linear list mi' according to the pointer list mi_count in the secondary hash index table;
s2432, finding the position pos of the minimizer in the ordered linear list mi 'which is the same as the hash value of the minimizer' by using binary search;
s2433, finding the initial position and the final position of the hash value in the upper and lower directions of the position pos by using binary search to obtain each matching block corresponding to the mapping sequence and the target seed sequence, and if the matching blocks are found and the number of the matching blocks is greater than a filtering threshold rep_n, ignoring the matching block; otherwise, adding the matching block into the matching block set to obtain a preliminary matching block set of each mapping sequence and the target seed sequence;
s2434, sorting the matching blocks according to seed sequence identifiers of the matching blocks in the preliminary matching block set and the initial positions of the matching blocks on the mapping sequences, and merging the matching blocks into one matching block if two adjacent matching blocks after sorting meet the collinearity condition to obtain a matching block set of each mapping sequence and the target seed sequence;
The collinearity condition:
Figure BDA0003897758520000061
ts 1 and ts 2 Representing the initial positions of two adjacent matching blocks on the target seed sequence respectively;
qs 1 and qs 2 Representing the initial positions of two adjacent matching blocks on the mapping sequence respectively;
t represents the maximum value of the distance between two adjacent, combinable kmers on the genome;
tid 1 and tid 2 Identifiers representing target seed sequences corresponding to two adjacent matching blocks;
each matching block in the matching block set is represented by a six-tuple MB (qs, qe, tid, ts, te, cov '), wherein qs represents a starting position of the matching block on the mapping sequence, qe represents an ending position of the matching block on the mapping sequence, tid represents an identifier of a target seed sequence corresponding to the matching block, ts represents a starting position of the matching block on the target seed sequence, te represents an ending position of the matching block on the target seed sequence, and cov' represents a matched non-overlapping base number;
s244, obtaining a comparison skeleton of each mapping sequence and a target seed sequence by using a sparse dynamic programming algorithm according to the matching block set, wherein the specific process is as follows:
constructing a directed acyclic graph by taking the matching blocks of each mapping sequence and the target seed sequence as vertexes, and then constructing any two vertexes MB of the directed acyclic graph 3 (qs 3 ,qe 3 ,tid 3 ,ts 3 ,te 3 ,cov 3 ') and MB 4 (qs 4 ,qe 4 ,tid 4 ,ts 4 ,te 4 ,cov′ 4 ) The condition that there is an edge between:
Figure BDA0003897758520000062
calculating individual scores for all paths in the directed acyclic graph according to the edges of the directed acyclic graph and the recursive formula (6):
S(MB j )=max(S(MB i )+w(MB i →MB j )-p(MB i →MB j )) (6)
wherein S represents the score of the current vertex optimal path;
w(MB i →MB j ) Representing edge MB i →MB j Is used for the weight of the (c),
Figure BDA0003897758520000063
p(MB i →MB j ) Representing edge MB i →MB j Is used for the penalty coefficient of (1),
p(MB i →MB j )=abs((qs j -qe i )-(ts j -te i ))/w(MB i →MB j );
the path is to connect a plurality of matching blocks, the same mapping sequence is possibly matched with a plurality of target seed sequences, and the path with the highest score is used as a comparison skeleton of each mapping sequence and the corresponding target seed sequence aiming at each matched target seed sequence;
s245, according to the dynamic programming matrix and the comparison skeleton, obtaining overlapping information of each mapping sequence and each seed sequence, filtering the overlapping which does not meet the condition, and obtaining an overlapping set which meets the condition, wherein the specific process is as follows:
s2451, backtracking according to the dynamic programming matrix and the comparison skeleton to obtain overlapping information of each mapping sequence and each target seed sequence, and combining all overlapping information to obtain an overlapping information set of each mapping sequence and all target seed sequences; the dynamic programming matrix is a combination matrix of scoring values of each vertex in the directed loop-free path, wherein the scoring values of each vertex in the directed loop-free path are highest;
S2452, filtering and overlapping according to a conditional formula to obtain overlapping conforming to the condition;
the conditional formula:
Figure BDA0003897758520000071
wherein, mapping base represents the non-overlapping base number of the mapping sequence and the target seed sequence matching block;
T 1 representing a self-defined minimum matching base number;
the mapping length represents the length of overlap obtained by the mapping sequence and the target seed sequence;
length' represents the length of the mapping sequence;
ratio represents the ratio of the custom overlap region to the length of the mapping sequence;
ove min representing a custom minimum overlap length;
overlapping means that the number of base pairs which are not matched at the two ends of the mapping sequence and the target seed sequence;
T 2 representing the self-defined maximum number of bases with two unmatched ends;
s2453, for each mapping sequence, each iteration self-defining reserves a certain number of overlapping frameworks with the highest score, repeatedly executing S242-S245 to obtain all overlapping between the mapping sequence and the seed sequence, and merging all overlapping to obtain an overlapping set.
Further, each piece of overlapping information obtained in S2451 is ove (qid, ql, qs ', qe', tid, tl, ts ', te', rev, mbp, mln, score), wherein qid represents an identifier of the mapping sequence; ql represents the length of the mapping sequence; tl represents the length of the target seed sequence; qs' represents the starting position on the mapping sequence overlapping the target seed sequence; qe' represents the end position of the mapping sequence overlapping the target seed sequence; ts' represents the starting position on the target seed sequence overlapping the mapping sequence; te' represents the end position on the target seed sequence overlapping the mapping sequence; rev represents the positive and negative cases of the mapping sequence and the target seed sequence, and is "+" if on the same strand, and is "-" otherwise; mbp the number of bases overlapping the mapping sequence and the target seed sequence; mln the length of overlap of the mapping sequence and the target seed sequence; score represents the score of the overlapping backbone of the mapped sequence and the target seed sequence.
Further, in S25, the overlapping sets of the seed sequence and the mapping sequence are symmetrically recorded to obtain information corresponding to each overlapping set of the seed sequence and the mapping sequence, and according to the obtained information, the coverage of the overlapping areas on the seed sequence and the mapping sequence is calculated respectively, which comprises the following specific steps:
s251, symmetrically recording the overlapped set obtained in the S2453 to obtain a seed sequence and a mapping sequence containing overlapped information;
s252, each mapping sequence and each seed sequence are uniformly divided into a plurality of blocks of 1000bp, and coverage of each mapping sequence and each seed sequence obtained in S251 is calculated by taking the blocks as units, wherein the specific process is as follows:
for the seed sequence B, a seed sequence P and a mapping sequence Q which are directly connected with the seed sequence B are found through an overlapping set, and the number of the P and the Q is more than or equal to 1; calculating the coordinate range of the block where the initial position and the end position of the overlapping region of the seed sequence P and the mapping sequence Q are located according to the overlapping information, and adding one to the coverage in the range of the seed sequence B if the coordinate range is in the range of the seed sequence B; if the coordinate range exceeds the range of the seed sequence B, no coverage exists, and the next piece of overlapping information is continuously searched;
For the mapping sequence G, the seed sequences Y directly connected with the mapping sequence G are found through an overlapping set, the number of the seed sequences Y is larger than or equal to 1, the coordinate range of the blocks is calculated according to the overlapping information, the coverage of the blocks in the coordinate range of the mapping sequence G is equal to the coverage of the corresponding blocks of the connected seed sequences Y, if the mapping sequence G is connected with a plurality of seed sequences Y, the average coverage of the seed sequences Y is taken as the coverage of each block between the mapping sequence G and the seed sequences Y, and the coverage of the blocks which are not in the coordinate range of the mapping sequence G is not calculated.
The beneficial effects are that:
(1) According to the invention, the sequencing sequence is divided into the seed sequence and the mapping sequence, the iteration method is utilized to detect the overlapping between the seed sequence and the mapping sequence, the seed sequence and the mapping sequence are selected again at the beginning of each iteration, the one-to-one comparison among all sequences is avoided, the consumption of calculation resources is reduced, and the needed overlapping information can be obtained more rapidly and accurately. The detection of overlap between sequencing sequences of a region of the genome can be accomplished by aligning the mapped sequences to the seed sequences of each round over a partial region of the genome.
Sequencing from long to short according to the length of the sequencing sequence at the beginning of the first round of iteration, taking the sequencing sequence as a mapping sequence, and selecting one or more sequencing sequences with the longest sequencing length as seed sequences; constructing a secondary hash index table of the seed sequence; calculating a filtering threshold value of the seed sequence; and comparing the seed sequence with the mapping sequence by using a secondary hash index table of the seed sequence and a filtering threshold value to obtain an overlapped set of the seed sequence and the mapping sequence, and ending the first iteration. Symmetrically recording the overlapping set of the seed sequence and the mapping sequence obtained in the previous round at the beginning of the iteration of the second round and above to obtain corresponding information of each overlapping on the seed sequence and the mapping sequence, and respectively calculating the coverage of the overlapping area on the seed sequence and the mapping sequence according to the obtained information; and selecting the sequence with insufficient coverage as a mapping sequence to be detected in the round, randomly selecting the sequence with partial insufficient coverage as a seed sequence, and repeating the operation in the first round until the reduced number of the sequencing sequences with insufficient coverage is lower than a threshold value, so as to obtain overlapping information of the sequencing sequences.
(2) And constructing a secondary hash index table of the seed sequence in each iteration by using a minimum hash strategy, improving seed retrieval efficiency, and realizing quick search of an accurate matching block between sequences. And then the seed sequence and the mapping sequence are compared through a sparse dynamic programming algorithm, so that the comparison speed can be effectively improved, and the overlapping information between the mapping sequence and all the seed sequences can be detected at the same time.
(3) An operation strategy for controlling the memory consumption is designed, the upper limit of the available memory of the current computer can be set by a user during operation, and the total number of bases which can be processed in each round or batch is calculated according to the upper limit. The strategy enables the memory of the invention to be controllable, and the memory can be operated on various computers such as a personal computer, a server and the like. The method realizes the detection of the overlapping between sequences with rapid detection and controllable memory.
Drawings
FIG. 1 is an overall flow chart of the present invention;
Detailed Description
The first embodiment is as follows: referring to fig. 1, a method for detecting overlap between sequences according to this embodiment includes the following steps:
s1, obtaining a sequencing sequence.
A collection of sequencing sequences is obtained, the collection comprising a plurality of sequencing sequences.
S2, dividing the sequencing sequence into a seed sequence and a mapping sequence, and comparing the seed sequence with the mapping sequence by using an iterative method to obtain overlapping information of the sequencing sequence, wherein the specific process is as follows:
s21, sequencing the sequencing sequences as mapping sequences, sequencing the sequencing sequences from long to short according to the sequence length, and selecting one or more sequencing sequences with the longest sequencing length as seed sequences.
At the beginning of each iteration, the seed sequence and the mapped sequence need to be redetermined, and all sequencing sequences are stored in a fastq or fasta format file. It is understood that each sequence in the set of sequencing sequences has two markers, respectively, that mark whether it is a seed sequence and whether it is a mapped sequence in the current iteration round. The one or more sequencing sequences with the longest length after sequencing are sequencing sequences of the first few bits of the sequencing sequences after sequencing, and the selected sequencing sequences may be equal in length or different in length. According to different iteration rounds, the method for selecting the seed sequence and the mapping sequence can be divided into a first round of iteration and a second round and more than two cases of iteration.
In the first iteration, all sequencing sequences are not detected, so all sequencing sequences are used as mapping sequences, the lengths of all sequencing sequences are sequenced from long to short, a certain number of sequencing sequences with the longest length (the number is defined by a user) are selected as seed sequences, and only partial sequences with the longest lengths are selected in the first iteration, and are marked as seed sequences.
S22, constructing a secondary hash index table of the seed sequence, and obtaining the constructed secondary hash index table, wherein the specific process is as follows:
after determining the seed sequence, constructing a secondary hash index table for the selected seed sequence, wherein the construction process can be divided into the following steps:
s221, sequentially reading each seed sequence from a sequencing sequence in fastq or fasta format to obtain information of each seed sequence, combining the obtained information of each seed sequence to obtain an information set of the seed sequences, wherein the base number of all the seed sequences is about batch_size, and the M units are GB according to the upper limit M of a memory provided by a user.
Figure BDA0003897758520000101
M represents a self-defined upper limit of a memory, and the unit is GB; # reads represents the total number of sequenced sequences; top n Representing the amount of overlapping information that is at most retained per round of each sequencing sequence; t represents the thread number; b (B) 1 Representing the number of bytes occupied by storing each seed sequence, mapping sequence marking information and overlapping information; b (B) 2 Representing the number of bytes occupied to store each specific piece of overlapping information; b (B) 3 Representing the number of bytes occupied by an element in the stored index table; s is S 1 Under the condition of simultaneous reading of multiple threads in the process of representation and comparison, S 1 The number of bytes occupied by a base; s is S 2 Representing the condition of simultaneous reading of multiple threads in the indexing process, S 2 Number of bytes occupied by a base.
Each seed sequence is represented by a tetrad (id, length, name, seq), wherein id represents the cis position of each seed sequence in the sequencing sequence, and is taken as an identifier corresponding to each seed sequence, length represents the length of each seed sequence, name represents the name of each seed sequence in the sequencing sequence, and seq represents the specific base sequence of each seed sequence.
S222, constructing a linear list and a pointer list of the seed sequence according to the information set of the seed sequence to obtain the constructed linear list and pointer list, wherein the specific process is as follows:
s2221, defining a window with length w, wherein the window starts to move rightwards from the first base position of each seed sequence until the last base is finished, repeatedly traversing each seed sequence by using the window, and each window is internally provided with w short character strings kmer.
The window length is w, indicating that there are w short strings therein, respectively as the first character of the head of the w short strings kmer. Every time the window moves right, the first short string kmer is shifted out and the new short string kmer is appended, so there are always w short strings in the window.
S2222, respectively processing the short string kmers in each window by using a hash function to obtain hash values of each short string kmer in each window, selecting the short string kmer with the smallest hash value in each window as a minimizer, enabling the minimizer to represent the current window, when hash values of two adjacent minimzers are consistent, keeping the same minimizer on the same seed sequence by the two adjacent windows, obtaining minimzers of all seed sequences, ordering all minimzers according to the obtained sequence, and generating a linear list mi, wherein the linear list mi is a linear list with an indefinite length and is used for storing minimzers of all seed sequences.
Each minimizer is represented by a four-tuple minimizer (value, rid, pos, strand), where value represents the hash value corresponding to the minimizer, rid represents the seed sequence identifier corresponding to the minimizer, pos represents the location of the minimizer on the seed sequence, and strand represents the positive and negative chain corresponding to the minimizer.
S2223, each element of the pointer list mi_num records the number of kmers having the same top l bits. And obtaining a minimizer representing the current window, and taking the first l bit lmer of the hash value according to the hash value of the minimizer, wherein lmer is a short character string with the length of l, and the lmer element of the corresponding pointer list mi_num is added with one.
S223, constructing a secondary hash index table of the seed sequence according to the constructed linear list and the pointer list, wherein the specific process is as follows:
s2231, calculating to obtain a pointer list mi_count according to the pointer list mi_num.
The constructed two-level hash index table consists of a list mi and a pointer list mi_count, wherein the pointer list mi_count is used for recording the initial position of all kmers with the previous l bits as corresponding key values in the linear list mi, and the pointer list mi_count is 4 in length l The key value of the pointer list mi_count is lmer (l < k), and the data type of the element is unsigned integer. The pointer list mi_count is directly calculated from the pointer list mi_num.
S2232, sorting the linear list mi according to the hash value corresponding to each minimizer, and generating an ordered linear list mi'.
S2233, merging the pointer list mi_count and the ordered linear list mi' to form a secondary hash index table of the seed sequence.
S23, calculating a filtering threshold of a minimizer of the seed sequence, wherein the specific process is as follows:
rep_n=rep_kmer[#minimizer*r n ] (2)
where rep_n represents the filtering threshold; # minimizer represents the number of minimzers for different hash values; r is (r) n Representing the percentage of minimizer that the number of filter repetitions is the greatest (which can be defined by the user by parameter r); rep_kmer [ ]Representing storing a decreasing ordered number of sets of each different minimizer.
In order to improve the accuracy of comparison, the invention calculates the filtering threshold rep_n according to the formula. The number of minimzers of different hash values (the minimzers of the same hash value are grouped into one group, and how many different minimzers are in total, the number of minimzers in each group) and the number of times each minimizer occurs are obtained according to an ordered linear list mi', so that in order to obtain a threshold, the minimzers with the number of occurrence times being high are filtered in the following process, that is, the minimzers with the occurrence times being greater than the filtering threshold rep_n are not considered in the following comparison process, and the minimzers with the occurrence times being too high may come from repeated sequence areas on the genome.
S24, comparing the seed sequence with the mapping sequence by utilizing a secondary hash index table of the seed sequence and a filtering threshold value to obtain an overlapped set of the seed sequence and the mapping sequence, wherein the specific process is as follows:
s241, sequentially reading each mapping sequence from sequencing sequences in fastq or fasta formats to obtain information of each mapping sequence, combining the obtained information of each mapping sequence to obtain an information set of the mapping sequences, wherein the base number of all the mapping sequences is about batch_size, and the M units are GB according to the upper limit M of a memory provided by a user.
Figure BDA0003897758520000121
M represents a self-defined upper limit of a memory, and the unit is GB; # reads represents the total number of sequenced sequences; top n Representing the amount of overlapping information that is at most retained per round of each sequencing sequence; t represents the thread number; b (B) 1 Representing the number of bytes occupied by storing each seed sequence, mapping sequence marking information and overlapping information; b (B) 2 Representing the number of bytes occupied to store each specific piece of overlapping information; b (B) 3 Representing the number of bytes occupied by an element in the stored index table; s is S 1 Under the condition of simultaneous reading of multiple threads in the process of representation and comparison, S 1 The number of bytes occupied by a base; s is S 2 Representing the condition of simultaneous reading of multiple threads in the indexing process, S 2 Number of bytes occupied by a base.
Each mapping sequence is represented by a tetrad (id ', length', name ', seq'), wherein id 'represents the order of each mapping sequence in the file, and as an identifier corresponding to each mapping sequence, length' represents the length of each mapping sequence, name 'represents the name of each mapping sequence in the file, and seq' represents the specific base sequence of each mapping sequence.
S242, respectively constructing a second linear list of each mapping sequence according to the obtained information of each mapping sequence to obtain a second linear list of each mapping sequence, wherein the specific process is as follows:
S2421, defining a window with the length w, wherein the window starts to move rightwards from the first base position of each mapping sequence until the last base is finished; and repeatedly traversing each mapping sequence by utilizing the windows, wherein w short character strings kmer exist in each window.
The window length is w, indicating that there are w short strings therein, respectively as the first character of the head of the w short strings kmer. Every time the window moves right, the first short string kmer is shifted out and the new short string kmer is appended, so there are always w short strings in the window.
S2422, processing the short character string kmer in each window on each mapping sequence by using a hash function to obtain a hash value of each short character string kmer in each window, selecting the short character string kmer with the smallest hash value in each window as a minimizer ', enabling the minimizer ' to represent the current window, when the hash values of the minimizer ' of two adjacent windows are consistent, keeping the same minimizer ' of the two adjacent windows at the same position on the same mapping sequence, obtaining all minimizer ' of the mapping sequence, combining all minimzers ' to generate a linear list two mi_q, wherein the linear list two mi_q is a linear list with variable length, and storing the minimizer ' of each mapping sequence.
Each minimizer 'includes a quadruple minimizer' (value ', rid', pos ', strand'), where value 'represents a hash value corresponding to the minimizer', rid 'represents a mapping sequence identifier corresponding to the minimizer', pos 'represents a position of the minimizer' on the mapping sequence, and strand 'represents a positive and negative chain corresponding to the minimizer'.
S243, searching the matching blocks of each mapping sequence and the target seed sequence by using a secondary hash index table of the seed sequence to obtain a matching block set of each mapping sequence and the target seed sequence, wherein the specific process is as follows:
the matching block length is small and consistent with the kmer length above. Each minimizer' in each linear list two mi q is found binary in the index table. Positioning the starting position and the ending position of the front lmer of the minimizer 'in the ordered linear list mi' according to the pointer list mi_count in the secondary hash index table, finding the position pos of the minimizer in the ordered linear list mi 'which is the same as the hash value of the minimizer' through binary search, and finally finding the starting position and the ending position of the hash value in the upper and lower directions of the position pos through binary search (a plurality of minimzers with the same hash value can exist) to obtain each matching block corresponding to the mapping sequence and the target seed sequence, and if the matching blocks are found and the number of the matching blocks is larger than a filtering threshold rep_n, ignoring the matching block; otherwise, adding the matching block into the matching block set, repeating the above operation, and finally obtaining a preliminary matching block set of each mapping sequence and the target seed sequence. Since the seed sequence and the mapping sequence are to be compared one by one, the target seed sequence is the seed sequence to which the current and mapping sequences are compared.
Each matching block (EMs, exactMatches) in the preliminary matching block set is represented by a quadruple EM (tid, qs, ts, cov), where tid represents the identifier of the target seed sequence, qs represents the starting position of the matching block on the mapping sequence, ts represents the starting position of the matching block on the target seed sequence, and cov represents the length of the matching block, i.e. the number of bases matched.
Sorting the matching blocks according to seed sequence identifiers of the matching blocks in the preliminary matching block set and the initial positions of the matching blocks on the mapping sequence, and if the matching blocks are sorted, sorting adjacent matching blocks EM 1 (tid 1 ,qs 1 ,ts 1 ,cov 1 ) And EM 2 (tid 2 ,qs 2 ,ts 2 ,cov 2 ) And combining the two identical linear blocks into a Matching Block (MB) to obtain a final matching Block set of each mapping sequence and the target seed sequence, wherein each matching Block in the matching Block set is represented by a six-tuple MB (qs, qe, tid, ts, te, cov '), wherein qe represents the end position of the matching Block on the mapping sequence, te represents the end position of the matching Block on the target seed sequence, and cov' represents the number of matched non-overlapping bases.
Figure BDA0003897758520000141
T represents the maximum value of the distance between two adjacent mergers on the genome, which can be calculated by the prior method;
S244, obtaining a comparison skeleton of each mapping sequence and a target seed sequence by using a sparse dynamic programming algorithm according to the matching block set, wherein the specific process is as follows:
according to the obtained matching block set, a sparse dynamic programming algorithm is used for obtaining a comparison skeleton between each mapping sequence and a target seed sequence, and the specific steps are as follows:
constructing a directed acyclic graph DAG by taking the matching blocks MB of each mapping sequence and the target seed sequence as vertexes, and then arbitrarily setting two vertexes MB of the directed acyclic graph DAG 3 (qs 3 ,qe 3 ,tid 3 ,ts 3 ,te 3 ,cov′ 3 ) And MB (MB) 4 (qs 4 ,qe 4 ,tid 4 ,ts 4 ,te 4 ,cov′ 4 ) The conditions for the presence of edges therebetween are as follows:
Figure BDA0003897758520000151
from the edges of the directed acyclic graph and the recursive formula (6), individual scores are computed for all paths in the directed acyclic graph DAG, the paths connecting a number of matching blocks:
S(MB j )=max(S(MB i )+w(MB i →MB j )-p(MB i →MB j )) (6)
where S represents the score of the current vertex optimal path, w (MB i →MB j ) Representing edge MB i →MB j Weight, p (MB) i →MB j ) Representing edge MB i →MB j Penalty coefficients of (a).
Figure BDA0003897758520000152
p(MB i →MB j )=abs((qs j -qe i )-(ts j -te i ))/w(MB i →MB j ) (8)
Since the same mapping sequence may have a match with a plurality of target seed sequences, the matching target seed sequences are each present, and the highest-scoring path is used as an alignment skeleton for each mapping sequence and the target seed sequences.
S245, according to the dynamic programming matrix and the comparison skeleton, obtaining overlapping information of each mapping sequence and each seed sequence, filtering the overlapping which does not meet the condition, and obtaining an overlapping set which meets the condition, wherein the specific process is as follows:
The dynamic programming matrix is obtained when all path scores are calculated, one element in the dynamic programming matrix is the scoring value of the vertex with the highest scoring value in all paths, namely the dynamic programming matrix is a combination matrix of the scoring values of each vertex with the highest scoring value in all paths in the directed loop.
S2451, backtracking according to a dynamic programming matrix and the comparison skeleton (highest score value of a path) to obtain overlapping information of each mapping sequence and each target seed sequence, and combining all overlapping information to obtain an overlapping information set of each mapping sequence and all target seed sequences; each overlay information is denoted ove (qid, ql, qs ', qe', tid, tl, ts ', te', rev, mbp, mln, score), wherein qid denotes an identifier of the mapping sequence; ql represents the length of the mapping sequence; tl represents the length of the target seed sequence; qs' represents the starting position on the mapping sequence overlapping the target seed sequence; qe' represents the end position of the mapping sequence overlapping the target seed sequence; ts' represents the starting position on the target seed sequence overlapping the mapping sequence; te' represents the end position on the target seed sequence overlapping the mapping sequence; rev represents the positive and negative cases of the mapping sequence and the target seed sequence, and is "+" if on the same strand, and is "-" otherwise; mbp the number of bases overlapping the mapping sequence and the target seed sequence; mln the length of overlap of the mapping sequence and the target seed sequence; score represents the score of the overlapping backbone of the mapped sequence and the target seed sequence.
S2452, filtering the overlapping according to a conditional formula to obtain overlapping conforming to the condition, and reserving when the overlapping information ove meets the following condition, otherwise discarding, wherein the conditional formula is as follows:
Figure BDA0003897758520000161
wherein, mapping base represents the non-overlapping base number of the mapping sequence and the target seed sequence matching block; t (T) 1 Representing a minimum number of matching bases that can be defined by a user; the mapping length represents the length of overlap obtained by the mapping sequence and the target seed sequence; length' represents the length of the mapping sequence; ratio represents the ratio of the overlap region defined by the user to the length of the mapping sequence; ove min A minimum overlap length that can be defined by a user; overlapping means that the number of base pairs which are not matched at the two ends of the mapping sequence and the target seed sequence; t (T) 2 Is the maximum number of bases that can be user-defined for both end mismatches. The final result is a qualified overlap set ove collection
S2453, for each mapping sequence, reserves n (n may be user-defined) overlapping sets ove per round of iterations collection The overlapping skeletons of the two parts are overlapped with the highest score, and corresponding overlapping information ove is obtained. And repeatedly executing S242-S245 to obtain all the overlaps between the mapping sequence and the seed sequence, and merging all the overlaps to obtain an overlap set.
S25, symmetrically recording overlapping sets of the seed sequence and the mapping sequence to obtain corresponding information of each overlapping set on the seed sequence and the mapping sequence, and respectively calculating coverage of overlapping areas on the seed sequence and the mapping sequence according to the obtained information, wherein the specific process is as follows:
At the second and higher iterations, the first iteration has detected overlapping information between portions of the sequencing sequences. According to the invention, by randomly selecting the seed sequences, the overlap detection between the sequencing sequences of the genome region can be completed by comparing the mapping sequences to the seed sequences of each round in each partial region of the genome. And meanwhile, the overlapping information between the mapping sequences can be calculated through seed sequence transmission. If a sequence from a region of the genome is not selected as a seed sequence, then the coverage of the seed sequence and mapped sequences from that region as well will be lower than the average coverage of the collection of sequenced sequences, even zero. Therefore, the invention selects the sequence with insufficient coverage to enter the second round and above round number comparison by calculating the coverage of the overlapping area on all the sequences. Sequences r satisfying the following conditions are referred to as undercoverage sequences: m (m is a program parameter, which can be defined by a user, and a default value is equal to 3) in the sequence r is less than a certain threshold T (T is a program parameter, which can be defined by a user, which is a percentage of the coverage of the set of sequencing sequences).
S251, before estimating coverage in the second round and above round iteration, overlapping set ove obtained in the previous round is obtained collection Symmetrically recording to obtain seed sequence and mapping sequence containing overlapping information, wherein if the mapping sequence with identifier of qid can be compared with seed sequence with identifier of tid, namely overlapping set ove collection There is record overlap information ove (qid, ql, qs ', qe', tid, tl, ts ', te', rev, mbp, mln, score) representing overlap information between the mapping sequence and the target seed sequence, then ove in the overlap set collection Its symmetric overlay information ove (qid, ql, qs ', qe', tid, tl, ts ', te', rev, mbp, mln, score) is recorded so that the map sequence and seed sequence can be found to each other bi-directionally.
S252, in order to avoid the waste of calculation resources, the invention does not estimate the base level of the coverage, but equally divides each mapping sequence and each seed sequence into a plurality of 1000bp blocks, and calculates the coverage of each mapping sequence and each seed sequence obtained in S251 by taking the blocks as units.
For seed sequence B, its coverage is calculated by: by overlapping sets ove collection Finding a seed sequence P and a mapping sequence Q which are directly connected with the seed sequence B, wherein the number of the P and the Q is more than or equal to 1; calculating overlap region of seed sequence P and mapping sequence Q according to overlap information The coordinate range of the block where the starting position and the ending position are located, if the coordinate range is within the range of the seed sequence B, adding one to the coverage within the range of the seed sequence B; if the coordinate range exceeds the range of the seed sequence B, the coverage is not available, and the next piece of overlapping information is continuously searched.
For seed sequences P that are contiguous with and within seed sequence B, a stack (auxiliary structure) lookup is used to find a mapping sequence that is not directly contiguous with seed sequence P, but has an overlapping region, passed through seed sequence B.
For mapping sequence G, its coverage is calculated by: if the detected overlap between the seed sequence and the mapped sequence is correct, the overlap region between any two sequences is the same region on the genome, so the coverage of the overlap regions is equal. By overlapping sets ove collection Finding seed sequences Y directly connected with the mapping sequence G, wherein the number of the seed sequences Y is more than or equal to 1, calculating the coordinate range of the blocks according to the overlapping information, enabling the coverage of the blocks in the coordinate range of the mapping sequence G to be equal to the coverage of the corresponding blocks of the connected seed sequences Y, and if the mapping sequence G is connected with a plurality of seed sequences Y, taking the average coverage of the seed sequences Y as the coverage of each block between the mapping sequence G and the seed sequences Y, wherein the coverage of the blocks not in the coordinate range of the mapping sequence G is not calculated.
In practice, due to sequencing errors and the presence of repeated sequences on the genome, false inter-sequence overlap information may be detected, where sequences from different regions on the genome are linked together by the transfer of the false overlap information, resulting in a calculated coverage of the corresponding sequences that is higher than the actual coverage.
S26, taking all sequencing sequences with insufficient coverage as mapping sequences, selecting a certain number of sequencing sequences with insufficient coverage (the number is defined by a user) as seed sequences, repeatedly executing S21-S25 until the number of sequencing sequences with insufficient coverage, which are reduced in a certain round of iteration, is lower than a certain threshold value, and obtaining overlapping information of the sequencing sequences.
After all iteration rounds are finished, the result of the overlapping information among the sequences is output to the hard disk in the PAF format.
The invention realizes a plurality of threads through a double-layer parallel architecture when in application. Taking three main threads as an example, the architecture firstly generates three main threads according to the requirement of calculation, and the three main threads are used for the following functions:
step one: data is read from the hard disk to the thread buffer.
Step two: and carrying out a required calculation process, and temporarily storing the calculation result into a thread buffer area.
Step three: and writing the calculation result of the thread buffer area into the hard disk.
The three main threads respectively execute the steps one, two and three in sequence, and cannot execute the same steps at the same time. If one main thread finishes executing a step, and the next step to be executed is occupied by other threads, the execution of the other threads is waited for. After each thread finishes executing the third step, the process goes to re-executing the first step, and the first step to the third step are repeatedly executed until all data are read.
The multithreading implementation method can separate the steps of data reading, data calculating and the like at the same time, and avoids the overall too slow speed caused by too slow speed of the I/O of the hard disk.
The specific method for implementing multithreading by different functions of the invention is as follows.
(1) Multithreading method for constructing index
First two main threads are generated, each main thread having two steps:
step one: the seed sequence is read from the file.
Step two: indexing the seed sequence and constructing a secondary hash index table.
(2) Aligned multithreading method
First three main threads are generated, each main thread has three steps:
step one: reading a mapping sequence from a file; memory space required by the main thread is allocated.
Step two: a comparison of the mapping sequences is made. When this step is performed, the main thread generates t sub-threads (the parameter can be defined by the user), each sub-thread processes a mapping sequence separately, and after the processing is completed, the next sub-thread is allocated.
Step three: and releasing the memory space used by the main thread.
(3) Multithreading method for calculating coverage
When the coverage is calculated, only one main thread is used, the main thread generates t sub threads (the parameter can be defined by a user), each sub thread independently processes a mapping sequence, and the next sub thread is distributed after the processing is finished.
Examples
To test the performance of the invention to detect inter-sequence overlap on different genomic sequences, the PBSIM software was used to model R103 ONT datasets with a sequencing depth of 50x, an average length of 15000, a sequencing error rate of 13% (the ratio of insertion error, deletion error and substitution error of 23:31:46) for 8 different species (e.g., escherichia coli, caenorhabditis elegans, drosophila, human) and to download real ONT datasets with different sequencing depths for 4 different species (e.g., caenorhabditis elegans, drosophila, human). Genomic information of the mock species and generated dataset information are shown in table 1 (a) below, and real dataset information is shown in table 1 (b) below.
Table 1 (a):
Figure BDA0003897758520000191
table 1 (b)
Figure BDA0003897758520000192
/>
Figure BDA0003897758520000201
The experiment compares the invention with a minimap2 tool in terms of both sequence level alignment accuracy and recall. The consumption degree of the invention on the computing resource is evaluated according to the time and the memory peak value required in the running process of the program. Each evaluation criterion was defined as follows.
(1) Accuracy of
The correct position of overlap between sequences requires that the following conditions be met: i.e. the start and end positions overlapping the reference genome are calculated from the actual positions of the two sequences on the reference genome, and the overlap is considered correct if the difference between the two start and end positions is within a certain range.
Figure BDA0003897758520000202
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003897758520000203
respectively, the start and end positions, seq, of the overlap between the two sequences on the reference genome length For sequence length, error rate To be a sequencing error rate.
Assume that the number of overlapping information satisfying the above condition is num correct The number of output overlapping information is num reported The overall accuracy of the overlap between sequences is defined as:
Figure BDA0003897758520000204
(2) Recall rate of recall
Because the invention only keeps the overlapping with the highest score for each sequence, the total overlapping information between each sequence and other sequences can be calculated by a transmission method. Assuming that the number of overlaps calculated through the above transfer operation is num transitive The number of real overlaps between sequences is num true The overall recall of the inter-sequence overlap is defined as:
Figure BDA0003897758520000205
experimental results
The experiment utilizes the invention and the minimap2 to carry out overlapping detection on 8 simulated data sets, and the experimental results are shown in the following table 2, so that the time and the memory peak value required by the running of the program, and the accuracy and the recall rate of the result obtained by the program are counted.
TABLE 2
Figure BDA0003897758520000211
The accuracy and recall of the results obtained by the procedure are shown in table 3 below, wherein, since minimap2 is dedicated to obtaining overlapping information between all sequences, the calculation of recall is not performed according to the method of calculating transfer overlapping information as described in the previous section, but the overlapping information directly obtained therefrom is used.
TABLE 3 Table 3
Figure BDA0003897758520000221
Experiment Using the present invention and minimap2 to overlap the 4 real data sets, the experimental results are shown in Table 4 below, statistics of the time and memory peaks required for program operation, and accuracy and recall of the results obtained by the program
TABLE 4 Table 4
Figure BDA0003897758520000222
/>
Figure BDA0003897758520000231
The accuracy and recall of the results obtained by the procedure are given in table 5 below.
TABLE 5
Figure BDA0003897758520000232
As can be seen from the above tables 2 and 4, the comparison efficiency of the detection of the inter-sequence overlap detection method provided by the invention exceeds the minimum 2 in the sequencing data of all species by 3-6 times of the tool; the memory consumption is 10% -20% of that of minimum 2.
As can be seen from the above tables 3 and 5, the detection accuracy of the present invention can reach 99% or more on a small-scale genome with fewer repeated sequences; the recall rate on the simulated data set reaches more than 98%, and the recall rate on the real data set is more than 87% and is 1% -12% lower than the detection tool minimap 2; on a large-scale genome with more repeated sequences, the detection accuracy can be maintained to be more than 90%, far exceeding other detection tools, and the recall rate is more than 94%.

Claims (9)

1. A method for inter-sequence overlap detection, characterized by: it comprises the following steps:
s1, acquiring a sequencing sequence;
s2, dividing the sequencing sequence into a seed sequence and a mapping sequence, and comparing the seed sequence with the mapping sequence to obtain overlapping information of the sequencing sequence, wherein the specific process is as follows:
s21, sequencing the sequencing sequences as mapping sequences, sequencing the sequencing sequences from long to short according to the sequence length, and selecting one or more sequencing sequences with the longest sequencing length as seed sequences;
s22, constructing a secondary hash index table of the seed sequence, and obtaining a constructed secondary hash index table;
s23, calculating a filtering threshold value of the seed sequence;
s24, comparing the seed sequence with the mapping sequence by utilizing a secondary hash index table of the seed sequence and a filtering threshold value to obtain an overlapped set of the seed sequence and the mapping sequence, wherein the specific process is as follows:
S241, sequentially reading each mapping sequence from the sequencing sequence to obtain information of each mapping sequence, and combining the obtained information of each mapping sequence to obtain an information set of the mapping sequence;
the information of each mapping sequence is represented by a tetrad read '(id', name ', seq'), wherein id 'represents the cis position of each mapping sequence in the file, and is taken as an identifier corresponding to each mapping sequence, length' represents the length of each mapping sequence, name 'represents the name of each mapping sequence in the file, and seq' represents a specific base sequence of each mapping sequence;
the number of bases of all mapped sequences is:
Figure FDA0004149155460000011
m represents a self-defined upper limit of a memory, and the unit is GB;
# reads represents the total number of sequenced sequences;
top n representing the amount of overlapping information that is at most retained per round of each sequencing sequence;
t represents the thread number;
B 1 representing the number of bytes occupied by storing each seed sequence, mapping sequence marking information and overlapping information;
B 2 representing the number of bytes occupied to store each specific piece of overlapping information;
B 3 representing the number of bytes occupied by an element in the stored index table;
S 1 under the condition of simultaneous reading of multiple threads in the process of representation and comparison, S 1 The number of bytes occupied by a base;
S 2 representing the condition of simultaneous reading of multiple threads in the indexing process, S 2 The number of bytes occupied by a base;
s242, respectively constructing a second linear list of each mapping sequence according to the obtained information of each mapping sequence to obtain a second linear list of each mapping sequence, wherein the specific process is as follows:
s2421, defining a window with the length w, wherein the window starts to move rightwards from the first base position of each mapping sequence until the last base is finished; repeatedly traversing each mapping sequence by utilizing the windows, wherein w short character strings are arranged in each window;
s2422, processing the short character strings in each window on each mapping sequence by using a hash function to obtain the hash value of each short character string in each window, selecting the short character string with the smallest hash value in each window as a minimizer ', enabling the minimizer' to represent the current window, and when the hash values of the minimizer 'of two adjacent windows are consistent, keeping the same minimizer' for the two adjacent windows at the same position on the same mapping sequence, so as to obtain all minimizer 'of the mapping sequence, and combining all minimzers' to generate a linear list two mi_q;
Each minimizer 'includes a quadruple minimizer' (value ', rid', pos ', strand'), where value 'represents a hash value corresponding to the minimizer', rid 'represents a mapping sequence identifier corresponding to the minimizer', pos 'represents a position of the minimizer' on the mapping sequence, and strand 'represents a positive and negative chain corresponding to the minimizer';
s243, searching the matching blocks of each mapping sequence and the target seed sequence by using a secondary hash index table of the seed sequence to obtain a matching block set of each mapping sequence and the target seed sequence, wherein the specific process is as follows:
s2431, positioning the starting position and the ending position of the lmer of the minimizer 'in the ordered linear list mi' according to the pointer list mi_count in the secondary hash index table;
s2432, finding the position pos of the minimizer in the ordered linear list mi 'which is the same as the hash value of the minimizer' by using binary search;
s2433, finding the initial position and the final position of the hash value in the upper and lower directions of the position pos by using binary search to obtain each matching block corresponding to the mapping sequence and the target seed sequence, and if the matching blocks are found and the number of the matching blocks is greater than a filtering threshold rep_n, ignoring the matching block; otherwise, adding the matching block into the matching block set to obtain a preliminary matching block set of each mapping sequence and the target seed sequence;
S2434, sorting the matching blocks according to seed sequence identifiers of the matching blocks in the preliminary matching block set and the initial positions of the matching blocks on the mapping sequences, and merging the matching blocks into one matching block if two adjacent matching blocks after sorting meet the collinearity condition to obtain a matching block set of each mapping sequence and the target seed sequence;
the collinearity condition:
Figure FDA0004149155460000031
ts 1 and ts 2 Representing the initial positions of two adjacent matching blocks on the target seed sequence respectively;
qs 1 and qs 2 Representing the initial positions of two adjacent matching blocks on the mapping sequence respectively;
t represents the maximum value of the distance between two adjacent, combinable kmers on the genome;
tid 1 and tid 2 Identifiers representing target seed sequences corresponding to two adjacent matching blocks;
each matching block in the matching block set is represented by a six-tuple MB (qs, qe, tid, ts, te, cov '), wherein qs represents a starting position of the matching block on the mapping sequence, qe represents an ending position of the matching block on the mapping sequence, tid represents an identifier of a target seed sequence corresponding to the matching block, ts represents a starting position of the matching block on the target seed sequence, te represents an ending position of the matching block on the target seed sequence, and cov' represents a matched non-overlapping base number;
S244, obtaining a comparison skeleton of each mapping sequence and a target seed sequence by using a sparse dynamic programming algorithm according to the matching block set, wherein the specific process is as follows:
constructing a directed acyclic graph by taking the matching blocks of each mapping sequence and the target seed sequence as vertexes, and then constructing any two vertexes MB of the directed acyclic graph 3 (qs 3 ,qe 3 ,tid 3 ,ts 3 ,te 3 ,cov′ 3 ) And MB (MB) 4 (qs 4 ,qe 4 ,tid 4 ,ts 4 ,te 4 ,cov′ 4 ) The condition that there is an edge between:
Figure FDA0004149155460000032
calculating individual scores for all paths in the directed acyclic graph according to the edges of the directed acyclic graph and the recursive formula (6):
S(MB j )=max(S(MB i )+w(MB i →MB j )-p(MB i →MB j )) (6)
wherein S represents the score of the current vertex optimal path;
w(MB i →MB j ) Representing edge MB i →MB j Is used for the weight of the (c),
Figure FDA0004149155460000033
p(MB i →MB j ) Representing edge MB i →MB j Penalty coefficient, p (MB) i →MB j )=abs((qs j -qe i )-(ts j -te i ))/w(MB i →MB j );
The path is to connect a plurality of matching blocks, the same mapping sequence is possibly matched with a plurality of target seed sequences, and the path with the highest score is used as a comparison skeleton of each mapping sequence and the corresponding target seed sequence aiming at each matched target seed sequence;
s245, according to the dynamic programming matrix and the comparison skeleton, obtaining overlapping information of each mapping sequence and each seed sequence, filtering the overlapping which does not meet the condition, and obtaining an overlapping set which meets the condition, wherein the specific process is as follows:
s2451, backtracking according to the dynamic programming matrix and the comparison skeleton to obtain overlapping information of each mapping sequence and each target seed sequence, and combining all overlapping information to obtain an overlapping information set of each mapping sequence and all target seed sequences; the dynamic programming matrix is a combination matrix of scoring values of each vertex in the directed loop-free path, wherein the scoring values of each vertex in the directed loop-free path are highest;
S2452, filtering and overlapping according to a conditional formula to obtain overlapping conforming to the condition;
the conditional formula:
Figure FDA0004149155460000041
wherein, matringbases represents the non-overlapping base number of the mapping sequence and the target seed sequence matching block;
T 1 representing a self-defined minimum matching base number;
matringalength represents the length of overlap obtained by mapping sequence and target seed sequence;
length' represents the length of the mapping sequence;
ratio represents the ratio of the custom overlap region to the length of the mapping sequence;
ove min representing a custom minimum overlap length;
overlapping means that the number of base pairs which are not matched at the two ends of the mapping sequence and the target seed sequence;
T 2 representing the self-defined maximum number of bases with two unmatched ends;
s2453, for each mapping sequence, each iteration self-defining and reserving a certain number of overlapping frameworks with the highest score, repeatedly executing S242-S245 to obtain all overlapping between the mapping sequence and the seed sequence, and merging all overlapping to obtain an overlapping set;
s25, symmetrically recording overlapping sets of the seed sequence and the mapping sequence to obtain corresponding information of each overlapping set on the seed sequence and the mapping sequence, and respectively calculating coverage of overlapping areas on the seed sequence and the mapping sequence according to the obtained information;
S26, taking all sequencing sequences with insufficient coverage as mapping sequences, selecting a certain number of sequencing sequences with insufficient coverage as seed sequences, and repeatedly executing S21-S25 until the reduced number of the sequencing sequences with insufficient coverage is lower than a threshold value, so as to obtain overlapping information of the sequencing sequences.
2. A method for inter-sequence overlap detection according to claim 1, wherein: the step S22 of constructing a secondary hash index table of the seed sequence to obtain a constructed secondary hash index table, which comprises the following specific steps:
s221, sequentially reading each seed sequence from the sequencing sequence to obtain information of each seed sequence, and combining the obtained information of each seed sequence to obtain an information set of the seed sequence;
the number of bases of all seed sequences is:
Figure FDA0004149155460000051
m represents a self-defined upper limit of a memory, and the unit is GB;
# reads represents the total number of sequenced sequences;
top n representing the amount of overlapping information that is at most retained per round of each sequencing sequence;
t represents the thread number;
B 1 representing the number of bytes occupied by storing each seed sequence, mapping sequence marking information and overlapping information;
B 2 representing the number of bytes occupied to store each specific piece of overlapping information;
B 3 representing the number of bytes occupied by an element in the stored index table;
S 1 Under the condition of simultaneous reading of multiple threads in the process of representation and comparison, S 1 The number of bytes occupied by a base;
S 2 representing the condition of simultaneous reading of multiple threads in the indexing process, S 2 The number of bytes occupied by a base;
s222, constructing a linear list and a pointer list of the seed sequence according to the information set of the seed sequence, and obtaining a constructed linear list and a pointer list;
s223, constructing a secondary hash index table of the seed sequence according to the constructed linear list and the pointer list.
3. A method for inter-sequence overlap detection according to claim 1, wherein: the information of each seed sequence in S221 is represented by a tetrad (id, name, seq), where id represents the cis position of each seed sequence in the sequence, and as an identifier corresponding to each seed sequence, length represents the length of each seed sequence, name represents the name of each seed sequence in the sequence, and seq represents a specific base sequence of each seed sequence.
4. A method for inter-sequence overlap detection as claimed in claim 3, wherein: in S222, a linear list and a pointer list of the seed sequence are constructed according to the information set of the seed sequence, and the constructed linear list and pointer list are obtained, which comprises the following specific steps:
S2221, defining a window with the length of w, wherein the window starts to move rightwards from the first base position of each seed sequence until the last base is finished, repeatedly traversing each seed sequence by using the window, and each window is internally provided with w short character strings;
s2222, respectively processing the short character strings in each window by using a hash function to obtain hash values of each short character string in each window, selecting the short character string with the smallest hash value in each window as a minimizer, enabling the minimizer to represent the current window, when the hash values of two adjacent minimzers are consistent, keeping the same minimizer for the two adjacent windows when the hash values of the two adjacent minimzers are located at the same position on the same seed sequence, obtaining minimzers of all seed sequences, and sequencing all minimzers according to the obtained sequence to generate a linear list mi;
s2223, each element of the pointer list mi_num records the number of kmers having the same top l bits.
5. A method for inter-sequence overlap detection as claimed in claim 4, wherein: each minimizer includes a quadruple minimizer (value, rid, pos, strand), where value represents the hash value corresponding to the minimizer, rid represents the seed sequence identifier, pos represents the location of the minimizer on the seed sequence, and strand represents the positive and negative chain corresponding to the minimizer.
6. A method for inter-sequence overlap detection as claimed in claim 5, wherein: in the step S223, a secondary hash index table of the seed sequence is constructed according to the constructed linear list and the pointer list, and the specific process is as follows:
s2231, calculating to obtain a pointer list mi_count according to the pointer list mi_num;
s2232, sorting the linear list mi according to the hash value corresponding to each minimizer to generate an ordered linear list mi';
s2233, merging the pointer list mi_count and the ordered linear list mi' to form a secondary hash index table of the seed sequence.
7. A method for inter-sequence overlap detection according to claim 6, wherein: the filtering threshold value of the seed sequence is calculated in the step S23, and the specific process is as follows:
calculating a filtering threshold of a minimizer of the seed sequence:
rep_n=rep_kmer[#minimizer*r n ] (2)
wherein rep_n represents a filtering threshold;
# minimizer represents the number of minimzers for different hash values;
r n representing the percentage of minimizer with the highest number of filtering repetition in a self-defined way;
rep_kmer [ ] represents a decrementally ordered set storing each different minimizer number.
8. A method for inter-sequence overlap detection as claimed in claim 7, wherein: each piece of overlapping information obtained in S2451 is ove (qid, ql, qs ', qe', tid, tl, ts ', te', rev, mbp, mln, score), wherein qid represents an identifier of the mapping sequence; ql represents the length of the mapping sequence; tl represents the length of the target seed sequence; qs' represents the starting position on the mapping sequence overlapping the target seed sequence; qe' represents the end position of the mapping sequence overlapping the target seed sequence; ts' represents the starting position on the target seed sequence overlapping the mapping sequence; te' represents the end position on the target seed sequence overlapping the mapping sequence; rev represents the positive and negative cases of the mapping sequence and the target seed sequence, and is "+" if on the same strand, and is "-" otherwise; mbp the number of bases overlapping the mapping sequence and the target seed sequence; mln the length of overlap of the mapping sequence and the target seed sequence; score represents the score of the overlapping backbone of the mapped sequence and the target seed sequence.
9. A method for inter-sequence overlap detection as claimed in claim 8, wherein: and S25, symmetrically recording overlapping sets of the seed sequence and the mapping sequence to obtain corresponding information of each overlapping set on the seed sequence and the mapping sequence, and respectively calculating the coverage of overlapping areas on the seed sequence and the mapping sequence according to the obtained information, wherein the specific process is as follows:
s251, symmetrically recording the overlapped set obtained in the S2453 to obtain a seed sequence and a mapping sequence containing overlapped information;
s252, each mapping sequence and each seed sequence are uniformly divided into a plurality of blocks of 1000bp, and coverage of each mapping sequence and each seed sequence obtained in S251 is calculated by taking the blocks as units, wherein the specific process is as follows:
for the seed sequence B, a seed sequence P and a mapping sequence Q which are directly connected with the seed sequence B are found through an overlapping set, and the number of the P and the Q is more than or equal to 1; calculating the coordinate range of the block where the initial position and the end position of the overlapping region of the seed sequence P and the mapping sequence Q are located according to the overlapping information, and adding one to the coverage in the range of the seed sequence B if the coordinate range is in the range of the seed sequence B; if the coordinate range exceeds the range of the seed sequence B, no coverage exists, and the next piece of overlapping information is continuously searched;
For the mapping sequence G, the seed sequences Y directly connected with the mapping sequence G are found through an overlapping set, the number of the seed sequences Y is larger than or equal to 1, the coordinate range of the blocks is calculated according to the overlapping information, the coverage of the blocks in the coordinate range of the mapping sequence G is equal to the coverage of the corresponding blocks of the connected seed sequences Y, if the mapping sequence G is connected with a plurality of seed sequences Y, the average coverage of the seed sequences Y is taken as the coverage of each block between the mapping sequence G and the seed sequences Y, and the coverage of the blocks which are not in the coordinate range of the mapping sequence G is not calculated.
CN202211280417.9A 2022-10-19 2022-10-19 Method for detecting overlapping between sequences Active CN115641911B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211280417.9A CN115641911B (en) 2022-10-19 2022-10-19 Method for detecting overlapping between sequences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211280417.9A CN115641911B (en) 2022-10-19 2022-10-19 Method for detecting overlapping between sequences

Publications (2)

Publication Number Publication Date
CN115641911A CN115641911A (en) 2023-01-24
CN115641911B true CN115641911B (en) 2023-05-23

Family

ID=84944697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211280417.9A Active CN115641911B (en) 2022-10-19 2022-10-19 Method for detecting overlapping between sequences

Country Status (1)

Country Link
CN (1) CN115641911B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373538B (en) * 2023-12-08 2024-03-19 山东大学 Biological sequence comparison method and system based on multithread calculation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809510B2 (en) * 2002-02-27 2010-10-05 Ip Genesis, Inc. Positional hashing method for performing DNA sequence similarity search
WO2017143585A1 (en) * 2016-02-26 2017-08-31 深圳华大基因研究院 Method and apparatus for assembling separated long fragment sequences
CN111292805B (en) * 2020-03-19 2023-08-18 山东大学 Third generation sequencing data overlap detection method and system
CN114999573B (en) * 2022-04-14 2023-07-07 哈尔滨因极科技有限公司 Genome variation detection method and detection system

Also Published As

Publication number Publication date
CN115641911A (en) 2023-01-24

Similar Documents

Publication Publication Date Title
Ekim et al. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer
US9576073B2 (en) Distance queries on massive networks
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN115641911B (en) Method for detecting overlapping between sequences
JP4912646B2 (en) Gene transcript mapping method and system
CN109656798B (en) Vertex reordering-based big data processing capability test method for supercomputer
WO2022037232A1 (en) Method and apparatus for image grouping in three-dimensional reconstruction, electronic device, and computer readable storage medium
CN111552692A (en) Plus-minus cuckoo filter
CN105335624B (en) A kind of gene order fragment method for rapidly positioning based on bitmap
Kim Boosting graph similarity search through pre-computation
Rasheed et al. LSH-Div: Species diversity estimation using locality sensitive hashing
CN110097581A (en) Method based on point cloud registering ICP algorithm building K-D tree
CN112397148B (en) Sequence comparison method, sequence correction method and device thereof
Wei et al. A branch elimination-based efficient algorithm for large-scale multiple longest common subsequence problem
CN108509618B (en) Big data multidimensional data indexing method based on space filling curve
CN103309951A (en) Method and device for searching multimedia file on the net
CN114265886B (en) Similarity model retrieval system based on improved Apriori algorithm
CN113010882B (en) Custom position sequence pattern matching method suitable for cache loss attack
CN102402692B (en) Method and system for recognizing feature character strings
Li et al. Seeding with minimized subsequence
Schulz et al. Exact Sketch-Based Read Mapping
Müller Alignments and beyond: A versatile swarm-based framework for de novo amplicon clustering
Wang et al. Research on the Optimal Methods for Graph Edit Distance
Ekim Scalable sketching and indexing algorithms for large biological datasets
Vyverman ALFALFA: fast and accurate mapping of long next generation sequencing reads

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant