CN115831225A

CN115831225A - Structural variation verification system and method suitable for genome repetitive sequence

Info

Publication number: CN115831225A
Application number: CN202211400920.3A
Authority: CN
Inventors: 叶凯; 车肖飞; 王松渤
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2023-03-21

Abstract

The invention discloses a structural variation verification system and a method suitable for a genome repetitive sequence, which comprise the following steps: the sequence comparison module is used for carrying out sequence comparison on the sequenced BAM files, fasta files of the reference genome and VCF files to obtain a re-comparison result and sending the re-comparison result to the duplication elimination module; the duplication removing module is used for removing duplication of the received duplication comparison result and sending the duplication removed result to the structure variation evaluating module; and the structural variation evaluation module is used for evaluating the received de-duplicated result by a structural variation evaluation method based on distance measurement to realize structural variation verification of the genome repetitive sequence. The structural variation verification system is helpful for helping people to efficiently, accurately and comprehensively examine sequencing evidence of structural variation occurrence, and enhances and simplifies the process of manual examination.

Description

Structural variation verification system and method suitable for genome repetitive sequence

Technical Field

The invention belongs to the technical field of genome structural variation identification, and particularly relates to a structural variation verification system and method suitable for a genome repetitive sequence.

Background

Since the discovery of the double-helix structure of DNA, the research of life science is on the molecular level, and the sequencing technology appeared in the 70 s of the 20 th century makes a great contribution to deciphering genetic code. Single molecule sequencing technologies, which have emerged in recent years and can read nucleotide sequences at the single molecule level, are also referred to as third generation sequencing technologies, and are mainly represented by the Pacific Bioscience (PacBio) and Oxford Nanopore Technology (ONT) platforms. Compared with the traditional first-generation and second-generation sequencing technologies, the third-generation sequencing technology can generate longer base reading length, can directly sequence RNA, does not need reverse transcription, and has extremely high sequencing speed. Furthermore, single molecule sequencing techniques offer the opportunity to more fully detect structural variations with higher resolution. The third generation single molecule sequencing has average sequence length of 15kbp or more, thus greatly improving the reliability and resolution of structural variation detection, especially in human genome repetitive sequence region and complex structural variation detection.

Structural genomic Variation (Structural Variation) refers to a genomic rearrangement of more than 50bp in length, typically including deletions, insertions, inversions, duplications, translocations, and the like. Structural variation is closely related to everyone, mainly in human diseases (cancer, autism, alzheimer's disease, etc.), chromosomal evolution (gene loss and transposon activity), gene regulation (rearrangement of transcription factors) and other phenotypes (mating and intrinsic reproductive isolation). Therefore, the nature of the structural variation is of great importance to human medicine and genetics. It contributes to the early detection of disease and to elucidation of its underlying genetic and molecular processes.

Accurate identification of structural variations is a prominent but important issue in genomics. The rapid development of single molecule sequencing technology provides better resolution and more comprehensive detection opportunity for whole genome structural variation detection. In recent years, studies and tools for detecting genome structural variation based on single-molecule sequencing technology are diversified, for example, snifles published in 2018, svim published in 2019, cutsv published in 2020, and other tools are developed. However, even the most advanced tools still have a large number of false positives, and therefore it is necessary to perform a verification evaluation on the results of the structural variation detection.

At present, few researches related to structural variation verification and evaluation are available, and the methods mainly comprise Vapor and TT-Mars. However, both tools fail to provide effective validation evaluation of structural variation within the repeat region of the genome and are less effective in visualization, and further, the results of TT-Mars are heavily dependent on high quality genome assembly sequences. Therefore, no effective structural variation verification tool exists for structural variation in the repeat sequence region in the genome.

Disclosure of Invention

In order to overcome the problems in the prior art, the present invention aims to provide a system and a method for verifying structural variation of a genomic repeat sequence, which can accurately evaluate the structural variation of the genomic repeat sequence.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a structural variation verification system for genomic repeats, comprising:

the sequence comparison module is used for carrying out sequence comparison on the sequenced BAM files, fasta files and VCF files of the reference genome through a Hash comparison algorithm based on the kmer to obtain a repeated comparison result of a sequencing sequence read containing the structural variation, a ref sequence of the reference genome and a structural variation prediction sequence pre, and sending the result to the duplication elimination module;

the duplication removing module is used for removing duplication of the received duplication comparison result of the sequencing sequence read containing the structural variation, the ref sequence of the reference genome and the structural variation prediction sequence pre and sending the duplication removed result to the structural variation evaluating module;

and the structural variation evaluation module is used for evaluating the received de-duplicated result by a structural variation evaluation method based on distance measurement to realize structural variation verification of the genome repetitive sequence.

A structural variation verification method suitable for a genome repetitive sequence comprises the following steps:

1) Performing sequence comparison on the sequenced BAM files, fasta files and VCF files of the reference genome through a kmer-based Hash comparison algorithm to obtain a result of the weight comparison of a sequencing sequence read containing the structural variation, a ref sequence of the reference genome and a structural variation prediction sequence pre;

2) De-duplication of the result of the weight ratio of the sequenced sequence read containing the structural variation to the ref sequence of the reference genome and to the predicted sequence pre of the structural variation;

3) And evaluating the result after the duplication removal by a structural variation evaluation method based on distance measurement to realize the structural variation verification of the genome repetitive sequence.

Further, the specific process of step 1) is as follows:

extracting start-stop coordinates of each structural variation in the VCF file;

according to the start-stop coordinates of the structural variation, all m sequencing sequences read covered by a certain structural variation s and a corresponding reference genome sequence ref are taken from a BAM file, and the structural variation s is inserted into the reference genome sequence ref to construct a structural variation prediction sequence pre;

then traversing the kmer sequence of the reference genome sequence ref, and if a new kmer sequence is obtained, indexing whether the same kmer sequence exists in a hash table; if the sequence exists, the matched kmer sequence position in the sequencing sequence read and the reference genome sequence ref continuously moves by 1bp to check whether the next base is the same or not until the bases of the two are not the same, the starting coordinates of the matched sequence in the sequencing sequence read and the reference genome sequence ref, the length of the matched sequence and the direction of the matched sequence are recorded, and the result of the weight ratio of the sequencing sequence read containing the structural variation to the ref sequence of the reference genome and the structural variation prediction sequence pre is obtained.

Further, the hash table is determined by the following process:

respectively comparing the sequencing sequence read with a corresponding reference genome sequence ref and a structural variation prediction sequence pre in a weight ratio manner; aiming at sequencing data of a CCS and an ONT, selecting the length of a kmer sequence to be 31, traversing a sequencing sequence read, selecting the initial coordinate of the kmer sequence to be the hash value of the kmer sequence, and storing the hash value in a hash table; and traversing the reverse complementary sequence of the sequencing sequence read, and storing the kmer sequence and the initial coordinates corresponding to the kmer sequence in a hash table.

Further, the specific process of step 2) is as follows:

if no repeated sequence exists in the ref sequence of the reference genome, the result of the re-alignment is a line segment with the length equal to that of the ref sequence of the reference genome in the rectangular plane coordinate system, and the line segment is represented as a line segment along the main diagonal line in the sequence alignment chart;

if the reference genome ref sequence contains repeated sequences, the result of the re-alignment comprises a plurality of line segments with different lengths, which are represented as a plurality of line segments not along the main diagonal in the sequence alignment chart, and the coordinate range of the ref of the repeated sequences is recorded.

Further, the step 2) further comprises the following steps:

traversing the result obtained by comparing the sequencing sequence read with the ref sequence of the reference genome according to the ref coordinate sequence of the repetitive sequence, and in a sequence comparison diagram, aiming at the fragment positioned on the left side of the structural variation, removing the duplication and simultaneously keeping the fragment positioned on the main diagonal; for fragments located to the right of the structural variation, the fragments located on the line where the main diagonal is offset from the structural variation length intercept are retained while de-duplicating.

Further, the specific process of step 3) is as follows:

respectively visualizing a fragment generated by the realignment of the sequencing sequence read and the reference genome sequence ref and a fragment generated by the realignment of the sequencing sequence read and the structural variation prediction sequence pre to generate a sequence alignment chart; the fragments generated by the realignment are embodied as a plurality of line segments in the sequence alignment chart; if the sequencing sequence read is completely consistent with the reference genome sequence ref, a line segment completely along the main diagonal line exists in the generated sequence alignment chart, otherwise, the line segment in the sequence alignment chart deviates from the main diagonal line, and the deviation represents the difference degree of the two sequences; calculating the average distance between all line segments in the sequence comparison map and the main diagonal line to obtain the average distance in the sequence comparison map corresponding to the structural variation s, the average distance between the sequencing sequence read and the reference genome sequence ref and the average distance between the sequencing sequence read and the structural variation prediction sequence pre; and normalizing the average distance between the sequencing sequence read and the reference genome sequence ref and the average distance between the sequencing sequence read and the structural variation prediction sequence pre to obtain the score of the structural variation s, and realizing the evaluation of the structural variation according to the score of the structural variation s.

Further, the average distance d in the alignment chart corresponding to the structural variation s _s,i,j,avg Calculated by the following formula:

supposing that a certain structural variation s covers m sequencing sequence reads, wherein n line segments are in total in a sequence alignment chart corresponding to the ith sequencing sequence read, and the distance d between the jth line segment and the main diagonal line _s,i,j Calculated by the following formula:

d _s,i,j ＝1/3[(x _s,i,j,start -y _s,i,j,start )+(x _s,i,j,mid -y _s,i,j,mid )+(x _s,i,j,end -y _s,i,j,end )]

in the formula, x _s,i,j,start ，y _s,i,j,start For the ref start coordinate, read start coordinate, x corresponding to the jth line segment _s,i,j,mid ，y _s,i,j,mid For the ref midpoint coordinate, read midpoint coordinate, x corresponding to the jth line segment _s,i,j,end ，y _s,i,j,end The ref end point coordinate, read end point coordinate, corresponding to the jth line segment.

Further, the average distance between the sequencing sequence read and the reference genome sequence ref and the average distance between the sequencing sequence read and the structure variation prediction sequence pre are normalized to obtain a Score of the structure variation s _s,i ：

When the Score of the structural variation s is greater than the threshold value Score _threshold Then, the structural variation s is considered as an acceptable structural variation for evaluation.

Compared with the prior art, the invention has the following beneficial effects:

firstly, performing Hash algorithm realignment on sequencing sequence reads in each structural variation interval and two reference genome sequences based on kmer sequences. Secondly, the result of the above-mentioned realignment is subjected to de-duplication to eliminate the influence of the repetitive sequence in the reference genome on the evaluation of the structural variation. And thirdly, based on the result of the re-comparison after the re-duplication removal processing, adopting a structure variation evaluation model based on distance measurement to carry out grading verification on each structure variation. And finally, outputting the evaluation result of each structural variation and the visual picture together. Through experimental tests, the structural variation verification system can effectively evaluate and verify the structural variation of the repeated sequence in the genome, and meanwhile, the output structural variation visual picture can be used for a user to more visually check the real situation of each structural variation. Therefore, the structural variation verification system is helpful for helping people to efficiently, accurately and comprehensively review the sequencing evidence of structural variation occurrence, and enhances and simplifies the manual review process.

Drawings

FIG. 1 is a schematic diagram of a structural variation verification system for genome repeats according to the present invention;

FIG. 2 shows a predicted sequence constructed by Deletion (Deletion);

FIG. 3 is a schematic diagram of the deduplication principle;

FIG. 4 is a ROC curve for 21000 simulation data;

FIG. 5 is a graph of recall against score threshold for 21000 simulated data;

FIG. 6 is a graph of recall of structural variation in repeat regions as a function of score threshold; wherein, (a) is the result graph of the repeated regions and the non-repeated regions of the homozygote sequence under different sequencing depths, and (b) is the result graph of the repeated regions and the non-repeated regions of the homozygote sequence and the heterozygote sequence under the same sequencing depth;

FIG. 7 is a graph of recall of spotsv versus vapor in HG002 samples as a function of score threshold;

FIG. 8 is a diagram illustrating an input of the structural variation verification system and an output of the structural variation verification system, wherein the absence (Deletion) is taken as an example; wherein, (a) is the reference genome sequence ref and its segment plot, (b) is the pre-de-duplication sequencing sequence read and the segment plot of the reference genome sequence ref, (c) is the post-de-duplication sequencing sequence read and the segment plot of the reference genome sequence ref (alignment plot).

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

The invention provides a structural variation verification system of a whole genome based on python, and is particularly suitable for structural variation of a genome repetitive sequence. To further clarify the objects, advantages and solutions provided by the present invention, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a structural variation verification system suitable for a genome repeat sequence, which mainly includes an input module, a sequence comparison module, a duplication elimination module, a structural variation evaluation module, and an output module, and the five modules are respectively responsible for processing user input, performing hash duplication comparison based on a kmer sequence, performing duplication elimination on a result of the duplication comparison, performing structural variation structural evaluation, and outputting a structural variation evaluation result and a visual picture. The following further detailed description of the various templates of the present invention is intended to be illustrative of the invention and is not intended to be limiting.

1. Input module

The input module is used for inputting the sequenced BAM files, the fasta files and the VCF files of the reference genome in the prior art and sending the input files to the sequence comparison module;

the VCF file, namely a Variant Call Format, is a standard Format of a structure variation storage result; the fasta file is a text-based format for identifying nucleic acid sequences, here reference genomic sequences; the BAM file is a binary file of a sequencing result.

The input module takes sorted BAM files, fasta files of reference genome and VCF files output by various structural variation detection tools as input. Supports three generations of Sequencing data such as PacBio CLR (Continuous Long Reads), pacBio CCS (Circular Consensus Sequencing), ONT (Oxford Nanopore Technologies) and gene alignment tools such as pbmm2, ngmlr, minimap2, etc., and can carry out verification and evaluation of structural variation on the results of mainstream structural variation detection tools such as CuteSV, svim, sniffles, PBSV, etc.

2. Sequence alignment module

The sequence comparison module is used for carrying out sequence comparison on the received files through a kmer-based Hash comparison algorithm to obtain the initial coordinates of a matched sequence in a sequencing sequence read and a reference genome sequence ref, the length of the matched sequence and the direction of the matched sequence, and further obtain the result of the double comparison of the sequencing sequence read containing the structural variation, the sequence ref of the reference genome (the ref sequence is a gene sequence with a fixed length) and a structural variation prediction sequence pre; and sending the data to a duplicate removal module;

after a fasta file of a reference genome, a sorted BAM file and a VCF file output by a structural variation detection tool are input, information such as chromosomes, start and stop coordinates, types and lengths of structural variations recorded in the VCF file is extracted one by one. In order to observe and analyze the structural variation more intuitively, an auxiliary length is respectively expanded to two sides on the basis of the start-stop coordinates of the structural variation, the initial default is 1000bp, and the actual auxiliary length can be automatically adjusted according to the start-stop coordinates of the sequencing sequence read covered in a certain structural variation interval.

According to the start-stop coordinate information of the structural variation, all m sequencing sequences read covered by a certain structural variation s and the corresponding reference genome sequence ref can be taken from the BAM file. In addition, a structural variant predictor sequence pre is constructed by inserting the structural variant s into the reference genomic sequence ref. FIG. 2 shows a predicted sequence pre constructed by taking Deletion (Deletion) as an example of one of the structural variations of a gene.

Then, the sequencing sequence read is respectively realigned with the corresponding reference genome sequence ref (the reference genome sequence ref is the original reference genome sequence) and the structural variation prediction sequence pre (the structural variation prediction sequence pre is the structural variation prediction sequence obtained by inserting the recorded structural variation into the reference genome sequence). And aiming at sequencing data of the CCS and the ONT, selecting the length of the kmer sequence to be 31, traversing the sequencing sequence read, selecting the starting coordinate of the kmer sequence to be the hash value of the kmer sequence, and storing the starting coordinate in a hash table. Then, the reverse complement sequence of the sequencing sequence read is traversed in the same manner, and the kmer sequence and the initial coordinates corresponding to the kmer sequence are stored in a hash table, where the structure of the hash table is shown in table 1.

Table 1 hash table structure

Key	Value
		read kmer ₁	[kmer ₁ position ₁ ,kmer ₁ position ₂ ,...]
read kmer ₂	[kmer ₂ position ₁ ,kmer ₂ position ₂ ,...]
		…	…
read kmer _n	[kmer _n position ₁ ,kmer _n position ₂ ,...]

Then, the reference genomic sequence ref is traversed by the kmer sequence. Each time a new kmer sequence is obtained, it is indexed in the hash table whether the same kmer sequence exists. If the sequence exists, the matched kmer sequence positions in the sequencing sequence read and the reference genome sequence ref are continuously moved by 1bp to check whether the next base is still the same or not until the bases of the sequencing sequence read and the reference genome sequence ref are not the same, and information such as the starting coordinates of the matched sequence in the sequencing sequence read and the reference genome sequence ref, the length of the matched sequence and the direction of the matched sequence is recorded (the kmer sequence matched with the sequencing sequence read is a positive direction, and the kmer sequence matched with a reverse complementary sequence of the sequencing sequence read is a reverse direction). The result of the above-mentioned realignment is a series of matched start-stop coordinates, which are embodied as a line segment (segment) in the rectangular coordinate system of the plane. The hash realignment algorithm comprises the following specific processes:

1) Obtaining a complementary sequencing sequence read _ reverse _ seq through reverse complementation of the sequencing sequence read _ seq;

2) Adding each kmer sequence read _ kmer in the sequencing sequence read _ seq into a hash table;

3) Carrying out the same treatment on the complementary sequencing sequence read _ reverse _ seq by using the step 2);

4) Traversing all kmer sequences ref _ kmer in the reference genome sequence ref _ seq, if ref _ kmer appears in the hash table in the step 2), comparing whether the next base of ref _ kmer is the same as the next base of read _ kmer in the hash table, if so, expanding the aligned sequence match _ seq, otherwise, making match _ seq = ref _ kmer.

3. Weight removal module

The de-duplication module is used for de-duplicating the result of the duplication between the sequencing sequence read containing the structural variation and the ref sequence of the reference genome and the structural variation prediction sequence pre output by the sequence comparison module so as to eliminate the influence of the reference genome repetitive sequence on the structural variation; sending the result after the duplication removal to a structure variation evaluation module;

specifically, the human reference genome contains a large number of repeated sequences, and the existence of the repeated sequences seriously influences the analysis of SV, so that a completely new idea that the influence of the repeated sequences in the reference genome on the structural variation region is eliminated is provided. In the sequence comparison module, the result of the weight ratio of the sequencing sequence read containing the structural variation to the ref sequence of the reference genome and the structural variation prediction sequence pre is obtained, and the result after weight removal is sent to the structural variation evaluation module.

To obtain the coordinates of the repeated sequences within the reference genomic sequence ref, the reference genomic sequence ref is re-aligned to itself.

If there are no repeated sequences in the reference genome ref sequence, the result of the realignment is a segment with a length equal to that of the reference genome ref sequence in the rectangular plane coordinate system and is represented as a segment (segment) along the major diagonal in the segment plot of the alignment plot; assuming that repeated sequences are contained within the reference genomic ref sequence, the result of the realignment includes segments of unequal length, represented in the segment plot as segments not along the major diagonal, and the coordinate ranges of the ref for these repeated sequences are recorded.

And traversing the result obtained by comparing the weight ratio of the sequencing sequence read to the ref sequence of the reference genome according to the ref coordinate sequence. In the segment plot (alignment plot), for segments located to the left of the structural variation, segments located on the main diagonal are retained while de-duplication; for segments located to the right of the SV (structural variation), segments located on the straight line of the major diagonal offset SV length intercept are retained while de-duplicated. The segments within the ref coordinate range of the repeat are removed, and the remaining segments are the segments containing only the structural variation region. Traversing these segments in segment plot, a break point (breakpoint) occurs between two non-collinear segments.

A schematic diagram of the deduplication principle is shown in fig. 3.

4. Structural variation evaluation module

The structure variation evaluation module is used for evaluating the received de-duplicated result by a structure variation evaluation method based on distance measurement and sending the evaluation result to the output module;

specifically, based on the result of the deduplication module, a novel structural variation evaluation method based on distance measurement is adopted to perform scoring verification on each structural variation.

First, segment plots generated by the realignment of the sequencing sequence read and the reference genome sequence ref and segment plots generated by the realignment of the sequencing sequence read and the structural variation prediction sequence pre are respectively visualized to generate segment plots. The segments generated by the realignment are embodied as a plurality of line segments in the segment plot. Assuming that the two sequences are identical, only one line segment of the generated segment plot is completely along the main diagonal line, whereas the line segment of the segment plot deviates from the main diagonal line, and the deviation represents the difference degree between the two sequences. In the segment plot corresponding to the sequencing sequence read and the reference genome sequence ref, the existence of the difference shows that the structural variation occurs in the region; the magnitude of this difference in the segment plots corresponding to the sequencing sequence read and the predicted sequence of structural variation pre indicates how well the structural variation present in this region is consistent with the structural variation described in the VCF file. The difference degree is quantified by calculating the average distance between all line segments in the segment plot and the main diagonal, and the two difference degrees are normalized, so that the evaluation of the structural variation is realized.

Assuming that a certain structural variation s covers m sequencing sequence reads, wherein n line segments are shared in the segment plot corresponding to the ith sequencing sequence read, and the distance d between the jth line segment and the main diagonal line is defined _s,i,j Comprises the following steps:

Defining the average distance d in segment plot corresponding to the structural variation s _s,i,j,avg Comprises the following steps:

defining the average distance between the sequencing sequence read and the reference genome sequence ref as d _{s,i,j,avg,ref} Then d is _{s,i,j,avg,ref} E [0, + ∞)), a larger value indicates that the region has structural variation. Defining the average distance between the read of the sequencing sequence and the pre of the structural variation prediction sequence as d _{s,i,j,avg,predict} Then d is _{s,i,j,avg,ref} E [0, + ∞)), a smaller value indicates that the structural variation of the region is consistent with the structural variation recorded in the VCF file. Normalizing the two to obtain the Score of the structural variation s _s,i ：

Further, score _s,i ∈[0,1]。Score _s,i When =0, it indicates that no structural variation exists in the region; score _s,i If =1, this indicates that there is a structural variation in the area that completely matches the VCF file record. In practical applications, there may be several noise points in some segment plot due to sequencing errors or incomplete de-duplication, and these noises will Score the structural variation s _s,i The value of (a) has some influence, and therefore, the Score of the structural variation s is considered _s,i The closer to 1, the more consistent the true structural variation is with the record of the VCF file, and the Score of the structural variation s _s,i The closer to 0, the more the actual situation is inconsistent with the structural variation of the VCF file record.

For a certain structural variation s, calculate the scoring list of all m sequencing sequence reads that it covers [ Score _s,0 ,Score _s,1 ,...,Score _s,i ,...,Score _s,m ]And the highest score among them:

Score _s,highest ＝max([Score _s,0 ,Score _s,1 ,...,Score _s,i ,...,Score _s,m ])。

in addition, the user may specify a scoring threshold for structural variation, with Score as a default _threshold =0.8, score assuming i-th sequencing sequence read _s,i ＞Score _threshold Then the sequencing sequence read is considered as a support sequence (supporting reads) for the structural variation s.

Thus, the Supporting sequence of a certain structural variation s dominates the Supporting reads contribution _s Comprises the following steps:

in the formula, score _threshold The structure variation scoring threshold value set for the user is a decimal between 0 and 1; score when the Score of structural variation s is greater than the threshold _threshold Then, the structural variation s is considered to be an evaluated qualified structural variation;

further, genotyping of the structural variants s _s Comprises the following steps:

5. output module

The output module is used for outputting the received evaluation result;

score highest Score of structural variation assessment _s,highest Scoring of the entire sequenced sequence read covered by structural variation [ Score _s,0 ,Score _s,1 ,...,Score _s,i ,...,Score _s,m ]The support sequence of structural variation is the ratio, genotyping, all breakpoints within structural variation are written into the input VCF file appended column. In addition, a sequence comparison line segment graph (segment plot) of each visualized structural variation is output, and a picture after the duplication is output is default so that a user can more visually check the real situation of each structural variation, and the user can specify whether the process picture of the duplication is output or not.

In order to verify the correctness of the structural variation verification tool provided by the invention, experimental verification is respectively carried out in a simulation sample and a real sample.

1. Analog data

In 23 human chromosomes from chromosome I to chromosome X, 21000 structural variations are randomly inserted, including ten structural variation types such as Deletion (Deletion), insertion (Insertion), duplication (Duplication), inversion (Inversion) and the like. Of these, 20000 structural variations were used as positive samples, and 1000 structural variations were used as negative samples. FIG. 4 is a ROC curve of 21000 sample data as described above.

The structural variation verification tool provided by the invention can output the score of the structural variation, and the score is between 0 and 1. The closer the score result is to 1, the closer the structural variation result is to the true structural variation, i.e. the more correct the verification result is. The recall (call) of the simulated data with the score threshold given by the user with different score thresholds (SV score cutoff) is shown in fig. 5.

The structural variation verification tool provided by the invention is also applicable to genome repetitive sequences. For the verification, 21000 samples were screened to obtain 405 samples located in the repeat region. FIGS. 6 (a) and (b) show the variation recall ratio (recall) of the structure of the repeat region according to the score threshold. It can be seen that the structural variation verification tool provided by the present invention has the same effect as the structural variation of the non-repetitive region when verifying the structural variation in the repetitive region.

2. Real data

To compare the structural variation validation tool (spotsv) provided by the present invention with the existing tool (vapor) results, two tools were run in the HG002 structural variation dataset. FIG. 7 is a graph of recall against score threshold for two tools in HG002 sample sets. It can be seen that the recall rate of the structural variation verification tool spotsv provided by the invention is far higher than that of vapor under different sequencing technologies and different sequencing depths.

In fig. 8, (a), (b), and (c) are visual output pictures of the structural variation tool provided by the present invention, taking the Deletion (Deletion) of one of the structural variations as an example. Wherein, subgraph one is the reference genome and the segment plot of the reference genome, which indicates that a large number of repeated sequences exist in the region. FIG. 8 (b) shows segment plots of the reference genome and the sequencing sequence read before deduplication, where structural variations could not be detected due to the presence of the repeat sequence. And the third subgraph is the segment plot of the reference genome and the sequencing sequence read after the duplication, has no influence of the repetitive sequence, and can clearly evaluate the real situation of the structural variation. FIG. 8 shows that the structural variation validation tool provided by the present invention is applicable to genome repeat sequences.

Claims

1. A structural variation verification system for genomic repeat sequences, comprising:

and the structure variation evaluation module is used for evaluating the received de-duplicated result by a structure variation evaluation method based on distance measurement to realize the structure variation verification of the genome repetitive sequence.

2. A structural variation verification method suitable for a genome repetitive sequence is characterized by comprising the following steps:

3. The method for verifying the structural variation of the genome repetitive sequence according to claim 2, wherein the specific process in step 1) comprises:

extracting start-stop coordinates of each structural variation in the VCF file;

4. The method for verifying structural variation of genome repetitive sequences according to claim 3, wherein the hash table is determined by the following process:

5. The method for verifying structural variation of genome repetitive sequences according to claim 2, wherein the specific process of step 2) is as follows:

6. The method for verifying structural variation of genome repetitive sequences according to claim 5, wherein the step 2) further comprises the steps of:

traversing the result obtained by comparing the sequencing sequence read with the ref sequence of the reference genome according to the ref coordinate sequence of the repetitive sequence, and in a sequence comparison diagram, aiming at the fragment positioned on the left side of the structural variation, removing the duplication and simultaneously keeping the fragment positioned on the main diagonal; for the segment located to the right of the structural variation, the segment located on the line where the major diagonal is offset from the structural variation length intercept is retained while deduplication is performed.

7. The method for verifying structural variation of repetitive genome sequences according to claim 2, wherein the specific process of step 3) is as follows:

8. The method of claim 7, wherein the structural variation s corresponds to the average distance d in the alignment chart _s,i,j,avg Calculated by the following formula:

9. The method of claim 7, wherein the average distance between the read of the sequenced sequence and the ref genome sequence and the average distance between the read of the sequenced sequence and the pre predicted sequence of the structural variation are normalized to obtain the Score of the structural variation s _s,i ：

Score when the Score of structural variation s is greater than the threshold value _threshold Then, the structural variation s is considered as an evaluation-qualified structural variation.