CN113963746B

CN113963746B - Genome structure variation detection system and method based on deep learning

Info

Publication number: CN113963746B
Application number: CN202111156180.9A
Authority: CN
Inventors: 叶凯; 蔺佳栋; 王松渤; 杨晓飞
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-09-19
Anticipated expiration: 2041-09-29
Also published as: CN113963746A

Abstract

The invention provides a genome structure variation detection system and method based on deep learning, wherein the method mainly comprises four steps: (1) Extracting a structural variation characteristic sequence based on the existing sequence alignment technology; (2) Coding the structural variation feature sequence similarity image by utilizing the RGB image; (3) Predicting structural variation contained in the structural variation feature sequence similarity image by utilizing a multi-target recognition framework; (4) The complex structure variant type is systematically represented through the graph data structure. The invention realizes the simultaneous detection of simple and complex structural variations from sequence difference encoded images without relying on any structural variation model.

Description

Genome structure variation detection system and method based on deep learning

Technical Field

The invention belongs to the technical field of accurate medical treatment, and relates to a genome structure variation detection system and method based on deep learning.

Background

Over the past decade, large international collaborative projects based on second generation sequencing data, such as TCGA, ICGC and HGSVC, have continuously revealed differences in genomic structural variation among populations and their close relationship with the occurrence of genetic diseases, tumors, and the like. In recent five years, with the development and continuous popularization of third generation long-reading long-sequencing, the number of known structural variations in human germ cells is 2.5 times that of the structural variations detected by second generation sequencing, and the structural variations provide an important basis for subsequent evolution and related disease research. More importantly, more and more simple structural variations were found to be complex structural variations by further analysis, for example, the genome complex structural variations were first introduced fully in 2015 Nature. The specificity of complex structural variations is first shown to be a distinct form of formation from simple structural variations, and they provide new evidence for researchers to study the mechanisms of repair of genomic lesions as part of the genome that has not been mined. On the other hand, complex structural variations are strongly correlated with genetic diseases and developmental diseases, and related researches greatly enrich researchers' knowledge of complex structural variations, such as studies published in Genome Biology in 2017, find 16 different complex variation types and deeply analyze their roles in the autism formation process. However, these complex structural variations in germ cells cannot be detected by the conventional clinical approaches, and the complex events cannot be detected accurately by the methods based on third generation sequencing due to the complexity of the structural variations and the limitations of the conventional detection methods. Compared to germ cell structural variations, tumor genomes undergo multiple and rapid selections, and thus there are more large-sized complex structural variations such as chromosome fragmentation (chromotricsis) and chromosome warping (Chromoplexy). These complex structural variations are believed to be rapidly forming events in the course of tumor development, which greatly promote tumor development.

In general, with the global concept of 'large health' being proposed and the aging problem of population in China becoming prominent, the incidence rate of genetic diseases, developmental diseases and cancers is higher and higher, so with the continuous drop of the price of the third generation sequencing data, the whole genome detection based on the third generation sequencing technology will become the necessary trend of clinical diagnosis. In a sub-setting, a large amount of sequencing data will be generated, and interpretation of these data, especially data related to clinical disease, will be critical to the development of the industry as a whole.

The main steps of the current genome-wide structural variation detection for the third generation sequencing technology include detection theory based on the second generation sequencing data, which is still continued, and mainly includes three steps: (1) modeling known genomic structural variations; (2) Deducing abnormal comparison features possibly reflected by the model in sequencing data comparison results; (3) And matching sequencing read comparison results according to the constructed abnormal comparison feature models of different structural variation types, and finally obtaining detection results. Methods such as PBSV, sniffles, SVIM, nanoVar, cuteSV, which are detection tools developed based on the above detection ideas, have been widely used for germ cell genomic structural variation detection, as well as for small amounts of disease and tumor sample analysis. In order to detect complex structural variations, most detection tools use patching, i.e. adding an anomaly comparison model corresponding to a new type of structural variation to the original tool. The most representative of this is Sniffles, which is the first detection tool to detect two complex structural variant types by adding an additional anomaly alignment model. However, in the development of sequencing technology, researchers still know about the structural variation of the genome at the corner of iceberg, and the method for detecting the structural variation by patching is not capable of treating the symptoms and root causes, and the unknown structural variation type in the genome still cannot be explored. On the other hand, such tools developed based on modeling ideas are particularly complex and poorly readable due to the specific code to be written for each variant type, which also directly results in low computational efficiency and difficult maintenance. This is mainly due to the fact that the complexity of the abnormal alignment features of the complex structural variations causes a great difference in detection sensitivity for different size ranges and different variation types, for example, as shown in fig. 1, for simple deletion variations and deletion inversion complex structural variations, existing tools can detect the complex structural variations as single deletions or inversion, and even some tools can miss this event. In recent two and three years, as more and more complex structural variations are discovered through cumbersome manual analysis, biomedical researchers have increasingly recognized that complex structural variations play an important role in certain diseases that cannot be diagnosed; meanwhile, in order to achieve a better comprehensive structural variation detection result, a brand new detection system is a key technology for promoting future clinical detection. In addition to the limitations of the model, repeated sequences have long been key factors affecting structural variation detection, and no effective solution has been available to date. In addition, the characterization method of the complex structural variation is not unified all the time, different researches mostly adopt the combination of simple variation to characterize the complex structural variation type, and meanwhile, the method is matched with detailed text interpretation, and the biggest problem of the method is that the method is unfavorable for comparing the detected complex structural variation among different researches.

In summary, despite the recent 10 years of development, researchers have utilized genomic sequencing data to detect simple types of variations and apply this information to research of human evolution, population migration and fusion, mechanisms of disease and treatment protocols, greatly promoting biomedical development. However, such modeling and patching-based structural mutation detection theory has failed to meet the requirements of future scientific research, hospitals and genetic testing service providers for mutation detection, and in particular, fails to support the transition from targeted detection to whole genome detection.

Disclosure of Invention

Aiming at the problems of the existing whole genome structural variation detection technology, the invention provides a genome structural variation detection system and method based on deep learning, which realize the simultaneous detection of simple and complex structural variation from sequence difference coding images without any structural variation model.

In order to achieve the above object, the technical scheme of the present invention is as follows:

a genome structure variation detection method based on deep learning, comprising:

step 1, extracting structural variation characteristic sequences: comparing the sample sequence with a reference genome sequence to obtain a global comparison result, and extracting a structural variation characteristic sequence according to the global comparison result; the matching fragments in the structural variation feature sequence are called main fragments; according to the comparison characteristics of the structural variation characteristic sequences, carrying out local Kmer weight comparison on the sequences of the unmatched fragments in the structural variation characteristic sequences and the reference genome sequences, and obtaining matched fragments called secondary fragments through the local Kmer weight comparison;

Step 2, structural variation characteristic sequence similarity image coding: adopting an RGB image three-channel coding mode, combining a main segment and a secondary segment, coding a structural variation characteristic sequence and a reference genome sequence to obtain a similarity RGB image of the reference genome sequence and a sample sequence, and simultaneously coding the reference genome sequence to obtain a similarity image of the reference genome sequence; subtracting the two images to obtain a structural variation characteristic sequence similarity image;

step 3, structural variation characteristic sequence similarity image segmentation: combining two adjacent main fragments in the structural variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only containing single structural variation; combining adjacent primary fragments and secondary fragments in the sub-image in sequence in pairs according to the sequence of the primary fragments and the secondary fragments on the reference genome sequence to obtain a fragment of interest;

step 4, identifying structural variation characteristic sequence similarity images and carrying out structural variation characterization: identifying all interesting fragments in the sub-images containing single structural variation by using a pre-trained structural variation detection CNN model to obtain complex structural variation fragments; the complex structural variant fragments are systematically characterized and classified using a graph data structure.

Preferably, in step 1, according to the alignment feature of the structural variation feature sequence, the sequence of the unmatched fragment in the structural variation feature sequence is locally aligned with the reference genome sequence by Kmer weight, specifically: according to CIGAR characters of the structural variation characteristic sequence, extracting unmatched fragments of the structural variation characteristic sequence from the CIGAR characters, and carrying out local Kmer heavy comparison on the sequences of the unmatched fragments and the reference genome sequence to obtain secondary fragments;

preferably, step 2 specifically includes:

1) RGB three-way sequence similarity coding: coding the structural variation characteristic sequence and the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0), and outputting a similarity RGB image of the reference genome sequence and the sample sequence; encoding the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0) to obtain a similarity image of the reference genome sequence, and recording the position information of each matching segment;

2) Removal of reference genomic sequence repeat: and according to the RGB images of the similarity between the reference genome sequence and the sample sequence and the coordinate positions of the matching fragments in the similarity images of the reference genome sequence and the reference genome sequence relative to the reference genome sequence, fragments corresponding to the fragments in the similarity RGB images of the reference genome sequence and the sample sequence in the search of the similarity images of the reference genome sequence and the reference genome sequence, and if the corresponding fragments are found, removing the corresponding fragments from the RGB images of the similarity between the reference genome sequence and the sample sequence to obtain the structural variation characteristic sequence similarity images.

Preferably, step 3 specifically includes:

1) Single structural variation segmentation: combining two adjacent main fragments in the structural variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only containing single structural variation;

2) Structural variant image multi-objective segmentation: and sorting according to the coordinates of the main fragment and the secondary fragment on the structural variation characteristic sequence, and combining all fragments in the sub-images pairwise according to the sorting result, wherein the combination of the main fragment and the secondary fragment is used as the interested fragment.

Preferably, the training method of the structural variation detection CNN model in step 4 specifically includes:

1) Constructing a structural variation training data set: the real data uses 2500 sample structural variation characteristic sequences in 1000Genome Project as training data, the virtual data uses VISOR virtual noise-free interference training samples as training data, and the training data set is formed by the real data and the virtual data;

2) Training data set encoding: encoding training data in the training data set by adopting the method in the step 2 to obtain an interested fragment of the training data set;

3) Model training: and inputting the training data set into a convolutional neural network, training the convolutional neural network, and obtaining a structural variation detection CNN model after training is completed.

Preferably, in step 4, identifying all the interesting fragments in the sub-images containing the single structural variation by using a pre-trained structural variation detection CNN model to obtain complex structural variation fragments; the complex structure variant fragments are systematically characterized and classified by using a graph data structure, and specifically:

identifying interesting fragments in the sub-images through the structural variation detection CNN model to obtain complex structural variation fragments, constructing a structural variation characterization graph based on the complex structural variation fragments, and calculating whether different structural variations belong to the same type or not based on the topological structure of the structural variation characterization graph, wherein each node in the structural variation characterization graph is a fragment contained in all interesting fragments in the sub-images, and each side is connected with two continuous fragments on the sample sequence.

A deep learning-based genomic structural variation detection system comprising:

the structure variation characteristic sequence extraction module is used for comparing the sample sequence with the reference genome sequence to obtain a global comparison result, and extracting the structure variation characteristic sequence according to the global comparison result; the matching fragments in the structural variation feature sequence are called main fragments; according to the comparison characteristics of the structural variation characteristic sequences, carrying out local Kmer weight comparison on the sequences of the unmatched fragments in the structural variation characteristic sequences and the reference genome sequences, and obtaining matched fragments called secondary fragments through the local Kmer weight comparison;

The structure variation characteristic sequence coding module is used for coding the structure variation characteristic sequence and the reference genome sequence by adopting a three-channel coding mode of the RGB image to obtain a similarity RGB image of the reference genome sequence and the sample sequence; meanwhile, coding the reference genome sequence to obtain a self-similarity image of the reference genome sequence; subtracting the two images to obtain a structural variation characteristic sequence similarity image;

the structure variation characteristic sequence similarity image segmentation module is used for combining two adjacent main fragments in the structure variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only comprising one structure variation; combining adjacent primary and secondary fragments in the sub-image in the order of the primary and secondary fragments on the reference genome sequence to obtain a fragment of interest;

the structural variation recognition and characterization module is used for recognizing all interesting fragments in the sub-images containing the single structural variation by using a pre-trained structural variation detection CNN model to obtain complex structural variation fragments; the complex structural variant fragments are systematically characterized and classified using a graph data structure.

Preferably, the structural variant signature sequence encoding module comprises:

the RGB three-channel sequence similarity coding module codes the structural variation characteristic sequence and the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0) and outputs a similarity RGB image of the reference genome sequence and the sample sequence; coding the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0) to obtain a similarity image of the reference genome sequence, and recording the position information of each matching fragment;

and removing the repeated segment module of the reference genome sequence, and removing the segment corresponding to the segment in the similarity RGB image of the reference genome sequence and the sample sequence in the search of the reference genome sequence self-similarity image according to the RGB image of the similarity of the reference genome sequence and the sample sequence and the coordinate position of the matched segment in the similarity image of the reference genome sequence self-similarity image relative to the reference genome sequence, if the corresponding segment is found, removing the segment from the RGB image of the similarity of the reference genome sequence and the sample sequence, thereby obtaining the structural variation characteristic sequence similarity image.

Preferably, the structural variation feature sequence similarity image segmentation module includes:

the single structure variation segmentation module is used for combining two adjacent main fragments in the structure variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only comprising one structure variation;

the structure variation image segmentation module is used for sequencing according to the coordinates of the main segment and the secondary segment on the structure variation characteristic sequence, combining the two adjacent segments according to the sequencing result to obtain a combined segment, filtering the combined segment formed by combining the two secondary segments and the two linear segments, and taking the combined segment formed by combining the main segment and the secondary segment as the interested segment.

Preferably, the structural variation recognition and characterization module specifically comprises:

the structure variation training data set module is used for forming a training data set from real data and virtual data, wherein the real data uses 2500 samples of structure variation in 1000Genome Project as training data, and the virtual data uses VISOR virtual noise-free training data;

the training data set coding module is used for coding all training data in the training data set according to the structural variation characteristic sequence coding module to obtain an interested fragment of the training data set;

The convolutional neural network training module is used for inputting the training data set into a convolutional neural network, training the convolutional neural network and obtaining a structural variation detection CNN model after training is completed;

the structure variation characterization module is used for identifying the interested fragments in the sub-images through the structure variation detection CNN model to obtain complex structure variation fragments, constructing a structure variation characterization graph based on the complex structure variation fragments, calculating whether different structure variations belong to the same type based on the topological structure of the structure variation characterization graph, wherein each node in the structure variation characterization graph is a fragment contained in all the interested fragments in the sub-images, and each side is connected with two continuous fragments on the sample sequence.

Compared with the prior art, the invention has the following beneficial technical effects:

the invention provides a model independent genome structure variation detection method for long-reading long sequencing for the first time, which mainly comprises four steps (figure 2): (1) Extracting a structural variation characteristic sequence based on the existing sequence alignment technology; (2) Coding the structural variation feature sequence similarity image by utilizing the RGB image; (3) Predicting structural variation contained in the structural variation feature sequence similarity image by utilizing a multi-target recognition framework; (4) The complex structure variant type is systematically represented through the graph data structure. The invention innovatively optimizes the existing image coding technology and the multi-target recognition frame, encodes the similarity of the reference genome sequence and the sample sequence by utilizing the optimized image coding technology in the step (2), and detects the structural variation with different complexity degrees carried in the sample sequence from the structural variation characteristic sequence similarity image by utilizing the multi-target recognition frame based on the Convolutional Neural Network (CNN) in the step (3). First, a three-channel (RGB) encoding scheme is used to encode a reference genomic sequence (REF) and sample sequence (ALT) similarity RGB image, defined as REF-to-ALT image. Secondly, in order to remove the influence of background noise (such as short tandem repetition and the like) in the reference genome sequence on structural variation detection, a reference genome sequence self similarity image (REF-to-REF image) is encoded, and then the influence of the background noise is removed by utilizing an image subtraction mode, in particular, the influence on detection of complex structural variation is avoided. In a CNN-based multi-target recognition framework, the invention provides a two-step image segmentation method. The first step is mainly used for splitting a plurality of structural variations contained in a structural variation characteristic sequence Similarity Image into sub-images containing only one structural variation, wherein the sub-images containing one structural variation are defined as Similarity Images (SI); and secondly, segmenting the similarity image into interested fragments (Segment of Interest, SOI), and finally using a CNN model trained in advance to identify all the SOI contained in the SI and obtain a final prediction result. The image coding mode and the multi-target identification framework used in the invention have good expandability, so that the method can be applied to various long-reading long sequencing data, especially data generated by PacBIO and Oxford Nanopore sequencing platforms; meanwhile, the invention can be applied to the contig sequences assembled based on long-reading long-sequencing. In addition to complex structural variation detection, in the past, different naming schemes for the same complex structural variation type have been provided, and these naming schemes often depend on subjective understanding of researchers, and lack of a system and a uniform naming scheme also hinders complex structural variation related research. Another innovation point of the invention is that the invention firstly provides the method which is beneficial to representing the complex structural variation of the graph data structure, and classifies different types of complex structural variation through calculating the topological structure similarity of the graph. Based on the characteristics, the invention can greatly improve the detection rate and the accuracy of structural variation with different complexity, further promote disease diagnosis and early screening based on third-generation test data, and provide a high-efficiency and reliable detection system for the related fields. On the other hand, the invention is more favorable for detecting the structural variation from different clones in the tumor from the aspect of identifying sequence structural variation characteristic sequences, and provides a brand-new detection system for researching the effect of the structural variation in the tumor evolution process.

Furthermore, because the low detection rate of the reciprocating heterostructure variation is used, the characterization and comparison modes are always dependent on the artificial definition of related researchers, the invention improves the detection rate of the complex structure variation and simultaneously provides a complex structure variation characterization mode based on a graph, which essentially constructs the connection relation of different SOI (silicon on insulator) relative to a reference genome sequence, wherein nodes in the graph are fragments contained in the SOI, and the edges are connected with two fragments which are continuous on a sample sequence

The invention relates to a genome structure variation detection system based on multi-target identification, which takes a model independent structure variation detection theory as a core, and realizes the structure variation detection for long-reading long-sequencing without depending on any variation model through a structure variation characteristic sequence extraction module, a structure variation characteristic sequence similarity image coding module, a structure variation characteristic sequence similarity image segmentation module and a structure variation identification and characterization module. Two fundamental features of structural variation detection are grasped based on sequence similarity coding and multi-objective recognition of RGB images, firstly structural variation is represented as a difference between a reference genome sequence and a sample sequence, and secondly complex structural variation is represented as a combination of multiple simple structural variations; then, the invention detects structural variations of varying degrees of complexity by identifying the SOI in the similarity image. The invention does not depend on any variation model, and has higher sensitivity than the existing detection method based on the model for detecting the variation of the known simple structure.

In summary, the genome structural variation detection system related to the invention is a core technology for realizing accurate diagnosis, and simultaneously grasps a great opportunity of accurate medical development brought by a third generation sequencing technology, creatively provides a model-independent structural variation detection theory from a technical point of view, and designs and realizes a brand-new structural variation detection system according to the theory. The system ensures high detection rate and accuracy of simple structural variation, greatly improves the detection rate of complex structural variation, and provides important technical support for promoting clinical application of third-generation sequencing technology. The invention is oriented to the important national demands, researches the core problem in the 'accurate medical treatment' of the national strategy emerging industry, is beneficial to breaking the situation that important key core technology is limited by people in the strategy essential field of genome structure variation detection in China, is beneficial to developing a new 'accurate medical treatment' related industry development direction and cultivates a new economic growth point.

Drawings

FIG. 1 is a model of simple and complex structural variations constructed for long-read long sequencing technologies;

FIG. 2 is a flow of a deep learning based genomic structural variation detection system for long read lengths and assembly sequences;

FIG. 3 is a comparison of simple structural variation detection results for different long sequencing platforms;

FIG. 4 is a comparison of detection structures for virtual complex structural variations;

FIG. 5 is a comparison of the results of different dimensional structural variation detection for high fidelity (HiFi) sequencing;

FIG. 6 is a comparison of the detection results for sequencing reads and assembled contigs of the present invention;

FIG. 7 is a comparison of structural variation detection results for typed genomes.

Detailed Description

The invention will now be described in further detail with reference to specific examples, which are intended to be illustrative of the invention rather than limiting.

The invention provides a new theory of model independent genome structure variation detection, and designs a genome structure variation detection system for long-reading long-sequencing based on the new theory.

The genome structural variation detection system and method based on deep learning provided by the invention are specifically expressed as the fundamental characteristics of structural variation are the differences between a sample sequence (comprising a sequencing read or an assembly sequence) and a reference genome sequence for various long-reading long sequencing technologies. Therefore, the invention encodes the difference between the sample sequence and the reference genome sequence in an image mode, detects the structural variation with different complexity degrees in the similarity images of the reference genome sequence and the sample sequence by using a multi-target recognition frame, systematically characterizes and classifies the different types of complex structural variation by using the data structure of the image, and finally achieves the purposes of detecting and characterizing the structural variation. The core of the theoretical design algorithm mainly comprises: (1) extracting structural variation characteristic sequences; (2) structural variant feature sequence similarity image coding; (3) structural variation feature sequence similarity image segmentation; (4) Identification of structural variation characteristic sequences and characterization of structural variation.

The core method related by the invention mainly comprises the following steps:

step 1, extracting a structural variation characteristic sequence: locking a region where potential structural variation exists according to sequencing data of a certain sample, and extracting a structural variation characteristic sequence supporting the structural variation; the method specifically comprises the following steps:

the sample sequence is compared to the reference genome sequence using prior art techniques. After the comparison result is obtained, abnormal comparison reads are extracted, and meanwhile, a read sequence crossing the structural variation site is extracted from the reads, and the read sequence is used as a structural variation characteristic sequence through sequence assembly or directly. Existing alignment techniques generally provide two main anomaly alignment read features: (1) The CIGAR characters in the comparison result are inserted and deleted in the record; (2) complementary alignment features retained in the alignment results.

Mutation characteristic sequence realignment based on Kmer: in step 1, matching fragments and unmatched fragments in the sample sequence and the reference genome sequence are recorded in CIGAR characters based on the structural variation characteristic sequences obtained in the prior art, and the result is regarded as a global comparison result. Firstly, extracting a sequence of a non-matching fragment from a global comparison result, secondly, re-comparing the sequence of the non-matching fragment with a reference genome sequence, and recording a new comparison result, wherein in the comparison process, whether the fragment is repeated and is aligned is recorded, and the comparison direction is respectively from 5 'to 3' end and from 3 'to 5' end of DNA. Next, defining the matched segment obtained in the cigs character as a main segment, and defining the re-aligned segment obtained by re-alignment as a secondary segment, wherein all the main segment and the secondary segment are collectively called as matched segments of a sample sequence and a reference genome sequence, and each segment simultaneously comprises three characteristics of alignment position, alignment direction and whether repeated sequences are repeated.

Step 2, structural variation characteristic sequence similarity image coding: coding a similar RGB image between the structural variation characteristic sequence and a reference sequence according to the structural variation characteristic sequence obtained in the step 1 and the comparison result of the step 1 and the comparison result based on Kmer weight; meanwhile, the same coding scheme is used for coding the REF-to-REF image of the similarity of the reference genome sequence, and then the REF-to-ALT image after denoising is obtained by an image subtraction mode.

The step 2 specifically comprises the following steps:

1) RGB three-channel sequence comparison result codes: and (3) coding one fragment into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0) according to three characteristics carried by each fragment, wherein each channel can code sequences with different lengths by setting minimum fragment threshold values, each pixel point in an image can reach single base resolution, and each pixel point represents a pair of bases with the same reference genome sequence and sample sequence. Through the three-channel coding mode, the step can output the similarity RGB image of the reference genome sequence and the sample sequence and the position information of each fragment in the image; meanwhile, the re-comparison method in the step 1 is called again to compare the reference genome sequence with the reference genome sequence, a single base resolution comparison result is obtained by continuously moving the Kmer in the comparison process, and then a REF-to-REF image is obtained by using the same three-channel coding mode.

2) Similarity of reference genome sequence and sample sequence RGB image denoising: obtaining a similar RGB image and a REF-to-REF image of a reference genome sequence and a sample sequence according to the step 1), firstly extracting the initial coordinate and the termination coordinate of each fragment in the two images relative to the reference genome sequence, and simultaneously respectively arranging the fragments extracted from the two images according to the coordinate sequence of the reference genome sequence; and secondly, selecting fragments extracted from the RGB images of the similarity of the reference genome sequence and the sample sequence, and detecting the overlapping condition of the fragments in the REF-to-REF images one by one, wherein the detection can be completed under the condition of O (log) time complexity. After the overlapping condition of the segment is obtained, whether the segment should be removed from the similarity RGB image of the reference genome sequence and the sample sequence is judged according to a preset threshold value. And the two images are subtracted by the method to obtain the structural variation characteristic sequence similarity image.

Step 3, structural variation characteristic sequence similarity image segmentation: according to the structural variation characteristic sequence Similarity Image obtained in the step 2, firstly dividing the structural variation characteristic sequence Similarity Image into Similarity Images (SI) only carrying single structural variation according to the arrangement and combination of main fragments in the Image; on the basis of SI, the similarity image is segmented into different interested fragments (Segment of Interest, SOI) according to the combination of the main fragment and the secondary fragment;

The step 3 specifically comprises the following steps:

1) Single structural variation segmentation: the method is used for separating a plurality of different structural variations appearing in the structural variation characteristic sequence similarity image, and avoids predicting and characterizing continuously appearing simple structural variations as complex structural variations. Firstly, arranging main fragments in a structural variation characteristic sequence similarity image in sequence, secondly, combining the main fragments in pairs in sequence, and simultaneously extracting the image containing the main fragments from the original structural variation characteristic sequence similarity image according to the coordinate positions of the main fragments to finally obtain a similarity image containing single structural variation;

2) Structural variant similarity image multi-objective segmentation: and segmenting the similarity image to obtain a plurality of SOI to be identified. The main steps are that firstly, the main fragments and the secondary fragments are ordered in the similarity image according to the sequence of the main fragments and the secondary fragments relative to the reference genome sequence, and secondly, the adjacent two fragments are combined according to the ordering result to obtain and extract a sub-image containing the two fragments, which is called SOI. In the extraction process, the SOI combined by the two secondary fragments is filtered first; secondly, judging whether the two fragments can form a linear fragment or not, wherein the process mainly comprises the step of checking whether the slope difference of the two fragments in a two-dimensional space is in a fixed range or not, and the slope difference is mainly calculated by the minimum structural variation size which a user wants to detect; at the same time, since the dimensions of each similarity image are different, the slope difference range is also different in each image. After the above two filtering steps, the SOI in each similarity image will be used for subsequent identification;

Step 4, identifying a structural variation characteristic sequence and carrying out structural variation characterization: according to a plurality of targets (namely SOI) in the similar image obtained in the step 3, different complex structural variations in the similar image are identified by using a convolutional neural network (Convolution Neural Network, CNN for short) trained in advance; after the identification is completed, the patterns are used for systematically representing and classifying different complex structure variation types.

The step 4 specifically comprises the following steps:

1) Constructing a structural variation training data set and encoding: structural variations for training mainly include four simple structural variation types, deletion (DEL), insertion (INS), inversion (INV) and repetition (DUP). The real data uses 2500 samples of structural variation in 1000Genome Project as training data, and the virtual data uses VISOR virtual noise-free training samples, which together form a training data set. Secondly, encoding the training data set according to the manner described in steps 2 and 3;

2) Convolutional Neural Network (CNN) training: inputting the training set coded in the step 1) into a convolutional neural network, and training the AlexNet neural network by using the prior art. The AlexNet neural network training parameters use a transfer learning mode, best performance parameters in Google image Net contests are used as model initialization parameters, and meanwhile, an optimal model is selected as a final structural variation detection model by using a cross verification mode.

3) Constructing a complex structure variation characterization diagram: and detecting a plurality of SOI in the CNN model identification sub-image through the structural variation, firstly obtaining a complex structural variation characterization graph based on the identification result, and secondly calculating whether different complex structural variations belong to the same type according to the topological structure of the structural variation characterization graph.

Constructing a structural variation characterization graph: firstly, obtaining matching fragments from all SOI in a similarity image, wherein each fragment comprises initial and final point coordinates, directions and repeated fragment source information, and the fragments are uniformly used as each node in a structural variation characterization graph, and each node has a direction; second, each edge in the figure connects two adjacent fragments on the sample sequence, while for a duplicate fragment of known origin, the repeated edge is used to connect the fragment and its origin.

Structural variation characterization map topological structure similarity calculation: firstly, calculating whether the number of nodes and edges of two graphs are matched; secondly, for two graphs having the same number of nodes and edges, respectively obtaining a 5 'to 3' end path and a 3 'to 5' end path of each graph according to the connection relation and the node direction of the edges, and if two structural variation characterization graphs have symmetrical paths, considering that the two structural variation belongs to the same type.

The specific implementation process of the invention is as follows, and the flow is shown in figure 2:

step 1, comparing long-reading long-sample sequencing data (sample sequence) with a reference genome sequence by utilizing the prior art, determining the coordinates of each read sequence on the reference genome, and simultaneously extracting a part of the comparison result, which is not matched with the sample sequence and the reference genome sequence, to obtain the read sequence with abnormal comparison characteristics, wherein the read sequence is used as a structural variation characteristic sequence through sequence assembly or directly.

For long read length comparisons in step 1, which have been developed over several years, studies have been more thorough, and are typically done using seed-and-extension in combination with dynamic programming. The comparison step comprises seed generation and expansion, and the expansion is mainly realized by adopting a dynamic programming mode. Representative tools of the mainstream include minimap2, NGMLR and pbmm2.

After the comparison result is obtained, the key step of extracting the structural variation characteristic sequence is to find the comparison abnormal site. If the sample sequence does not contain any form of structural variation at a site, the read sequence at that site is indistinguishable from the reference genomic sequence and the alignment is normal, otherwise, the read sequence aligned to that site may carry a variation signature. The structural variations involved in the present invention leave a variety of characteristics mainly including: (1) Unmatched sequences (unmatched), if a sample sequence has an insertion or deletion of a short sequence at a site relative to a reference genomic sequence, multiple reads aligned to that site will be characterized by unmatched sequences, which are predominantly represented by insertions (I) and deletions (D) in the alignment. Here, the invention only considers non-matching sequences with long reads greater than 50. Aiming at the unmatched sequences in the step (1), the invention refines the source of the insertion fragments of the unmatched sequences by adopting a local hash comparison method based on Kmer, and simultaneously discovers variant characteristics except the insertion (I) and the deletion (D). (2) Supplementary alignment (supplementary alignment), because of structural variation at a certain site of the sample, when the reads are aligned to pass through the site, the alignment software breaks a long read sequence into a plurality of subsequences, and the subsequences are compared with the reference genome sequence to obtain a supplementary alignment result. However, due to imperfections in the reference genome itself, the present invention defaults to non-processing reads that produce 4 and more supplemental alignments, which parameters can be adjusted according to user needs.

Step 2, coding similarity between the structural variation characteristic sequence and the reference genome sequence by using a three-channel RGB diagram:

after the structural variation characteristic sequences are obtained through the characteristics, the structural variation characteristic sequences are converted into segment characteristics according to the comparison result of each structural variation characteristic sequence, wherein the segment characteristics comprise matched segments and unmatched segments, and the segment characteristics with the same attribute are clustered. Aiming at the segment characteristics in each cluster, the invention codes CIGAR characters, kmer comparison results and reference genome sequences of the structural variation characteristic sequences in the step 1 into sequence matching channels (255, 0), sequence repeating channels (0, 255) and sequence reversing channels (0, 255, 0), each channel can code different sequences by setting minimum segment threshold values, and the step can output RGB coding matrixes of sample sequences and reference genome sequences and position information of each matching segment in the three-channel coding mode; constructing a similarity RGB image of the reference genome sequence and the sample sequence according to RGB coding matrixes of the sample sequence and the reference genome sequence; meanwhile, the reference genome sequence is encoded into a sequence matching channel (255, 0), a sequence repetition channel (0, 255) and a sequence inversion channel (0, 255, 0), a REF-to-REF image is obtained, the influence of repeated fragments in the reference genome sequence on subsequent structural variation detection is removed in a subtraction mode of the two images, and a structural variation characteristic sequence similarity image is obtained.

Step 3, dividing the structural variation characteristic sequence similarity image in the step 2 by adopting a multi-target identification framework to obtain a similarity image which takes SOI as a plurality of targets and contains single structural variation:

firstly, along with the continuous increase of sequencing read length, one read sequence is likely to cross more than one structural variation at the same time, so that the invention firstly carries out single structural variation segmentation on the structural variation characteristic sequence similarity image, and the main purpose is to separate a plurality of different structural variations appearing in the image, so that the prediction and characterization of the continuous simple structural variation serving as the complex structural variation are avoided; the specific steps mainly include sequentially arranging and combining main segments to form a similarity image, that is, in a structural variation feature sequence similarity image, sub-images formed by every two main segments are considered to contain a structural variation, and adjacent two main segments are combined to obtain a similarity image containing only a single structural variation. And combining the primary segment and the secondary segment which are sequentially adjacent in the similarity image to obtain the SOI.

Step 4, identifying a structural variation characteristic sequence and carrying out structural variation characterization: according to the multiple targets in the similarity image obtained in the step 3, predicting and classifying by using a CNN model trained in advance, and finally representing different structural variation types by using a pattern mode:

The invention firstly constructs a structural variation training data set, and the structural variation used for training mainly comprises four simple structural variation types, namely Deletion (DEL), insertion (INS), inversion (INV) and repetition (DUP). The real data uses 2500 samples of structural variation in 1000Genome Project as training data, and the virtual data uses VISOR virtual noise-free training samples, which together form a training data set. Structural variations in the training dataset are encoded according to the manner described in step 2. The invention selects the existing AlexNet neural network, trains the AlexNet neural network by utilizing the coded structural variation, uses a transfer learning mode for the training parameters of the AlexNet neural network, uses the best performance parameters in Google ImageNet competition as model initialization parameters, and simultaneously uses a cross-validation mode for selecting the optimal model as a final structural variation detection CNN model. And identifying targets in the sub-images by detecting the CNN model through the structural variation, and finally identifying the structural variation with different complexity without depending on any structural variation model. In addition, aiming at the multi-breakpoint characteristics of the complex structural variation, the invention utilizes the data structure of the graph to represent different types of complex structural variation. The composition process firstly extracts fragments contained in all SOI in the structural variation characteristic sequence similarity image as nodes of the graph, and secondly, each edge in the graph is connected with two adjacent fragments on the sample sequence. Meanwhile, based on the representation method of the graph, the invention classifies the variation of different complex structures by judging the topological similarity of the graph, and the specific steps comprise: 1) Checking whether the two structural variation characterization graphs have the same number of nodes and edges; 2) And judging whether symmetrical paths exist in the two graphs or not for the characterization graphs with the same edges and nodes, and if the two conditions are met, considering that the two compared complex structural variations belong to the same type.

A deep learning-based genomic structural variation detection system comprising:

The structural variation characteristic sequence extraction module specifically comprises:

the extraction module is used for comparing the sample sequencing data with a reference genome sequence to obtain a comparison result; extracting a structural variation characteristic sequence according to the comparison result;

the low-quality variation signal removing module is used for removing structural variation characteristic sequences with unsatisfied quality and mainly comprises the steps of comparing structural variation characteristic sequences with quality lower than 20 and generating more than 3 complementary comparison;

and the feature clustering and filtering module is used for clustering the structural variation feature sequences through feature similarity to obtain feature clusters, filtering the feature clusters with feature values smaller than a preset threshold value, and reserving the structural variation feature sequences and CIGAR characters in the rest feature clusters. The signals which potentially support the same structural variation are clustered into one class through a hierarchical clustering method according to the similarity measurement, the class which does not meet the minimum cluster size is removed after clustering, and the screened clustering result is used for locking the potential variation region;

The local comparison module based on the Kmer sequence is used for comparing specific mismatch fragments in the structural variation characteristic sequence according to CIGAR characters of the structural variation characteristic sequence and recording the Kmer comparison result.

The structural variation characteristic sequence coding module comprises:

the RGB three-channel sequence similarity coding module codes the structural variation characteristic sequence and the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0), each channel can code different sequences by setting a minimum fragment threshold value, and the module outputs a similarity RGB image of the reference genome sequence and the sample sequence in the three-channel coding mode; coding the reference genome sequence into a sequence matching channel (255, 0), a sequence repeating channel (0, 255) and a sequence reversing channel (0, 255, 0) to obtain a similarity image of the reference genome sequence, and recording the position information of each matching fragment; the coding process mainly comprises the steps of coding main fragments through CIGAR characters of a structural variation characteristic sequence, and secondly adopting a Kmer-based re-alignment coding mode aiming at unmatched sequences existing in alignment.

And removing the repeated segments of the reference genome sequence, finding out each segment from the RGB images of the similarity of the reference genome sequence and the sample sequence and the corresponding segment from the RGB images of the similarity of the reference genome sequence and the sample sequence according to the coordinate positions of the matched segments in the RGB images of the similarity of the reference genome sequence and the sample sequence and the self-similarity images of the reference genome sequence relative to the reference genome sequence, and removing the corresponding segment from the RGB images of the similarity of the reference genome sequence and the sample sequence after finding out the corresponding segment to obtain the structural variation characteristic sequence similarity images. The method is used for solving the problems of low detection rate and low accuracy of structural variation caused by repeated sequences in a reference genome. The module firstly provides a method for subtracting RGB coding of the sequence of the reference genome from RGB coding matrix of the sequence of the sample by using the RGB coding matrix of the reference genome to remove influence of repeated sequence on structural variation detection, and restore real information of the sequence of the sample in the region.

The segment classification module is used for marking the main segment and the secondary segment in the structural variation characteristic sequence similarity image after the repeated segments are removed; the major fragment is derived from the fragment marked M in the CIGAR character of the structural variation signature sequence, and the minor fragment is derived from the Kmer-based local alignment.

The structural variation characteristic sequence similarity image segmentation module comprises:

the single structure variation segmentation module is used for combining two adjacent main fragments in the structure variation characteristic sequence similarity image according to the sequence of the main fragments on the reference genome sequence to obtain a sub-image only comprising one structure variation; the method is used for separating a plurality of different structural variations appearing in the structural variation characteristic sequence similarity image, and avoiding predicting and characterizing continuous appearing simple structural variations as complex structural variations;

the structure variation image segmentation module is used for sequencing according to the coordinates of the main segment and the secondary segment on the structure variation characteristic sequence, combining the two adjacent segments according to the sequencing result to obtain a combined segment, filtering the combined segment formed by combining the two secondary segments and the two linear segments, and taking the combined segment formed by combining the main segment and the secondary segment as the interested segment. The method is used for identifying structural variations with different complexity, and particularly identifying internal structures contained in the structural variations. Compared with the traditional computer vision multi-target recognition, the RGB image designed in the invention is sparse and is not suitable for using the traditional sliding window mode.

The structural variation recognition and characterization module specifically comprises:

the structural variation training dataset module is constructed, and the structural variation for training mainly comprises four simple structural variation types, deletion (DEL), insertion (INS), inversion (INV) and repetition (DUP). Wherein the real data uses structural variations of 2500 samples in 1000Genome Project as training data, but since DEL and INS account for the vast majority of real data, an unbalanced training data set is created. Thus, the invention adds the structural variation which is virtual by using open source software VISOR and has no noise interference in the training data set;

the convolutional neural network training module is used for training the AlexNet neural network, the neural network training parameters use a migration learning mode, best performance parameters in Google image Net contests are used as model initialization parameters, and the parameters are adjusted through back propagation and gradient descent algorithms in the training process so as to minimize a cross entropy objective function. Simultaneously selecting an optimal model by using a cross verification mode as a final structural variation detection CNN model;

The structure variation characterization module is used for identifying the interested fragments in the sub-images through the structure variation detection CNN model to obtain complex structure variation fragments, constructing a structure variation characterization graph based on the complex structure variation fragments, calculating whether different structure variations belong to the same type based on the topological structure of the structure variation characterization graph, wherein each node in the structure variation characterization graph is a fragment contained in all the interested fragments in the sub-images, and each side is connected with two continuous fragments on the sample sequence; the nodes in the structural variation characterization map represent matching segments contained in all the SOI's in the structural variation signature similarity image, which segments are arranged from small to large according to coordinates appearing in the sample sequence. Edges in the structural variation characterization graph are used for connecting two adjacent matching segments; the structural variation characterization map is stored in a file according to an image segment assembly format (Graphical Fragment Assembly, hereinafter referred to as GFA), in the output file, S represents a matching segment obtained from the structural variation feature sequence similarity image, and L represents a connection relationship between different segments.

Simulation instance

Simulation experiments evaluate and compare the present invention with four prior art detection techniques, including CuteSV, pbSV, sniffles and SVIM. The invention adopts default parameters compared with the prior art, wherein the structural variation support threshold is 5, and only reads with the comparison quality more than or equal to 20 are selected.

Experiment one: because the invention does not rely on any model, the structural variation of various degrees of complexity is detected by merely identifying sequence differences in the image encoding. One purpose of the experiment is to examine the simple structural variation detection capability of the present invention for different sequencing platforms. Aiming at the existing two main stream length long sequencing platforms PacBIO and Oxford Nanopore on the world, the test utilizes a sample HG002 as a standard set according to the test software evaluation flow in the Genome In A Bottle (GIAB) international genome authority project, and the comprehensive test capability of the invention for different platform sequencing data and different sequencing data amounts is evaluated. The result is shown in fig. 3, and the invention can still achieve the same result as the current most advanced detection software without depending on any model. The test results further reflect the universality of the invention, and can be applied to detection of simple and complex structural variations.

Experiment II: as the present invention relates to this field front problem, there is no industry accepted set of complex structural variation criteria for evaluating different methods. Thus, based on the results of complex structural variations by manual screening and testing published in journal of Nature in 2015, the experiment virtualizes 10 different complex structural variation types, each containing 300 complex variation events. The performance of the present invention is compared to existing detection techniques using a virtual complex structural variation dataset. In addition, unlike simple structural variations, complex structural variations often contain multiple different types of break points, called subelements, within which two indicators are used to evaluate the performance of different detection methods in an assay: 1) Region matching; 2) A perfect match. In short, region matching requires that different detection methods accurately detect genomic regions containing complex structural variations, complete matching, and accurate detection of complex structural variations regions and subelements of complex structural variations. The detection performance results are shown in fig. 4, and for easier region matching, the recall rate of the detection method of the invention is 93%, while the second highest recall rate is only 70% of the detection method of the CuteSV. On the other hand, for complete matching, the Sniffles is the only tool capable of detecting the variation of the complex structure, the recall rate of the invention is 92% and is twice as high as that of the Sniffles, and the invention can still maintain 90% of accuracy on the premise of high recall rate, thereby ensuring the accuracy of detecting the variation of the complex structure. However, the other detection methods cannot detect the complex structural variation, and only find the genomic region where the complex structural variation is located.

Experiment III: the test aims at checking the robustness of the sequence image coding and identification framework of the invention by using sequence comparison results with different read lengths. The invention uses the structural variation of HG00733 sample published in Science in 2021 as a standard set, and firstly selects a high-fidelity (HiFi) sequencing read section of the PacBio platform of the sample as a short DNA sequence input data set. On the other hand, to obtain longer DNA sequence input datasets, the experiment used a contig sequence assembled from pacbi HiFi reads, with reads of two different DNA sequence lengths being aligned to a human reference genome by existing sequence alignment techniques, respectively, to obtain input files. The invention compares the detection performance aiming at short DNA sequences, and the result is shown in figure 5, so that the detection performance of the invention for detecting the structural variation of different lengths is superior to that of a detection method based on a model. Because other detection schemes cannot detect long DNA sequence data sets, the invention only compares the performances of the method when short DNA sequences and long DNA sequences are taken as inputs respectively, and the results of FIG. 6 show that the detection performance of the method is positively correlated with the length of the input DNA sequences for detection, which also meets the structural variation detection requirement of sequencing data with longer and more accuracy in the future.

Experiment IV: with the development of genome assembly and typing technology in recent two years, more and more genome assembly results after typing appear, and the experiment aims to compare the structure variation detection requirements of the genome subjected to the future typing. The invention still adopts the parting assembly results of three samples HG00733, HG00514 and NA19240 published in Science in 2021 and the structure variation standard set after parting to evaluate different detection technologies, and the analysis assembly results of the three samples are respectively compared with human reference genome by the existing comparison technology. Since humans belong to diploid organisms, the standard set is divided into H1 and H2 for comparison, and the results are shown in FIG. 7, where Fscore is higher on both karyotypes than in the prior art. The experiment shows that the invention can well meet the detection of genome structural variation after parting assembly.

Claims

1. The genome structure variation detection method based on deep learning is characterized by comprising the following steps of:

step 4, identifying structural variation characteristic sequence similarity images and carrying out structural variation characterization: identifying all interesting fragments in the sub-images containing single structural variation by using a pre-trained structural variation detection CNN model to obtain complex structural variation fragments; systematically characterizing and classifying the complex structure variant fragments by using a graph data structure;

The step 2 specifically comprises the following steps:

2. The deep learning-based genome structure variation detection method according to claim 1, wherein in step 1, according to the comparison feature of the structure variation feature sequence, the sequence of the unmatched fragment in the structure variation feature sequence is locally Kmer aligned with the reference genome sequence, specifically: and extracting unmatched fragments from the CIGAR characters of the structural variation characteristic sequence and the reference genome sequence, and carrying out local Kmer re-comparison on the sequences of the unmatched fragments and the reference genome sequence to obtain secondary fragments.

3. The method for detecting genomic structural variation based on deep learning according to claim 1, wherein step 3 specifically comprises:

4. The deep learning-based genomic structural variation detection method according to claim 1, wherein the structural variation detection CNN model training method in step 4 specifically comprises:

5. The deep learning-based genome structure variation detection method according to claim 1, wherein in step 4, all interesting fragments in the sub-image containing the single structure variation are identified by using a pre-trained structure variation detection CNN model to obtain complex structure variation fragments; the complex structure variant fragments are systematically characterized and classified by using a graph data structure, and specifically:

6. A deep learning-based genomic structural variation detection system comprising:

the structural variation recognition and characterization module is used for recognizing all interesting fragments in the sub-images containing the single structural variation by using a pre-trained structural variation detection CNN model to obtain complex structural variation fragments; systematically characterizing and classifying the complex structure variant fragments by using a graph data structure;

the structural variation characteristic sequence coding module comprises:

7. The deep learning based genomic structural variation detection system according to claim 6, wherein the structural variation feature sequence similarity image segmentation module comprises:

8. The deep learning based genomic structural variation detection system of claim 6, wherein the structural variation identification and characterization module specifically comprises: