CN115641910B

CN115641910B - Combined detection method for structural variation of third generation group genome

Info

Publication number: CN115641910B
Application number: CN202211287317.9A
Authority: CN
Inventors: 姜涛; 曹舒淇; 刘博�; 王亚东
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-05-12
Anticipated expiration: 2042-10-20
Also published as: CN115641910A

Abstract

The utility model provides a three-generation group genome structure variation joint detection method, which relates to the technical field of gene variation detection and aims at solving the problem of low large-scale group variation detection speed in the prior art. The method for the combined detection of the group genome SV effectively avoids the problem of excessive integration in SV integration, improves the diversity of the detected group SV while maintaining the accuracy of combined detection, divides the group SV by utilizing a block division strategy, reasonably uses the polynuclear characteristic of computing resources, and greatly improves the speed of combined detection of the large-scale group SV.

Description

Combined detection method for structural variation of third generation group genome

Technical Field

The invention relates to the technical field of genetic variation detection, in particular to a three-generation group genome structure variation joint detection method.

Background

Third generation sequencing technologies such as PacBio (Pacific BioSciences) and ONT (Oxford Nanopore Technologies) utilize the idea of sequencing while synthesizing and utilize the base pairing stage to emit different lights when different bases are added, and determine the type of base entering according to the wavelength and peak value of the lights. The third generation sequencing techniques produce sequenced fragments that possess long read lengths, even in excess of 1Mbp, while also having an average length of up to 10kbp, compared to other sequencing techniques, so that they can cover a long genomic region. Longer read lengths allow it to detect more structural variations, and studies have shown that in the detection of genomic variations in human individuals, the use of third generation sequencing data can detect twice as many structural variations as the use of second generation sequencing data, providing a great opportunity for the development of structural variation detection.

Structural variations are changes in the entire sequence of a continuous DNA sequence by Deletion (Insertion), insertion (Duplication), inversion (Inversion), translocation (Translocation), and the like, and the affected base length is generally greater than 50 base pairs. Related studies have shown that it is longer in length and more complex in structure than Single Nucleotide Variation (SNV), short insert deletion variation (Indel). Related studies indicate that on average, each human individual contains about twenty thousand SVs on the genome, although the number of SVs is smaller compared with SNV and Indel, because the span of the related DNA fragment intervals is larger, the influence range on the genome is the widest in space, and therefore, the accurate detection of structural variation has important significance for genome research.

Along with the smooth implementation of international thousand-person genome plans, genome plans are developed in various countries, and the aim is to draw genome maps of large-scale people belonging to the home country, further understand the genetic variation characteristics of the home country and the home country deeply from the genome level, promote genome scientific development and lay a foundation for accurate health medical development. As an important part of large-scale genome research, how to accurately and efficiently perform population structural variation detection is a hotspot and difficulty of current research. The current group structure variation detection tool has the problem of excessive integration, and variation which does not belong to the same group structure variation is easily identified as the same group structure variation, so that the diversity of variation detection results is lost; meanwhile, for large-scale groups, the mutation detection of the groups needs to take a large amount of time and space, the detection speed is low, and a certain difficulty is brought to the practical application of group structure mutation detection.

Disclosure of Invention

The purpose of the invention is that: aiming at the problem of low detection speed of large-scale group variation in the prior art, a three-generation group genome structure variation joint detection method is provided.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a three-generation group genome structural variation joint detection method comprises the following steps:

step one: obtaining structural variation information of a plurality of individuals, extracting structural variation, namely SV, according to the structural variation information of the individuals, and grouping the SV according to a chromosome interval where the variation is located and the variation type to obtain a plurality of groups of SV sets to be combined;

step two: for each combined SV set, sequencing the SVs in the SV set to be combined by taking a genome coordinate site as a first keyword and the SV length as a second keyword to obtain an ordered SV set to be combined;

step three: merging the SVs in the ordered SV set to be merged, wherein the specific steps of merging are as follows:

combining by two rounds;

according to coordinate sites of SVs on a genome in the first round of merging, defining a threshold value, and collecting together the coordinate sites of adjacent SVs in the ordered SV set to be merged, wherein the difference between the coordinate sites of the adjacent SVs is smaller than the threshold value;

dividing the result of the first round of merging by using a bipartite graph maximum matching algorithm in the second round of merging to obtain a plurality of candidate SV sets;

step four: acquiring the median of the coordinate sites of the SVs in the candidate SV set and the median of the SV length, and taking the median as the coordinate sites and the length of the group SVs;

step five: and after the processing of the multiple groups of SV sets to be combined is finished, acquiring a set of coordinate sites and lengths of the group SVs, namely finishing the joint detection.

Further, the specific steps of the first step are as follows:

firstly, extracting all SVs in a region with the length of 10Mbp on one chromosome in the structural variation information of an individual, and starting the next extraction from the tail region of the previous extraction, backtracking by 10kbp and extending backwards by 10Mbp;

in the extraction process, defining a class, extracting the chromosome where the SV is located, the variation type of the SV, the genome coordinate site of the SV and the length of the SV according to the content of each line in the structural variation information of the individual, recording the individual where the SV is located, and taking the individual where the SV is located, the chromosome where the SV is located, the variation type of the SV, the genome coordinate site of the SV and the length of the SV as variation characteristics;

after all the SVs in the region are collected, the SVs are stored according to different mutation types to form a plurality of groups of SV sets to be combined.

Further, the specific steps of the second step are as follows:

firstly, establishing a small root pile, taking variation characteristics as elements in the pile, and setting the key words ordered in the pile as genome coordinate sites of SV;

initializing a small root pile, wherein the size of the pile is the number of individuals in a set of SVs to be combined, traversing each individual in the set of SVs to be combined, adding the SV with the smallest genome coordinate site of the SV in each individual into the pile, if the top element of the pile is not the SV with the smallest genome coordinate site, exchanging the SV with the minimum genome coordinate site in the pile with the SV corresponding to the top element of the pile, removing the SV with the smallest genome coordinate site in the pile from the pile, adding the SV into a sorting list, then taking out the SV with the second smallest genome coordinate site of the SV from the individual where the removed SV is located, adding the SV into the small root pile, reducing the size of the pile by 1 when no residual SV exists in the individual where the removed SV is located, and repeating the steps until the individual processing of the SVs is completed, thus obtaining the sorting list, namely the ordered set of SVs to be combined.

Further, the specific steps of the first round of merging are as follows:

traversing ordered SV sets to be combined, judging whether the difference of the position coordinates of two adjacent SVs is less than 1500bp, and collecting continuous SVs with the position coordinates less than 1500bp into the same subset.

Further, the specific steps of the second round of merging are as follows:

in the subset, the SVs are divided according to the individuals to which the SVs belong, the individuals in the subset are ordered according to the number of the SVs in the individuals, the individual with the largest number of the SVs is selected as an initial candidate merging result, then the individual with the largest number of the SVs in the rest individuals in the subset is selected to be merged with the initial candidate merging result until all the individuals are merged into the candidate merging result, and when all the subsets are processed, a plurality of candidate SV sets are obtained.

Further, the specific step of selecting the individual with the largest SV number from the remaining individuals in the subset to combine with the initial candidate combination result is:

firstly, establishing a bipartite graph, wherein each node in the bipartite graph represents a variation characteristic, and for two variation characteristics respectively from an individual to be combined and a candidate combination result, namely two nodes respectively from the individual to be combined and the candidate combination result, connecting the two points to edges if the variation characteristics corresponding to the two nodes satisfy the following formula:

wherein bp is ₁ 、bp ₂ Genome coordinate sites representing two SVs, len ₁ 、len ₂ Representing the length of two SVs, bp' ₁ 、bp' ₂ Representing the site of translocation variation on another chromosome, p representing the side length;

after the graph is built, a bipartite graph is obtained, the novel bipartite graph is subjected to maximum matching by utilizing a KM algorithm, a matching result is obtained, and SV combination is carried out according to the information of each matching result.

Further, the specific steps of performing maximum matching on the new bipartite graph by using the KM algorithm to obtain a matching result, and performing SV merging according to the information of each matching result are as follows:

firstly, preprocessing the obtained bipartite graph, wherein the preprocessing steps are as follows:

taking the opposite numbers of the weights of all the side lengths in the bipartite graph to obtain a new bipartite graph, connecting one side of nodes which are not connected to the two sides of the new bipartite graph, setting the weight of the node to be combined to be positive infinity, adding virtual SV nodes in an individual to be combined, enabling the total number of the nodes of the individual to be combined to be the same as the number of the SVs in the candidate combination result, and setting the weights of the side lengths connected by the nodes to be positive infinity;

and then carrying out maximum matching on the new bipartite graph by utilizing a KM algorithm to obtain a matching result, and carrying out SV merging according to the information of each matching result, wherein the specific steps of merging are as follows:

if the weight of the side length between two matched nodes is not positive infinity, representing that the two nodes are in the best matching, then merging the SV corresponding to the node representing the individual to be merged in the two nodes with the SV corresponding to the node representing the candidate merging result, namely adding the SV corresponding to the node representing the individual to be merged into the node representing the candidate merging result;

if the weight of the side length between the two matched nodes is positive infinity, the two nodes cannot be matched, and at the moment, if the corresponding SV in the individual to be combined is not a virtual node and no SV in the candidate combining result can be combined with the corresponding SV, the SV corresponding to the node of the individual to be combined is directly added into the candidate combining result to become a new candidate SV.

Further, the side length p is expressed as:

further, the specific steps of the fourth step are as follows:

and obtaining candidate merging results of the SVs after two rounds of merging, generating final group SV information according to the candidate merging results, and taking the median of genome coordinate loci of all the individual SVs contained in each candidate SV set in the candidate merging results and the median of SV length as the coordinate loci and the length of the group SVs.

The beneficial effects of the invention are as follows:

the invention provides a third-generation sequencing-based group genome SV joint detection method, which takes a plurality of individual VCFs as input, extracts and sequences variant features in individual VCF files, combines similar SVs among different individuals by using two rounds of merging based on a bipartite graph maximum matching algorithm, and thus generates group VCFs representing group SVs. The method for the combined detection of the group genome SV effectively avoids the problem of excessive integration in SV integration, improves the diversity of the detected group SV while maintaining the accuracy of combined detection, divides the group SV by utilizing a block division strategy, reasonably uses the polynuclear characteristic of computing resources, and greatly improves the speed of combined detection of the large-scale group SV.

Drawings

Fig. 1 is an overall flow chart of the present application.

Detailed Description

It should be noted in particular that, without conflict, the various embodiments disclosed herein may be combined with each other.

The first embodiment is as follows: referring to fig. 1, the embodiment specifically illustrates a third generation group genome structure variation joint detection method according to the embodiment, which includes the following steps:

combining by two rounds;

The joint detection is an important link in the group genome variation detection, and can simultaneously consider the variation of all individuals, so that a high-precision group genome variation map is drawn, and the group genome variation rule is found.

The group genome SV joint detection method establishes a group genome structure variation integration algorithm system based on bipartite graph matching, and clusters individual structure variation sites according to genome space position information according to a specific integration rule, so as to obtain a joint detection candidate site set. The algorithm takes a plurality of VCF files representing different individuals as input, clusters the structural variation with high spatial similarity from different individuals, analyzes the clustered result to obtain final group structural variation sites, and outputs the final group structural variation sites according to a VCF format after sequencing.

Mainly comprises the following four steps:

1. and extracting structural variation characteristic signals according to structural variation information of all input individuals, classifying according to chromosome intervals where variation is located and variation types to obtain multiple groups of SV sets to be combined, ensuring that cross combination does not occur in different groups of sets, and providing possibility for parallelizing different sets.

2. Sequencing the individual structural variation characteristic tuples in the to-be-combined set, and on the basis that the SVs in each individual are ordered, applying a multi-path merging idea, and sequencing all the SVs in each individual by taking the genome coordinate position as a first keyword and the SV length as a second keyword to obtain an ordered to-be-combined SV set.

3. The SVs in the ordered SV set to be combined are integrated, so that the SVs from different individuals are ensured not to be combined together, and meanwhile, the SVs which are more similar are combined correctly as much as possible. In order to obtain higher accuracy of integration, a two-round fusion method is adopted, the general division is carried out according to genome coordinate sites in the first round of fusion, and the SV length and the individual information are considered in the second round of fusion to carry out finer and accurate division.

4. And calculating group structure variation sites according to the integration information, and obtaining more accurate group SV information by a median obtaining method.

Extraction of structural variation characteristic signals

Firstly, SV in each single sample is collected and processed in parallel by adopting a divide-and-conquer strategy, and structural variation information in each individual input file is read in parallel. Specifically, all SVs within a specified range (typically a region of length 10Mbp on one chromosome) on each sample are first extracted, then the next round of extraction is traced back by 10kbp from the end region of the previous round of extraction and extended back by 10Mbp, which effectively recalls SVs located at the boundary of the adjacent interval.

In the extraction process, a class is defined, and according to the content of each line of the input VCF file, the chromosome where the SV is located, the mutation type of the SV, the starting position where the SV occurs, the length of the SV (the position where the mutation is on the other chromosome for translocation mutation) are extracted, and meanwhile, the sample where the SV is located is recorded as the mutation characteristic. After all SVs in a range area are collected, SVs are stored according to different mutation types to form a plurality of groups of SV sets to be combined, and the SV sets to be combined obtained in the mode contain all mutation of a specific type in a specific section of chromosome interval of each individual. Along with the completion of scanning of different types of SVs of the whole chromosome, a plurality of groups of SV sets to be combined covering the whole chromosome and the whole variation type are finally formed.

Ordering of individual structural variation feature tuples

Next, each SV set to be merged is processed separately, since the SVs are from different individual VCFs when extracting the SVs, all VCFs from different individuals are first ranked before merging, a total of S individuals to be merged are set, and the SV length (the position of the mutation on another chromosome for translocation mutation) is ranked by taking the coordinate position of the SV on the reference genome as the first keyword and the SV length as the second keyword. According to the VCF format rules, SVs in each individual are ordered in order, so the ordering operation can be regarded as merging the S-group ordered SV sequences such that the merged sequences remain ordered. Therefore, the multi-path merging thought is adopted in the sorting: firstly, a small root pile is established, elements in the pile are SV signal information extracted in the last step, and key words sequenced in the pile are set as coordinate sites of SVs on a genome. According to the features of the small root heap, the smallest element in the heap is located at the top of the heap, so the top of the heap element is the SV with the smallest genomic coordinate site in the heap.

Upon initializing the heap, each individual to be consolidated is traversed, and the SV with the smallest genomic coordinates therein is added to the heap (i.e., the first SV in the sequence of SVs to be consolidated), at which point the size of the heap is the number S of individuals to be consolidated. According to the characteristics of the small root heap, the top element is necessarily the SV with the minimum genome coordinates in the SVs of all individuals, the SV is removed from the heap and added into a list of sorting results, then the next SV is taken out from the individual where the SV is located and added into the small root heap, and the small root heap is adjusted so that the minimum element is still located at the top of the heap. In this operation, the number of elements in the heap remains S, and SVs with the smallest genome coordinates among all individuals, except those that have been added to the ordered result list, are deposited.

According to the strategy, the SVs with the minimum genome coordinates are taken out each time and added into the sorting result list, and a new SV is added into the heap, so that the SVs added into the sorting result list each time are minimum, and finally an ordered SV list is obtained, wherein the list contains the SVs of all individuals to be combined in the current interval and the current mutation type.

Integration of structural variant feature tuples

The ordered list is then combined with SVs, which is the key point in the combined detection of SVs, in which SVs shared by different individuals are combined into a group of SVs, i.e. similar SVs are integrated. At the same time of integration, it is ensured that SVs from the same individual are not pooled, since SVs from the same individual represent two variations on that individual, if pooled, would result in a reduction of the two variations originally present to one, which would affect the diversity of the population SV.

In order to ensure that SVs from the same individuals are not combined while enabling more similar SVs to be combined together as much as possible, a two-round combination method is adopted, wherein in the first round of combination, variation in a relatively close interval is collected together according to coordinate sites of SVs on a genome, and in the second round of combination, the length of the SVs and information from the individuals are considered, and the result of the first round of combination is further and more finely divided, so that not only the coordinate sites are similar, but also the lengths of the SVs are similar, and SVs from different individuals are combined together.

In the first round of merging, traversing the ordered SV list in 2.2, collecting the sites of two adjacent SVs into the same set when the difference of the coordinates is smaller than 1500bp, otherwise, completing the collection of the previous set, and entering the next set by the next SV. The first round of merging is a more relaxed rule merging, only the genomic coordinate sites of the SVs are considered, and SVs with sites closer to each other are collected together, so that the threshold value selected in merging is relatively larger.

In the second round of merging, the SV set collected after the first round of merging is further divided, a matching-based pairwise merging method is adopted, firstly, the sets to be divided are sorted according to the SV number of different samples, the individual containing the most SV is selected as an initial candidate merging result, each time, the individual containing the most SV in the rest individuals to be merged is selected greedily to be merged with the candidate merging result, and the merging of all individuals is completed until all individuals are merged into the candidate merging result.

In the merging process of the individual and the candidate merging result, a bipartite graph-based maximum matching method is adopted, and the SVs in the two are merged in a matching mode. Specifically, a bipartite graph is first created, where each node in the graph represents one SV information, which may be an SV of a single individual or a result of combining multiple SVs. Then, the sides are connected between the SVs capable of being combined, only two points respectively from the individuals and the candidate combination result can be connected because the SVs between the same individuals cannot be combined, and meanwhile, the information of the individuals does not need to be considered when the sides are connected because the individuals which are combined in the candidate combination result do not contain the individuals to be combined currently. When two SVs satisfy formula (1), a side representing the similarity between them is connected between them:

wherein bp is ₁ 、bp ₂ Genome coordinate sites representing two SVs, len ₁ 、len ₂ Representing the length of two SVs, bp ₁ 、bp' ₂ Representing the site of translocation variation on another chromosome. The side length is calculated by the formula (2), and the side length between the SVs which are more similar is ensured to be smaller:

/>

after the graph is built, a bipartite graph can be obtained, two sides are respectively the to-be-combined individual and the candidate combination result, the sides connecting the bipartite graph represent the similarity between the SVs at two sides, and the process of combining the to-be-combined individual and the candidate combination result is to find a matching mode of the bipartite graph, so that the sum of the weights of the matched sides is minimum, namely the SVs which are more similar are matched. In matching, a best matching mode is obtained by applying a KM algorithm of biggest matching of the bipartite graph, preprocessing is firstly carried out on the bipartite graph to be matched according to an application scene of the KM algorithm, and the biggest matching is obtained by searching an augmented path through the KM algorithm, so that all edge weights in the graph are obtained by the opposite number, a new bipartite graph is obtained, and the biggest matching found in the new bipartite graph is the smallest matching of the original bipartite graph; meanwhile, the KM algorithm requires that the bipartite graph is a complete graph, so that for vertex pairs which are not connected on two sides of the bipartite graph, namely SV which cannot be combined according to a combination rule, one edge is also connected, but the weight of the SV is set to be positive infinity, and at the moment, if the vertex pairs are required to be matched, huge cost is required, and therefore the situation that the two vertices cannot be combined can be simulated. Meanwhile, according to a greedy merging strategy, the number of SVs in an individual to be merged is smaller than or equal to the number of SVs in a candidate merging result, so that some virtual SV nodes are added in the individual to be merged so that the total number of the virtual SV nodes is the same as the number of SVs in the candidate merging result, and as above, the edge weights connected by the nodes are set to be positive infinity, which means that the points cannot be matched. After pretreatment is carried out, the obtained bipartite graph can be subjected to maximum matching by applying a KM algorithm.

After the maximum matching is completed, a matching result can be obtained, and SV merging information is extracted according to the vertex pair information of each matching. If the edge weight between the matched vertex pairs is non-infinite, representing that the two vertexes are in the best matching, merging the SV represented by the vertex in the individual to be merged with the SV represented by the vertex in the candidate merging result, and updating the SV information in the candidate merging result; if the edge weight between the matched vertex pairs is infinite, the two vertices cannot be matched, and if the corresponding SV in the individual to be combined is not a virtual node, no SV in the candidate combining result can be combined with the corresponding SV, the SV is directly added into the candidate combining result to form a new candidate SV.

Calculation of population structural variation sites

And obtaining a candidate merging result of the SV after two rounds of merging, and generating final group SV information according to the candidate merging result. And taking the median of the genome coordinate loci of all the individual SVs contained in each candidate SV set in the candidate merging result and the median of the SV length as the coordinate loci and the length of the group SVs, so that the SV information can be restored to the greatest extent.

It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims

1. The three-generation group genome structural variation joint detection method is characterized by comprising the following steps of:

combining by two rounds;

step five: after the processing of the multiple groups of SV sets to be combined is finished, acquiring a set of coordinate sites and lengths of the group SVs, namely finishing the joint detection;

the specific steps of the first step are as follows:

after all the SVs in the region are collected, storing the SVs according to different mutation types to form a plurality of groups of SV sets to be combined;

the specific steps of the second step are as follows:

2. The method for joint detection of genomic structural variation of a third generation population according to claim 1, wherein the threshold in the first round of merger is 1500bp.

3. The method for the joint detection of genomic structural variation of three generation groups according to claim 2, wherein the specific steps of the first round of merging are as follows:

4. The method for joint detection of genomic structural variation of three generation populations according to claim 3, wherein the specific steps of the second round of merging are as follows:

5. The method for joint detection of genomic structural variation of three generation groups according to claim 4, wherein the specific step of selecting the individual with the largest number of SVs among the remaining individuals in the subset to be combined with the initial candidate combination result comprises the following steps:

firstly, establishing a bipartite graph, wherein each node in the bipartite graph represents a variation characteristic, and connecting two variation characteristics respectively from an individual to be combined and a candidate combination result, namely two nodes respectively from the individual to be combined and the candidate combination result, and the variation characteristics corresponding to the two nodes meet the following formula:

6. The method for joint detection of genomic structural variation of three generation groups according to claim 5, wherein the specific steps of performing maximum matching on the new bipartite graph by using the KM algorithm to obtain a matching result, and performing SV combination according to information of each matching result are as follows:

7. The joint detection method for structural variation of genome of three generation groups according to claim 6, wherein the side length p is expressed as:

8. the method for joint detection of genomic structural variation of a third generation population according to claim 7, wherein the specific steps of the fourth step are as follows: