CN115631789A

CN115631789A - Pangenome-based group joint variation detection method

Info

Publication number: CN115631789A
Application number: CN202211313819.4A
Authority: CN
Inventors: 刘亚东; 刘博�; 李高阳; 张安琪; 王亚东
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-20
Anticipated expiration: 2042-10-25
Also published as: CN115631789B

Abstract

A group joint variation detection method based on pan-genome relates to the technical field of gene biology. The invention aims to solve the problems of low comparison efficiency, weak mutation detection sensitivity and inaccurate detection result of the existing group combined mutation detection method. The invention comprises the following steps: acquiring a reference genome and a detected sample genome, identifying candidate variation sites on the detected sample genome by using the reference genome and the detected sample genome, and classifying the candidate variation sites; constructing a local haplotype sequence by using the candidate variant sites according to the types of the candidate variant sites, and integrating the constructed local haplotype sequence among samples to obtain an integrated haplotype sequence set; detecting the genotype of the variation site of the detected sample genome by using the integrated haplotype sequence; traversing all detected sample genomes in the population, and repeatedly executing the steps from one step to three to obtain all variant genotypes in the population. The invention is used for group genome joint variation detection.

Description

Pangenome-based group joint variation detection method

Technical Field

The invention relates to the technical field of gene biology, in particular to a population joint variation detection method based on pan-genome.

Background

Mutation detection, i.e., the detection of nucleotide differences between an individual sample and a reference genome, is the basis of numerous genomic studies, enabling the discovery of associations between genomic variations and important phenotypes and diseases. In the genome, there are a large number of variations with small variations in base sequence, such as SNV and Indel, which are the most common types of variations among human heritable variations, accounting for 90% or more of the known variations. The genome sequences of different individuals have 99.99% homology, and SNV/Indel plays a decisive role in the individualization characteristics and phenotype of organisms.

With the progress of high throughput sequencing technologies, more and more bioinformatics tools are being developed for detecting genomic variants, and among them, a series of variants detection methods represented by GATK are widely used and have good performance in detecting single-sample variants. Different populations have certain differences in the aspects of heredity, variation and the like, so that different genetic characteristics are shown, and combined variation detection needs to be carried out on the basis of large-scale populations in order to depict the differences among different populations and the occurrence frequency of variation in the populations. The essence of large-scale genome planning and personal genome sequencing tasks is to analyze the inherent heredity and variation information of individual and group genomes by means of a series of high-reliability and high-precision data analysis algorithms, depict differences among different groups, find pathogenic genes or susceptibility genes of various diseases, reveal molecular mechanisms of disease occurrence and development, and realize accurate diagnosis and treatment of diseases.

At present, a joint variation detection method is mainly adopted in large-scale genome data analysis, a current mainstream algorithm is a BWA + GATK detection workflow developed by scientific research institutions of America, english and other countries, and a closed algorithm ecology is formed. However, this method still has problems in alignment efficiency, mutation detection sensitivity and accuracy, and is increasingly difficult to adapt to large-scale genome projects of increasing scale.

Disclosure of Invention

The invention aims to solve the problems of low comparison efficiency, weak mutation detection sensitivity and inaccurate detection result of the existing group combined mutation detection method, and provides a group combined mutation detection method based on a pan-genome.

A group joint variation detection method based on pan-genome comprises the following specific processes:

step one, acquiring a reference genome and a detected sample genome, identifying candidate variation sites on the detected sample genome by using the reference genome and the detected sample genome, and classifying the candidate variation sites:

firstly, carrying out interval block division on a reference genome, and comparing the reference genome subjected to interval block division with a measured sample genome to obtain a comparison information set; stacking the comparison information according to the coordinate sequence of the comparison information in the reference genome to obtain a stacking result; then, counting stacking results, and identifying the category of the candidate mutation sites on the detected sample genome by using the stacking results;

the classes of candidate mutation sites on the tested sample genome include: high confidence candidate variant sites, low complexity region candidate variant sites;

secondly, constructing a local haplotype sequence by utilizing the candidate variation sites according to the types of the candidate variation sites, and integrating the constructed local haplotype sequence among samples to obtain an integrated haplotype sequence set;

step three, detecting the genotype of the variation site of the detected sample genome by using the integrated haplotype sequence:

firstly, carrying out non-gap comparison on a sequence fragment of a haploid sequence set and a comparison information set to obtain the sum of the mass of accumulated mismatched bases compared by each sequence fragment non-gap as S, and then calculating the genotype of a variation point by using the sum of the mass of accumulated mismatched bases compared by each sequence fragment non-gap as S;

and step four, traversing all detected sample genomes in the population, and repeatedly executing the steps one to three to obtain all variant genotypes in the population.

Further, the step one of acquiring a reference genome and a measured sample genome, identifying candidate mutation sites on the measured sample genome by using the reference genome and the measured sample genome, and classifying the candidate mutation sites includes the following steps:

step one, acquiring a reference genome, and carrying out interval block division on the reference genome to obtain a divided reference genome B = B ₁ ，B ₂ ，...，B _n ；

Wherein, B ₁ ，B ₂ ，...，B _n Respectively representing each divided interval block, wherein n is the total number of the interval blocks;

the interval block partitioning is according to "chromosome: dividing the format of the starting position-the ending position';

there is no overlap between the interval blocks;

step two, obtaining a detected sample genome, comparing the detected sample genome with divided reference genomes to obtain a detected sample genome B _i The inner comparison information set;

step one, step three, the obtained product in step two falls on B _i Stacking all the comparison information in the reference genome according to the coordinate sequence of the comparison information in the reference genome to obtain a stacking result;

and step four, counting the stacking results obtained in the step three in each interval block of the divided reference genome, recording sequence information of the stacking results in each interval block through dictionary variables, identifying candidate variation sites of the tested sample according to the stacking result sequence information recorded by the dictionary, and classifying the candidate variation sites.

Further, the falling point B obtained in the second step in the first step and the third step _i All alignment information in the reference genome is stacked according to the coordinate sequence of the alignment information in the reference genome, and the method comprises the following steps:

firstly, filtering comparison information which is not compared, is subjected to suboptimal comparison and split comparison according to the value of a comparison information 'flag' field in a comparison information set; filtering comparison information with the 'mapq' domain value smaller than 10 to obtain a filtered comparison information set;

wherein the value of the unmatched comparison information 'flag' field is 4; the value of the 'flag' field of the suboptimal comparison information is 256; the value of the 'flag' field of the split comparison information is 2048;

then, stacking the filtered comparison information according to the coordinate sequence of the comparison information in the reference genome by using a comparison result analysis tool to obtain a stacking result;

the stacking result comprises: reference sequence name, genome coordinates, reference base, number of sequencing fragments covering the current position, base sequence of the current position, base quality ASCII code sequence of the current position.

Furthermore, in the first step four, the stacking results obtained in the third step are counted in each interval block of the divided reference genome, the sequence information of the stacking results in each interval block is recorded through the dictionary variables, the candidate variant loci of the tested sample are identified according to the stacking result sequence information recorded by the dictionary, and the candidate variant loci are classified, which comprises the following steps:

step one, counting stacking results obtained in the step one and step three in each interval block of the divided reference genome, and recording the base appearing at each genome position in each interval block, the sequence information of insertion and deletion and the frequency information of each base sequence through dictionary variables:

firstly, recording base sequence information of base appearing at each genome position in a block, and base sequence information of insertion and deletion through dictionary variables;

the dictionary variables include: pileup _ fact;

the pileup _ dit is used for acquiring the occurrence times of a base sequence, an inserted base sequence and a deleted base sequence;

then, the frequency of occurrence of each base was obtained from the number of occurrences of each base sequence recorded in the pileup _ dit dictionary variable:

P _j1∈keys ＝C _j1 /T

wherein T is the total number of occurrences of all bases, keys = { A, C, G, T, I, D } is a set of bases, j1 is a base in keys, C _j1 Indicates the number of occurrences of base j 1;

step one, step two, obtaining the occurrence frequency of each base by utilizing the step one, step four, and obtaining the candidate variation site:

if at least one non-reference sequence base occurs at a genomic position _j1∈keys ＞T _alt If the current position is the candidate mutation site;

wherein, T _alt Is a frequency threshold;

the candidate mutation sites are represented by 5-tuple, and comprise: genomic position, reference base, non-reference base, frequency of occurrence of non-reference base sequence;

and step four and step three, classifying the candidate variation sites obtained in the step four and step two.

Further, the step one, four and three of classifying the candidate mutation sites obtained in the step one, four and two of the steps specifically includes:

candidate variation sites are classified into the following three categories: high confidence candidate variation sites HCs, low confidence candidate variation sites LCs and low complex region candidate variation sites LCCs;

HCs are determined when the candidate mutation sites satisfy the following three characteristics:

(1) Frequency of occurrence P of a certain non-reference base _HCs Greater than a second frequency threshold T _HCs I.e. P _HCs ＞T _HCs ；

(2) Taking the position of the current candidate mutation site on the detected sample genome as the center, and not containing other candidate mutation sites in the windows extending w-bp at the upstream and the downstream;

(3) The position of the current candidate mutation site on the detected sample genome does not belong to a low complexity region, and comprises the following steps: the GAP region after genome splicing constitutes a heterochromatin domain; segment repetition; a pseudo-autosomal region of a sex chromosome; short tandem repeats;

LCs are determined when the candidate mutation site satisfies one of the following four characteristics:

(1) Frequency of occurrence P of a certain non-reference base _LCs At a second frequency threshold T set by the user _HCs A third frequency threshold T preset by the user _LCs In time between, i.e. T _LCs ＜P _LCs ＜T _HCs ；

(2) The position of the current candidate variation site is a candidate multi-allelic variation site;

(3) Other candidate mutation sites are contained in windows which respectively extend w-bp upstream and downstream by taking the position of the current candidate mutation site on a detected sample genome as a center;

(4) The current position belongs to a non-low complexity area;

when none of the candidate mutation sites satisfies the judgment conditions of HCs and LCs, the candidate mutation sites are classified as LCCs.

Further, the second step of constructing a local haplotype sequence by using the candidate mutation sites according to the types of the candidate mutation sites, and integrating the constructed local haplotype sequence among samples to obtain an integrated haplotype sequence set includes the following steps:

step two, dividing each interval block of the divided reference genome into windows with fixed length, traversing all candidate mutation points in the interval block, and distributing all the candidate mutation points in the interval block into an optimal window:

firstly, dividing each interval block of a divided reference genome into a plurality of windows with the length W, and setting the overlapping length between adjacent windows;

then, all candidate variant sites are traversed within each bin of the partitioned reference genome and assigned to the optimal window:

when the candidate mutation site is offset from the window start position

In between, thenThe current window is the optimal window of the current candidate variation site;

secondly, constructing a haplotype sequence of the sample to be detected in each window according to the type of the candidate variation sites distributed to the optimal window;

step two, integrating the haplotype sequences of the tested sample in the window to obtain an integrated haplotype sequence set:

firstly, traversing all the tested sample haplotype sequences in a window, and carrying out duplication elimination according to the variation contained in the sequences to finally generate a non-repetitive sample haplotype sequence set in the window;

then, go through all windows and generate the haplotype sequence set h = { h) of each window in turn ₁ ，h ₂ ，...，h _n }。

Further, the second step of constructing a haplotype sequence of the sample to be tested in each window according to the candidate variant locus types allocated to the optimal window comprises the following steps:

step two, aiming at each window, judging whether each window contains a candidate variation site or not, and if not, determining that the reference genome sequence is the haplotype sequence of the current window; if the candidate variation sites are contained, judging whether the window contains the low confidence coefficient candidate variation sites, if the window contains the low confidence coefficient candidate variation sites, executing the second step to obtain the constructed tested sample haplotype sequence, and if the window only contains the high confidence coefficient candidate variation sites, executing the second step to obtain the constructed tested sample haplotype sequence;

and step two, obtaining the haplotype sequence of the tested sample by using a de Bruijn graph-based splicing method in a window:

extracting all sequencing fragments in the window, and filtering out the sequencing fragments with the overlapping degree of less than 10 percent of the window;

setting an initial k-mer =41bp, constructing a local de Bruijn graph by using the remaining sequencing fragments, identifying all possible consistency sequences from the local de Bruijn graph through a depth-first search algorithm, sequencing the support degrees of all possible consistency sequences according to the sequencing fragments, and selecting a plurality of contigs with the highest support degree;

all contigs generated at k-mer =41 were added to the sequencing fragment, k-mer =46 was set, the local deBruijn graph was reconstructed, and contigs were regenerated:

setting the step length to be 5bp, finishing k-mer =75bp, and repeating the operation in sequence until contigs, namely the haplotype sequence of the sample to be detected, is generated in the last step;

extracting all sequencing fragments covering the current window from the comparison information set by using a comparison analysis tool, and acquiring a sequencing fragment ID supporting each candidate variation site in the comparison information set; taking intersection of the sequencing fragment IDs supporting any two candidate variation sites, if the number of the sequencing fragments contained in the intersection is not less than a number threshold, considering that the variation corresponding to the two candidate variation sites appears on the same haplotype sequence, and replacing the base at the corresponding position on the reference sequence with a variation base to construct a haplotype sequence to obtain the haplotype sequence; if two candidate variation sites are not supported by any identical sequenced fragment, they are considered to be from different haplotype sequences, and haplotype sequences are constructed based on the candidate variation and reference genomic sequences contained on the different haplotypes, respectively.

Further, the step three of detecting the genotype of the variation site of the detected sample genome by using the integrated haplotype sequence comprises the following steps:

step three, carrying out non-gap comparison on the haploid sequence set obtained in the step two and the sequencing fragments of the comparison information set to obtain the sum S of the quality of the accumulated mismatched bases of each sequencing fragment non-gap comparison:

step three, grouping the candidate variant sites and haplotype sequences of each window:

firstly, all haplotype sequences containing candidate variation sites are selected from a haplotype sequence set and are marked as H; the rest haplotype sequences do not contain candidate variation and are marked as R;

then, grouping candidate variant sites contained in the tested sample: if the distance between the two candidate variant loci is less than the distance threshold, then treating the two candidate variant loci as a group;

step three or two, the sum S of the quality of the accumulated mismatched bases of each sequencing fragment non-gap comparison is obtained by using the candidate variant sites grouped by each window and the grouped haplotype sequence:

firstly, for each group of candidate variation sites obtained in the third step, all haplotype sequences containing the group of variation are selected from H and recorded as H, and all sequencing fragments in a window are extracted from a comparison information set;

then, respectively mapping complete matching between all sequencing fragments and the haplotype sequence and the reference sequence to obtain a complete matching block with the longest length, and determining the initial specific position as p;

then, starting from the initial position of the alignment, respectively extending single bases to the left and the right, and accumulating the base quality of mismatched bases when encountering base mismatching in the extension process; stopping base extension when the sum of the qualities of the extended mismatched bases at the two ends of the sequence or the accumulated quality of the mismatched bases is larger than the base quality threshold;

finally, selecting a non-repeated haplotype sequence from R as R, and respectively carrying out base extension of non-gap to obtain the sum of the quality of the accumulative mismatched bases compared with each sequencing fragment non-gap as S;

and step two, acquiring the genotype of the mutation point by utilizing the sum S of the accumulated mismatched base quality compared by each sequencing fragment non-gap obtained in the step one.

Further, in the second step, the genotype of the mutation point is obtained by using the sum S of the accumulated mismatched base quality of each sequenced fragment non-gap alignment obtained in the first step, and the method comprises the following steps:

step three, step two, obtaining the optimal comparison probability set of each sequencing fragment after comparing all sequencing fragments with the monomer sequences in h and r respectively by utilizing the accumulated mismatch base quality sum S of each sequencing fragment non-gap comparison obtained in step three

Obtained by the following method: let the haplotype sequence list containing the mutation h = { h = { h ₁ ，h ₂ ，...，h _n’ The set of probabilities of aligning the sequencing fragment to the haplotype is p = { p ₁ ，p ₂ ，...，p _n′ Selecting the maximum probability p _max As the probability of the alignment of the currently sequenced fragment to the haplotype containing the variation, the corresponding haplotype is h _max ；

Wherein k belongs to [1, m ] is the label of the probability in the optimal alignment probability set of each sequencing fragment; m is the total number of probabilities obtained, n' is the number of haplotypes within the window;

p _j obtained by the following formula:

wherein j ∈ [1, n' ] is the index of the probability that the sequencing fragment aligns to n haplotypes;

and step three and two, calculating the likelihood probability of different genotypes by utilizing the P and Pr obtained in the step three and two, and taking the maximum likelihood function as the variant genotype in the current window, wherein the maximum likelihood function is as follows:

further, in the fourth step, all the detected sample genomes in the population are traversed, and the first step to the third step are repeatedly executed to obtain all the variant genotypes in the population, specifically:

traversing candidate variation sites in all windows, and aiming at a certain variation site, if all samples have no variation signal at the site or have variation signals but the genotype of the detected variation is 0/0, determining that all individuals of the current population have no variation, and not outputting; and if the genotype of the sample at the position is 0/1 or 1/1, sequentially traversing all the samples, outputting the genotype of each detected sample at the position, and if a certain position is a multiallelic gene, splitting the multiallelic gene into a plurality of biallelic genes and sequentially outputting the multiallelic genes.

The invention has the beneficial effects that:

according to the method, a large amount of local haplotype information in a group genome is utilized to carry out sequencing fragment re-comparison, and group information is utilized to carry out characteristic statistical analysis, so that the sensitivity and accuracy of mutation detection are effectively improved; in addition, the invention utilizes the high-speed comparison characteristic of the local haplotypes, and obviously improves the efficiency of mutation detection.

Drawings

FIG. 1 is a schematic diagram of the present invention.

Detailed Description

The first embodiment is as follows: as shown in fig. 1, the population joint variation detection method based on pan-genome according to the present embodiment comprises the following specific processes:

the method comprises the following steps of firstly, obtaining a reference genome and a detected sample genome, identifying candidate variation sites on the detected sample genome by using the reference genome and the detected sample genome, and classifying the candidate variation sites, wherein the method comprises the following steps:

step one, acquiring a reference genome, and performing interval block division on the reference genome to obtain a divided reference genome B = { B = { (B) } ₁ ，B ₂ ，...，B _n }：

The length of interval block division of the reference genome is 10Mbps;

the interval block division is according to "chromosome: dividing the format of the starting position-the ending position';

there is no overlap between the interval blocks;

wherein, B ₁ ，B ₂ ，...，B _n Each representing each partitioned interval block, n being the total number of interval blocks, i = 1., n;

step two, obtaining a detected sample genome, comparing the detected sample genome with the divided reference genome, and obtaining the detected sample genome falling on B _i Inner alignment position, thereby obtaining the genome of the test sample falling in B _i The comparison information set in the database;

the comparison information is stored in the following three file forms: SAM, BAM, CRAM;

step one, step three, the obtained product in step two falls on B _i Stacking all the comparison information according to the coordinate sequence of the comparison information in the reference genome to obtain a stacking result:

firstly, before stacking (projection), the comparison information of the unmatched (flag = 4), the suboptimal (flag = 256) and the split comparison (flag = 2048) is filtered according to the value of the 'flag' field in the comparison information; filtering the comparison information with the 'mapq' domain value smaller than 10 to obtain a filtered comparison information set;

then, inputting the filtered comparison information set into an mpileup functional module in an existing comparison result analysis tool SAMtools to stack (pileup) according to a coordinate sequence of the mpileup functional module in a reference genome, namely projecting a spatial position to obtain a stacking result;

the stacking result includes the base coverage and corresponding base quality of each position in the reference genome, and there are 6 columns in total, which are: reference sequence name, genome coordinates, reference base, number of sequencing fragments covering the current position, base sequence of the current position, base quality ASCII code sequence of the current position.

The sample genome with candidate variation is mainly embodied in the base sequence of the current position, and comprises related information such as matching, mismatching, insertion, deletion, strand comparison, comparison quality and the like, and the structure is relatively complex. The explanation is as follows: 1) ' represents a positive strand match to a reference sequence; 2) ',' represents a negative strand match to the reference sequence; 3) 'ATCGN' represents a mismatch on the positive strand; 4) 'atcgn' represents a mismatch on the minus strand; 5) ' denotes ambiguous bases; 6) 'Λ' represents that the matched base is the beginning of a read; ' following the next ascii code minus 33 represents the quality of the alignment; these two symbols modify the following base, the immediately following base (ATCGATcggNn) representing the first base of the read; 7) '$' represents the end of a read, the symbol modifying the base preceding it; 8) The canonical formula '\ + [0-9] + [ ACGTNacgtn ] +' represents the inserted base after the site, e.g., '+3agg' represents the insertion of 3 bases here; 9) The canonical formula '- [0-9] + [ ACGTNacgtn ] +' represents the missing base after the site, e.g. '-4 CTGA'. Represents a deletion of 4 bases here;

note that the column five is given "^ arbitrary characters" + "

“-[0-9]+[ACGTNacgtn]+”、“+[0-9]+[ACGTNacgtn]+ "deletion, the number of bases is identical to the number of aligned quality ASCII characters in column six.

Step four, counting the stacking results obtained in the step three in each interval block of the divided reference genome, recording sequence information of the stacking results in each interval block through dictionary variables, identifying candidate variation sites of the tested sample according to the stacking result sequence information recorded by the dictionary, and classifying the candidate variation sites;

step four, counting stacking results obtained in the step three in each interval block of the divided reference genome, and recording base appearing at each genome position in each interval block, sequence information of insertion and deletion and frequency information of each base sequence through two dictionary variables;

first, information on the base sequence of occurrence, insertion, and deletion at each genome position within a block is recorded by dictionary variables:

the dictionary variable is pileup _ ditt, is used for acquiring the occurrence frequency of the base sequence, the inserted base sequence and the deleted base sequence, and specifically comprises the following steps: key (Key Value) is a genome position, value (Value) is all possible base sequences covering the current position; all possible base sequences covering the current position include matches, mismatches ("ACGTN"), insertions ("I") and deletions ("D"), and the number of occurrences of each base sequence is recorded. For example: pileup _ fact = { "a": ca, "C": cc, "G": cg, "T": ct, "N": cn, "I": ci, "D": cd, wherein A, C, G and T represent four bases, I and D are respectively represented by insertion and deletion marks, and N represents an unknown base. Ca, cc, cg, ct, ci, cd, cn respectively represent the number of times that the above base sequences appear at the current position.

P _j1∈keys ＝C _j1 /T

wherein, T = ∑ Σ _j1∈keys C _j1 Is the total number of occurrences of all bases, C _j1 Represents the number of occurrences of the base j1, keys = { a, C, G, T, I, D }, j1 is a certain base in keys;

step one, step two, obtaining candidate variation sites by using the frequency of each base obtained in the step one, step two:

if at least one non-reference sequence base is present at a genomic position that satisfies the frequency P of its occurrence _j1∈keys ＞T _alt (default: 0.2), where T _alt And if the frequency threshold is preset for the user, the current position is a candidate variation site, and the current position is represented by a 5-tuple, and the genome position (ctgpos), the reference base (refB), the non-reference base (altB) and the frequency (P) of appearance of the non-reference base sequence are respectively recorded _alt )。

The base group comprises: a reference base and a non-reference base, the non-reference base representing a possible variation;

it is noteworthy that if there are multiple non-reference sequence bases present, the frequency of occurrence is greater than the frequency threshold T _alt Then, the current position is called a candidate multi-allelic variation site (multi-allelic site), and a plurality of possible sites of the site are recorded at the same timeEnergy non-reference base information.

Step one, step three, classifying the candidate variation sites obtained in the step one, step two:

candidate mutation sites are classified into the following three categories: high confidence candidate variant Sites (HCs), low confidence candidate variant Sites (LCs), and low complex region candidate variant Sites (LCCs);

a candidate mutation site is classified as HCs when it has the following three features:

(1) Frequency of occurrence P of non-reference base sequence classified into HCs _HCs Is greater than a second frequency threshold T preset by the user _HCs I.e. P _HCs ＞T _HCs (default: 0.5);

(2) The candidate mutation sites are not included in windows extending w-bp upstream and downstream from the position of the current candidate mutation site on the genome of the sample to be tested (default w =100 bp). When at least two candidate variant sites appear in the same window, the position deviation and the deviation of variant sequences of the two candidate variant sites with closer distance appear due to the systematic deviation of the alignment frameworks or the scoring systematic deviation of Smith-Waterman.

(3) The current genomic position does not belong to low-complexity regions, including: the GAP region after genome splicing constitutes a heterochromatin domain (constitutive heterocyclic domains); segment repetition (segment repetition); pseudo-autosomal region of sex chromosome (the) pseudo-autol regions soft se chroma somes); short tandem repeats (short tandem repeats);

the reference genome is obtained by splicing sequencing fragments, but is not a complete sequence, and a plurality of position regions exist in the middle, namely GAP;

the GAP region after the genome splicing comprises: centromeres, telomeres, telomes.

A candidate mutation site is classified as LC when it has one of the following four characteristics _s ：

(1) Frequency of occurrence P of non-reference base _LCs At a second frequency threshold T set by the user _HCs A third frequency threshold T preset by the user _LCs In time between, i.e. T _LCs ＜P _LCs ＜T _HCs ，T _LCs Default to 0.2.

(2) The position of the current candidate mutation site is a candidate multiallelic mutation site (multi-allelic site).

(3) The current candidate mutation site is included in a window extending by w-bp (default w =100 bp) upstream and downstream from the current candidate mutation site on the genome of the sample to be tested.

(4) The current location does not belong to a low complexity region.

Otherwise, when one candidate variable locus does not meet the judgment conditions of HCs and LCs, the candidate variable locus is divided into LCCs.

Step two, constructing a local haplotype sequence according to the type of the candidate variable point, and integrating the constructed local haplotype sequence among samples to obtain an integrated haplotype sequence, wherein the method comprises the following steps of:

step two, dividing each interval block of the divided reference genome into windows with fixed length, traversing all candidate variable sites in the interval block, and distributing all the candidate variable sites in the interval block into an optimal window:

because the reference genome sequence has a large block length and includes genome structure variation, partial region variation is complex, and the construction of a haplotype sequence is difficult. Therefore, the invention divides the interval block into a plurality of windows with fixed length W =300bp, the overlapping length between adjacent windows is 150bp, and the local haplotype sequence in the W window is constructed for local genome variation detection.

Traversing all candidate mutation sites in the interval block and distributing the candidate mutation sites into an optimal window, namely, the candidate mutation sites fall at the center position of the window as much as possible, specifically, when the offset of the candidate mutation sites compared with the starting position of the window is within the range

In the middle ofThe current window is considered to be the optimal window.

step two, aiming at each window, judging whether each window contains a candidate variation site or not, and if not, determining the reference genome sequence in the window as the haplotype sequence of the current window; if the candidate variation sites are contained, judging whether the window contains the low confidence candidate variation sites, if the window contains the low confidence candidate variation sites, executing the second step to obtain the constructed detected sample haplotype sequence, and if the window only contains the high confidence candidate variation sites, executing the second step to obtain the constructed detected sample haplotype sequence;

step two, when the window contains the candidate variant sites with low confidence coefficient, generating a consistent sequence (contigs) of a sequencing sequence around the high-quality candidate variant sites by using a splicing method based on a de Bruijn graph in the window, wherein the consistent sequence (contigs) is a possibly existing haplotype sequence:

first, all sequencing fragments within the window were extracted and sequencing fragments that overlapped the window by less than 10% were filtered out.

Setting initial k-mer =41bp, constructing a local de Bruijn graph, identifying all possible consistency sequences from the graph through a depth-first search algorithm, sequencing the support degrees of the consistency sequences according to sequencing fragments, and selecting a plurality of contigs (default 2) with the highest support degree;

adding all contigs generated when the k-mer =41 into a sequencing fragment, setting the k-mer =46, reconstructing a local deBruijn graph, and regenerating contigs according to the steps;

setting the step length to be 5bp, finishing the k-mer =75bp, and repeating the operation in sequence until contigs are generated in the last step.

And step two and three, when only high-confidence candidate sites are contained in the window, extracting all sequencing fragments covering the current window from the comparison information set through a view function module in an SAMtools analysis tool, and acquiring the sequencing fragment ID supporting each candidate variant site in the comparison information set (when the comparison information set contains the same base sequence as the candidate site, the sequencing fragment is considered to support the candidate variant site). Taking intersection of sequencing fragment IDs supporting any two candidate mutation sites, if the number of sequencing fragments contained in the intersection is not less than a user-defined number threshold (default is 2), considering that the mutation corresponding to the two candidate mutation sites appears on the same haplotype sequence, and replacing the base at the corresponding position on the reference sequence with the variant base to construct the haplotype sequence, so as to obtain the haplotype sequence; otherwise, if the haplotype sequences are considered to be from different haplotype sequences, then the haplotype sequences are constructed according to the candidate variation and the reference genome sequence contained in different haplotypes.

first, in a window, all the tested sample haplotype sequences are traversed, and deduplication is performed according to the candidate variations contained in the sequences, so as to finally generate a non-repetitive sample haplotype sequence set in the window. Traversing all windows and sequentially generating a haplotype sequence set h = { h } of each window ₁ ，h ₂ ，...，h _n }。

The de-weighting specifically comprises the following steps: if the haplotype sequences for both samples contained the same variation, only one haplotype sequence was retained. Within a window, different sites may contain different types of variation,

step three, detecting the genotype of the variation site of the genome of the population sample according to the integrated haplotype sequence, comprising the following steps:

step three, carrying out non-gap comparison (non-gap permission comparison) on the haploid sequence set obtained in the step two and the sequencing fragments of the comparison information set, and obtaining that the sum of the quality of the accumulated mismatched bases of each sequencing fragment non-gap comparison is S:

step three, grouping the candidate variant loci and haplotype sequences of each window:

firstly, in a window, sequentially traversing the candidate variation site information of each tested sample, and selecting all haplotype sequences containing the candidate variation sites from the haplotype sequence set according to the tested candidate variation site information, wherein the haplotype sequences are marked as H, and the rest haplotypes do not contain the candidate variation and are marked as R.

Then, grouping candidate variant sites contained in the tested sample: two candidate mutation sites are considered as a group if the distance between them is smaller than the user-defined distance threshold (default: 5 bp).

firstly, for each group of candidate variation sites obtained in the third step, all haplotype sequences containing the group of variation are selected from H and recorded as H, and all sequencing fragments in a window are extracted from an alignment information set.

Then, by mapping all the sequencing fragments and the perfect matches (exact matches) between the haplotype sequences and the reference sequence in the local region respectively, the maximum perfect match, i.e. the perfect match block with the longest length, is obtained, and the initial specific position is determined as p. Starting from the alignment starting position, single base extension was performed to the left and right sides, respectively, and when a base mismatch was encountered during extension, the base quality of the mismatched base (ASCII value-33) was accumulated. When the base extension is extended to both ends of the sequence or the sum of the cumulative mismatched base masses is larger than the base mass threshold (default: 300) set by the user, the base extension is stopped. Similarly, the haplotype sequences which did not overlap with each other were selected from R and denoted as R, and non-gap bases were extended. The sum of the cumulative mismatched base masses obtained by aligning non-gap of each sequenced fragment is recorded as S.

Step two, acquiring the genotype of the mutation point by utilizing the sum S of the accumulated mismatched base quality compared by each sequencing fragment non-gap obtained in the step one:

step three, step two, the sum S of the accumulated mismatched base quality of each sequencing fragment non-gap comparison is utilized to obtain the comparison of all sequencing fragments with the monomer type sequences in h and r respectively, and each sequencing fragment is used for comparisonOptimal alignment probability set of sequence segments

Obtained by the following method: let h = { h ₁ ，h ₂ ，...，h _n′ Represents a list of haplotype sequences containing variations, and the set of probabilities for aligning the sequencing fragment to n haplotypes is p = { p ₁ ，p ₂ ，...，p _n′ Selecting p with the highest probability _max As the probability of the current sequencing fragment aligned to the haplotype containing the variation, the corresponding haplotype is h _max 。

p _j obtained by the following formula:

wherein j ∈ [1, n '] is the index of the probability that the sequencing fragment aligns to n' haplotypes;

and thirdly, calculating the likelihood probability of three possible genotypes (0/0, 0/1, 1/1) based on P and Pr, and taking the maximum likelihood function as the variant genotype in the current window, wherein the formula is as follows:

and if the genotype is 0/1 or 1/1, the optimal haplotype in h is presumed according to each probability in P, and each variant genotype contained in the optimal haplotype is set to be 0/1 or 1/1.

Step four, traversing all detected sample genomes in the population, and repeatedly executing the steps one to three to obtain all variant genotypes in the population:

traversing candidate variation sites in all windows, and aiming at a certain variation site, if all samples have no variation signal at the site or have variation signals but the genotype of the detected variation is 0/0, determining that all individuals of the current population have no variation, and outputting the variation; and if the genotype of the sample at the position is 0/1 or 1/1, sequentially traversing all the samples, outputting the genotype of each tested sample at the position, and if a certain position is a multiallelic gene, splitting the multiallelic gene into a plurality of biallelic genes and sequentially outputting the multiallelic gene.

Claims

1. A population joint variation detection method based on pan-genome is characterized in that the method comprises the following specific processes:

firstly, carrying out interval block division on a reference genome, and comparing the reference genome subjected to interval block division with a measured sample genome to obtain a comparison information set; stacking the comparison information according to the coordinate sequence of the comparison information in the reference genome to obtain a stacking result; then, counting the stacking result, and identifying the category of the candidate mutation sites on the tested sample genome by using the stacking result;

the classes of candidate variation sites on the tested sample genome include: high confidence candidate variant sites, low complex region candidate variant sites;

firstly, carrying out non-gap comparison on a haploid sequence set and comparison information set sequencing fragments to obtain the sum of the accumulative mismatched base quality of non-gap comparison of each sequencing fragment as S, and then obtaining the genotype of a mutation point by utilizing the sum of the accumulative mismatched base quality of non-gap comparison of each sequencing fragment as S;

2. The pan-genome-based population joint variation detection method of claim 1, wherein: the first step comprises the following steps:

step one, acquiring a reference genome, and performing interval block division on the reference genome to obtain a divided reference genome B = B ₁ ,B ₂ ,…,B _n ；

Wherein, B ₁ ,B ₂ ，…,B _n Respectively representing each divided interval block, wherein n is the total number of the interval blocks;

the interval block partitioning is according to "chromosome: dividing the format of the starting position-the ending position;

there is no overlap between the interval blocks;

step two, obtaining a detected sample genome, comparing the detected sample genome with divided reference genomes to obtain a detected sample genome B _i The comparison information set in the database;

step one, step three, the obtained product in step two falls on B _i All the comparison information in the reference base according to the comparison informationStacking the coordinates in the factor group in sequence to obtain a stacking result;

3. The method of claim 2, wherein the genome-wide based population-associated variation detection method comprises: the first step three comprises the following steps:

wherein, the value of the unmatched comparison information 'flag' field is 4; the value of the 'flag' field of the suboptimal comparison information is 256; the value of the 'flag' field of the split comparison information is 2048;

4. The pan-genome-based population joint variation detection method of claim 3, wherein: counting the stacking result obtained in the step one by step three in each interval block of the divided reference genome in the step one, recording the sequence information of the stacking result in each interval block through a dictionary variable, identifying candidate variation sites of the tested sample according to the stacking result sequence information recorded by the dictionary, and classifying the candidate variation sites, wherein the method comprises the following steps:

the dictionary variables include: pileup _ dit;

P _j1∈keys ＝C _j1 /T

wherein T is the total number of occurrences of all bases, keys = { A, C, G, T, I, D } is the set of bases, j1 is a base in keys, C _j1 Indicates the number of occurrences of base j 1;

step one, step two, the frequency of each base obtained in the step one, step four is utilized to obtain a candidate variation site:

wherein, T _alt Is a frequency threshold;

the candidate mutation sites are represented by 5-tuple, including: genomic position, reference base, non-reference base, frequency of occurrence of non-reference base sequence;

5. The pan-genome-based population joint variation detection method of claim 4, wherein: in the step one, the step three, the candidate mutation sites obtained in the step one, the step two are classified, and the method specifically comprises the following steps:

LCs are determined when the candidate mutation sites satisfy one of the following four characteristics:

(3) Other candidate variant loci are contained in windows which respectively extend by w-bp at the upstream and the downstream by taking the position of the current candidate variant locus on the detected sample genome as a center;

(4) The current position belongs to a non-low complexity area;

and when the candidate variable locus does not meet the judgment conditions of the HCs and the LCs, the candidate variable locus is divided into the LCCs.

6. The pan-genome-based population joint variation detection method of claim 5, wherein: in the second step, a local haplotype sequence is constructed by using the candidate mutation sites according to the types of the candidate mutation sites, and the constructed local haplotype sequence is integrated among samples to obtain an integrated haplotype sequence set, and the method comprises the following steps:

then, all candidate mutation sites are traversed in each range block of the divided reference genome, and the candidate mutation sites are allocated to the optimal window:

when the candidate mutation site is offset from the window start position

In the meantime, the current window is the optimal window of the current candidate mutation site;

secondly, constructing a haplotype sequence of the tested sample in each window according to the candidate variation site types distributed to the optimal window;

then, all windows are traversed, and a haplotype sequence set h = { h } for each window is generated in turn ₁ ,h ₂ ,…,h _n }。

7. The method of claim 6, wherein the genome-wide based population-associated variation detection method comprises: and in the second step, a haplotype sequence of the sample to be detected is constructed in each window according to the candidate variant locus type distributed to the optimal window, and the method comprises the following steps:

step two, aiming at each window, judging whether each window contains a candidate variation site or not, and if not, determining that the reference genome sequence is the haplotype sequence of the current window; if the candidate variation sites are contained, judging whether the window contains the low confidence candidate variation sites, if the window contains the low confidence candidate variation sites, executing the second step to obtain the constructed detected sample haplotype sequence, and if the window only contains the high confidence candidate variation sites, executing the second step to obtain the constructed detected sample haplotype sequence;

extracting all sequencing fragments in the window, and filtering out the sequencing fragments with the overlapping degree of less than 10% of the window;

adding all contigs generated at k-mer =41 to the sequencing fragments, setting k-mer =46, reconstructing the local deBruijn graph, and regenerating contigs;

extracting all sequencing fragments covering the current window from the comparison information set by using a comparison analysis tool, and acquiring a sequencing fragment ID supporting each candidate variation site in the comparison information set; taking intersection of the sequencing fragment IDs supporting any two candidate variation sites, if the number of the sequencing fragments contained in the intersection is not less than a number threshold, considering that the variation corresponding to the two candidate variation sites appears on the same haplotype sequence, and replacing the base at the corresponding position on the reference sequence with the variation base to construct the haplotype sequence to obtain the haplotype sequence; if two candidate variation sites are not supported by any identical sequencing fragment, they are considered to be from different haplotype sequences, and haplotype sequences are constructed based on the candidate variation and reference genomic sequence contained on the different haplotypes, respectively.

8. The pan-genome-based population joint variation detection method of claim 7, wherein: in the third step, the integrated haplotype sequence is used for detecting the genotype of the variation site of the detected sample genome, and the method comprises the following steps:

firstly, all haplotype sequences containing candidate variation sites are selected from the haplotype sequence set and are marked as H; the rest haplotype sequences do not contain candidate variation and are marked as R;

then, starting from the initial position of the comparison, respectively extending single bases to the left side and the right side, and accumulating the base quality of mismatched bases when the bases are mismatched in the extending process; stopping base extension when the sum of the qualities of the extended and accumulated mismatched bases is larger than the base quality threshold value;

finally, selecting a non-repeated haplotype sequence from R as R, and respectively carrying out base extension of non-gap to obtain the sum of the quality of the accumulated mismatched bases compared by each sequencing fragment non-gap as S;

9. The method of claim 8, wherein the genome-wide based population-associated variation detection method comprises: in the second step, the genotype of the mutation point is obtained by utilizing the sum S of the accumulated mismatched base quality compared by each sequencing fragment non-gap obtained in the first step, and the method comprises the following steps:

Obtained by the following method: let list of haplotype sequences containing mutations h = { h = { h } ₁ ,h ₂ ,…,h _n’ The set of probabilities of aligning the sequencing fragment to the haplotype is p = { p ₁ ,p ₂ ,…,p _n ' }, selecting the maximum probability p among them _max As the probability of the alignment of the currently sequenced fragment to the haplotype containing the variation, the corresponding haplotype is h _max ；

p _j obtained by the following formula:

and step III, calculating the likelihood probability of different genotypes by utilizing the P and Pr obtained in the step III, and taking the maximum likelihood function as the variant genotype in the current window, wherein the variant genotype is as follows:

10. the method of claim 9, wherein the genome-wide based population-associated variation detection method comprises: traversing all the detected sample genomes in the population in the fourth step, and repeatedly executing the first step to the third step to obtain all the variant genotypes in the population, wherein the steps are as follows: