CN112669906B

CN112669906B - Detection method, device, terminal device and computer-readable storage medium for measuring genome instability

Info

Publication number: CN112669906B
Application number: CN202011360888.1A
Authority: CN
Inventors: 焦阳; 孟培; 王春丽; 陈冬菊; 蔡宇航; 邵明辉; 宋程程; 石太平
Original assignee: Huada Biotechnology Wuhan Co ltd; Tianjin Medical Laboratory Bgi; BGI Shenzhen Co Ltd
Current assignee: Huada Biotechnology Wuhan Co ltd; Tianjin Medical Laboratory Bgi; BGI Shenzhen Co Ltd
Priority date: 2020-11-25
Filing date: 2020-11-27
Publication date: 2021-09-28
Anticipated expiration: 2040-11-27
Also published as: CN112669906A

Abstract

The invention provides a detection method for measuring genome instability, which comprises the following steps: (1) obtaining sequencing data; (2) determining a BAF value and an LRR value; (3) removing outliers; (4) at least one round of segmentation treatment; (5) combining breakpoint sets; (6) merging the second-level sorted fragments; (7) filtering and cutting the third-level sorted fragments, (8) carrying out same distribution inspection; (9) determining a BAF mean value and a copy number mean value; (10) performing plane coordinate axis clustering mapping on the BAF mean value and copy number mean value percentage ordinal of the four-level sorted fragments so as to determine the tumor purity of the tumor sample in the sample to be detected; (11) determining a ploidy correction factor of a sample to be detected; (12) carrying out BAF numerical correction and copy number correction; (13) determining the genotype A parameter value and the genotype B parameter value; (14) and determining the type of the variation.

Description

Detection method, device, terminal device and computer-readable storage medium for measuring genome instability

Technical Field

The invention belongs to the field of biotechnology, and particularly relates to a detection method, equipment, terminal equipment and a computer-readable storage medium for measuring genome instability.

Background

Most cancer cells are characterized by genomic instability, a tendency for genomic changes during cell division. Cancer is usually caused by the inactivation of multiple genes that control cell division and cancer suppressor genes. Studies have shown that genome integrity is closely monitored by a variety of monitoring mechanisms, including DNA damage checkpoints, DNA repair mechanisms, mitotic checkpoints, and the like. Defects in the regulation of either of these mechanisms often lead to genomic instability, rendering the cell susceptible to malignant transformation. For example, post-translational modifications of histone tails are closely related to the regulation of cell cycle and chromatin structure, and DNA methylation status is related to genome integrity. Genomic instability provides individuals with shorter cell cycles and/or the advantage of bypassing the intracellular and immune control systems, thereby giving cancer cells growth advantages and being selected as malignantly transformed cells. Genomic instability includes small structural variations such as increased base pair mutation frequencies, microsatellite instability (MSI), and significant structural variations such as changes in chromosome number or structure. At present, the mechanism for the unstable origin of the genome is not yet clear.

Chromosomal structural abnormalities resulting from genomic instability are a key feature of cancer cells. Single nucleotide polymorphisms or copy number variations of oncogenes or tumor suppressor genes are direct drivers of tumorigenesis and progression. Genomic instability can cause genome-wide genetic variations that confer clonal growth and genetic evolution advantages to tumor cells, ultimately leading to drug resistance and tumor recurrence. Thus, there are multiple evolutionarily conserved pathways within cells that respond to these errors by initiating DNA repair processes or initiating apoptosis. At present, several DNA repair pathways are known to be activated after DNA damage, and can be generally classified into Nucleotide Excision Repair (NER), Base Excision Repair (BER), mismatch repair (MMR), and DNA Double Strand Break Repair (DSBR).

In DNA double-strand damage, DNA double-strand breaks are repaired mainly by two pathways, homologous recombination (HRR) and non-homologous end joining (NHEJ). NHEJ does not require sequence homology at the ends of the break, repairing DSBs that are frequent and less complex throughout the cell cycle; HRR frequencies are much lower than NHEJ, mainly mediating complex or more dangerous DSB repair such as replication fork disruption. The DNA damage repair mechanism plays an important role in maintaining the stability of the genome.

Recent studies have shown that three approaches to assessing DNA-based genomic instability, loss of heterozygosity (LOH), large segment switching (LST) and trans-Telomeric Allelic Imbalance (TAI), reflect the extent of potential homologous recombination DNA repair defects in tumors.

However, the analysis of genomic instability is still in need of improvement at present.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. Thus, the present invention provides a sequencing data analysis method, a sequencing data analysis apparatus, a terminal apparatus and a computer-readable storage medium that can be used for genome instability such as copy number abnormality detection.

According to a first aspect of the invention, there is provided a method of sequencing data analysis or a detection method for measuring genomic instability, the method comprising, according to an embodiment of the invention: (1) obtaining sequencing data from a plurality of capture regions for a sample to be tested, each of the plurality of capture regions containing at least one SNP site; (2) determining, for each of the plurality of capture regions, a BAF value and an LRR value for the capture region, the BAF value characterizing a median SNP site mutation genotype frequency in the capture region, the LRR value determined by the formula Log2 (sequencing depth of test sample/sequencing depth of control sample); (3) performing outlier removal on the plurality of capture areas based on the BAF values and the LRR values of the capture areas to obtain the plurality of capture areas subjected to filtering; (4) performing at least one round of segmentation processing on at least one chromosome based on the BAF value and the LRR value of the capture region to obtain BAF primary sorted fragments and LRR primary sorted fragments, respectively; (5) aiming at the BAF primary sorting segment and the LRR primary sorting segment, performing breakpoint set combination, and determining a secondary sorting segment based on the combination of breakpoints; (6) performing iterative combination processing on the secondary sorting fragments based on a preset combination threshold value so as to obtain tertiary sorting fragments; (7) filtering and cutting the tertiary sorted fragments based on a predetermined fragment length threshold value so as to obtain a plurality of quaternary sorted fragments, wherein the length of the quaternary sorted fragments is 3-11 Mb, and optionally, the length of the quaternary sorted fragments is 4.5-5.5 Mb; (8) performing homodistribution test on BAF values of the four-level sorted fragments and BAF values of corresponding regions in a control sample based on BAF values of capture regions contained in the four-level sorted fragments, and classifying each four-level sorted fragment into balanced fragments, unbalanced fragments, homozygous fragments or non-homozygous fragments based on a predetermined test threshold and a density ratio of the four-level sorted fragments before and after the removal of BAF outliers, wherein the density ratio of the four-level sorted fragments before and after the removal of BAF outliers is smaller than a predetermined homozygous threshold, the four-level sorted fragments are homozygous fragments, and the four-level sorted fragments are non-homozygous fragments if the density ratio of the four-level sorted fragments before and after the removal of BAF outliers is not smaller than the predetermined homozygous threshold; (9) determining, for each of the plurality of four-level sorted segments, a BAF mean, a copy number mean, and an ordering of the BAF mean of the four-level sorted segment across all four-level sorted segments based on the BAF value and the LRR value of the capture region included in the four-level sorted segment; (10) performing plane coordinate axis clustering mapping on the BAF mean value and the copy number mean value percentage ordinal number of the four-level sorting fragment so as to determine the tumor purity of the sample to be detected; (11) performing cluster analysis on the copy number mean and copy percentage ordinal of the equilibrium fragment based on the tumor purity so as to determine a ploidy correction factor of the sample to be tested; (12) performing BAF numerical correction and copy number correction on each of the plurality of tertiary sorted fragments respectively based on the tumor purity of the sample to be tested and the ploidy correction factor; (13) determining a genotype A parameter value and a genotype B parameter value based on the corrected BAF value and copy number value; (14) determining a type of variation of the tertiary sorted fragments based on the A and B genotype parameter values and the length of the tertiary sorted fragments, the type of variation comprising: loss of heterozygosity, transition in long fragment copy number state, and cross-telomeric allele imbalance.

According to the embodiment of the present invention, with the above method, sequencing data obtained by sequencing a genome can be effectively analyzed to obtain corresponding genomic instability such as copy number abnormality, and the like, and the type of variation can be determined, such as the type of variation including: loss of heterozygosity, transition in long fragment copy number state, and cross-telomeric allele imbalance.

In a second aspect of the invention, the invention also provides a sequencing data analysis apparatus or a detection apparatus for measuring genomic instability, the sequencing data analysis apparatus comprising, according to an embodiment of the invention: the sequencing data acquisition module is used for acquiring sequencing data from a plurality of capture areas aiming at a sample to be detected, wherein each capture area comprises at least one SNP locus; a BAF-LRR value determination module for determining, for each of the plurality of capture regions, a BAF value and an LRR value for the capture region, the BAF value characterizing a median SNP site mutation genotype frequency in the capture region, the LRR value determined by the formula Log2 (sequencing depth of test sample/sequencing depth of control sample); an outlier removal module configured to perform outlier removal on the plurality of capture areas based on the BAF values and the LRR values of the capture areas so as to obtain the plurality of capture areas subjected to filtering processing; a segmentation processing module for performing at least one round of segmentation processing on at least one chromosome based on the BAF value and the LRR value of the capture region to obtain a BAF primary sorted fragment and an LRR primary sorted fragment, respectively; a breakpoint set merging module, configured to perform breakpoint set merging on the BAF primary sorting segment and the LRR primary sorting segment, and determine a secondary sorting segment based on a combination of breakpoints; the second-level sorting segment merging module is used for carrying out iterative merging processing on the second-level sorting segments on the basis of a preset merging threshold value so as to obtain third-level sorting segments; a fourth-level sorting fragment obtaining module, configured to filter and cut the third-level sorting fragments based on a predetermined fragment length threshold, so as to obtain a plurality of fourth-level sorting fragments, where the length of each fourth-level sorting fragment is 3-11 Mb, and optionally, the length of each fourth-level sorting fragment is 4.5-5.5 Mb; a fourth-level sorting segment classifying module, configured to perform a homodistribution test on BAF values of capture regions included in the fourth-level sorting segment and BAF values of corresponding regions in a control sample, and classify each fourth-level sorting segment into a balanced segment, an unbalanced segment, a homozygous segment, or a non-homozygous segment based on a predetermined test threshold and a density ratio before and after BAF outliers in each fourth-level sorting segment are removed, where a density ratio before and after BAF outliers in the fourth-level sorting segment are removed is smaller than a predetermined homozygous threshold, the fourth-level sorting segment is a homozygous segment, and a density ratio before and after BAF outliers in the fourth-level sorting segment are not smaller than the predetermined homozygous threshold, the fourth-level sorting segment is a non-homozygous segment; a fourth-level sorted segment parameter determination module for determining, for each of the plurality of fourth-level sorted segments, a BAF mean, a copy number mean, and an ordering of the BAF mean of the fourth-level sorted segments among all the fourth-level sorted segments, respectively, based on the BAF value and the LRR value of the capture region included in the fourth-level sorted segment; the tumor purity determination module is used for performing plane coordinate axis clustering mapping on the BAF mean value and copy number mean value percentage ordinal number of the four-level sorted fragments so as to determine the tumor purity of the sample to be detected; a ploidy correction factor determination module, configured to perform cluster analysis on the copy number mean and the copy percentage number of the equilibrium fragments based on the tumor purity, so as to determine a ploidy correction factor of the sample to be tested; a third-level sorting segment correction module, configured to perform BAF value correction and copy number correction on each of the plurality of third-level sorting segments, respectively, based on the tumor purity of the sample to be detected and the ploidy correction factor; a genotype parameter value determination module for determining a genotype parameter value A and a genotype parameter value B based on the corrected BAF value and copy number value; a variation type determination module for determining a variation type of the tertiary sorted fragments based on the A genotype parameter value and the B genotype parameter value and the length of the tertiary sorted fragments, the variation type comprising: loss of heterozygosity, transition in long fragment copy number state, and cross-telomeric allele imbalance.

According to an embodiment of the present invention, with the sequencing data analysis apparatus, the above-described sequencing data analysis method can be effectively implemented, corresponding genomic instability such as copy number abnormality and the like can be effectively obtained by analyzing sequencing data obtained by sequencing a genome, and a mutation type such as a mutation type including: loss of heterozygosity, transition in long fragment copy number state, and cross-telomeric allele imbalance.

In a third aspect of the present invention, the present invention also provides a terminal device, which includes: a memory and a processor; the memory is used for storing program codes; the processor, invoking the program code, when executed, is configured to perform the method as described above.

In a fourth aspect of the invention, the invention also proposes a computer-readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method as described above, according to an embodiment of the invention.

It is to be noted that the features and advantages described herein for the detection method for measuring genomic instability or the detection device for measuring genomic instability can be understood by those skilled in the art to be applicable to other aspects as well, and are not described herein again.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow diagram of a method of sequencing data analysis according to one embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a sequencing data analysis apparatus according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of outlier removal according to another embodiment of the present invention;

FIG. 4 is a schematic illustration of a segmentation method and a reference schematic illustration of a ploidy correction according to another embodiment of the present invention;

FIG. 5 is a segmentation result using the public software PureCN according to another embodiment of the present invention;

FIG. 6 is a comparison of the S1 sample with PureCN (left panel) and the detection method for measuring genomic instability of the present invention (also sometimes referred to herein as the "GSA algorithm") (right panel) sample chromosome 1 segment according to another embodiment of the present invention;

fig. 7 is a comparison of regional LRR distributions according to another embodiment of the present invention, in order: chromosome 1, the 0-109M region (top left), and the 0-3.7M region (top right), the 16-22M region (bottom left), and the 74-78M region (bottom right);

FIG. 8 is a comparison of the distribution of regional BAF according to another embodiment of the present invention, the left being the region of chromosomes 0-109M No. 1 and the right being the region of chromosomes 74-78M No. 1;

FIG. 9 is a comparison of the S1 sample chromosome 2 segment using PureCN (left panel) and the GSA algorithm of the present invention (right panel) according to another embodiment of the present invention;

FIG. 10 is a comparison graph of chromosome 3 segments of a sample S1 using PureCN (left panel) and GSA algorithm (right panel) according to another embodiment of the present invention;

FIG. 11 is a comparison of chromosome 3 end segments of a sample S1 using PureCN (left panel) and GSA algorithm (right panel) according to another embodiment of the present invention; (ii) a

FIGS. 12-16 are comparisons of the segmentation effect of the S1 sample using PurenCN (left panel) and GSA algorithm of the present invention (right panel) for chromosome 4-22 ends according to another embodiment of the present invention;

FIGS. 17-22 are comparisons of the segmentation effect of the S2 sample on different chromosomes using PureCN (left panel) and the GSA algorithm of the present invention (right panel) according to another embodiment of the present invention;

FIG. 23 is a comparison of calculated purity versus theoretical purity using PureCN and the GSA algorithm of the present invention, in accordance with another embodiment of the present invention;

FIG. 24 is a comparison of ploidy calculated using the PurenN and GSA algorithms of the present invention, according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In a first aspect of the invention, the invention provides a method of sequencing data analysis, according to an embodiment of the invention, with reference to figure 1, the method comprising:

s10: obtaining sequencing data

In this step, sequencing data is obtained for a test sample from a plurality of capture regions, each of the plurality of capture regions containing at least one SNP site.

According to embodiments of the present invention, the capture region may be obtained by analyzing the position of a known SNP site, thereby achieving a balance between efficiency and accuracy. According to embodiments of the present invention, a novel method for determining capture regions proposed by the inventors can be employed to partition a reference genome into several capture regions (sometimes also referred to herein as "capture regions"). Thus, the capture probes are designed for the respective capture regions in order to construct sequencing data from these capture regions.

Further, the person skilled in the art can perform informatics analysis on the off-set data (sequencing data) for the specific region by using a conventional SNP analysis method, and can obtain a sequencing result at a known SNP site for a sample to be tested. For example, sequencing data can be quality controlled and filtered before being processed using known algorithms, including but not limited to: SOAPsnp, atlas SNP2, GATK, and SAMtols calls. And will not be described in detail herein.

According to an embodiment of the present invention, the sequencing data of the capture region can be obtained by using the kit of the present invention, and the kit for measuring genomic instability comprises a probe set, wherein the probe set is determined by the following steps:

(1) dividing the reference genomic sequence into a plurality of primary regions based on at least one of a predetermined number of expected intervals and expected intervals, the primary regions containing at least one known SNP site;

(2) performing SNP filtering for each of the plurality of primary regions so as to obtain a primary high-frequency SNP, and dividing the primary region into a primary region containing the primary high-frequency SNP and a primary region not containing the primary high-frequency SNP, wherein the primary high-frequency SNP is a SNP with the highest frequency in the primary region and the frequency of the primary high-frequency SNP is not lower than a predetermined frequency;

(3) selecting a region where the distance between adjacent first-level high-frequency SNPs exceeds a predetermined threshold as a gap region, the predetermined threshold being not less than 1.5 times the expected interval;

(4) dividing the gap region into at least one secondary region based on the expected spacing;

(5) respectively carrying out secondary high-frequency SNP search aiming at each of the at least one secondary region, wherein the frequency of the secondary high-frequency SNP is not lower than the preset frequency:

(a) extending from the central point of the secondary area to two sides at least once, wherein the extending treatment extends from two sides of the central point for a preset length; and

(b) searching for the SNP with the highest frequency in the obtained region after the at least one extension for a predetermined length, and selecting the SNP as the secondary high-frequency SNP if the frequency of the SNP is not less than 10%, wherein the extension process is stopped when the secondary high-frequency SNP is obtained or the extension process is stopped when the length of the region after the extension process exceeds the predetermined threshold,

(6) determining a tertiary region based on the primary high-frequency SNP and the secondary high-frequency SNP as a starting point SNP, wherein the tertiary region is determined by extending a predetermined length on two sides of the starting point SNP; and

(7) based on the tertiary region, a probe is constructed that specifically recognizes the tertiary region, the probe being suitable for determining genomic instability.

According to the embodiment of the invention, the probe obtained by adopting the method can effectively acquire the data of the corresponding capture area, and can control the number of the capture areas and improve the analysis efficiency.

S20: determining BAF and LRR values

In this step, for each of the plurality of capture regions, a BAF value and an LRR value are determined for the capture region, the BAF value characterizing the mutation genotype frequency of the median SNP site in the capture region, the LRR value being determined by the formula Log2 (sequencing depth of test sample/sequencing depth of control sample).

Herein, BAF refers to BAF having the median SNP site mutation genotype frequency within the bed region of each capture region as the region. The LRR is determined by the following steps: the mean depth within each capture region (sometimes also referred to as the bed region) was counted and corrected for GC content and capture region, normalized to the mean sequencing depth of the sample, and LRR calculated from log2 (tumor tissue sample depth/control sample depth). The control sample depth is the mean depth of the control sample (the term "control sample" as used herein should be understood in a broad sense, and can be other tissue cells of the individual from the same source as the sequenced sample, such as blood cells, or biological samples from other individuals with known pathological conditions, such as normal human samples, and specifically, can be the sequencing data of historical blood cell samples) after correction normalization.

According to an embodiment of the invention, the BAF value may be determined for each of a plurality of the capture areas by:

(2-1) determining the minor allele frequency of each of said SNP sites in said capture region so as to obtain the minor allele frequencies of all sites within the capture region;

(2-2) determining the median value of the minor allele frequencies of all loci in each capture region as the BAF value of said capture region.

According to the embodiments of the present invention, the skilled person can obtain the frequency of the minor allele at each site in a conventional manner. Preferably, the sub-allele frequency can be determined for a given SNP by counting the number of sequencing reads (reads) supporting each SNP type, for example, for a given SNP site, there are 100 reads supporting (aligned uniquely with) that SNP site, where the number of sequencing reads supporting bases A, T, G and C is 50, 40, 5 and 5, respectively, then T is the sub-allele and then the sub-allele frequency is 40%. Further, the frequencies of the minor alleles of all the SNP sites involved in the capture region are counted, and the median of the frequencies of the minor alleles is determined, so that the BAF value of the capture region can be obtained.

According to an embodiment of the present invention, for each of the capture areas, the LRR value may be determined by:

(2-i) determining the normalized sequencing depth of the sample to be tested and the normalized sequencing depth of the control sample for the capture region; and

(2-ii) determining the LRR value of said capture area according to the following formula:

log2 (sequencing depth of test sample/sequencing depth of control sample).

S30: performing outlier removal

In this step, outlier removal is performed on the plurality of capture areas based on the BAF values and the LRR values of the capture areas to obtain the plurality of capture areas subjected to the filtering process.

According to embodiments of the present invention, it is possible to visually remove clearly deviating outliers based on the resulting mapping results by mapping, for example, the positional variation of the captured region on the chromosome with the corresponding BAF value or LRR value. The data points that are significantly deviated can be judged visually by manual operation. In addition, according to an embodiment of the present invention, automatic identification and processing of outliers may also be performed by the following steps.

Firstly, before the outlier removal, an abnormal BAF value and an abnormal LRR value are corrected in advance by using a method of near linear interpolation, and the correction includes:

determining a capture region in which the BAF value or the LRR value is abnormal (for example, a value whose displayed value magnitude is obviously inconsistent with other data or a value displayed as infinite inf according to an embodiment of the invention), extending no less than 5 capture regions before and after the capture region, and taking the average BAF value or LRR value of the regions obtained after extension as a correction value of the abnormal value. One skilled in the art will appreciate that the number of extended capture regions is sufficient such that the extended corrected value is no longer an outlier. Note that BAF and LRR are both in units of capture regions, one capture region for each value.

Referring to fig. 3, additionally, according to an embodiment of the present invention, the BAF value outliers are determined by: the value points with BAF values less than 0.05 or greater than 0.95 were selected as BAF value outliers. According to an embodiment of the present invention, the LRR value outlier is determined by: aiming at each capture area, respectively constructing an LRR value-position index binary normal distribution taking the capture area as a center; for a predetermined capture area, extending a predetermined number of adjacent capture areas forward and backward, preferably 100 adjacent capture areas, with the predetermined capture area as a center, to obtain an LRR numerical analysis area; determining the probability density of each LRR value-position index binary normal distribution at the position of the preset capturing area based on the corresponding LRR value-position index binary normal distribution of the capturing area contained in the LRR value analysis area; calculating the probability densities all at the predetermined capture area location in superposition to determine a total value of the probability densities for the predetermined capture area; determining the predetermined capture area as an LRR numerical outlier if the total probability density value is below a predetermined outlier threshold.

According to the embodiment of the invention, each point is firstly assigned with a two-dimensional normally distributed density function, the covariance of the density function is 0, and the x-direction variance refers to the total paragraph length. For each point, the probability density of the adjacent points is calculated by superposition, and for the point close to the paragraph edge, the density of the point without the symmetrical position needs to be doubled. The sum of the densities in all points below the threshold or below a certain fraction (e.g., 30%) will be removed.

Thus, the filtered capture area data can be obtained by the above processing.

S40: segmented to obtain first-stage sorted fragments

In this step, at least one round of segmentation processing is performed on at least a part of at least one chromosome based on the BAF value and the LRR value of the capture region (as understood by those skilled in the art, the capture region processed here refers to the capture region filtered through step S30) to obtain BAF-sorted fragments and LRR-sorted fragments, respectively.

With reference to figure 4 of the drawings,

in step (4), at least a portion of the complete chromosomal sequence (preferably, the complete chromosomal full-length sequence, e.g., other chromosomes other than the Y chromosome, is used according to an embodiment of the present invention) is selected as a starting analysis region, and the segmentation process is performed based on the LRR value and BAF value, respectively, based on the starting analysis region as a root byte,

wherein the segmentation process further comprises:

(4-1) and calculating the average value of the parameters of the area to be segmented and processed

The parameter is a BAF value or an LRR value;

(4-2) determining a difference between the parameter of each of the capture areas and the average value of the parameter for a plurality of the capture areas in the processing area to be segmented

Wherein i represents the number of capture regions;

(4-3) for the processing area to be segmented, starting from the first one of the capturing areas, in the extending direction of the processing area to be segmented, performing one cumulative addition of the difference values for each of the capturing areas

Wherein,

indicating the number zone for the i-th numberA field for arithmetically adding the difference values of the No. 1 to i capture areas;

(4-4) in the to-be-segmented processing region, determining the cumulatively added extreme points, selecting the extreme points, the distances between which and one of the endpoints of the to-be-segmented processing region both exceed a predetermined distance threshold value, as segmentation points, and segmenting the initial analysis region into a plurality of sub-regions by using the segmentation points;

(4-5) repeating steps (4-1) - (4-4) at least once for said plurality of sub-regions until the newly generated passage is no longer less than the predetermined length threshold, so as to obtain a plurality of passages containing at least 50 of said capture region, so as to obtain said sorted fragments.

The predetermined length threshold is sufficient to include no less than 50 of the capture areas.

According to the embodiment of the present invention, it is further possible to include:

(4-6) combining adjacent fragments having a length smaller than a predetermined length threshold (e.g., a length of 50 capture regions, such as 3Mb) or a number of SNPs smaller than a threshold (e.g., a number of SNPs included in an average of 50 capture regions), and taking the combined fragments as the sorted fragments,

the merging further comprises:

and performing cycle statistical test on the same distribution of adjacent sorted fragments based on BAF and LRR values respectively, and merging adjacent paragraphs with distribution deviation smaller than the threshold of the same distribution test.

According to an embodiment of the present invention, in this step, a paragraph tree structure is built and traversed in a recursive manner: firstly, the whole paragraph is taken as a root node, all the segmented sub-paragraph of the paragraph is taken as a child node, and the paragraph after the child node is segmented again is taken as a child node of the child node.

When the following conditions are met, recursive judgment needs to be initiated, and the method comprises the following steps: and judging whether the paragraph is a root node, if so, ending the recursive calling process, and if not, backtracking the last level child node of the node to recursively call the segmentation process. If there are no unprocessed flat child nodes, the backtracking root node recursively invokes the partitioning process.

For a paragraph to be segmented, the segmentation process is as follows: firstly, judging whether the paragraph is segmented or not, if so, initiating recursive judgment. If the paragraph is not segmented but the number of data points included is less than or equal to the threshold value, then the recursive judgment is initiated. If the node is not segmented and the number of data points is greater than the threshold, marking the segment as segmented and calculating the run length of the data deviating from the mean

And selecting a run maximum value and a run minimum value point, and calculating the difference value between the maximum value point and the minimum value.

If the maximum value point and the minimum value point (namely the breakpoint) are respectively arranged at the two ends of the paragraph, or the difference value between the point and the two ends of the paragraph is smaller than the threshold value, the 2 points are not recorded, the recursion judgment is initiated, and the next cycle is started from the middle part of the two points to generate a child node. If the distance from one point to the two ends of the paragraph is larger than the threshold value, the point is recorded, the whole sequence is divided into two parts (2 sub-nodes are generated), and the respective starting point and the end point of the two sub-nodes are recorded. If the distance between two points and the two ends of the paragraph is larger than the threshold value, three segments are recorded (3 child nodes are generated), and the respective starting point and the end point of the three child nodes are recorded. And processing the last child node according to the method.

If the length of the divided child node is smaller than the threshold value, backtracking upwards, processing the penultimate child node, otherwise segmenting again, continuously generating a new child node, processing the last child node, and processing the level node of the father node if all the child nodes are processed completely; and ending until the root node is finally backtracked and no new child node exists.

And merging the coarsely divided paragraphs. The merging includes two steps: the small paragraphs are merged firstly, and merging can be carried out simply according to the fact that the paragraph length does not meet the threshold value, or the length does not meet the threshold value and the point number does not meet the threshold value, and the threshold value needs to be adjusted by referring to the expected resolution. Subsequently, the adjacent fragments are subjected to statistical tests of the same distribution. The test indexes are as follows:

sum of differences of probability density functions (usually obtained using kernel density estimation) of anterior and posterior segment point distributions

The difference between the accumulated distribution functions (obtained by numerically integrating the nuclear density curve) of the front-stage and back-stage point distributions and the difference between the total standard deviation after the two segments are merged and the weighted average of the standard deviations of the two segments.

Merging is carried out in a circulating mode, firstly, two adjacent segments which are closest to the same distribution are merged according to a certain detection index, the indexes of the newly merged paragraph and the paragraphs before and after the paragraph are recalculated, and the process is circulated until all the indexes reach the threshold value.

S50: performing breakpoint set combination to determine secondary sorting segment

In this step, break point set merging is performed for the BAF primary sort segment and the LRR primary sort segment, and a secondary sort segment is determined based on a combination of the break points.

For example, as an example, the fragment sites of the BAF primary sorting fragments are 1-10 and 20-30 respectively, the LRR primary sorting fragments are 5-15 and 25-35 respectively, the breakpoints are 1,5,10,15,20,25,30 and 35 respectively, and the secondary sorting fragments are 1-5, 5-10, 10-15, 15-20, 20-25, 25-30 and 30-35 respectively. This example is for illustrative purposes only and does not limit the invention in any way.

S60: merging the second-stage sorted fragments to obtain third-stage sorted fragments

In this step, the secondary sorted fragments are subjected to a merging process based on a predetermined merging threshold (for example, secondary sorted fragments having a fragment length of not more than 1M or a fragment length of not more than 2M and a SNP number of less than 50 or a SNP number of less than 5) so as to obtain tertiary sorted fragments.

S70: filtering and cutting the third-level sorted segment to obtain a fourth-level sorted segment

In the step, the three-level sorting fragments are filtered and cut based on the length so as to obtain a plurality of four-level sorting fragments, and the length of each four-level sorting fragment is 3-11 Mb.

S80: sorting four-stage sorted fragments

In this step, based on BAF values of capture regions contained in the four-level sorted fragments, performing a homodistribution test on BAF values of corresponding regions in the four-level sorted fragments and a control sample, and classifying each four-level sorted fragment into a balanced fragment, an unbalanced fragment, a non-homozygous fragment, or a homozygous fragment based on a predetermined test threshold and a density ratio of BAF outliers contained in each four-level sorted fragment before and after removal, wherein the density ratio of BAF outliers contained in the four-level sorted fragments before and after removal is less than a predetermined homozygous threshold, the four-level sorted fragment is a homozygous fragment, and the four-level sorted fragment is a non-homozygous fragment if the density ratio of BAF outliers contained in the four-level sorted fragment before and after removal is not less than the predetermined homozygous threshold.

According to the embodiment of the invention, after the segmentation is completed, all segments with the length exceeding the threshold (e.g. 10MB) are selected. The segments are cut according to the threshold length, and the segments with insufficient length at two ends are discarded. Paragraph information is then calculated for all remaining paragraphs, including the BAF mean of the paragraph, the percentage ordering of the BAF mean across all of the paragraphs, the mean of the copy number (2 to the LRR power multiplied by 2), and the percentage ordering of the copy number across all of the paragraphs. The BAF distribution of the segment is then tested for co-distribution with a control reference (which, as previously described, may be a historical blood cell reference baseline or blood cell sequencing data of the test subject), and the segment is labeled as balanced if the indicator is below a threshold for balance and unbalanced if above a threshold for imbalance. A paragraph is marked as homozygous if the density ratio of BAF loci before and after outlier removal in the paragraph is below the threshold of homozygosity.

According to an embodiment of the present invention, in the step, further comprising:

(8-1) determining, for each of a plurality of said four-level sorted fragments, at least one of the following parameters, respectively:

(a) the four-level sorted segment comprises the mean value of the BAF values of the capture area and the ranking of the mean value of the BAF values among the BAF values of the capture area of all the four-level sorted segments; and

(b) the four-level sorted fragment comprises the average copy number of the capture region and a ranking of the average copy number in the capture region copy numbers of all the four-level sorted fragments;

(8-2) separately performing, for each of the four-level sorted fragments, a homodistribution test of all BAF values in the four-level sorted fragment between a test sample and a control sample, and obtaining a test indicator value of the four-level sorted fragment, classifying the four-level sorted fragment as a balanced fragment if the test indicator value is lower than a threshold of balance, and classifying the four-level sorted fragment as an unbalanced fragment if the test indicator value is higher than a threshold of unbalance; and

(8-3) for each of the four-level sorted fragments, classifying the four-level sorted fragment as a homozygous fragment if a density ratio before and after removal of BAF outliers included in the four-level sorted fragment is less than a predetermined homozygous threshold, and classifying the secondary sorted fragment as a non-homozygous fragment otherwise.

The term "threshold" is used in many places herein, unless otherwise specified, and can be determined by one skilled in the art by performing parallel assays on samples of known interrelationships, e.g., by performing parallel assays on known homozygous fragments, a homozygous threshold can be determined.

S90: determining BAF mean, copy number mean, and rank of four-stage sorted fragments

In this step, a BAF mean, a copy number mean, and an ordering of the BAF mean of the four-level sorted segments among all the four-level sorted segments are determined for each of the plurality of four-level sorted segments, respectively, based on the BAF value and the LRR value of the capture region included in the four-level sorted segment.

In the step (9), further comprising:

(9-1) performing a plane coordinate axis clustering plot in the four-level sorted fragments with the average copy number percentage ordinal number of the fragments as a first axis and the average BAF of the fragments as a second axis, so as to determine the tumor purity in a sample to be tested; and

(9-2) selecting all balanced fragments in the four-level sorted fragments, taking the copy number average value percentage ordinal number of the balanced fragments as a first axis, taking the copy number average value of the balanced fragments as a second axis, and carrying out plane coordinate axis clustering to draw a graph; and

and (9-3) determining a ploidy correction factor in the sample to be detected based on the tumor purity calculation result of the four-level sorting fragment and the plane coordinate axis cluster map obtained in the step (9-2).

S100: determination of tumor purity

In this step, a plane coordinate axis clustering plot is performed on BAF mean, copy number mean percentage ordinal of the four-level sorted fragments to determine the tumor purity of the sample to be tested.

S110: determining a ploidy correction factor

In this step, based on the tumor purity, the copy number mean and copy percentage number of the equilibrium fragments are subjected to cluster analysis to determine a ploidy correction factor for the sample to be tested.

The theoretical values (CN, B) for copy number and B allele content and the copy number and BAF experimental values (CN, BAF) satisfy the following relations:

CN^*·scale_factor＝CN·purity+2(1-purity)

(1)

wherein scale _ factor is the ploidy correction factor and purity is the tumor purity. For regions of known copy number and B allele frequency, there are:

for fragments of the same genotype, the calculated purity can be approximated as a normal distribution with the theoretical value of purity as the mean, so:

and (3) finding a point cluster corresponding to a specific genotype by clustering the BAF-CN percentage ordinal two-dimensional data, calculating the BAF-average value and substituting the BAF-average value into a formula to calculate the purity.

Knowing the purity, the copy number and copy percentage ordinal number of all segments labeled as "balanced" genotypes can be clustered, and a ploidy correction factor (sometimes referred to herein as a "ploidy correction factor," these two terms being used interchangeably) calculated. If the number of sites marked as equilibrium is insufficient, the gradient is increased by a threshold value, and the number of comparison points is changed. If no obvious change exists, the output is abnormal.

There are generally three karyotypes for conventional samples: the double type is dominant, the double type and the quadruple type are balanced, the quadruple type is dominant, and the other situations are judged to be unrecognizable. The situation that the purity is low is usually not recognized, so the program defaults to the double-type as the dominant treatment and reports the abnormality.

Genotyping regions are divided according to the density distribution of the balanced genotypes. When the requirement that the balanced genotype density has only one distribution peak in the X direction in a larger range and the genotype density in a region with smaller value of X has only one distribution peak in the y direction is met, the diploid is considered as dominant. In this case, the region with the equilibrium genotype density higher than the threshold value is designated as a diploid region, the region with the lower copy number is designated as a haplotype region, the purity is calculated as the haplotype region, and the ploidy correction factor is calculated as the diploid region.

When the equilibrium genotype density has two or more peaks, the diploid and tetraploid equilibria are designated. And determining the regional positions of the haplotype and the triploid according to the positions of the two distribution peaks, and calculating the purity by taking a point cluster with the point number meeting the requirement in the B genotype, the BB genotype or the ABB genotype as a reference. The ploidy correction factor was calculated for the region with a large number of AB genotypes or AABB genotypes.

When the requirement of balanced genotype density is met, only one distribution peak is arranged in the X direction in a larger range, but a plurality of distribution peaks are arranged in the y direction in a smaller X region, the quadruple type is considered as dominant. The program tries to divide the haplotype into the triploid region according to the position relation of the clustering point clusters and the valley positions of the density distribution, and calculates the purity by taking the point clusters with the point numbers meeting the requirements in the B genotype, the BB genotype or the ABB genotype as the reference. Ploidy correction factors were calculated as AABB genotype regions.

S120: BAF value correction and copy number correction are carried out on the three-level primary selection segment

In this step, BAF number correction and copy number correction are performed separately for each of the plurality of tertiary primary fragments based on the purity of the mutant sample and the ploidy correction factor.

S130: determination of A genotype parameter values and B genotype parameter values

In this step, the values of the A genotype parameters and the B genotype parameters are determined based on the corrected BAF values and copy number values.

S140: determining variant types of tertiary sorted fragments

In this step, based on the a and B genotype parameter values and the length of the tertiary sorted fragments, the variation types of the tertiary sorted fragments are determined, the variation types including:

in this step, heterozygosity loss, long fragment copy number state transition and cross-telomeric allele imbalance.

According to an embodiment of the present invention, for each segment determined by the paragraph segmentation procedure, purity corrected BAF, copy number can be calculated and marked whether its genotype is balanced or homozygous.

If the marker is homozygous, the genotype A is 0, and the genotype B is the corrected copy number rounded.

If the mark is balanced, whether the corrected copy number is larger than 2 is judged, if not, A is designated as 1, B is designated as 1, if yes, an even number closest to the value is searched, and A and B are designated as half of the even number.

If the two cases do not exist, the corrected copy number and the corrected B value are calculated first, and the range of B ═ max (B-10, 1) to B + min (B +1, 8), a ═ max (CN-B-8, 0) to min (CN-B +8, B) is enumerated, and the difference between BAF and corrected BAF of the paragraph is calculated in each case.

For the previous list, the option marked as clearly unbalanced is removed, but a ═ B, the option with | a + B-CN | greater than a · CN + B is removed, a and B are inferred from historical data detection, and defaults are 0.15 and 0.5.

And screening the options that the difference value of the correction value of the calculated CN and the actual CN in the section is larger than the threshold value, the difference value of the correction value of the calculated BAF and the actual BAF in the section is larger than the threshold value, or the product of the two difference values is larger than the threshold value.

Of the remaining options, the one that has the smallest difference between the BAF and the actual BAF correction value is calculated as the value of A, B

As in the screening step above, if the number of remaining options after screening is 0, a group in which the BAF difference value and the CN difference value are multiplied by the minimum before screening is selected as the value of A, B.

After the calculation of all the A, B values of the segments is completed, the segments with the same adjacent A, B values are merged, and the final detection result is output.

According to the mutation detection result, various indexes for measuring the instability of the genome can be counted. Here, the common indicators in the industry with certain clinical scientific significance, loss of heterozygosity (LOH), long fragment copy number state transition (LST), and trans-Telomere Allelic Imbalance (TAI) are exemplified. Loss of heterozygosity LOH is defined herein as: a is 0 and B! 0 and the interval length is more than 15M. According to some reports, the inventors provide the option to turn on or off the filtering for the above condition, when the entire chromosome has only one fragment and the above condition is met, without noting the loss of heterozygosity.

The long fragment copy number state transition LST is defined as A + B! 2 and the length is more than 10M. All fragments with a length of less than 3M need to be merged first before computation.

The trans-telomeric allele imbalance TAI is defined as the statistics for each chromosome over the telomere and not over a segment of the centromere, if A! B and a length greater than the 11M threshold, count once.

In addition, the sample global genomic ploidy value is a weighted average of the copy number of each fragment region of the chromosome.

In a second aspect of the present invention, the present invention also provides a sequencing data analysis apparatus which may be used to implement the above-described sequencing analysis method, according to an embodiment of the present invention, the sequencing data analysis apparatus comprising: the kit comprises a sequencing data acquisition module 10, a BAF-LRR numerical value determination module 20, an outlier removal module 30, a segmentation processing module 40, a breakpoint set merging module 50, a secondary sorting segment merging module 60, a quaternary sorting segment acquisition module 70, a quaternary sorting segment classification module 80, a quaternary sorting segment parameter determination module 90, a tumor purity determination module 100, a ploidy correction factor determination module 110, a tertiary sorting segment correction module 120, a genotype parameter value determination module 130 and a mutation type determination module 140.

Specifically, according to an embodiment of the present invention, the sequencing data analysis apparatus includes:

a sequencing data acquisition module 10 for acquiring sequencing data from a plurality of capture regions for a sample to be tested, each of the plurality of capture regions containing at least one SNP site;

a BAF-LRR value determination module 20 for determining, for each of the plurality of capture regions, a BAF value and an LRR value for the capture region, the BAF value characterizing the mutation genotype frequency of the median SNP site in the capture region, the LRR value being determined by the formula Log2 (sequencing depth of test sample/sequencing depth of control sample);

an outlier removal module 30 for outlier removal of said plurality of captured regions based on the BAF values and the LRR values of said captured regions to obtain said filtered plurality of captured regions;

a segmentation processing module 40 for performing at least one round of segmentation processing on at least one chromosome based on said BAF value and said LRR value of said capture region to obtain BAF-primary sorted fragments and LRR-primary sorted fragments, respectively;

a breakpoint set merging module 50, configured to perform breakpoint set merging on the BAF primary sorting segment and the LRR primary sorting segment, and determine a secondary sorting segment based on a combination of the breakpoints;

a secondary sorted segment merging module 60 for performing an iterative merging process on the secondary sorted segments based on a predetermined merging threshold to obtain tertiary sorted segments;

a fourth-level sorted fragment obtaining module 70, configured to filter and cut the third-level sorted fragments based on a predetermined fragment length threshold, so as to obtain a plurality of fourth-level sorted fragments, where the length of each fourth-level sorted fragment is 3-11 Mb, and optionally, the length of each fourth-level sorted fragment is 4.5-5.5 Mb;

a fourth-level sorting segment classifying module 80, configured to perform a same distribution test on BAF values of corresponding regions in a control sample based on BAF values of capture regions included in the fourth-level sorting segments, and classify each fourth-level sorting segment into a balanced segment, an unbalanced segment, a homozygous segment, or a non-homozygous segment based on a predetermined test threshold and a density ratio before and after BAF outliers are removed in each fourth-level sorting segment, where a density ratio before and after BAF outliers are removed in each fourth-level sorting segment is smaller than a predetermined homozygous threshold, so that the fourth-level sorting segment is a homozygous segment, and a density ratio before and after BAF outliers are removed in each fourth-level sorting segment is not smaller than the predetermined homozygous threshold, so that the fourth-level sorting segment is a non-homozygous segment;

a fourth-level sort segment parameter determination module 90 for determining, for each of the plurality of fourth-level sort segments, a BAF mean, a copy number mean, and an ordering of the BAF mean of the fourth-level sort segments among all the fourth-level sort segments, respectively, based on the BAF value and the LRR value of the capture region included in the fourth-level sort segment;

a tumor purity determination module 100, configured to perform plane coordinate axis clustering mapping on BAF mean and copy number mean percentage ordinal of the four-level sorted fragments so as to determine tumor purity of the sample to be tested;

a ploidy correction factor determination module 110, configured to perform cluster analysis on the copy number mean and the copy percentage number of the equilibrium fragments based on the tumor purity, so as to determine a ploidy correction factor of the sample to be tested;

a tertiary sorting segment correction module 120 for performing BAF numerical correction and copy number correction on each of the plurality of tertiary sorting segments, respectively, based on the tumor purity of the sample to be tested and the ploidy correction factor;

a genotype parameter value determination module 130 for determining a genotype A parameter value and a genotype B parameter value based on the corrected BAF value and copy number value;

a variant type determination module 140 for determining variant types for the tertiary sorted fragments based on the a and B genotype parameter values and the lengths of the tertiary sorted fragments, the variant types comprising:

loss of heterozygosity, transition in long fragment copy number state, and cross-telomeric allele imbalance.

According to an embodiment of the invention, the BAF-LRR value determining module is arranged to determine the BAF value for each of a plurality of the capture areas by:

(2-2) determining the median value of the frequencies of the minor alleles of all loci in each of said capture regions as the BAF value of said capture region,

and/or

The BAF-LRR value determining module is configured to determine the LRR value for each of the capture areas by:

log2 (sequencing depth of test sample/sequencing depth of control sample).

According to an embodiment of the invention, the outlier removal module is arranged to perform the filtering by:

according to an embodiment of the present invention, before outlier removal is performed, correction processing is performed in advance on an abnormal BAF value and an abnormal LRR value, the correction processing including:

determining a capture area with abnormal BAF value or LRR value, respectively extending no less than 5 capture areas before and after the capture area, and taking the average BAF value or LRR value of the area obtained after extension as the correction value of the abnormal value.

According to an embodiment of the invention, the BAF value outliers are determined by:

selecting a value point with a BAF value smaller than 0.05 or larger than 0.95 as a BAF value outlier to remove, and performing mirror folding by taking BAF as 0.5 as a center, namely | BAF value-0.5 | + 0.5; (ii) a

And/or

The LRR value outliers are determined by:

aiming at each capture area, respectively constructing an LRR value-position index binary normal distribution taking the capture area as a center;

for a predetermined capture area, extending a predetermined number of adjacent capture areas forward and backward, preferably 100 adjacent capture areas, with the predetermined capture area as a center, to obtain an LRR numerical analysis area;

determining the probability density of each LRR value-position index binary normal distribution at the position of the preset capturing area based on the corresponding LRR value-position index binary normal distribution of the capturing area contained in the LRR value analysis area;

calculating all the probability densities in a superposition manner so as to determine a probability density total value of the preset capture area;

determining the predetermined capture area as an LRR numerical outlier if the total probability density value is below a predetermined outlier threshold.

According to an embodiment of the invention, the segmentation processing module is arranged to be adapted to:

selecting a complete chromosomal sequence as a starting analysis region and performing the segmentation process based on the LRR value and the BAF value, respectively, based on the starting analysis region as a root byte,

wherein the segmentation process further comprises:

The parameter is a BAF value or an LRR value;

Wherein i represents the number of capture regions;

Wherein,

means for arithmetically summing the difference values of the No. 1 to No. i capture areas for the No. i number area;

(4-5) repeating steps (4-1) - (4-4) at least once for said plurality of sub-regions until the newly generated paragraph no longer exceeds the predetermined length threshold, so as to obtain a plurality of paragraphs containing at least 50 of said capture region, so as to obtain said sorted fragments,

wherein the predetermined length threshold is sufficient to include no less than 50 of the capture areas.

According to an embodiment of the present invention, after the step (4-5), further comprising:

(4-6) combining adjacent paragraphs having a length less than a predetermined length threshold or a number of SNPs less than a threshold, and taking the combined fragments as the sorted fragments,

the merging further comprises:

adjacent sorted fragments are subjected to a round-robin statistical test of the same distribution based on BAF and LRR values, respectively, and at least one round of merging is performed on adjacent segments having a distribution deviation less than a threshold of the same distribution test.

According to an embodiment of the invention, the fourth order sorted segment classification module is adapted to:

(a) the four-level sorted segment comprises the mean value of the BAF values of the capture area and a ranking of the mean value of the BAF values among the BAF values of the capture area of all of the four-level sorted segments; and

(8-2) performing, for each of the four-level sorted fragments, a homodistribution test between all BAF value row test samples and control samples in the four-level sorted fragment, and obtaining a test indicator value of the four-level sorted fragment, classifying the four-level sorted fragment as a balanced fragment if the test indicator value is lower than a balance threshold, and classifying the four-level sorted fragment as an unbalanced fragment if the test indicator value is higher than an unbalance threshold; and

(8-3) for each of the quaternary sorted fragments, classifying the quaternary sorted fragment as a homozygous fragment if a density ratio before and after removal of BAF outliers included in the quaternary sorted fragment is less than a predetermined homozygous threshold.

According to an embodiment of the invention, the four-stage sorted-fragment-parameter determination module is adapted to:

By utilizing the sequencing data analysis equipment, the sequencing data analysis method can be effectively implemented, so that sequencing data obtained by sequencing a genome can be effectively analyzed, and corresponding genomic instability such as copy number abnormality and the like can be obtained.

Design examples of Probe

Prospective Probe Density assessment

Considering that the chip must contain a sufficient number of regions, the number of data points per unit length of genome is maintained to ensure the accuracy of mutation detection. Higher probe density means higher resolution, and decreasing the density entails a loss of resolution, but also brings a cost advantage. Therefore, in this embodiment, the density of the probe is evaluated based on the WGS actually measured sample, specifically as follows:

WGS was divided into about 113 ten thousand intervals in 2500bp units. Extracting the regions to the target number by a uniform interval extraction algorithm, and specifically operating as follows: the starting point (defaults to the first point), the total number of regions (113 ten thousand) and the expected number of regions are input, and the segmentation interval stepSize is the total number of regions/expected number of regions.

The number of regions is expected by setting different gradients, such as: 60W, 50W, 40W, 30W, 20W, 10W and 5W, and then evaluating the actual resolving power of each chromosome region based on the SNP variation sites detected by the extracted different gradient regions, wherein through the actual measurement analysis of multiple cases of WES samples, the chromosome genotype resolving power is considered to be greatly reduced when the expected region is less than 10 ten thousand. Thus, we obtained a desired probe density of 10W.

Candidate SNP site screening

The SNP site selection is mainly based on a frequency database of a population, and in the example, a thousand-person database and a dbSNP database are adopted as references. The screening condition is mainly the crowd frequency, and the east Asia crowd frequency is used by default to improve the success rate of site selection. The screening threshold can be set to multiple gears, and defaults are high frequency: not less than 10%, medium-high frequency: not less than 5 percent. And only SNP sites on autosomes and X chromosomes are screened.

A candidate SNP file can be preliminarily obtained based on the crowd frequency, and the candidate SNP file mainly comprises: chromosome, chromosome coordinates, population frequency, frequency scale.

Secondary screening of candidate SNP sites based on expected probe density

Based on the uniformly distributed regions (10W) on the genome obtained in the first step, the expected intervals can be calculated. Then filtering and screening high-frequency SNP sites in the region (candidate high-frequency crowd mutation sites in the database obtained in the second step), and for the region in which the high-frequency sites are not found, expanding the search range (increasing a plurality of bp on the left and right, defaulting to 1000) and searching again. Subsequently, the region without the high-frequency SNP site is deleted, and the region with the high-frequency SNP site is extended by a certain region (default left extension 60, right extension 59) from the left to the right, with the site with the highest population frequency as the center.

Filling gap area

And (3) calculating the distance between adjacent SNP sites based on the SNP sites screened secondarily, defining the SNP sites as a gap region if the distance between the SNP sites and the adjacent SNP sites is more than 1.5 times of the expected interval, gradually expanding the search range (expanding 500BP every time by default) from the central position of the gap region to search for high-frequency SNP sites, stopping searching if the expansion reaches the threshold value of the expected interval, and keeping the region empty. For all non-empty regions, the selected SNP sites in the region are extended by a certain region (default left extension 60, right extension 59).

Submitting a probe design area

Combining all SNP sites in the 3 rd and 4 th steps, extending a certain region (default left extension 60 and right extension 59) from left to right to serve as a final candidate probe region, and submitting a probe supplier to perform probe synthesis.

Example 1 copy number detection and purity correction Module in the method of the invention (GSA method)

To reflect the optimization of the inventors for large-fragment genotype testing, the inventors selected public software with better industry evaluation but no targeted optimization for comparison. PureCN is a relatively perfect implementation of the ABSOLUTE algorithm, which is the mainstream method for performing purity correction according to copy number data at present. PureCN software updates are active and the inventors currently use the stable version 1.8.1 for evaluation. In addition to the copy number calculation itself, the superiority and inferiority of the method of the present invention and PureCN purity calculation is also an indicator in this comparison.

Here the inventors will show the calculated scatter plot of BAF and LRR for each capture area. The paragraph division given by the software is embodied by three background colors of red, green and blue in sequence, and the colors are only used for distinguishing adjacent paragraphs and have no special meaning. This embodiment uses two pairs of paired samples, and the case where the samples are mixed proportionally with the pairs. The inventors will compare several aspects from segmentation, copy number detection and purity correction results.

The segmentation tendency is as follows:

before comparing chromosome segments, the inventors first defined the following: the PureCN is not specially optimized for the detection of long-segment CNV at the chromosome level, so that short CNV and long CNV are not distinguished, and the PureCN can be divided for local few-point data fluctuation, so that the result is segmented and finely crushed, and the whole trend of a chromosome region cannot be well reflected. The following figure is an example, the left figure shows all segments of the chromosome, and the right figure hides the segments with the length less than 3M. In actual comparison, the inventors only compared paragraphs where PurenN is greater than 3M in length. The results are shown in fig. 5, where the left image is the whole segment, and the right image is the segment after 3M is removed, the red, green and blue colors of the background have no practical significance and are only used for easy differentiation and counting.

Referring to FIGS. 6-16 and 17-22, the inventors first compared the difference in copy number detection between the two methods, using sample S1 as an example. Each chromosome shows two segmentation maps, left and right, with PurenN results on the left and GSA results on the right, unless otherwise specified. First the inventors observed chromosome one and seen that there was some data fluctuation in copy number of the 0-50M region, possibly true copy number variation, or subclones, and possibly test bias from NGS data. Both software were able to remove the effects of these outliers to some extent, but there were some differences in the concrete performance. First, the copy number distributions of the 0-3.7M region, the 16-22M region, and the 74-78M region are different from the periphery (as shown in FIG. 3), and it is easy to observe that the BAF distributions of the 74-78M region are also different from the periphery (as shown in FIG. 4), the segmentation procedure of PureCN does not distinguish these segments, and the GSA method can distinguish them. From the subsequent tables, it can be seen that PureCN gives 3 copies of the conclusion for the whole region from 0 to 109M, while observing its BAF, with a distribution centered around 0.5, indicating that its AB gene ratio is close to 1: 1, it is obviously impossible to have 3 copies, which should be the result of it not dividing the whole fragment and erroneously averaging it. While GSA gives results that most of the regions 0-120M are AABB type, 4 copies, 0-3.7M region gives 10 copies, and 16-22M region gives 6 copies. BAF in the 74-78M region does not support an AB genotype ratio of 1: 1, so 5 copies are concluded. The conclusions that the GSA approach gives can be considered more reasonable.

TABLE 1 comparison of the positions and copy numbers of the individual segments of chromosome 1 in PureCN and GSA methods

Similarly, it is easy to see from the scatter diagram that the copy number of the 164-178M region is different from that of the periphery, and the ratio AB of the periphery region is 1: 1, where the copy number is lower, it is obviously unreasonable that the GSA calculates the copy number of the 150-. Taken together, GSA can correctly partition chromosomal regions and give more reasonable copy number calculations.

When chromosome 2 is analyzed next, the biggest segmentation problem of PureCN is that the 76M position is not divided as seen from a scatter diagram, the algorithm of the invention can make correct segmentation, and PureCN is disconnected at 21M, 90-92M and 147M, and no difference exists before and after data segmentation, and the inventor considers that the disconnection is not more reasonable. It should be noted that the GSA method is to perform special treatment on the gap region of chromosome centromere, and to perform merging treatment if the distribution of the gap regions is the same before and after the gap. However, although PureCN segmented chromosomes, four segments gave a conclusion of copy number 3, the inventors judged the first segment as 3 copies and the subsequent larger segments were all designated as 4 copies. Based on the reason that BAF distribution is close to 0.5 and copy number distribution as a whole is larger, it can be considered that the copy number calculation of the GSA method is more reasonable.

TABLE 2 comparison of the positions and copy numbers of the individual segments of chromosome 2 in PureCN and GSA methods

Next, analyzing the case of chromosome 3, PureCN considers 0-100MB as a whole segment, but actually it can be seen that in some regions of the chromosome, such as 22-25MB and 31-35MB, 64-67MB, the distribution of copy number is obviously different from that in other regions, which is not local fluctuation, but the whole segment has the behavior of consistency trend. The GSA algorithm can correctly identify these segments and give results at higher copy numbers (ABBB genotype, 4 copies), while PureCN calculates the entire segment as 3 copies. In addition, PureCN did not correctly process GAP near centromere, and GSA method did. Of course, there are some misjudgments in the second half of the GSA, such as the failure to correctly distinguish 175M and the break points in the following regions. The following shows the enlargement of the BAF scatter diagram in this area, and in fact, from the data, the transition trend between these sections is more moderate, and it is not so easy to find a clear breakpoint. But in practice the segmentation threshold may still be adjusted to obtain a finer partitioning. The default parameters adopted at present are actually the result of making a certain balance for the conventional data, and if the data is abnormal, some adjustment can be actually made according to the actual situation.

TABLE 3 comparison of the positions and copy numbers of the individual segments of chromosome 3 in PureCN and GSA methods

The remaining paragraphs generally follow similar trends. Overall, the GSA algorithm gives more reliable results than PureCN. Only the comparison results of each chromosome are listed here and will not be discussed in detail.

TABLE 4 comparison of the positions and copy numbers of the various segments of chromosome 4-22 by PureCN and GSA methods

The sample S2 segments performed similarly, with the case of each chromosome segment being shown as prueCN (with segments less than 3M in length removed) on the left and GSA on the right.

The performance of the GSA program in segmentation is more obvious than that of PureCN. Overall, GSA segmentation is more likely to reflect the global trend at the chromosome level, and data fluctuations (possibly due to differences in local copy number, and possibly also due to artifacts due to sequencing quality) are eliminated by outlier removal procedures. PureCN is fragmented more sporadically, such as the long fragment of chromosome 1, 107-243MB (across the centromere), although there is some data fluctuation, it is observed that the whole fragment shows uniform genotype characteristics, but the fragment is fragmented under the fragment of PureCN. Meanwhile, some breakpoints are not correctly identified in the PureCN result, such as about 76M position of chromosome 1, and a plurality of positions of chromosome 3 obviously have genotype conversion, and the GSA program correctly identifies the breakpoint, but PureCN has no way to identify the breakpoint. Similarly, PureCN has many breaks in the integrity of chromosome 2, but a genotype in the 48-54M region that is significantly different from the preceding and following paragraphs is not correctly identified. While the segmentation reliability of GSA software is significantly better than PureCN. For chromosome gap treatment, the genotype before and after chromosome 4 centromere gap can be observed to be the same, but it is not merged by PureCN. The genotypes of the chromosome 5 before and after gap are different, PureCN does not perform correct segmentation, and GSA software can correctly process the situation. Other chromosomes behave similarly and are not repeated.

TABLE 5 PureCN and GSA methods S2 comparison of position and copy number of each chromosome paragraph

2.2.2 purity calculation:

here, the gradient of purity is achieved by mixing a high purity sample into a control sample. For diluted samples, correct calibration can achieve consistent results. It should be noted, however, that for the NGS method, blending is not a simple proportional division, but rather a change in dup rate needs to be considered.

Assuming that the total reads number in a sample is N, it can be divided into two parts, the effective reads number Ne1 and the repeat reads number Nd 1. The former is the number of kinds of non-repeated reads in N, and the latter is the difference between N and the former, which indicates that there is at least one repeated read number. The total dup rate r1 is Nd/N. Actually, the Nd1 may also obtain a unique set Ne2, and accordingly, there is a corresponding dup set Nd2, and the dup rate r2 of the second order is Nd2/Nd 1. An independent set Ne3 may exist in Nd2, corresponding to a repeated set Nd3 and a third-order dup rate r 3. And so on. Thus, a finite number of stages (Nei, Ndi, ri) can be constructed. Theoretically, some reads are easier to generate dup than other reads, so the higher the order, ri will be larger, and it can be seen that in the case of a low first-order dup rate, the proportion of Nei in N will decrease rapidly, and the effective order will not be too many.

When the inventors extracted reads of Nc from N, it is necessary to consider its composition. Nc comes in part from Ne1, which is not repeated, but in part from Nd1, which is repeated with the entire set of Ne1, but in part may be repeated with an unselected part (Ne1-Ne1c), among which non-repeated set Ne2c is not actually repeated with Ne1 c. Similarly, Ne3c does not repeat with Ne1c and Ne2c, and the dup rate rc should be equal to (Nc-Ne1c-Ne2c-Ne3 c-)/Nc. However, it is not difficult to find that the higher order neuc fraction is actually lower, and the higher order terms are practically negligible without any particular rigorous discussion.

To simplify the calculation, the inventors here neglect terms above third order, and when extracting Nc from the total reads number N, the effective reads number can be roughly calculated using the following formula:

the actual purity can be calculated as follows:

the inventors can calculate the theoretical purity after mixing based on the pathological purity of the sample itself and the calculated mixture ratio. The inventors can then compare the software calculated purity with the theoretical value. The PureCN algorithm tries to calculate the goodness of fit between theoretical data and measured data under different purities and ploidies, and results higher than a threshold value are reserved. The problem is that the result of the best fit is not necessarily the most reasonable, and the algorithm does not provide judgment per se, and needs manual comprehensive judgment according to various conditions. The GSA algorithm tries to find the optimal purity ploidy through clustering, and the optimal result can be directly found.

TABLE 6 calculation of pureCN and GSA algorithms for purity and ploidy of mixed samples

From the results in table 6, it can be seen that, in the case of only half of the eight samples of the two samples and the three respective mixed samples, the first choice calculated by PureCN is closer to the theoretical value, but not the first choice, and it is likely that the later choices are closer, which is more difficult to perform in practical operations. The GSA algorithm gives a choice that is closer to the theoretical frequency overall. Of course, the purity actually calculated may not be exactly the same as the pathological purity, and a more accurate representation should be of the corresponding cell content of the master clone. However, it can be seen from the results of fig. 23 that the purity calculated by the method of the present inventors and the theoretical purity obtained by dilution maintain a good linear relationship, which indicates that the calculation can accurately reflect the change due to dilution, and the results of PureCN do not have such self-consistency.

For the case of ploidy, as shown in fig. 24, the fluctuation of the ploidy of each sample given by the GSA algorithm is small because the ploidy is an essential property of the sample and should not change with the change of the number of mixed normal samples, and thus the smaller the fluctuation, the more the correction algorithm can function correctly. It should be noted that the lower the purity, the more difficult the calibration of the sample. It can be seen that for the S1 sample, the ploidy calculations for the latter two mixed samples PureCN are biased greatly, while the third GSA, although biased, still performs better than PureCN. S1 is generally a sample with high ploidy, with average ploidy close to 3, while PureCN in the third sample mixture has calculated close to 2, indicating that the correction failed. For S2, since the sample average ploidy is close to 2, the correction difficulty is much smaller, and both algorithms can obtain results with smaller fluctuation. But in general the fluctuations given by GSA are still smaller. Taken together, the inventors' GSA algorithm is apparently due to PureCN in terms of purity calculation and correction.

Concrete embodiment 2-genome instability and homologous group Gene repair Gene mutation detection comprehensive evaluation kit (1) sample selection

3 commercial cell line samples are randomly selected, and the method and the kit designed by the invention are utilized to carry out embodiment verification.

TABLE 7

Sample name
	HCC1395
HCC1954
	MDA-MB-436

(2) Sample fragmentation

The kit is compatible with 100-400 ng DNA samples, and the volume of the kit is filled to 75 mu l by using ultrapure water (NF water) without nuclease. The liquid was transferred into the interrupt tube and placed into a nucleic acid interrupt instrument (Bioruptor pico). Pre-cooling to 4 ℃ by using an interrupt instrument, wherein interrupt parameters are as follows:

TABLE 8

Size of target fragment (bp)	Interrupting parameters	Number of cycles
			200	Break 30 s/stop 30s	16

(3) Tip repair

1) The terminal buffer was removed beforehand, thawed at room temperature, vortexed and centrifuged briefly. The enzymatic reagent is removed prior to use;

2) adding the components according to the following table, uniformly blowing and sucking, and placing on an ice box;

TABLE 9

3) Setting a terminal repair program on the PCR instrument, and placing a terminal repair reaction system in the PCR instrument for reaction;

watch 10

4) After the end repair procedure, 1.5 times of magnetic beads were purified (magnetic plate purification, either manual or automated). Mu.l of magnetic bead A was added and finally dissolved back in a 0.2ml tube with 28.1. mu.l of DNA lysis solution.

(4) Terminal end with 'A'

1) Remove plus ` A ` buffer, thaw at room temperature, vortex, and centrifuge briefly. Taking out the enzyme reagent before use;

TABLE 11

Adding the component of' A	Volume of each reaction (μ l)
		End-modified purified redissolved DNA	28
Adding 'A' buffer	5
		Adding 'A' enzyme	2
Total volume	35

3) Setting a program of adding 'A' on a PCR instrument, and placing a reaction system of adding 'A' in the PCR instrument;

TABLE 12

Add 'A' program (Hot lid 50 ℃ C.)
	37℃，30min
4℃，Hold

4) After the procedure of adding 'A' is finished, impurities are not purified, and the next reaction is directly carried out;

(5) joint connection

1) The adapter junction buffer was removed, the adapter thawed at room temperature, vortexed and briefly centrifuged. Taking out the enzyme reagent before use;

2) adding the components according to the following table, uniformly mixing by blowing and sucking, and placing on an ice box;

watch 13

The sample and linker number correspondence is as follows:

TABLE 14

Sample name	Joint name
		HCC1395	Joint
1
	HCC1954	Joint	2
MDA-MB-436				Joint	3

The linker sequence is as follows:

watch 15

3) Setting a joint connection program on the PCR instrument, and placing a joint connection reaction system in the PCR instrument;

TABLE 16

Joint connection procedure (Hot lid 50 ℃ C.)
	23℃，20min
4℃，Hold

4) After the linker ligation procedure is completed, magnetic bead a is purified (plate pure, either by manual or automated purification). Add 30. mu.l nuclease-free water and 50. mu.l magnetic bead A for purification, finally re-dissolve with 41. mu.l DNA lysis solution and transfer to a new 0.2ml PCR tube.

(6) Library amplification

1) Taking out the label primer, unfreezing the PCR reaction solution at 4 ℃, whirling, centrifuging for a short time, and placing in an ice box;

2) preparing a reaction system according to the following table, and temporarily storing the reaction system in an ice box;

TABLE 17

Library amplification Components	Volume of each reaction (μ l)
		Purification of DNA in the previous step	40
PCR reaction solution	50
		Label primer x	10
Total volume	100

3) Setting a program on a PCR instrument, placing the reaction system in the previous step into the PCR instrument, and starting reaction;

watch 18

The sample and adaptor primer correspondence is as follows:

watch 19

Sample name	Joint name	Primer name
			HCC1395	Joint
1	Primer 1
		HCC1954	Joint	2	Primer 2
MDA-MB-436	Joint					3	Primer 3

The primer sequences are as follows:

watch 20

4) After the PCR procedure is completed, the magnetic bead A is purified (plate-pure, which can be purified manually or automatically). Add 100 u l magnetic bead A purification, finally use 32 u l nuclease free water back dissolved in 0.2ml PCR tube for preservation.

(7) Library mixes (Pooling)

Library mixing was performed based on the Qubit results as shown in the following table:

TABLE 21

Sample name	Input amount of library (ng)
		AN3CA	500
HCC38	500
		MDA-MB-361	500

(8) Hybrid library preparation

1) And taking out the needed amount of the magnetic beads B in advance, and balancing at room temperature for at least 30 min.

2) Add 20. mu.L of hybridization reagent 1 to a new 1.5mL centrifuge tube and add the hybridization substrate prepared in the previous step.

3) Add 130. mu.L of magnetic bead B, vortex for 10s and mix well.

4) Standing at room temperature for 10min, centrifuging instantaneously, and standing on a magnetic frame for 1min until the liquid is clear.

5) Carefully aspirate and discard the supernatant, add 190 μ L of 80% ethanol, and let stand for 30s until the liquid is clear.

6) The supernatant was carefully aspirated and discarded. Keeping the centrifuge tube on the magnetic frame, opening the tube cover, standing at room temperature for 5min, and air drying.

7) 2 mul of hybridization reagent 2 and 11.4 mul of nuclease-free water are respectively added into each tube, and after being taken off from the magnetic frame, the mixture is fully vortexed and mixed.

(9) Chip hybridization

1) Taking out the hybridization reagent 3 and the hybridization reagent 4 in advance, and thawing at normal temperature; the HRD probe was removed from the freezer at-20 deg.C and thawed on ice. Hybridization mix was configured as follows;

TABLE 22

Note: the above components can be mixed together in advance and stored, and named as "hybridization solution".

2) mu.L of each prepared hybridization reaction solution was added to 13.4. mu.L of the hybridization substrate (with magnetic beads) in the previous step. Mixing, standing at room temperature for 2min, and standing on magnetic frame until the liquid is clear.

3) The hybridization probe was removed and thawed on ice. And taking out a new PCR tube corresponding to the number of the hybridization reactions, marking the new PCR tube as the number of the current hybridization library, and respectively adding 4 mu L of probes.

4) Transfer 56.4. mu.L of supernatant from step (2) to step (3) PCR tube containing 4. mu.L HRD probe. Vortex for 10s and mix well. Note the check mark.

5) After transient centrifugation, the mixture was placed on a PCR and the hybridization reaction program of the following table was run.

TABLE 23

Temperature of	Time
		95℃	5min
55℃	16～20h
		55℃	hold

Note that: the thermal lid of the PCR instrument is set to 105 ℃.

(10) Preparation before elution

1) Hybridization eluent preparation

The thermostated metal bath was adjusted to 55 ℃ in advance and 100. mu.l 1. multidot. Wash Buffer I and 400. mu.l 1. multidot. Stringent Wash Buffer were prepared for each hybridization reaction and preheated.

Watch 24

2) Preparation of magnetic beads for hybridization

a) A1 Xmagnetic bead C wash was prepared as follows:

TABLE 25 magnetic bead C Wash preparation

b) And taking out the magnetic bead C, and balancing at room temperature for 30 min. Vortex for 15s before use and mix well.

c) A new 1.5mL centrifuge tube was added with 50. mu.L of magnetic bead C per tube. The solution was placed on a magnetic stand for 1min until the solution was clear, and the supernatant was carefully aspirated and discarded.

d) Add 100. mu.L of 1 Xmagnetic bead C wash to each tube. Taking down the magnetic frame, and mixing completely. The tube was returned to the magnetic rack and allowed to stand for at least 1min until the liquid was clear, and the supernatant was carefully aspirated and discarded. This procedure was repeated once for a total of 2 washes.

e) Add 50. mu.L of 1 Xmagnetic bead C wash to each tube. The magnetic frame is removed, and the mixture is vortexed for 10s to mix thoroughly. Transfer to a new 1.5mL centrifuge tube.

f) Placing the centrifuge tube on a magnetic frame, standing for at least 1min until the liquid is clear, carefully sucking up and discarding the supernatant.

g) Note that: the next step should be immediately entered. Avoid the magnetic beads from being dried thoroughly too long.

3) Hybrid library with magnetic bead C binding

And after hybridization, continuously keeping the hybridization mixed solution at 55 ℃, after the magnetic bead C is cleaned, quickly transferring the hybridization reaction mixed solution in the second step to the magnetic bead C prepared in the last step, covering a tube cover, whirling for 10 seconds, fully mixing uniformly, and placing on a PCR instrument for incubation at 55 ℃ for 15min (hot cover 105 ℃).

(11) Elution Process

a) After 15min incubation, the hybridization mixture and 1X eluent 1 were removed from the PCR instrument.

b) To each centrifuge tube containing 60.4. mu.L of hybridization mixture-beads C, 100. mu.L of 1 Xeluent 1 preheated to 55 ℃ was added. Vortex for 10s and mix well. Place on magnetic rack until the liquid is clear, carefully aspirate and discard the supernatant.

c) 200 μ L of 1 Xeluent S pre-heated to 55 ℃ was added to each tube. After removal from the magnetic stand, vortexed for 10s and mixed well, and placed back in the PCR instrument to incubate at 55 ℃ for 5min (hot lid 105 ℃). After taking out, the solution was placed on a magnetic stand until the solution was clear, and the supernatant was carefully aspirated and discarded. This step was repeated once.

d) Add 200. mu.L of 1 Xeluent 1 (RT) to each tube, vortex for 10s and mix well. Standing at room temperature for 1min, centrifuging instantly, and standing on a magnetic frame until the liquid is clear. The supernatant was carefully aspirated and discarded.

e) Add 200. mu.L of 1 Xelution 2 to each tube, vortex for 10s and mix well. Transfer to a new 1.5mL centrifuge tube after transient centrifugation. Standing at room temperature for 1min, and placing on a magnetic frame until the liquid is clear. The supernatant was carefully aspirated and discarded.

f) Add 200. mu.L of 1 Xelution 3 to each tube, vortex for 10s and mix well. Standing at room temperature for 1min, centrifuging instantly, and standing on a magnetic frame until the liquid is clear. The supernatant was carefully aspirated and discarded.

g) mu.L of PCR-grade water/nuclease-free water was added to each tube, and the mixture was thoroughly pipetted and mixed. After transient centrifugation, the liquid was transferred in its entirety to a new PCR tube (containing magnetic beads).

(12) Capture library PCR reaction and purification

1) The post-capture PCR reaction mixture was prepared and reacted according to the following ratios:

watch 26

2) Magnetic bead purification of PCR products

Preparation work: balancing the magnetic bead B at room temperature for 30min in advance, and fully shaking and uniformly mixing the vortex 15s for later use;

purify with 1.0 Xmagnetic beads (50ul) and finally dissolve in 32. mu.l ddH 2O.

The primer sequences are as follows:

watch 27

Name (R)	Sequence 5-3	5' modification
			Flowcell primers F(10uM)	GAACGACATGGCTACGA	Phosphorylation of
Flowcell primers R(10uM)	TGTGAGCCAAGGAGTTG

(13) Sequencing and data analysis

And (3) carrying out sequencing on the library constructed in the previous step by a Huada MGISEQ-2000 sequencer after the library is qualified through electrophoresis detection. And (3) taking the constructed single-chain circular DNA library to perform DNA nanosphere preparation and MGISEQ-2000 on-machine sequencing. The sequencing process is carried out on the computer strictly according to the standard operation flow of MGISEQ-2000.

And (3) filtering, comparing, removing the weight and controlling the off-line data, completing the variation analysis of the genome snv, and obtaining a homologous recombination defect score (HRDScore) by measuring and calculating the heterozygosity (LOH), the Telomere Allelic Imbalance (TAI) and the large-scale state transition (LST) to evaluate the instability of the genome.

Watch 28

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An assay for measuring genomic instability, comprising:

(1) obtaining sequencing data from a plurality of capture regions for a sample to be tested, each of the plurality of capture regions containing at least one SNP site;

(2) determining, for each of the plurality of capture regions, a BAF value and an LRR value for the capture region, the BAF value characterizing a median SNP site mutation genotype frequency in the capture region, the LRR value determined by the formula Log2 (sequencing depth of test sample/sequencing depth of control sample);

(3) performing outlier removal on the plurality of capture areas based on the BAF values and the LRR values of the capture areas to obtain the plurality of capture areas subjected to filtering;

(4) performing at least one round of segmentation processing on at least one chromosome based on the BAF value and the LRR value of the capture region to obtain BAF primary sorted fragments and LRR primary sorted fragments, respectively;

(5) aiming at the BAF primary sorting segment and the LRR primary sorting segment, performing breakpoint set combination, and determining a secondary sorting segment based on the combination of breakpoints;

(6) performing iterative combination processing on the secondary sorting fragments based on a preset combination threshold value so as to obtain tertiary sorting fragments;

(7) filtering and cutting the tertiary sorted fragments based on a preset fragment length threshold value so as to obtain a plurality of quaternary sorted fragments, wherein the length of each quaternary sorted fragment is 3-11 Mb;

(8) performing homodistribution test on BAF values of the four-level sorted fragments and BAF values of corresponding regions in a control sample based on BAF values of capture regions contained in the four-level sorted fragments, and classifying each four-level sorted fragment into balanced fragments, unbalanced fragments, homozygous fragments or non-homozygous fragments based on a predetermined test threshold and a density ratio of the four-level sorted fragments before and after the removal of BAF outliers, wherein the density ratio of the four-level sorted fragments before and after the removal of BAF outliers is smaller than a predetermined homozygous threshold, the four-level sorted fragments are homozygous fragments, and the four-level sorted fragments are non-homozygous fragments if the density ratio of the four-level sorted fragments before and after the removal of BAF outliers is not smaller than the predetermined homozygous threshold;

(9) determining, for each of the plurality of four-level sorted segments, a BAF mean, a copy number mean, and an ordering of the BAF mean of the four-level sorted segment across all four-level sorted segments based on the BAF value and the LRR value of the capture region included in the four-level sorted segment;

(10) performing plane coordinate axis clustering mapping on the BAF mean value and the copy number mean value percentage ordinal number of the four-level sorting fragment so as to determine the tumor purity of the sample to be detected;

(11) performing cluster analysis on the copy number mean and copy percentage ordinal of the equilibrium fragment based on the tumor purity so as to determine a ploidy correction factor of the sample to be tested;

(12) performing BAF numerical correction and copy number correction on each of the three-stage sorting fragments respectively based on the tumor purity of the sample to be detected and the ploidy correction factor;

(13) determining a genotype A parameter value and a genotype B parameter value based on the corrected BAF value and copy number value;

(14) determining a type of variation of the tertiary sorted fragments based on the A and B genotype parameter values and the length of the tertiary sorted fragments, the type of variation comprising:

2. The method according to claim 1, wherein in step (7), the four-stage sorted fragments have a length of 4.5 to 5.5 Mb.

3. The method of claim 1, wherein in step (2), further comprising: determining the BAF value for each of a plurality of the capture regions by:

(2-1) determining a sub-allele frequency of each of said SNP sites in said capture region;

(2-2) determining the median value of the frequency of the minor alleles of all loci in each of said capture regions as the BAF value of said capture region.

4. The method of claim 1, wherein in step (2), further comprising:

for each of the capture areas, determining the LRR value by:

log2 (sequencing depth of test sample/sequencing depth of control sample).

5. The method of claim 4, wherein prior to performing the outlier removal, performing a correction process on the abnormal BAF value and the abnormal LRR value, the correction process comprising:

determining a capture area with abnormal BAF values or abnormal LRR values, respectively extending not less than 5 capture areas before and after the capture area until the number of non-abnormal values in the extended area is not less than 10, and taking the average BAF value or the LRR value of the area obtained after extension as a correction value of the abnormal values.

6. The method of claim 4, wherein the BAF value outliers are determined by:

selecting a value point with a BAF value smaller than 0.05 or larger than 0.95 as a BAF value outlier, removing, and adopting a value of | BAF-0.5 | +0.5 as a derivative BAF value of the rest BAF values;

and

the LRR value outliers are determined by:

for a predetermined capture area, extending a predetermined number of adjacent capture areas forward and backward with the predetermined capture area as a center to obtain an LRR numerical analysis area;

7. The method of claim 6, wherein 100 said adjacent capture areas are extended forward and backward for a predetermined capture area, centered on said predetermined capture area, to obtain an LRR numerical analysis area.

8. The method of claim 1, wherein in step (4), a complete chromosome sequence of a non-Y chromosome is selected as a start analysis region, and the segmentation process is performed based on LRR values and BAF values, respectively, based on the start analysis region as a root byte,

wherein the segmentation process further comprises:

(4-1) calculating the parameter average value of the area to be segmented

The parameter is a BAF value or an LRR value;

Wherein i denotes the number of the capture area, X_iA parameter indicating the capture area of number i;

Wherein,

(4-4) determining the cumulatively added extreme points in the to-be-segmented processing region, selecting the extreme points, the distances of which from one of the endpoints of the to-be-segmented processing region both exceed a predetermined distance threshold value, as segmentation points, and segmenting the initial analysis region into a plurality of sub-regions by using the segmentation points;

(4-5) repeating steps (4-1) - (4-4) at least once for said plurality of sub-regions to obtain a plurality of paragraphs containing at least 50 of said capture regions to obtain said sorted fragments.

9. The method of claim 8, wherein the predetermined length threshold is sufficient to include no less than 50 of the capture regions.

10. The method of claim 9, wherein the capture region is obtained using a kit for measuring genomic instability, the kit comprising a collection of probes determined by:

(7) constructing a probe specifically recognizing the tertiary region based on the tertiary region, the probe being adapted to determine genomic instability, wherein the tertiary region constitutes the capture region, the probe being used to acquire sequencing data from the capture region.

11. The method of claim 8, further comprising:

the merging further comprises:

adjacent sorted segments are subjected to a same distribution cycle statistical test based on the BAF value and the LRR value, respectively, and at least one adjacent segment having a distribution deviation less than a same distribution test threshold is merged.

12. The method of claim 1, wherein in step (8), further comprising:

(a) the four-level sorted segment comprises the BAF value average of the capture area and the BAF value average of the capture area is sorted by the BAF value of the capture area in all the four-level sorted segments; and

(8-3) for each of the quaternary sorted fragments, classifying the quaternary sorted fragment as a homozygous fragment if a density ratio before and after removal of BAF outliers included in the quaternary sorted fragment is less than a predetermined homozygous threshold, and classifying the quaternary sorted fragment as a non-homozygous fragment otherwise.

13. The method of claim 12, wherein in step (9), further comprising:

(9-3) determining a ploidy correction factor in the sample to be tested based on the tumor purity calculation result of the four-level sorting fragment and the plane coordinate axis cluster map obtained in the step (9-2).

14. A test device for measuring genomic instability, comprising:

the sequencing data acquisition module is used for acquiring sequencing data from a plurality of capture areas aiming at a sample to be detected, wherein each capture area comprises at least one SNP locus;

a BAF-LRR value determination module for determining, for each of the plurality of capture regions, a BAF value and an LRR value for the capture region, the BAF value characterizing a median SNP site mutation genotype frequency in the capture region, the LRR value determined by the formula Log2 (sequencing depth of test sample/sequencing depth of control sample);

an outlier removal module configured to perform outlier removal on the plurality of capture areas based on the BAF values and the LRR values of the capture areas so as to obtain the plurality of capture areas subjected to filtering processing;

a segmentation processing module for performing at least one round of segmentation processing on at least one chromosome based on the BAF value and the LRR value of the capture region to obtain a BAF primary sorted fragment and an LRR primary sorted fragment, respectively;

a breakpoint set merging module, configured to perform breakpoint set merging on the BAF primary sorting segment and the LRR primary sorting segment, and determine a secondary sorting segment based on a combination of breakpoints;

the second-level sorting segment merging module is used for carrying out iterative merging processing on the second-level sorting segments on the basis of a preset merging threshold value so as to obtain third-level sorting segments;

the four-level sorting fragment obtaining module is used for filtering and cutting the three-level sorting fragments based on a preset fragment length threshold so as to obtain a plurality of four-level sorting fragments, and the length of each four-level sorting fragment is 3-11 Mb;

a fourth-level sorting segment classifying module, configured to perform a homodistribution test on BAF values of capture regions included in the fourth-level sorting segment and BAF values of corresponding regions in a control sample, and classify each fourth-level sorting segment into a balanced segment, an unbalanced segment, a homozygous segment, or a non-homozygous segment based on a predetermined test threshold and a density ratio before and after BAF outliers in each fourth-level sorting segment are removed, where a density ratio before and after BAF outliers in the fourth-level sorting segment are removed is smaller than a predetermined homozygous threshold, the fourth-level sorting segment is a homozygous segment, and a density ratio before and after BAF outliers in the fourth-level sorting segment are not smaller than the predetermined homozygous threshold, the fourth-level sorting segment is a non-homozygous segment;

a fourth-level sorted segment parameter determination module for determining, for each of the plurality of fourth-level sorted segments, a BAF mean, a copy number mean, and an ordering of the BAF mean of the fourth-level sorted segments among all the fourth-level sorted segments, respectively, based on the BAF value and the LRR value of the capture region included in the fourth-level sorted segment;

the tumor purity determination module is used for performing plane coordinate axis clustering mapping on the BAF mean value and copy number mean value percentage ordinal number of the four-level sorted fragments so as to determine the tumor purity of the sample to be detected;

a ploidy correction factor determination module, configured to perform cluster analysis on the copy number mean and the copy percentage number of the equilibrium fragments based on the tumor purity, so as to determine a ploidy correction factor of the sample to be tested;

a third-level sorting segment correction module, configured to perform BAF value correction and copy number correction on each of the plurality of third-level sorting segments, respectively, based on the tumor purity of the sample to be detected and the ploidy correction factor;

a genotype parameter value determination module for determining a genotype parameter value A and a genotype parameter value B based on the corrected BAF value and copy number value;

a variation type determination module for determining a variation type of the tertiary sorted fragments based on the A genotype parameter value and the B genotype parameter value and the length of the tertiary sorted fragments, the variation type comprising:

15. The apparatus of claim 14, wherein the four-stage sorted fragments are 4.5-5.5 Mb in length.

16. The apparatus of claim 14, wherein the BAF-LRR value determining module is configured to determine the BAF value for each of a plurality of the capture areas by:

17. The apparatus of claim 14, wherein the BAF-LRR value determining module is configured to determine the LRR value for each of the capture areas by:

log2 (sequencing depth of test sample/sequencing depth of control sample).

18. The apparatus of claim 17, wherein prior to performing the outlier removal, performing a correction process on the abnormal BAF value and the abnormal LRR value, the correction process comprising:

19. The apparatus of claim 17, wherein the BAF value outliers are determined by:

and

the LRR value outliers are determined by:

20. The apparatus of claim 19, wherein for a predetermined capture area, 100 said adjacent capture areas are extended back and forth centered on said predetermined capture area to obtain an LRR numerical analysis area.

21. The apparatus of claim 14, wherein the segmentation processing module is arranged and adapted to:

selecting a complete chromosomal sequence of a non-Y chromosome as a start analysis region, and performing the segmentation process based on the LRR value and the BAF value, respectively, based on the start analysis region as a root byte,

wherein the segmentation process further comprises:

(4-1) calculating the parameter average value of the area to be segmented

The parameter is a BAF value or an LRR value;

Wherein,

22. The apparatus of claim 21, further comprising, after step (4-5):

the merging further comprises:

23. The apparatus of claim 14, wherein the fourth stage sorted segment classification module is adapted to:

24. The apparatus of claim 23, wherein the four-stage sort segment parameter determination module is adapted to:

25. A terminal device, comprising: a memory and a processor; the memory is used for storing program codes; the processor, calling the program code, when executed, is configured to perform the method of any of claims 1 to 13.

26. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any of claims 1-13.