CN117746982A - Method for detecting dominant clone based on exogenous DNA insertion mutation - Google Patents
Method for detecting dominant clone based on exogenous DNA insertion mutation Download PDFInfo
- Publication number
- CN117746982A CN117746982A CN202311733844.2A CN202311733844A CN117746982A CN 117746982 A CN117746982 A CN 117746982A CN 202311733844 A CN202311733844 A CN 202311733844A CN 117746982 A CN117746982 A CN 117746982A
- Authority
- CN
- China
- Prior art keywords
- dominant
- sequence
- clustering
- similarity
- scores
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003780 insertion Methods 0.000 title claims abstract description 29
- 230000037431 insertion Effects 0.000 title claims abstract description 29
- 230000035772 mutation Effects 0.000 title claims abstract description 28
- 108091029865 Exogenous DNA Proteins 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 title claims description 45
- 230000003321 amplification Effects 0.000 claims abstract description 39
- 238000003199 nucleic acid amplification method Methods 0.000 claims abstract description 39
- 238000010367 cloning Methods 0.000 claims abstract description 25
- 238000001514 detection method Methods 0.000 claims abstract description 19
- 230000010354 integration Effects 0.000 claims abstract description 17
- 241000700605 Viruses Species 0.000 claims abstract description 12
- 238000012937 correction Methods 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 28
- 210000004027 cell Anatomy 0.000 claims description 20
- 210000000349 chromosome Anatomy 0.000 claims description 13
- 230000000295 complement effect Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 9
- 108090000623 proteins and genes Proteins 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000003556 assay Methods 0.000 claims description 3
- 230000003612 virological effect Effects 0.000 claims 4
- 108020004414 DNA Proteins 0.000 claims 2
- 238000001415 gene therapy Methods 0.000 description 7
- 206010028980 Neoplasm Diseases 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 239000012634 fragment Substances 0.000 description 4
- 241000282472 Canis lupus familiaris Species 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 201000011510 cancer Diseases 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000010076 replication Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000002560 therapeutic procedure Methods 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 238000012408 PCR amplification Methods 0.000 description 1
- 206010035148 Plague Diseases 0.000 description 1
- 241000607479 Yersinia pestis Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000002659 cell therapy Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000012010 growth Effects 0.000 description 1
- 230000008073 immune recognition Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 231100000405 induce cancer Toxicity 0.000 description 1
- 210000005229 liver cell Anatomy 0.000 description 1
- 230000009826 neoplastic cell growth Effects 0.000 description 1
- 231100000590 oncogenic Toxicity 0.000 description 1
- 230000002246 oncogenic effect Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000010473 stable expression Effects 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000013603 viral vector Substances 0.000 description 1
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a detection method based on dominant clone caused by exogenous DNA insertion mutation, which IS used for preprocessing positive and negative chain correction and redundant integration of virus IS information data; step 2, after pretreatment IS completed, the IS caused by the non-specific amplification IS identified so as to remove false positive IS detection results caused by the PCR non-specific amplification; step 3, merging IS at adjacent positions according to a site clustering algorithm, and clustering; step 4, analyzing a cloning plane according to a clustering result, and describing dominant clones; the invention utilizes the PCR primer similarity ranking algorithm to identify and remove the nonspecific amplification caused by the similarity of the possible PCR primers, thereby obviously reducing the false positive of IS identification in practical application and improving the accuracy.
Description
Technical Field
The invention relates to the technical field related to analysis of second-generation sequencing data in the field of biological information, in particular to a detection method based on dominant clone caused by exogenous DNA insertion mutation.
Background
Gene therapy refers to a therapeutic method that uses molecular biological means to introduce exogenous DNA into the genome of genetically deficient cells to restore normal function to the cells. However, safety has been one of the important issues that plague the development of therapies in the development of gene therapy. An integrative vector is a type of vector DNA sequence commonly used in gene therapy for loading exogenous DNA fragments and integrating into the host genome in an inserted manner. Among them, lentiviral vectors (Lentiviral vectors, LVs), adeno-like vectors (AAV) and the like are ideal tool vectors for gene therapy due to their high gene transfer efficiency or stable expression ability in target cell genes.
When integrated into the host genome, the integrated vector may cause instability and rearrangement of the host cell genome and may cause gene expression disorders in the host cell, evade host immune recognition and maintain self-survival for a long period of time, thereby eventually leading to the occurrence of cancer. Currently, in preclinical studies and in some clinical trials using viral vectors, oncogenic phenomena due to integration of the vector in the genome have been observed. In an AAV gene therapy trial in dogs, researchers have found that some of the therapeutic gene segments carried by AAV are integrated near genes that control growth on the dog's chromosome and that some liver cells of these dogs divide faster than others and form sub-cell clusters with the potential to induce cancer. Therefore, detection of the risk of cancer induced by vector insertion after the market of gene therapy products is one of the important links of gene therapy. However, during sustained cell therapy, the time of occurrence of such potential neoplasia tends to be significantly delayed and difficult to predict. There is currently no better method to quantitatively evaluate the tumorigenicity and tumorigenicity of an inserted mutation. In most of the existing IS detection methods, IS IS identified based on a PCR amplification insert fragment end method, and in practical application, when the target fragment IS relatively low, a phenomenon of nonspecific amplification occurs, which results in false positive results of IS identification.
Disclosure of Invention
In order to solve the problems that the occurrence time of potential tumorigenicity often has obvious hysteresis and is difficult to predict in the continuous cell treatment process in the prior art, no better method can quantitatively evaluate the tumorigenicity and the tumorigenicity of the inserted mutation at present; in practical application, when the target fragment IS relatively low, nonspecific amplification phenomenon occurs, so that the defect of false positive result of IS identification IS caused.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention relates to a detection method of dominant clone caused by exogenous DNA insertion mutation, which comprises the following steps:
step 1, preprocessing positive and negative chain correction and redundancy integration IS carried out on virus IS information data;
step 2, after pretreatment IS completed, the IS caused by the non-specific amplification IS identified so as to remove false positive IS detection results caused by the PCR non-specific amplification;
step 3, merging IS at adjacent positions according to a site clustering algorithm, and clustering;
and 4, analyzing the cloning plane according to the clustering result, and describing the dominant clone.
As a preferred embodiment of the present invention, the virus IS information data in the step 1 IS a collection of virus IS, wherein each IS indicates an integration site on the host genome.
As a preferable technical scheme of the invention, the specific operation of correcting the positive and negative links of the virus IS information data in the step 1 IS that if the gene where the IS IS located IS a negative link and the IS reading section IS a negative link, the IS IS converted into a positive link.
As a preferable technical scheme of the invention, the method for carrying out redundant integration on the virus IS information data IS that if the site of IS and the positive and negative chain information are completely consistent, the sites and the positive and negative chain information are combined into 1 record; and coding the loci and chromosome information corresponding to the IS, simplifying the loci and the chromosome information into 1 number, and taking the 1 number as the basis of a locus clustering algorithm.
As a preferable technical scheme of the invention, the method for identifying IS caused by nonspecific amplification in the step 3 comprises the following 3 steps:
step A, extracting all PCR primer sequences, and calculating reverse complementary sequences according to the primer sequences to obtain a primer sequence library S1; extracting a plurality of base pairs before and after each IS according to the position of each IS to form a sequence library S2 to be matched; randomly selecting a sufficient number of random sites on a target genome to form a background sequence library S3;
step B, matching and scoring, wherein each sequence in the S2 and the S3 is respectively matched with each primer sequence in the primer sequence library and the reverse complementary sequence thereof; selecting the highest matching score in all scores from the matching results of the sequence and the reverse complementary sequence thereof as the score of the similarity of the IS and the primer; after the matching IS completed, each IS in S2 and S3 obtains a similarity score vector, and the length of the vector IS equal to the number of the primers; for each primer, sorting the similarity scores corresponding to the primers in S2 and S3, converting the sorting scores with percentiles, such as the scores ranked at the top 1%, converting all the scores into 1, and so on; after the conversion IS completed, each IS corresponds to a percentile score vector, which represents the ranking of the similarity of the sequence and the primer sequence in all sequences;
step C, judging nonspecific amplification; if no non-specific amplification exists, the sequences in S2 and S3 are completely random, and the distribution of the corresponding similarity ranking scores is theoretically uniformly distributed; in the case of non-specific amplification, the ranking distribution of the scores in S2 will bias toward the top-ranked sites; calculating the ranking vector of each IS in the step 2, so as to obtain the probability of nonspecific amplification of each IS in the step 2; the calculation mode IS that the percentile score vector of each IS IS multiplied continuously, and if the probability IS smaller than a certain threshold value T, the IS IS marked as nonspecific amplification; IS labeled as specifically amplified will be removed in later assays.
As a preferable technical scheme of the invention, in the step 3, IS at adjacent positions IS combined according to a site clustering algorithm, and clustering IS carried out by grouping IS according to chromosomes, and IS in each chromosome; constructing an IS distance matrix according to the distance between each IS genome locus, and then clustering the distance matrix by using a hierarchical clustering algorithm; after hierarchical clustering IS completed, the IS cluster obtained by extracting and identifying the result cluster of the hierarchical clustering according to a certain distance threshold IS called UIS, and the representative site of the UIS IS determined by the site with the highest reading support number in all sites forming the IS cluster.
As a preferred technical scheme of the invention, in the step 4, the cloning plane IS analyzed according to the clustering result, and the dominant clone IS described by calculating and evaluating the dominant clone of the sample from two dimensions by using the proportion of the support numbers of the reads of IS of the sample rank 10, wherein the dominant clone IS the diversity of the cell types of the independent integration sites of the cells and the uniformity of the cell types of the independent integration sites, namely the clone diversity and the clone uniformity.
The beneficial effects of the invention are as follows:
the dominant clone detection method based on exogenous DNA insertion mutation utilizes a PCR primer similarity ranking algorithm to identify and remove non-specific amplification caused by possible PCR primer similarity, and obviously reduces false positive of IS identification in practical application and improves accuracy. According to the invention, the detection of dominant clones IS realized by evaluating the IS diversity and IS uniformity of the sample, so that the tumor caused by the insertion mutation IS evaluated more reliably. According to the method, the dominant clone plane IS calculated, the cell disordered replication state caused by IS insertion IS quantitatively evaluated, the tumorigenicity of the insertion mutation IS predicted, and the practicability of IS detection in a clinical level IS increased.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
In the drawings:
FIG. 1 is a flow chart showing a method for detecting a dominant clone based on an insertion mutation of an exogenous DNA according to the present invention;
FIG. 2 is a schematic diagram showing selection of dominant clones based on insertion mutation of exogenous DNA in example 2 of the present invention;
FIG. 3 is a graph showing the results of the diversity and uniformity calculation of 9 samples based on the method for detecting dominant clones by insertion mutation of exogenous DNA in example 2 of the present invention;
FIG. 4 is a schematic diagram showing the results of prediction of cloning plane of 9 samples based on the method for detecting dominant clones caused by insertion mutation of exogenous DNA in example 2 of the present invention;
FIG. 5 is a diagram showing the comparison of the similarity of primer sequences and random locus ranking distribution in the method for detecting dominant clones based on insertion mutation of exogenous DNA in example 2 of the present invention;
FIG. 6 IS a graph showing IS sequence distribution after filtering of a sample based on a method for detecting a dominant clone by an insertion mutation of an exogenous DNA in example 2 of the present invention;
FIG. 7 is a schematic diagram showing the results of cloning plane calculation based on the method for detecting dominant clones by insertion mutation of exogenous DNA in example 2 of the present invention;
FIG. 8 IS a schematic diagram showing the top10 IS duty ratio of 9 samples of the detection method based on the dominant clone caused by the insertion mutation of the exogenous DNA in example 2 of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
Examples: as shown in FIG. 1, the method for detecting dominant clones based on insertion mutation of exogenous DNA of the present invention comprises the following steps:
step 1, preprocessing positive and negative chain correction and redundancy integration IS carried out on virus IS information data;
step 2, after pretreatment IS completed, the IS caused by the non-specific amplification IS identified so as to remove false positive IS detection results caused by the PCR non-specific amplification;
step 3, merging IS at adjacent positions according to a site clustering algorithm, and clustering;
and 4, analyzing the cloning plane according to the clustering result, and describing the dominant clone.
Wherein the virus IS information data in step 1 IS a collection of virus IS, wherein each IS identifies an integration site on the host genome.
The specific operation of correcting the positive and negative links of the virus IS information data in the step 1 IS that if the gene in which the IS IS located IS a negative link and the IS reading section IS a negative link, the IS IS converted into a positive link.
The method for carrying out redundant integration on the virus IS information data IS that if the site of IS and the positive and negative chain information are completely consistent, the sites and the positive and negative chain information are combined into 1 record; and coding the loci and chromosome information corresponding to the IS, simplifying the loci and the chromosome information into 1 number, and taking the 1 number as the basis of a locus clustering algorithm.
The method for identifying IS caused by nonspecific amplification in the step 3 comprises the following 3 steps:
step A, extracting all PCR primer sequences, and calculating reverse complementary sequences according to the primer sequences to obtain a primer sequence library S1; extracting a plurality of base pairs before and after each IS according to the position of each IS to form a sequence library S2 to be matched; randomly selecting a sufficient number of random sites on a target genome to form a background sequence library S3;
step B, matching and scoring, wherein each sequence in the S2 and the S3 is respectively matched with each primer sequence in the primer sequence library and the reverse complementary sequence thereof; selecting the highest matching score in all scores from the matching results of the sequence and the reverse complementary sequence thereof as the score of the similarity of the IS and the primer; after the matching IS completed, each IS in S2 and S3 obtains a similarity score vector, and the length of the vector IS equal to the number of the primers; for each primer, sorting the similarity scores corresponding to the primers in S2 and S3, converting the sorting scores with percentiles, such as the scores ranked at the top 1%, converting all the scores into 1, and so on; after the conversion IS completed, each IS corresponds to a percentile score vector, which represents the ranking of the similarity of the sequence and the primer sequence in all sequences;
step C, judging nonspecific amplification; if no non-specific amplification exists, the sequences in S2 and S3 are completely random, and the distribution of the corresponding similarity ranking scores is theoretically uniformly distributed; in the case of non-specific amplification, the ranking distribution of the scores in S2 will bias toward the top-ranked sites; calculating the ranking vector of each IS in the step 2, so as to obtain the probability of nonspecific amplification of each IS in the step 2; the calculation mode IS that the percentile score vector of each IS IS multiplied continuously, and if the probability IS smaller than a certain threshold value T, the IS IS marked as nonspecific amplification; IS labeled as specifically amplified will be removed in later assays.
In the step 3, the IS at adjacent positions IS combined according to a site clustering algorithm, and the clustering method IS that the IS IS grouped according to the chromosomes, and the IS in each chromosome; constructing an IS distance matrix according to the distance between each IS genome locus, and then clustering the distance matrix by using a hierarchical clustering algorithm; after hierarchical clustering IS completed, the IS cluster obtained by extracting and identifying the result cluster of the hierarchical clustering according to a certain distance threshold IS called UIS, and the representative site of the UIS IS determined by the site with the highest reading support number in all sites forming the IS cluster.
In the step 4, the cloning plane IS analyzed according to the clustering result, and the dominant clone IS described by calculating and evaluating the dominant clone of the sample from two dimensions by using the proportion of the support number of the reads of the IS of the sample rank 10, wherein the dominant clone IS the diversity of the cell type of the independent integration site of the cell and the uniformity of the cell type ratio of the independent integration site, namely the clone diversity and the clone uniformity.
The diversity index was calculated from the Simpson index (Simpson's Diversity Index). The calculation mode is that
D=1-Σ(ni/N)2
Wherein D is Simpson index. N is the number of reads support corresponding to each actually detected UIS, and N is the total UIS number.
The uniformity index is calculated by shannon index, and the specific calculation method is as follows:
H=-Σ(pi*log2(pi))
where H IS shannon index, p IS the ratio of each IS to the total IS, log2 represents the binary logarithm, ln represents the base of the binary logarithm, and S represents the total IS number.
In one coordinate axis, the diversity and uniformity are taken as metrics of the x-axis and y-axis, respectively, and the sample can be projected as a point on the coordinate axis. The method utilizes more than 210 samples in a local database to construct a machine learning discrimination model based on a support vector machine, takes the diversity index and the uniformity index of the samples as independent variables and the clonality as dependent variables, trains by utilizing the support vector machine, finds a division hyperplane in a sample space, and separates the samples with dominant clones from the clone samples without dominant clones (as shown in figure 4). After obtaining the partitioned hyperplane, the method uses this plane to evaluate the dominant clonality of the new sample. The plane can judge whether dominant clone occurs in cells or not, and the specific judgment rules are as follows: if the sample is below the cloning plane, it is a dominant clone, whereas if it is not. If the result of the cloning plane judges that the cells are not subjected to dominant cloning, the sample is polyclonal; if the results of the cloning plane determine that a dominant clone has occurred in the cells, the sample is oligoclonal.
The invention utilizes the PCR primer similarity ranking algorithm to identify and remove the nonspecific amplification caused by the similarity of the possible PCR primers, thereby obviously reducing the false positive of IS identification in practical application and improving the accuracy. According to the invention, the detection of dominant clones IS realized by evaluating the IS diversity and IS uniformity of the sample, so that the tumor caused by the insertion mutation IS evaluated more reliably. According to the method, the dominant clone plane IS calculated, the cell disordered replication state caused by IS insertion IS quantitatively evaluated, the tumorigenicity of the insertion mutation IS predicted, and the practicability of IS detection in a clinical level IS increased.
Example 2 as shown in fig. 2 to 8, 4 samples with dominant clones and 5 samples without dominant clones were selected, respectively, and then the dominant clone detection was performed on 9 samples using the present method (top 10 IS duty ratio of 9 samples, with No. 2,4,5,6 samples having obvious dominant clones); the detection of the sample of the embodiment shows the influence of false positive IS generated by nonspecific amplification in the sample on the real result; after the treatment by the method, the false positive IS obviously reduced, which proves that the method has better removal effect performance on the false positive IS caused by non-specific amplification. Firstly, the method extracts all the PCR primer sequences utilized in the experimental process, and calculates the reverse complementary sequences of the PCR primer sequences according to the primer sequences to form a primer sequence library S1. Then, the method extracts flanking sequences of 100bp before and after each IS in 2 samples to form a sequence library S2 to be matched, wherein the S2 contains IS flanking sequences from two samples. Meanwhile, 1000 random sites are randomly selected on the target genome to form a random site background sequence library S3.
After three sequence libraries of S1, S2 and S3 are obtained, each sequence in the sample sequence library (S2) and the random site sequence library (S3) is respectively matched to 3 primer sequences in S1 and respective reverse complementary sequences thereof. In scoring a, b the similarity of each primer sequence and its reverse complement, m=max (a, b) is chosen as the similarity score for that sequence to the primer, and finally the similarity score M1, M2, M3 for each primer is chosen as the final score vector, as shown in table 2. After scoring is complete, each of the sequences S2 and S3 obtains a set of similarity score vectors. And (3) sorting the corresponding different primers in the similarity score vectors in S2 and S3 respectively, and converting the sorting scores by using percentiles, for example, converting the score of the first 1% into 1, converting the score of the first 1% -2% into 2, and the like. After the conversion is completed, each of the sequences S2 and S3 corresponds to a rank vector of length 3, indicating the rank of the best similarity score for that sequence to the 3 primer sequences in all sequences.
From the density distribution, it can be seen that the ranking distribution of the 2 example samples was significantly biased at the top-ranked sites compared to the random sequence, indicating that the IS nearby sequences in the 2 samples were more similar to the PCR primers. In two samples with significant non-specific amplification leading to false positives (sample 1, sample 2, comparison of primer sequence similarity to random site ranking distribution) as shown in FIG. 5.
One obvious feature of non-specific amplification IS the multiple occurrence in different samples, while IS randomly integrated into the human genome IS unlikely to occur in both samples. We have found a plurality of IS detected simultaneously in two samples. Comparing the scores of these IS with other IS, we found that the score density distribution of IS occurring in both samples at the top IS more severely biased, as can be seen in fig. 6, with the highest similarity to the primer in sample 1 and sample 2, and the similarity rank distribution of IS to the primer IS more prone to the distribution of random sites after filtering by the method. The method has good effect on reducing false positives of IS recognition and improving accuracy, and the correlation of the prior score and nonspecific amplification IS also laterally demonstrated.
FIG. 7 shows the results of cloning plane calculations. Where the x-axis represents sample diversity and the y-axis represents sample uniformity. It can be seen that samples with dominant cloning phenomena are below the cloning plane, while samples without dominant cloning phenomena are above the cloning plane, indicating that the cloning plane defined by the method can accurately distinguish samples carrying dominant clones.
Finally, the method multiplies the ranking scores in S2, converts the ranking scores into the probabilities that the IS and the three primers are not nonspecifically amplified, marks the IS with the probability higher than T as a specific amplified sequence, and removes the IS. After IS of the nonspecific amplification mark IS removed, the deviation of the overall score curve distribution IS obviously reduced, and the overall score curve distribution basically presents a uniformly distributed state, which proves that the method has good recognition effect on the nonspecific amplification of IS.
After obtaining the diversity and uniformity index of each sample, the method establishes a coordinate axis with diversity and uniformity as x-axis and y-axis respectively, and projects the diversity and uniformity calculation result of each sample as the coordinates of the x-axis and y-axis of a point on the coordinate axis respectively. And then, the clone plane determined by machine learning is characterized on the coordinate axis, the relative position of the sample and the clone plane is observed (as shown in figure 7), if the sample is below the clone plane, the sample is dominant, otherwise, no dominant clone exists. As can be seen in FIG. 7, samples No. 2,4,5,6, which are dominant cloning phenomena, are all below the cloning plane, represented by dots; whereas sample 1,3,7,8,9 without dominant cloning was above the cloning plane, indicated by triangles. The judgment results are shown in Table 5. The judging result shows that the method can accurately judge the clonality of all samples.
By combining the above embodiments, the method utilizes a PCR primer similarity ranking algorithm to identify and remove the nonspecific amplification caused by the similarity of the possible PCR primers, thereby obviously reducing the false positive of IS identification in practical application and improving the accuracy of IS identification. In addition, the method realizes the detection of dominant clones by evaluating the IS diversity and IS uniformity of the sample, thereby more reliably evaluating the tumors caused by the insertion mutation and increasing the practicability of IS detection in clinical level.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. The method for detecting the dominant clone based on the insertion mutation of the exogenous DNA is characterized by comprising the following steps:
step 1, preprocessing positive and negative chain correction and redundancy integration IS carried out on virus IS information data;
step 2, after pretreatment IS completed, the IS caused by the non-specific amplification IS identified so as to remove false positive IS detection results caused by the PCR non-specific amplification;
step 3, merging IS at adjacent positions according to a site clustering algorithm, and clustering;
and 4, analyzing the cloning plane according to the clustering result, and describing the dominant clone.
2. The method of claim 1, wherein the information data of the viral IS in step 1 IS a collection of viral IS, wherein each IS indicates the integration position on the host genome.
3. The method for detecting dominant clones by insertion mutation of exogenous DNA according to claim 1, wherein the specific operation of correcting positive and negative strand of the viral IS information data in the step 1 IS to convert IS into positive strand if the gene in which IS IS located IS negative strand and the IS read IS negative strand.
4. The method for detecting dominant clones based on foreign DNA insertion mutation according to claim 3, wherein the redundant integration of viral IS information data IS performed by merging the IS sites and the positive and negative strand information into 1 record if they are completely identical; and coding the loci and chromosome information corresponding to the IS, simplifying the loci and the chromosome information into 1 number, and taking the 1 number as the basis of a locus clustering algorithm.
5. The method for detecting dominant clones based on insertion mutation of foreign DNA according to claim 1, wherein the method for identifying IS caused by nonspecific amplification in the step 3 comprises 3 steps of:
step A, extracting all PCR primer sequences, and calculating reverse complementary sequences according to the primer sequences to obtain a primer sequence library S1; extracting a plurality of base pairs before and after each IS according to the position of each IS to form a sequence library S2 to be matched; randomly selecting a sufficient number of random sites on a target genome to form a background sequence library S3;
step B, matching and scoring, wherein each sequence in the S2 and the S3 is respectively matched with each primer sequence in the primer sequence library and the reverse complementary sequence thereof; selecting the highest matching score in all scores from the matching results of the sequence and the reverse complementary sequence thereof as the score of the similarity of the IS and the primer; after the matching IS completed, each IS in S2 and S3 obtains a similarity score vector, and the length of the vector IS equal to the number of the primers; for each primer, sorting the similarity scores corresponding to the primers in S2 and S3, converting the sorting scores with percentiles, such as the scores ranked at the top 1%, converting all the scores into 1, and so on; after the conversion IS completed, each IS corresponds to a percentile score vector, which represents the ranking of the similarity of the sequence and the primer sequence in all sequences;
step C, judging nonspecific amplification; if no non-specific amplification exists, the sequences in S2 and S3 are completely random, and the distribution of the corresponding similarity ranking scores is theoretically uniformly distributed; in the case of non-specific amplification, the ranking distribution of the scores in S2 will bias toward the top-ranked sites; calculating the ranking vector of each IS in the step 2, so as to obtain the probability of nonspecific amplification of each IS in the step 2; the calculation mode IS that the percentile score vector of each IS IS multiplied continuously, and if the probability IS smaller than a certain threshold value T, the IS IS marked as nonspecific amplification; IS labeled as specifically amplified will be removed in later assays.
6. The method for detecting dominant clones based on insertion mutation of exogenous DNA according to claim 1, wherein in the step 3, IS at adjacent positions IS combined according to a site clustering algorithm, and clustered by grouping IS according to chromosome, IS in each chromosome; constructing an IS distance matrix according to the distance between each IS genome locus, and then clustering the distance matrix by using a hierarchical clustering algorithm; after hierarchical clustering IS completed, the IS cluster obtained by extracting and identifying the result cluster of the hierarchical clustering according to a certain distance threshold IS called UIS, and the representative site of the UIS IS determined by the site with the highest reading support number in all sites forming the IS cluster.
7. The method for detecting dominant clones by exogenous DNA insertion mutation according to claim 1, wherein in the step 4, the cloning plane IS analyzed according to the clustering result, and the dominant clones are described by calculating and evaluating the dominant clones of the sample from two dimensions by using the ratio of the number of support reads of IS of the top10 of the sample rank, namely the diversity of cell types of independent integration sites of the cells and the uniformity of cell types of independent integration sites, namely the cloning diversity and the cloning uniformity, respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311733844.2A CN117746982A (en) | 2023-12-15 | 2023-12-15 | Method for detecting dominant clone based on exogenous DNA insertion mutation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311733844.2A CN117746982A (en) | 2023-12-15 | 2023-12-15 | Method for detecting dominant clone based on exogenous DNA insertion mutation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117746982A true CN117746982A (en) | 2024-03-22 |
Family
ID=90260281
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311733844.2A Pending CN117746982A (en) | 2023-12-15 | 2023-12-15 | Method for detecting dominant clone based on exogenous DNA insertion mutation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117746982A (en) |
-
2023
- 2023-12-15 CN CN202311733844.2A patent/CN117746982A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Erickson et al. | DNA barcoding in land plants: developing standards to quantify and maximize success | |
Liu et al. | Selecting informative genes with parallel genetic algorithms in tissue classification | |
CN107480470B (en) | Known variation detection method and device based on Bayesian and Poisson distribution test | |
CN111276252B (en) | Construction method and device of tumor benign and malignant identification model | |
CN112927757B (en) | Gastric cancer biomarker identification method based on gene expression and DNA methylation data | |
CN107463795A (en) | A kind of prediction algorithm for identifying tyrosine posttranslational modification site | |
CN114420212A (en) | Escherichia coli strain identification method and system | |
CN111583998B (en) | Genome structure variation typing method considering copy number variation factors | |
Moyer et al. | Machine learning applications to DNA subsequence and restriction site analysis | |
Ramos et al. | An interpretable approach for lung cancer prediction and subtype classification using gene expression | |
CN113823356B (en) | Methylation site identification method and device | |
US7962427B2 (en) | Method for the detection of atypical sequences via generalized compositional methods | |
CN117746982A (en) | Method for detecting dominant clone based on exogenous DNA insertion mutation | |
US20220259657A1 (en) | Method for discovering marker for predicting risk of depression or suicide using multi-omics analysis, marker for predicting risk of depression or suicide, and method for predicting risk of depression or suicide using multi-omics analysis | |
CN114627964B (en) | Prediction enhancer based on multi-core learning and intensity classification method and classification equipment thereof | |
JP3936851B2 (en) | Clustering result evaluation method and clustering result display method | |
CN113380324B (en) | T cell receptor sequence motif combination recognition detection method, storage medium and equipment | |
Abbasi et al. | iLEC-DNA: Identifying Long Extra-chromosomal Circular DNA by Fusing Sequence-derived Features of Physicochemical Properties and Nucleotide Distribution Patterns | |
EP4425499A1 (en) | Method for diagnosis of cancer and prediction of cancer type, using methylated acellular nucleic acid | |
Lalrinmawii et al. | An Overview of the Workflow of Next-Generation Sequencing Data Analysis | |
CN116168761B (en) | Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium | |
CN116343923B (en) | Genome structural variation homology identification method | |
CN115910216B (en) | Method and system for identifying genome sequence classification errors based on machine learning | |
KR102166070B1 (en) | Analysis method for efficiency of programmable nuclease and apparatus for the same | |
Fracasso et al. | Applications of Machine Learning Tools in Genomics: A Review |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |