CN117746982A

CN117746982A - Method for detecting dominant clone based on exogenous DNA insertion mutation

Info

Publication number: CN117746982A
Application number: CN202311733844.2A
Authority: CN
Inventors: 倪帅; 孔华磊; 阚科佳; 朱凤娇; 何峰; 吴宁; 侯宇宸
Original assignee: Shanghai Weike Biotechnology Co ltd
Current assignee: Shanghai Weike Biotechnology Co ltd
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-03-22

Abstract

The invention discloses a detection method based on dominant clone caused by exogenous DNA insertion mutation, which IS used for preprocessing positive and negative chain correction and redundant integration of virus IS information data; step 2, after pretreatment IS completed, the IS caused by the non-specific amplification IS identified so as to remove false positive IS detection results caused by the PCR non-specific amplification; step 3, merging IS at adjacent positions according to a site clustering algorithm, and clustering; step 4, analyzing a cloning plane according to a clustering result, and describing dominant clones; the invention utilizes the PCR primer similarity ranking algorithm to identify and remove the nonspecific amplification caused by the similarity of the possible PCR primers, thereby obviously reducing the false positive of IS identification in practical application and improving the accuracy.

Description

Method for detecting dominant clone based on exogenous DNA insertion mutation

Technical Field

The invention relates to the technical field related to analysis of second-generation sequencing data in the field of biological information, in particular to a detection method based on dominant clone caused by exogenous DNA insertion mutation.

Background

Gene therapy refers to a therapeutic method that uses molecular biological means to introduce exogenous DNA into the genome of genetically deficient cells to restore normal function to the cells. However, safety has been one of the important issues that plague the development of therapies in the development of gene therapy. An integrative vector is a type of vector DNA sequence commonly used in gene therapy for loading exogenous DNA fragments and integrating into the host genome in an inserted manner. Among them, lentiviral vectors (Lentiviral vectors, LVs), adeno-like vectors (AAV) and the like are ideal tool vectors for gene therapy due to their high gene transfer efficiency or stable expression ability in target cell genes.

When integrated into the host genome, the integrated vector may cause instability and rearrangement of the host cell genome and may cause gene expression disorders in the host cell, evade host immune recognition and maintain self-survival for a long period of time, thereby eventually leading to the occurrence of cancer. Currently, in preclinical studies and in some clinical trials using viral vectors, oncogenic phenomena due to integration of the vector in the genome have been observed. In an AAV gene therapy trial in dogs, researchers have found that some of the therapeutic gene segments carried by AAV are integrated near genes that control growth on the dog's chromosome and that some liver cells of these dogs divide faster than others and form sub-cell clusters with the potential to induce cancer. Therefore, detection of the risk of cancer induced by vector insertion after the market of gene therapy products is one of the important links of gene therapy. However, during sustained cell therapy, the time of occurrence of such potential neoplasia tends to be significantly delayed and difficult to predict. There is currently no better method to quantitatively evaluate the tumorigenicity and tumorigenicity of an inserted mutation. In most of the existing IS detection methods, IS IS identified based on a PCR amplification insert fragment end method, and in practical application, when the target fragment IS relatively low, a phenomenon of nonspecific amplification occurs, which results in false positive results of IS identification.

Disclosure of Invention

In order to solve the problems that the occurrence time of potential tumorigenicity often has obvious hysteresis and is difficult to predict in the continuous cell treatment process in the prior art, no better method can quantitatively evaluate the tumorigenicity and the tumorigenicity of the inserted mutation at present; in practical application, when the target fragment IS relatively low, nonspecific amplification phenomenon occurs, so that the defect of false positive result of IS identification IS caused.

In order to solve the technical problems, the invention provides the following technical scheme:

the invention relates to a detection method of dominant clone caused by exogenous DNA insertion mutation, which comprises the following steps:

step 1, preprocessing positive and negative chain correction and redundancy integration IS carried out on virus IS information data;

step 2, after pretreatment IS completed, the IS caused by the non-specific amplification IS identified so as to remove false positive IS detection results caused by the PCR non-specific amplification;

step 3, merging IS at adjacent positions according to a site clustering algorithm, and clustering;

and 4, analyzing the cloning plane according to the clustering result, and describing the dominant clone.

As a preferred embodiment of the present invention, the virus IS information data in the step 1 IS a collection of virus IS, wherein each IS indicates an integration site on the host genome.

As a preferable technical scheme of the invention, the specific operation of correcting the positive and negative links of the virus IS information data in the step 1 IS that if the gene where the IS IS located IS a negative link and the IS reading section IS a negative link, the IS IS converted into a positive link.

As a preferable technical scheme of the invention, the method for carrying out redundant integration on the virus IS information data IS that if the site of IS and the positive and negative chain information are completely consistent, the sites and the positive and negative chain information are combined into 1 record; and coding the loci and chromosome information corresponding to the IS, simplifying the loci and the chromosome information into 1 number, and taking the 1 number as the basis of a locus clustering algorithm.

As a preferable technical scheme of the invention, the method for identifying IS caused by nonspecific amplification in the step 3 comprises the following 3 steps:

step A, extracting all PCR primer sequences, and calculating reverse complementary sequences according to the primer sequences to obtain a primer sequence library S1; extracting a plurality of base pairs before and after each IS according to the position of each IS to form a sequence library S2 to be matched; randomly selecting a sufficient number of random sites on a target genome to form a background sequence library S3;

step B, matching and scoring, wherein each sequence in the S2 and the S3 is respectively matched with each primer sequence in the primer sequence library and the reverse complementary sequence thereof; selecting the highest matching score in all scores from the matching results of the sequence and the reverse complementary sequence thereof as the score of the similarity of the IS and the primer; after the matching IS completed, each IS in S2 and S3 obtains a similarity score vector, and the length of the vector IS equal to the number of the primers; for each primer, sorting the similarity scores corresponding to the primers in S2 and S3, converting the sorting scores with percentiles, such as the scores ranked at the top 1%, converting all the scores into 1, and so on; after the conversion IS completed, each IS corresponds to a percentile score vector, which represents the ranking of the similarity of the sequence and the primer sequence in all sequences;

step C, judging nonspecific amplification; if no non-specific amplification exists, the sequences in S2 and S3 are completely random, and the distribution of the corresponding similarity ranking scores is theoretically uniformly distributed; in the case of non-specific amplification, the ranking distribution of the scores in S2 will bias toward the top-ranked sites; calculating the ranking vector of each IS in the step 2, so as to obtain the probability of nonspecific amplification of each IS in the step 2; the calculation mode IS that the percentile score vector of each IS IS multiplied continuously, and if the probability IS smaller than a certain threshold value T, the IS IS marked as nonspecific amplification; IS labeled as specifically amplified will be removed in later assays.

As a preferable technical scheme of the invention, in the step 3, IS at adjacent positions IS combined according to a site clustering algorithm, and clustering IS carried out by grouping IS according to chromosomes, and IS in each chromosome; constructing an IS distance matrix according to the distance between each IS genome locus, and then clustering the distance matrix by using a hierarchical clustering algorithm; after hierarchical clustering IS completed, the IS cluster obtained by extracting and identifying the result cluster of the hierarchical clustering according to a certain distance threshold IS called UIS, and the representative site of the UIS IS determined by the site with the highest reading support number in all sites forming the IS cluster.

As a preferred technical scheme of the invention, in the step 4, the cloning plane IS analyzed according to the clustering result, and the dominant clone IS described by calculating and evaluating the dominant clone of the sample from two dimensions by using the proportion of the support numbers of the reads of IS of the sample rank 10, wherein the dominant clone IS the diversity of the cell types of the independent integration sites of the cells and the uniformity of the cell types of the independent integration sites, namely the clone diversity and the clone uniformity.

The beneficial effects of the invention are as follows:

the dominant clone detection method based on exogenous DNA insertion mutation utilizes a PCR primer similarity ranking algorithm to identify and remove non-specific amplification caused by possible PCR primer similarity, and obviously reduces false positive of IS identification in practical application and improves accuracy. According to the invention, the detection of dominant clones IS realized by evaluating the IS diversity and IS uniformity of the sample, so that the tumor caused by the insertion mutation IS evaluated more reliably. According to the method, the dominant clone plane IS calculated, the cell disordered replication state caused by IS insertion IS quantitatively evaluated, the tumorigenicity of the insertion mutation IS predicted, and the practicability of IS detection in a clinical level IS increased.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

In the drawings:

FIG. 1 is a flow chart showing a method for detecting a dominant clone based on an insertion mutation of an exogenous DNA according to the present invention;

FIG. 2 is a schematic diagram showing selection of dominant clones based on insertion mutation of exogenous DNA in example 2 of the present invention;

FIG. 3 is a graph showing the results of the diversity and uniformity calculation of 9 samples based on the method for detecting dominant clones by insertion mutation of exogenous DNA in example 2 of the present invention;

FIG. 4 is a schematic diagram showing the results of prediction of cloning plane of 9 samples based on the method for detecting dominant clones caused by insertion mutation of exogenous DNA in example 2 of the present invention;

FIG. 5 is a diagram showing the comparison of the similarity of primer sequences and random locus ranking distribution in the method for detecting dominant clones based on insertion mutation of exogenous DNA in example 2 of the present invention;

FIG. 6 IS a graph showing IS sequence distribution after filtering of a sample based on a method for detecting a dominant clone by an insertion mutation of an exogenous DNA in example 2 of the present invention;

FIG. 7 is a schematic diagram showing the results of cloning plane calculation based on the method for detecting dominant clones by insertion mutation of exogenous DNA in example 2 of the present invention;

FIG. 8 IS a schematic diagram showing the top10 IS duty ratio of 9 samples of the detection method based on the dominant clone caused by the insertion mutation of the exogenous DNA in example 2 of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Examples: as shown in FIG. 1, the method for detecting dominant clones based on insertion mutation of exogenous DNA of the present invention comprises the following steps:

Wherein the virus IS information data in step 1 IS a collection of virus IS, wherein each IS identifies an integration site on the host genome.

The specific operation of correcting the positive and negative links of the virus IS information data in the step 1 IS that if the gene in which the IS IS located IS a negative link and the IS reading section IS a negative link, the IS IS converted into a positive link.

The method for carrying out redundant integration on the virus IS information data IS that if the site of IS and the positive and negative chain information are completely consistent, the sites and the positive and negative chain information are combined into 1 record; and coding the loci and chromosome information corresponding to the IS, simplifying the loci and the chromosome information into 1 number, and taking the 1 number as the basis of a locus clustering algorithm.

The method for identifying IS caused by nonspecific amplification in the step 3 comprises the following 3 steps:

In the step 3, the IS at adjacent positions IS combined according to a site clustering algorithm, and the clustering method IS that the IS IS grouped according to the chromosomes, and the IS in each chromosome; constructing an IS distance matrix according to the distance between each IS genome locus, and then clustering the distance matrix by using a hierarchical clustering algorithm; after hierarchical clustering IS completed, the IS cluster obtained by extracting and identifying the result cluster of the hierarchical clustering according to a certain distance threshold IS called UIS, and the representative site of the UIS IS determined by the site with the highest reading support number in all sites forming the IS cluster.

In the step 4, the cloning plane IS analyzed according to the clustering result, and the dominant clone IS described by calculating and evaluating the dominant clone of the sample from two dimensions by using the proportion of the support number of the reads of the IS of the sample rank 10, wherein the dominant clone IS the diversity of the cell type of the independent integration site of the cell and the uniformity of the cell type ratio of the independent integration site, namely the clone diversity and the clone uniformity.

The diversity index was calculated from the Simpson index (Simpson's Diversity Index). The calculation mode is that

D＝1-Σ(ni/N)2

Wherein D is Simpson index. N is the number of reads support corresponding to each actually detected UIS, and N is the total UIS number.

The uniformity index is calculated by shannon index, and the specific calculation method is as follows:

H＝-Σ(pi*log2(pi))

where H IS shannon index, p IS the ratio of each IS to the total IS, log2 represents the binary logarithm, ln represents the base of the binary logarithm, and S represents the total IS number.

In one coordinate axis, the diversity and uniformity are taken as metrics of the x-axis and y-axis, respectively, and the sample can be projected as a point on the coordinate axis. The method utilizes more than 210 samples in a local database to construct a machine learning discrimination model based on a support vector machine, takes the diversity index and the uniformity index of the samples as independent variables and the clonality as dependent variables, trains by utilizing the support vector machine, finds a division hyperplane in a sample space, and separates the samples with dominant clones from the clone samples without dominant clones (as shown in figure 4). After obtaining the partitioned hyperplane, the method uses this plane to evaluate the dominant clonality of the new sample. The plane can judge whether dominant clone occurs in cells or not, and the specific judgment rules are as follows: if the sample is below the cloning plane, it is a dominant clone, whereas if it is not. If the result of the cloning plane judges that the cells are not subjected to dominant cloning, the sample is polyclonal; if the results of the cloning plane determine that a dominant clone has occurred in the cells, the sample is oligoclonal.

The invention utilizes the PCR primer similarity ranking algorithm to identify and remove the nonspecific amplification caused by the similarity of the possible PCR primers, thereby obviously reducing the false positive of IS identification in practical application and improving the accuracy. According to the invention, the detection of dominant clones IS realized by evaluating the IS diversity and IS uniformity of the sample, so that the tumor caused by the insertion mutation IS evaluated more reliably. According to the method, the dominant clone plane IS calculated, the cell disordered replication state caused by IS insertion IS quantitatively evaluated, the tumorigenicity of the insertion mutation IS predicted, and the practicability of IS detection in a clinical level IS increased.

Example 2 as shown in fig. 2 to 8, 4 samples with dominant clones and 5 samples without dominant clones were selected, respectively, and then the dominant clone detection was performed on 9 samples using the present method (top 10 IS duty ratio of 9 samples, with No. 2,4,5,6 samples having obvious dominant clones); the detection of the sample of the embodiment shows the influence of false positive IS generated by nonspecific amplification in the sample on the real result; after the treatment by the method, the false positive IS obviously reduced, which proves that the method has better removal effect performance on the false positive IS caused by non-specific amplification. Firstly, the method extracts all the PCR primer sequences utilized in the experimental process, and calculates the reverse complementary sequences of the PCR primer sequences according to the primer sequences to form a primer sequence library S1. Then, the method extracts flanking sequences of 100bp before and after each IS in 2 samples to form a sequence library S2 to be matched, wherein the S2 contains IS flanking sequences from two samples. Meanwhile, 1000 random sites are randomly selected on the target genome to form a random site background sequence library S3.

After three sequence libraries of S1, S2 and S3 are obtained, each sequence in the sample sequence library (S2) and the random site sequence library (S3) is respectively matched to 3 primer sequences in S1 and respective reverse complementary sequences thereof. In scoring a, b the similarity of each primer sequence and its reverse complement, m=max (a, b) is chosen as the similarity score for that sequence to the primer, and finally the similarity score M1, M2, M3 for each primer is chosen as the final score vector, as shown in table 2. After scoring is complete, each of the sequences S2 and S3 obtains a set of similarity score vectors. And (3) sorting the corresponding different primers in the similarity score vectors in S2 and S3 respectively, and converting the sorting scores by using percentiles, for example, converting the score of the first 1% into 1, converting the score of the first 1% -2% into 2, and the like. After the conversion is completed, each of the sequences S2 and S3 corresponds to a rank vector of length 3, indicating the rank of the best similarity score for that sequence to the 3 primer sequences in all sequences.

From the density distribution, it can be seen that the ranking distribution of the 2 example samples was significantly biased at the top-ranked sites compared to the random sequence, indicating that the IS nearby sequences in the 2 samples were more similar to the PCR primers. In two samples with significant non-specific amplification leading to false positives (sample 1, sample 2, comparison of primer sequence similarity to random site ranking distribution) as shown in FIG. 5.

One obvious feature of non-specific amplification IS the multiple occurrence in different samples, while IS randomly integrated into the human genome IS unlikely to occur in both samples. We have found a plurality of IS detected simultaneously in two samples. Comparing the scores of these IS with other IS, we found that the score density distribution of IS occurring in both samples at the top IS more severely biased, as can be seen in fig. 6, with the highest similarity to the primer in sample 1 and sample 2, and the similarity rank distribution of IS to the primer IS more prone to the distribution of random sites after filtering by the method. The method has good effect on reducing false positives of IS recognition and improving accuracy, and the correlation of the prior score and nonspecific amplification IS also laterally demonstrated.

FIG. 7 shows the results of cloning plane calculations. Where the x-axis represents sample diversity and the y-axis represents sample uniformity. It can be seen that samples with dominant cloning phenomena are below the cloning plane, while samples without dominant cloning phenomena are above the cloning plane, indicating that the cloning plane defined by the method can accurately distinguish samples carrying dominant clones.

Finally, the method multiplies the ranking scores in S2, converts the ranking scores into the probabilities that the IS and the three primers are not nonspecifically amplified, marks the IS with the probability higher than T as a specific amplified sequence, and removes the IS. After IS of the nonspecific amplification mark IS removed, the deviation of the overall score curve distribution IS obviously reduced, and the overall score curve distribution basically presents a uniformly distributed state, which proves that the method has good recognition effect on the nonspecific amplification of IS.

After obtaining the diversity and uniformity index of each sample, the method establishes a coordinate axis with diversity and uniformity as x-axis and y-axis respectively, and projects the diversity and uniformity calculation result of each sample as the coordinates of the x-axis and y-axis of a point on the coordinate axis respectively. And then, the clone plane determined by machine learning is characterized on the coordinate axis, the relative position of the sample and the clone plane is observed (as shown in figure 7), if the sample is below the clone plane, the sample is dominant, otherwise, no dominant clone exists. As can be seen in FIG. 7, samples No. 2,4,5,6, which are dominant cloning phenomena, are all below the cloning plane, represented by dots; whereas sample 1,3,7,8,9 without dominant cloning was above the cloning plane, indicated by triangles. The judgment results are shown in Table 5. The judging result shows that the method can accurately judge the clonality of all samples.

By combining the above embodiments, the method utilizes a PCR primer similarity ranking algorithm to identify and remove the nonspecific amplification caused by the similarity of the possible PCR primers, thereby obviously reducing the false positive of IS identification in practical application and improving the accuracy of IS identification. In addition, the method realizes the detection of dominant clones by evaluating the IS diversity and IS uniformity of the sample, thereby more reliably evaluating the tumors caused by the insertion mutation and increasing the practicability of IS detection in clinical level.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for detecting the dominant clone based on the insertion mutation of the exogenous DNA is characterized by comprising the following steps:

2. The method of claim 1, wherein the information data of the viral IS in step 1 IS a collection of viral IS, wherein each IS indicates the integration position on the host genome.

3. The method for detecting dominant clones by insertion mutation of exogenous DNA according to claim 1, wherein the specific operation of correcting positive and negative strand of the viral IS information data in the step 1 IS to convert IS into positive strand if the gene in which IS IS located IS negative strand and the IS read IS negative strand.

4. The method for detecting dominant clones based on foreign DNA insertion mutation according to claim 3, wherein the redundant integration of viral IS information data IS performed by merging the IS sites and the positive and negative strand information into 1 record if they are completely identical; and coding the loci and chromosome information corresponding to the IS, simplifying the loci and the chromosome information into 1 number, and taking the 1 number as the basis of a locus clustering algorithm.

5. The method for detecting dominant clones based on insertion mutation of foreign DNA according to claim 1, wherein the method for identifying IS caused by nonspecific amplification in the step 3 comprises 3 steps of:

6. The method for detecting dominant clones based on insertion mutation of exogenous DNA according to claim 1, wherein in the step 3, IS at adjacent positions IS combined according to a site clustering algorithm, and clustered by grouping IS according to chromosome, IS in each chromosome; constructing an IS distance matrix according to the distance between each IS genome locus, and then clustering the distance matrix by using a hierarchical clustering algorithm; after hierarchical clustering IS completed, the IS cluster obtained by extracting and identifying the result cluster of the hierarchical clustering according to a certain distance threshold IS called UIS, and the representative site of the UIS IS determined by the site with the highest reading support number in all sites forming the IS cluster.

7. The method for detecting dominant clones by exogenous DNA insertion mutation according to claim 1, wherein in the step 4, the cloning plane IS analyzed according to the clustering result, and the dominant clones are described by calculating and evaluating the dominant clones of the sample from two dimensions by using the ratio of the number of support reads of IS of the top10 of the sample rank, namely the diversity of cell types of independent integration sites of the cells and the uniformity of cell types of independent integration sites, namely the cloning diversity and the cloning uniformity, respectively.