CN114743598B - Method for detecting recombination among new coronavirus pedigrees based on information theory - Google Patents
Method for detecting recombination among new coronavirus pedigrees based on information theory Download PDFInfo
- Publication number
- CN114743598B CN114743598B CN202210665351.9A CN202210665351A CN114743598B CN 114743598 B CN114743598 B CN 114743598B CN 202210665351 A CN202210665351 A CN 202210665351A CN 114743598 B CN114743598 B CN 114743598B
- Authority
- CN
- China
- Prior art keywords
- recombination
- sequence
- pedigree
- lineage
- site
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method for detecting recombination among new coronavirus pedigrees based on an information theory, which comprises the following steps: constructing a consistency sequence of a pedigree sequence to be inquired, and marking the base with the maximum content of each site and the corresponding proportion; merging the pedigree sequence and other reference pedigree sequences and then performing multiple sequence comparison to obtain an alignment sequence file; selecting whether to reserve a gap site in the alignment sequence file or not, and extracting a polymorphic site; calculating the recombination contribution value provided by each lineage sequence to the query lineage sequence at each polymorphic site; calculating the average recombination contribution value of each lineage sequence at the position in a sliding window, and detecting the recombined fragments from potential parents; splicing continuous and discrete recombination fragments on potential parents into a recombination region, and checking whether the recombination region is false positive; potential recombination breakpoints at the boundaries of the recombination region are searched by reference to polymorphic sites and recombination contributions in the lineage sequences.
Description
Technical Field
The invention relates to the technical field of biological gene detection, in particular to a method for detecting recombination between new coronavirus pedigrees based on information theory.
Background
The new coronavirus (SARS-CoV-2) which is prevalent in the global world brings great challenges to public health safety and social stability. New coronaviruses belong to the RNA viruses, and are in rapid evolution and variation due to long-term prevalence in the human population, and constantly generate new lineages. Ormkrron (Omicron) and Deltay (Delta) lineage variants, which require close attention as currently promulgated by the World Health Organization (WHO), are still prevalent worldwide, as are the previously prevalent Alpha, Beta, Gamma, Epsilon, and like lineage variants that require attention.
Related research shows that the new crown variant strains from different lineages have gene fragment recombination phenomena, and further generate new strains. This recombination accelerates the viral variation and even confers new properties to the virus, such as recombination of the neocrown XD variants from Delta subtypes AY.4 and Omicron subtype BA.1, recombination of the neocrown XE variants from Omicron subtypes BA.1 and BA.2, and community-transmitted advantage of the neocrown XE variants about 10% over his parental BA.1. Currently, more recombinant strains are still being tracked.
For recombinant identification of viruses, two kinds of software, RDP and Simplot, are generally used. However, RDP and Simplot have difficulty identifying recombination events for highly similar strains of neocorona due to the different lineages of neocorona, the small genomic differences between the different subtypes of the same lineage. Furthermore, although RDP has multiple algorithms and programs built in, the software is mainly used for detecting single sequences from each other, and is not suitable for detecting inter-lineage recombination comprising multiple sequences; simplot, although it can detect recombination of a given grouped sequence, is mainly used to plot a similarity point diagram based on the overall similarity of sequences within a sliding window, and it is difficult to identify recombination events from among highly similar new crown lineages.
Disclosure of Invention
The invention provides a method for detecting recombination among new coronavirus pedigrees based on information theory, aiming at introducing information theory and Shannon information entropy into identification of virus recombination, taking fragment transfer events in the virus recombination as a transmission process of 'information', quantifying recombination contribution and measuring occurrence probability of recombination by using a contribution value from pedigree weighted information quantity, and sensitively and efficiently identifying recombination events and recombination regions of new coronavirus existing among different pedigree sequences with high similarity.
In order to achieve the above object, the present invention provides a method for detecting recombination between new coronavirus pedigrees based on information theory, comprising:
the recombination contribution WIC is equal to the ratio of the corresponding bases of the query pedigree sequence with the most content in the loci, multiplied by the ratio of the bases in the reference pedigree sequence, and multiplied by the information content of the bases of the loci in the reference pedigree sequence;
Wherein, step 4 includes:
extracting a sequence corresponding to each reference pedigree in the alignment sequence file according to the pedigree name;
Calculating the base information content IC of each site;
the weighted information amount WIC for each site is calculated, WIC representing the recombination contribution value.
The amount of base information IC is
Wherein EIC is the total amount of expected base information, and Ent is the Shannon information entropy of the base site.
Wherein, step 5 includes:
each reference lineage sequence was slid using the same sliding window w and step size s, and the average recombination contribution for all sites in each sliding window was calculatedAnd the sum of the sums of all (X, S,) Plotting, X represents the lineage name, S represents the position of the point on the aligned sequence in the sliding window, averaging the recombination contributionsThe largest sliding window is labeled as potential recombinant fragment, the lineage sequences within the sliding window are labeled as potential parent and are labeled as potential parentAnd storing to F.
The total size of the recombination region on each potential parent was calculated and the pedigree sequence with the largest cumulative interval of recombination regions was labeled as potential major parent, designated major (M).
Wherein, step 6 includes: splicing continuous or discrete recombination segments existing on the same minor parent X into recombination regions, proofreading each recombination boundary in the splicing process, and setting the maximum recombination regionThe recombination region R, denoted d (X, R), is generated for each minor parent X.
Wherein, the step 6 comprises the following steps:
step 61, subjecting said F to the same lineage sequenceExtracting, sorting according to S value, adding into Z, and marking the first segment in Z as initial segment;
Step 62, the other fragments in ZWith the first segmentConstituting a recombination regionRecombination regionLeft boundary of (2)Right side boundaryAll is the total site count, recombination region of alignment sequence fileHas a size of;
Step 63, calculating each recombination regionContribution of lineage sequences withinJudging the contribution value of the pedigree sequenceWhether or not it is maximum andgreater than a specified threshold ratioGet ofValues ranging from 0 to 1;
The site recombination contribution WIC of each recombination region R from the secondary parent and the primary parent were tested for significant differences using the mann-whitney rank sum test, two-tailed probability.
The method for acquiring the recombination breakpoint in the step 7 comprises the following steps:
step 71, running a sliding window on all potential parents, starting from the first polymorphic site, calculating the P value of the site recombination contribution value WIC of the areas at two sides of the S point in the center of the window under the Mann-Whitney rank sum test, taking the negative number of the logarithm with the base 10 for the P value, recording as b (X, S,) X represents the pedigree sequence name, S represents the position of the point on the aligned sequence in the window;
step 72, for all b (X, S,) Perform point drawing and correspondenceThe peak with the highest value is the recombination breakpoint.
The scheme of the invention has the following beneficial effects:
the information theory and the Shannon information entropy are introduced into the identification of virus recombination, the fragment transfer event in the virus recombination is regarded as the transmission process of 'information', the contribution value of pedigree weighted information is used for quantifying the recombination contribution and measuring the occurrence probability of recombination, and the recombination event and recombination region of the new coronavirus existing in different highly similar pedigrees can be sensitively and efficiently identified.
Other advantages of the present invention will be described in detail in the detailed description that follows.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of the calculation of the contribution value WIC of site recombination according to the embodiment of the present invention;
FIG. 3 is a graph showing the distribution of the recombination contribution values WIC of all polymorphic sites in an example of the present invention;
FIG. 4 (a) is a scatter plot of the recombination contribution WIC of all polymorphic sites in the present invention, which is an example of the recombination of the Xinguan XE variant from BA.1; FIG. 4 (b) is a scatter plot of the recombinational contribution WIC of all polymorphic sites of the new crown XE variant from BA.2 as an example of the present invention; FIG. 4 (c) is a distribution scattergram of the recombination contribution WIC of all polymorphic sites of the present invention, which is an example of the recombination of the Xinguan XE variant from Delta;
FIG. 5 is a graph showing the calculation results of the mean recombination contribution value WIC of polymorphic sites within a sliding window for each lineage sequence according to the example of the present invention;
FIG. 6 (a) is an example of the recombination of the novel crown XE variant from BA.1 according to the invention, for each polymorphic site and correspondingA distribution graph of (a); FIG. 6 (b) is a diagram illustrating the recombination of the novel crown XE variant from BA.2 according to the present invention, for each polymorphic site and correspondingA distribution graph of (a); FIG. 6 (c) is a diagram illustrating the present invention employing Delta-derived recombination of a novel crown XE variant, for each polymorphic site and correspondingDistribution graph of (a).
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted", "connected" and "connected" are to be understood broadly, for example, as being either a locked connection, a detachable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Aiming at the existing problems, the invention provides a method for detecting recombination among new coronavirus pedigrees based on information theory, which introduces the information theory and Shannon information entropy into the identification of virus recombination, takes fragment transfer events in the virus recombination as the transmission process of 'information', and uses the contribution value from pedigree weighted information quantity to quantify the recombination contribution and measure the occurrence probability of recombination.
As shown in fig. 1, an embodiment of the present invention provides a method for detecting recombination between new coronavirus lineages based on information theory, comprising:
according to the sequence corresponding to each reference pedigree in the pedigree name extraction position alignment sequence file, counting the base type x and the corresponding proportion of each position of each reference pedigree sequenceThe amount of information on the base at each site, IC (information content), is calculated, and the recombination contribution WIC at each site is calculated.
Specifically, the amount of base information IC is
Wherein EIC is the total amount of expected base information, and Ent is the Shannon information entropy of the base site.
If step 3 retains the gap site of the alignment sequence file, the total amount of base information EIC expected for that site is equal toIf the gap site of the alignment sequence file is deleted in step 3, the total amount of base information EIC expected for that site is equal toI.e. equal to 2.
In particular, the recombination contribution WIC is equal to the base of the query lineage sequence that is most abundant at the locusCorresponding ratioMultiplied by bases in the reference lineage sequenceIn a ratio ofAnd then multiplied by the amount of base information IC of the site in the reference lineage sequence, i.e.
specifically, each reference lineage sequence was slid using the same sliding window w and step size s, and the average recombination contribution for all sites in each sliding window was calculatedAnd the process is repeated for all of (X, S,) Plotting, X represents the lineage name, S represents the position of the point on the aligned sequence in the sliding window, averaging the recombination contributionsThe largest sliding window is labeled as potential recombinant fragment, the lineage sequences within the sliding window are labeled as potential parent and are labeled as potential parentStore F, where it should be noted that larger window sizes reduce interference but also reduce sensitivity of the analysis; smaller window sizes increase sensitivity but also increase the probability of false positives.
The total size of the recombination region on each potential parent was calculated and the pedigree sequence with the largest cumulative interval of recombination regions was labeled as potential major parent, designated major (M).
specifically, multiple continuous or discrete recombination fragments possibly existing on the same secondary parent X are spliced into a recombination region, each recombination boundary is corrected in the splicing process, and the maximum recombination region is setThe recombination region R, denoted d (X, R), is generated for each minor parent X by the following steps and algorithms:
step 61, subjecting said F to the same lineage sequenceExtracting, sorting according to S value, adding into Z, and marking the first segment in Z as initial segment;
Step 62, the other fragments in ZWith the first segmentConstituting a recombination regionRecombination regionLeft side boundary ofRight side boundaryAll is the total site count, recombination region of alignment sequence fileHas a size of;
Step 63, calculating each recombination regionContribution of pedigree sequences withinDetermining the contribution value of the pedigree sequenceWhether or not it is maximum andgreater than a specified threshold ratioAnd (3) is in the range of 0 to 1, and the recombination region is marked asOr;
I) If it is notIs totally made ofWill be located atThe next segment is set as newReturning to step 62;
II) if presentThen is atUnder the condition of (1) searchingOf greatest valueAnd is marked asThe label at this time is d (X,). Then will beThe next segment is set as the new initial segmentReturning to step 62;
step 64, iterate steps 62 and 63, and combine all d (X,) Store to D until the initial fragmentThe last fragment in Z.
The site contribution WIC of each recombination region R from the minor parent and the major parent were tested for significant differences using the mann-whitney rank sum test, two-tailed probability, and if the p-value was less than 0.05, the differences were significant.
If D' is empty, it is reported that no possible recombination event was detected. Otherwise, the possible major parents M and all minor parents X and recombination regions R are reported and a statistical test value is provided for each recombination region.
Specifically, the step and algorithm for obtaining the recombination breakpoint includes:
step 71, running a sliding window on all potential parents, starting from the first polymorphic site, calculating the P value of the site recombination contribution value WIC of the areas at two sides of the S point in the center of the window under the Mann-Whitney rank sum test, taking the negative number of the logarithm with the base 10 for the P value, recording as b (X, S,) X represents the pedigree sequence name, S represents the position of the point on the aligned sequence in the window;
step 72, for all b (X, S,) Perform point drawing and correspondenceThe peak with the highest value is the recombination breakpoint.
This example demonstrates the invention with a novel coronavirus example, in combination with the download of early sampled Delta lineage and Omicron lineage sequences from the GISAID database (https:// www.gisaid.org /) as reference lineages. Wherein 93 Delta lineage strains, 75 Omicron lineage BA.1 subtype strains and 70 Omicron lineage BA.2 subtype strains are adopted. Recently sampled 2 recombinant strains XE as recombinant query lineage.
The Delta lineage, the Omicron lineage BA.1 subtype, the Omicron lineage BA.2 subtype, the recombinant query lineage XE described above were lineage labeled and the lineage names were written in front of each sequence name. Then, MAFFT software is called, and automatic (-Auto) parameters are used for Alignment to generate an Alignment sequence file.
For Alignment sequence files, gap sites after deletion of the Alignment sequences are selected. Considering that the differences between different lineages of the new crown are small, the embodiment adopts a recombination contribution value WIC based on polymorphic sites to detect the recombination between new crown lineages, so that the polymorphic sites in the aligned sequences are deleted, and finally 5288 polymorphic sites are left in the Alignment sequence file.
As with the algorithm shown in FIG. 2, the recombination contribution WIC of 5288 polymorphic sites contributed to the recombination query lineage XE by each reference lineage in the Alignment file was calculated, and the recombination contribution WIC of the first 29 polymorphic sites is shown in Table 1 below:
the distribution curve chart of the recombination contribution values WIC of all polymorphic sites is obtained and is shown in FIG. 3; a scatter plot of the WIC distribution of the contribution of recombination for all polymorphic sites is shown in FIG. 4.
For the 5288 polymorphic sites described above, a sliding window was used, the size of the sliding window was set to 100 polymorphic sites, the step size was set to 20 polymorphic sites, the polymorphic sites of each lineage were slid, and the average recombination contribution at the site in each window for each lineage was calculated. A total of 261 windows and corresponding average recombination contribution valuesThe calculation results are shown in fig. 5; mean recombination contribution of this lineage if within a sliding windowLarger than the major parent, are considered potential recombinant fragments.
For the potential recombinant fragments of each lineage described above, the maximum recombination interregional range parameter was set to 1200bp (number of polymorphic sites), the average recombination contribution value between recombination regionsThe threshold for the ratio was set at 95% in an attempt to splice these potentially recombined fragments into a reasonable recombined region. The pedigree with the largest accumulated recombination region was judged as the primary parent and the other pedigrees containing the recombination region were potential secondary parents. These secondary parents were then tested for significance in the corresponding recombination regions using the mann-whitney rank sum test, using two-tailed probabilities. The P value is calculated by comparing the recombination contribution value WIC of the main parent and the secondary parent at the recombination region site, and the difference is obvious when the P value is less than 0.05. Whether or not the recombination region is a possible false positive can be determined from the P value. The results of the recombinant identification and the recombinant regions and P-values for lineage XE are queried as shown in table 2 below:
based on the assay results and P values, the primary parent of the recombination query lineage XE was ba.2, ba.1 was the most likely secondary parent, and the location of the recombination regions in the aligned sequences were positions 181 to 11291 and 19347 to 21323, which is essentially consistent with the results that have been reported.
This example contains 5288 polymorphic site Alignment files, and uses a sliding window, where the size of the sliding window is set to 200 polymorphic sites, and the step size is set to 1 polymorphic site. Then, the sliding window is averagely divided into two sub-windows, the difference degree of the site recombination contribution values WIC of the two sub-windows is detected by using rank sum test and double-tail probability, the P value is calculated, and then the P value is calculated. For each polymorphic site and correspondingThe resulting profile is shown in FIG. 6, where the higher the peak, i.e., theWhere the larger the value, the smaller the P value, the higher the probability of a recombination breakpoint. In addition, the greater the probability that the region between the peaks is a recombination region.
BA.1, BA.2 and Delta at sites near the recombination region 181 to 11291The values are shown in the following table:
query pedigree XE by the above table of recombinations: ba.2 and recombinant query lineage XE: BA.1 showed that the highest peak in the recombination region 181 to 11291 was near position 11300, indicating that this and nearby positions are the most likely recombination breakpoints.
The embodiment of the invention introduces information theory and Shannon information entropy into the identification of virus recombination, treats the fragment transfer event in the virus recombination as the transmission process of 'information', quantifies the recombination contribution and measures the occurrence probability of recombination by using the contribution value from the pedigree weighted information quantity, and can sensitively and efficiently identify the recombination event and recombination region of the new coronavirus existing among different highly similar pedigrees.
The invention is used for testing recombination identification among lineages, is not limited to the novel coronavirus, and is also applicable to other viruses.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (9)
1. A method for detecting recombination between new coronavirus lineages based on information theory, which comprises the following steps:
step 1, constructing a consistency sequence of a pedigree sequence to be inquired, and marking a base with the maximum content of each site of the consistency sequence and a corresponding proportion;
step 2, merging the pedigree sequence to be inquired and other reference pedigree sequences, and then performing multiple sequence comparison to obtain an alignment sequence file;
step 3, selecting whether to reserve gap sites in the alignment sequence file or not, and extracting polymorphic sites;
step 4, respectively calculating the recombination contribution value WIC of each reference pedigree sequence provided for the pedigree sequence to be inquired at each polymorphic site;
the recombination contribution WIC is equal to the ratio of the corresponding bases of the query pedigree sequence with the most content in the loci, multiplied by the ratio of the bases in the reference pedigree sequence, and multiplied by the information content of the bases of the loci in the reference pedigree sequence;
step 5, calculate the mean contribution of recombination for each reference lineage sequence at sites within the sliding windowDetecting recombinant fragments from potential parents;
step 6, splicing continuous and discrete recombination fragments on each potential parent into a recombination region, and checking whether the recombination region is false positive;
step 7, searching potential recombination breakpoints at the boundaries of the recombination region by referring to polymorphic sites in the lineage sequence and the recombination contribution value WIC.
2. The method for detecting recombination between new coronavirus lineages according to claim 1, wherein the step 4 comprises:
extracting a sequence corresponding to each reference pedigree in the alignment sequence file according to the pedigree name;
Calculating the base information quantity IC of each site;
the weighted information amount WIC for each locus is calculated, which WIC represents the recombination contribution value.
4. The method for detecting recombination between new coronavirus lineages according to claim 1, wherein the step 5 comprises:
each reference lineage sequence was slid using the same sliding window w and step size s, and the average recombination contribution for all sites in each sliding window was calculatedAnd the sum of the sums of all (X, S,) Plotting, X representing the pedigree name, S representing the position of the point on the aligned sequence in the sliding window, averaging the recombination contribution valuesThe largest sliding window is labeled as potential recombined fragments, and the lineage sequences within the sliding window are labeled as potential parents and are labeled as potential parentsAnd storing to F.
5. The method for detecting recombination between new coronavirus lineages according to claim 4, wherein the total size of recombination regions on each potential parent is calculated, and the lineage sequence with the largest cumulative area of recombination regions is labeled as the potential major parent and designated as Major (M).
6. The method for detecting recombination between new coronavirus lineages according to claim 4, wherein the step 6 comprises:
splicing continuous or discrete recombination segments existing on the same potential parent X into recombination regions, proofreading each recombination boundary in the splicing process, and setting the maximum recombination regionThe recombination region R, denoted d (X, R), is generated for each potential parent X.
7. The method for detecting recombination between new coronavirus lineages according to claim 6, which comprises:
step 61, subjecting said F to the same lineage sequenceExtracting, sorting by increasing S value, and placingZThen will beZThe first fragment in the sequence is marked as the initial fragment;
Step 62, the other fragments in ZWith the first segmentConstituting a recombination regionRecombination regionLeft side boundary ofRight side boundaryAll is the total site count, recombination region of alignment sequence fileHas a size of;
Step 63, calculating each recombination regionRecombination contribution of lineage sequences withinJudging the recombination contribution value of the lineage sequenceWhether it is maximum and greater than a specified threshold ratio,;
8. The method for detecting recombination between new coronavirus pedigrees based on informatics claim 7, wherein the site recombination contribution value WIC of each recombination region R from the minor parent and the major parent are significantly different by using Mann-Whitney rank sum test and two-tailed probability test.
9. The method for detecting recombination between new coronavirus lineages based on the information theory as claimed in claim 1, wherein the method for obtaining recombination breakpoints in the step 7 comprises:
step 71, running a sliding window on all potential parents, starting from the first polymorphic site, calculating the P value of the site recombination contribution value WIC of the areas at two sides of the S point in the center of the window under the Mann-Whitney rank sum test, taking the negative number of the logarithm with the base 10 for the P value, recording as b (X, S,) X represents the pedigree sequence name, S represents the position of the point on the aligned sequence in the window;
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210665351.9A CN114743598B (en) | 2022-06-14 | 2022-06-14 | Method for detecting recombination among new coronavirus pedigrees based on information theory |
PCT/CN2022/137508 WO2023240947A1 (en) | 2022-06-14 | 2022-12-08 | Method for detecting recombination between sars-cov-2 lineages on the basis of information theory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210665351.9A CN114743598B (en) | 2022-06-14 | 2022-06-14 | Method for detecting recombination among new coronavirus pedigrees based on information theory |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114743598A CN114743598A (en) | 2022-07-12 |
CN114743598B true CN114743598B (en) | 2022-09-02 |
Family
ID=82287387
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210665351.9A Active CN114743598B (en) | 2022-06-14 | 2022-06-14 | Method for detecting recombination among new coronavirus pedigrees based on information theory |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114743598B (en) |
WO (1) | WO2023240947A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114743598B (en) * | 2022-06-14 | 2022-09-02 | 湖南大学 | Method for detecting recombination among new coronavirus pedigrees based on information theory |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2004239716A1 (en) * | 2003-05-07 | 2004-11-25 | Novasite Pharmaceuticals, Inc. | Multiplexed multitarget screening method |
CN102367490A (en) * | 2008-12-12 | 2012-03-07 | 深圳华大基因科技有限公司 | Method for detecting viruses |
SG11201707909YA (en) * | 2015-04-02 | 2017-10-30 | Jackson Lab | Method for detecting genomic variations using circularised mate-pair library and shotgun sequencing |
WO2018204764A1 (en) * | 2017-05-05 | 2018-11-08 | Camp4 Therapeutics Corporation | Identification and targeted modulation of gene signaling networks |
KR20220036908A (en) * | 2018-08-28 | 2022-03-23 | 보르 바이오파마 인크. | Genetically engineered hematopoietic stem cells and uses thereof |
CN111117975A (en) * | 2020-01-03 | 2020-05-08 | 西南民族大学 | Bovine coronavirus HE gene recombinant strain, inactivated vaccine prepared from same and application of inactivated vaccine |
CN111004869B (en) * | 2020-02-08 | 2023-05-23 | 吉林大学 | Fluorescent quantitative PCR (polymerase chain reaction) primer and reference standard for identifying genetic evolutionary lineages of H1N1 subtype influenza viruses |
CN114743598B (en) * | 2022-06-14 | 2022-09-02 | 湖南大学 | Method for detecting recombination among new coronavirus pedigrees based on information theory |
-
2022
- 2022-06-14 CN CN202210665351.9A patent/CN114743598B/en active Active
- 2022-12-08 WO PCT/CN2022/137508 patent/WO2023240947A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2023240947A1 (en) | 2023-12-21 |
CN114743598A (en) | 2022-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107423578B (en) | Device for detecting somatic cell mutation | |
CN114743598B (en) | Method for detecting recombination among new coronavirus pedigrees based on information theory | |
CN104794371B (en) | The method and apparatus for detecting retrotransponsons insertion polymorphism | |
CN111326212B (en) | Structural variation detection method | |
RU2014134175A (en) | METHOD AND SYSTEM FOR IDENTIFICATION OF VARIATION OF THE NUMBER OF COPIES IN THE GENOME | |
CN113948151B (en) | Processing method of low-depth WGS (WGS) offline data | |
CN108256292A (en) | A kind of copy number variation detection device | |
KR101936933B1 (en) | Methods for detecting nucleic acid sequence variations and a device for detecting nucleic acid sequence variations using the same | |
CN108595912B (en) | Method, device and system for detecting chromosome aneuploidy | |
CN113113152A (en) | Disease data set sample acquisition processing method, system, device, processor and storage medium thereof for novel coronavirus pneumonia | |
CN107451422A (en) | A kind of gene sequence data analysis and online interaction visualization method | |
CN113205857B (en) | Method and device for identifying non-homologous regions of genomic chromosomes | |
CN111696622B (en) | Method for correcting and evaluating detection result of mutation detection software | |
CN113096737A (en) | Method and system for automatically analyzing pathogen types | |
CN117275585A (en) | Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment | |
CN114758720B (en) | Method, apparatus and medium for detecting copy number variation | |
Harrison et al. | A survey of glass fragments recovered from clothing of persons suspected of involvement in crime | |
CN116825193A (en) | Method, device and storage medium for correcting mitochondrial genome sequencing mutation | |
US20210130888A1 (en) | Method, apparatus, and system for detecting chromosome aneuploidy | |
CN115856092A (en) | Method for determining rock crack initiation stress based on acoustic emission data and stress data | |
CN110970089B (en) | Pretreatment method and pretreatment device for fetal concentration calculation and application of pretreatment device | |
CN116209777A (en) | Genetic relationship judging method and device based on noninvasive prenatal gene detection data | |
CN108733982B (en) | Pregnant woman NIPT result correction method and device, and computer-readable storage medium and equipment | |
CN110533190B (en) | Data object analysis method and device based on machine learning | |
CN114305442B (en) | Detection method of atrial fibrillation occurrence start and stop points based on sliding window coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |