CN114743598B - Method for detecting recombination among new coronavirus pedigrees based on information theory - Google Patents

Method for detecting recombination among new coronavirus pedigrees based on information theory Download PDF

Info

Publication number
CN114743598B
CN114743598B CN202210665351.9A CN202210665351A CN114743598B CN 114743598 B CN114743598 B CN 114743598B CN 202210665351 A CN202210665351 A CN 202210665351A CN 114743598 B CN114743598 B CN 114743598B
Authority
CN
China
Prior art keywords
recombination
sequence
pedigree
lineage
site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210665351.9A
Other languages
Chinese (zh)
Other versions
CN114743598A (en
Inventor
葛行义
周秩建
叶生宝
邱烨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202210665351.9A priority Critical patent/CN114743598B/en
Publication of CN114743598A publication Critical patent/CN114743598A/en
Application granted granted Critical
Publication of CN114743598B publication Critical patent/CN114743598B/en
Priority to PCT/CN2022/137508 priority patent/WO2023240947A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for detecting recombination among new coronavirus pedigrees based on an information theory, which comprises the following steps: constructing a consistency sequence of a pedigree sequence to be inquired, and marking the base with the maximum content of each site and the corresponding proportion; merging the pedigree sequence and other reference pedigree sequences and then performing multiple sequence comparison to obtain an alignment sequence file; selecting whether to reserve a gap site in the alignment sequence file or not, and extracting a polymorphic site; calculating the recombination contribution value provided by each lineage sequence to the query lineage sequence at each polymorphic site; calculating the average recombination contribution value of each lineage sequence at the position in a sliding window, and detecting the recombined fragments from potential parents; splicing continuous and discrete recombination fragments on potential parents into a recombination region, and checking whether the recombination region is false positive; potential recombination breakpoints at the boundaries of the recombination region are searched by reference to polymorphic sites and recombination contributions in the lineage sequences.

Description

Method for detecting recombination among new coronavirus pedigrees based on information theory
Technical Field
The invention relates to the technical field of biological gene detection, in particular to a method for detecting recombination between new coronavirus pedigrees based on information theory.
Background
The new coronavirus (SARS-CoV-2) which is prevalent in the global world brings great challenges to public health safety and social stability. New coronaviruses belong to the RNA viruses, and are in rapid evolution and variation due to long-term prevalence in the human population, and constantly generate new lineages. Ormkrron (Omicron) and Deltay (Delta) lineage variants, which require close attention as currently promulgated by the World Health Organization (WHO), are still prevalent worldwide, as are the previously prevalent Alpha, Beta, Gamma, Epsilon, and like lineage variants that require attention.
Related research shows that the new crown variant strains from different lineages have gene fragment recombination phenomena, and further generate new strains. This recombination accelerates the viral variation and even confers new properties to the virus, such as recombination of the neocrown XD variants from Delta subtypes AY.4 and Omicron subtype BA.1, recombination of the neocrown XE variants from Omicron subtypes BA.1 and BA.2, and community-transmitted advantage of the neocrown XE variants about 10% over his parental BA.1. Currently, more recombinant strains are still being tracked.
For recombinant identification of viruses, two kinds of software, RDP and Simplot, are generally used. However, RDP and Simplot have difficulty identifying recombination events for highly similar strains of neocorona due to the different lineages of neocorona, the small genomic differences between the different subtypes of the same lineage. Furthermore, although RDP has multiple algorithms and programs built in, the software is mainly used for detecting single sequences from each other, and is not suitable for detecting inter-lineage recombination comprising multiple sequences; simplot, although it can detect recombination of a given grouped sequence, is mainly used to plot a similarity point diagram based on the overall similarity of sequences within a sliding window, and it is difficult to identify recombination events from among highly similar new crown lineages.
Disclosure of Invention
The invention provides a method for detecting recombination among new coronavirus pedigrees based on information theory, aiming at introducing information theory and Shannon information entropy into identification of virus recombination, taking fragment transfer events in the virus recombination as a transmission process of 'information', quantifying recombination contribution and measuring occurrence probability of recombination by using a contribution value from pedigree weighted information quantity, and sensitively and efficiently identifying recombination events and recombination regions of new coronavirus existing among different pedigree sequences with high similarity.
In order to achieve the above object, the present invention provides a method for detecting recombination between new coronavirus pedigrees based on information theory, comprising:
step 1, constructing a consistency sequence of a pedigree sequence to be inquired, and marking a base with the maximum content of each site of the consistency sequence and a corresponding proportion;
step 2, combining the query pedigree sequence with other reference pedigree sequences and then performing multiple sequence comparison to obtain an alignment sequence file;
step 3, selecting whether to reserve gap sites in the alignment sequence file or not, and extracting polymorphic sites;
step 4, respectively calculating the recombination contribution value WIC of each reference pedigree providing the query pedigree sequence at each polymorphic site;
the recombination contribution WIC is equal to the ratio of the corresponding bases of the query pedigree sequence with the most content in the loci, multiplied by the ratio of the bases in the reference pedigree sequence, and multiplied by the information content of the bases of the loci in the reference pedigree sequence;
step 5, calculating the average recombination contribution value WIC of each reference lineage sequence at the position in a sliding window, and detecting the recombination fragments from potential parents;
step 6, splicing continuous and discrete recombination fragments on each potential parent into a recombination region, and checking whether the recombination region is false positive;
step 7, searching for potential recombination breakpoints at the recombination region boundaries by referring to polymorphic sites in the lineage sequences and the recombination contribution WIC.
Wherein, step 4 includes:
extracting a sequence corresponding to each reference pedigree in the alignment sequence file according to the pedigree name;
counting the base type x and corresponding ratio of each site of each reference pedigree sequence
Figure 181134DEST_PATH_IMAGE001
Calculating the base information content IC of each site;
the weighted information amount WIC for each site is calculated, WIC representing the recombination contribution value.
The amount of base information IC is
Figure 190679DEST_PATH_IMAGE002
Figure 162046DEST_PATH_IMAGE003
Wherein EIC is the total amount of expected base information, and Ent is the Shannon information entropy of the base site.
Wherein, step 5 includes:
each reference lineage sequence was slid using the same sliding window w and step size s, and the average recombination contribution for all sites in each sliding window was calculated
Figure 497212DEST_PATH_IMAGE004
And the sum of the sums of all (X, S,
Figure 508025DEST_PATH_IMAGE004
) Plotting, X represents the lineage name, S represents the position of the point on the aligned sequence in the sliding window, averaging the recombination contributions
Figure 445894DEST_PATH_IMAGE004
The largest sliding window is labeled as potential recombinant fragment, the lineage sequences within the sliding window are labeled as potential parent and are labeled as potential parent
Figure 147133DEST_PATH_IMAGE005
And storing to F.
The total size of the recombination region on each potential parent was calculated and the pedigree sequence with the largest cumulative interval of recombination regions was labeled as potential major parent, designated major (M).
Wherein, step 6 includes: splicing continuous or discrete recombination segments existing on the same minor parent X into recombination regions, proofreading each recombination boundary in the splicing process, and setting the maximum recombination region
Figure 528567DEST_PATH_IMAGE006
The recombination region R, denoted d (X, R), is generated for each minor parent X.
Wherein, the step 6 comprises the following steps:
step 61, subjecting said F to the same lineage sequence
Figure 682468DEST_PATH_IMAGE005
Extracting, sorting according to S value, adding into Z, and marking the first segment in Z as initial segment
Figure 158449DEST_PATH_IMAGE007
Step 62, the other fragments in Z
Figure 979774DEST_PATH_IMAGE008
With the first segment
Figure 998021DEST_PATH_IMAGE007
Constituting a recombination region
Figure 29431DEST_PATH_IMAGE009
Recombination region
Figure 184469DEST_PATH_IMAGE009
Left boundary of (2)
Figure 735667DEST_PATH_IMAGE010
Right side boundary
Figure 317958DEST_PATH_IMAGE011
All is the total site count, recombination region of alignment sequence file
Figure 836664DEST_PATH_IMAGE009
Has a size of
Figure 529814DEST_PATH_IMAGE012
Step 63, calculating each recombination region
Figure 591311DEST_PATH_IMAGE009
Contribution of lineage sequences within
Figure 954290DEST_PATH_IMAGE004
Judging the contribution value of the pedigree sequence
Figure 835658DEST_PATH_IMAGE004
Whether or not it is maximum and
Figure 722712DEST_PATH_IMAGE013
greater than a specified threshold ratio
Figure 373136DEST_PATH_IMAGE014
Get ofValues ranging from 0 to 1;
step 64, iterating steps 62 and 63 until the initial segment
Figure 172596DEST_PATH_IMAGE015
The last fragment in Z.
The site recombination contribution WIC of each recombination region R from the secondary parent and the primary parent were tested for significant differences using the mann-whitney rank sum test, two-tailed probability.
The method for acquiring the recombination breakpoint in the step 7 comprises the following steps:
step 71, running a sliding window on all potential parents, starting from the first polymorphic site, calculating the P value of the site recombination contribution value WIC of the areas at two sides of the S point in the center of the window under the Mann-Whitney rank sum test, taking the negative number of the logarithm with the base 10 for the P value, recording as b (X, S,
Figure 275681DEST_PATH_IMAGE016
) X represents the pedigree sequence name, S represents the position of the point on the aligned sequence in the window;
step 72, for all b (X, S,
Figure 435267DEST_PATH_IMAGE016
) Perform point drawing and correspondence
Figure 205777DEST_PATH_IMAGE016
The peak with the highest value is the recombination breakpoint.
The scheme of the invention has the following beneficial effects:
the information theory and the Shannon information entropy are introduced into the identification of virus recombination, the fragment transfer event in the virus recombination is regarded as the transmission process of 'information', the contribution value of pedigree weighted information is used for quantifying the recombination contribution and measuring the occurrence probability of recombination, and the recombination event and recombination region of the new coronavirus existing in different highly similar pedigrees can be sensitively and efficiently identified.
Other advantages of the present invention will be described in detail in the detailed description that follows.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of the calculation of the contribution value WIC of site recombination according to the embodiment of the present invention;
FIG. 3 is a graph showing the distribution of the recombination contribution values WIC of all polymorphic sites in an example of the present invention;
FIG. 4 (a) is a scatter plot of the recombination contribution WIC of all polymorphic sites in the present invention, which is an example of the recombination of the Xinguan XE variant from BA.1; FIG. 4 (b) is a scatter plot of the recombinational contribution WIC of all polymorphic sites of the new crown XE variant from BA.2 as an example of the present invention; FIG. 4 (c) is a distribution scattergram of the recombination contribution WIC of all polymorphic sites of the present invention, which is an example of the recombination of the Xinguan XE variant from Delta;
FIG. 5 is a graph showing the calculation results of the mean recombination contribution value WIC of polymorphic sites within a sliding window for each lineage sequence according to the example of the present invention;
FIG. 6 (a) is an example of the recombination of the novel crown XE variant from BA.1 according to the invention, for each polymorphic site and corresponding
Figure 907629DEST_PATH_IMAGE016
A distribution graph of (a); FIG. 6 (b) is a diagram illustrating the recombination of the novel crown XE variant from BA.2 according to the present invention, for each polymorphic site and corresponding
Figure 498010DEST_PATH_IMAGE016
A distribution graph of (a); FIG. 6 (c) is a diagram illustrating the present invention employing Delta-derived recombination of a novel crown XE variant, for each polymorphic site and corresponding
Figure 726866DEST_PATH_IMAGE016
Distribution graph of (a).
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted", "connected" and "connected" are to be understood broadly, for example, as being either a locked connection, a detachable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Aiming at the existing problems, the invention provides a method for detecting recombination among new coronavirus pedigrees based on information theory, which introduces the information theory and Shannon information entropy into the identification of virus recombination, takes fragment transfer events in the virus recombination as the transmission process of 'information', and uses the contribution value from pedigree weighted information quantity to quantify the recombination contribution and measure the occurrence probability of recombination.
As shown in fig. 1, an embodiment of the present invention provides a method for detecting recombination between new coronavirus lineages based on information theory, comprising:
step 1, constructing a consistency sequence of a pedigree sequence to be inquired, and marking a base with the maximum content at each site of the consistency sequence
Figure 351883DEST_PATH_IMAGE017
And corresponding ratio
Figure 227566DEST_PATH_IMAGE018
(ii) a Filtering the variants for searching recombinants, if the query lineage sequence has only one sequence, the corresponding ratio of each base position
Figure 570823DEST_PATH_IMAGE018
Is constantly equal to 1.
Step 2, combining different pedigree sequences and then performing multiple sequence comparison to obtain an alignment sequence file; carrying out pedigree marking on a pedigree sequence to be inquired and other reference pedigree sequences, then combining sequence files of different pedigrees into a file, then carrying out multiple sequence comparison on the combined file to obtain an alignment sequence file, wherein the process is completed by calling MAFFT software, alignment is carried out by using automatic parameters (-auto), and alignment can be realized if each pedigree only has a single sequence.
Step 3, selecting whether to reserve a gap site in the alignment sequence file or not, and extracting a polymorphic site; if the influence of the gap sites on recombination is considered, the gap sites are reserved, if the pedigrees are highly similar, only polymorphic sites are used for searching recombination, namely, the polymorphic sites in the alignment sequence file are deleted, if the pedigrees are greatly different, the polymorphic sites in the alignment sequence file are reserved, and the polymorphic sites + the polymorphic sites are used for recombination identification.
Step 4, calculating the recombination contribution value WIC (weighted information content, WIC) of each reference spectrum system to the query spectrum sequence at each polymorphic site respectively, and identifying the similarity between sites based on the recombination contribution value of the weighted information amount of each site in the alignment sequence file; the calculation principle is as follows:
according to the sequence corresponding to each reference pedigree in the pedigree name extraction position alignment sequence file, counting the base type x and the corresponding proportion of each position of each reference pedigree sequence
Figure 72211DEST_PATH_IMAGE001
The amount of information on the base at each site, IC (information content), is calculated, and the recombination contribution WIC at each site is calculated.
Specifically, the amount of base information IC is
Figure 817313DEST_PATH_IMAGE019
Figure 254111DEST_PATH_IMAGE020
Wherein EIC is the total amount of expected base information, and Ent is the Shannon information entropy of the base site.
If step 3 retains the gap site of the alignment sequence file, the total amount of base information EIC expected for that site is equal to
Figure 960030DEST_PATH_IMAGE021
If the gap site of the alignment sequence file is deleted in step 3, the total amount of base information EIC expected for that site is equal to
Figure 202792DEST_PATH_IMAGE022
I.e. equal to 2.
In particular, the recombination contribution WIC is equal to the base of the query lineage sequence that is most abundant at the locus
Figure 802401DEST_PATH_IMAGE017
Corresponding ratio
Figure 285466DEST_PATH_IMAGE018
Multiplied by bases in the reference lineage sequence
Figure 603315DEST_PATH_IMAGE017
In a ratio of
Figure 712085DEST_PATH_IMAGE023
And then multiplied by the amount of base information IC of the site in the reference lineage sequence, i.e.
Figure 166200DEST_PATH_IMAGE024
Step 5, calculate the mean contribution of recombination for each lineage sequence at sites within the sliding window
Figure 944800DEST_PATH_IMAGE004
Detecting recombinant fragments from potential parents;
specifically, each reference lineage sequence was slid using the same sliding window w and step size s, and the average recombination contribution for all sites in each sliding window was calculated
Figure 622382DEST_PATH_IMAGE004
And the process is repeated for all of (X, S,
Figure 144630DEST_PATH_IMAGE004
) Plotting, X represents the lineage name, S represents the position of the point on the aligned sequence in the sliding window, averaging the recombination contributions
Figure 718831DEST_PATH_IMAGE004
The largest sliding window is labeled as potential recombinant fragment, the lineage sequences within the sliding window are labeled as potential parent and are labeled as potential parent
Figure 792966DEST_PATH_IMAGE005
Store F, where it should be noted that larger window sizes reduce interference but also reduce sensitivity of the analysis; smaller window sizes increase sensitivity but also increase the probability of false positives.
The total size of the recombination region on each potential parent was calculated and the pedigree sequence with the largest cumulative interval of recombination regions was labeled as potential major parent, designated major (M).
Step 6, splicing continuous and discrete recombination fragments on each potential parent into a recombination region, and checking whether the recombination region is false positive;
specifically, multiple continuous or discrete recombination fragments possibly existing on the same secondary parent X are spliced into a recombination region, each recombination boundary is corrected in the splicing process, and the maximum recombination region is set
Figure 85407DEST_PATH_IMAGE006
The recombination region R, denoted d (X, R), is generated for each minor parent X by the following steps and algorithms:
step 61, subjecting said F to the same lineage sequence
Figure 21133DEST_PATH_IMAGE005
Extracting, sorting according to S value, adding into Z, and marking the first segment in Z as initial segment
Figure 449841DEST_PATH_IMAGE007
Step 62, the other fragments in Z
Figure 835823DEST_PATH_IMAGE008
With the first segment
Figure 209035DEST_PATH_IMAGE007
Constituting a recombination region
Figure 338665DEST_PATH_IMAGE009
Recombination region
Figure 887458DEST_PATH_IMAGE009
Left side boundary of
Figure 54129DEST_PATH_IMAGE010
Right side boundary
Figure 55583DEST_PATH_IMAGE011
All is the total site count, recombination region of alignment sequence file
Figure 847958DEST_PATH_IMAGE009
Has a size of
Figure 251258DEST_PATH_IMAGE012
Step 63, calculating each recombination region
Figure 713463DEST_PATH_IMAGE009
Contribution of pedigree sequences within
Figure 812000DEST_PATH_IMAGE004
Determining the contribution value of the pedigree sequence
Figure 283433DEST_PATH_IMAGE004
Whether or not it is maximum and
Figure 400293DEST_PATH_IMAGE013
greater than a specified threshold ratio
Figure 564558DEST_PATH_IMAGE014
And (3) is in the range of 0 to 1, and the recombination region is marked as
Figure 540605DEST_PATH_IMAGE025
Or
Figure 157006DEST_PATH_IMAGE026
I) If it is not
Figure 269319DEST_PATH_IMAGE027
Is totally made of
Figure 338906DEST_PATH_IMAGE028
Will be located at
Figure 926882DEST_PATH_IMAGE029
The next segment is set as new
Figure 474538DEST_PATH_IMAGE015
Returning to step 62;
II) if present
Figure 706936DEST_PATH_IMAGE025
Then is at
Figure 822791DEST_PATH_IMAGE030
Under the condition of (1) searching
Figure 507850DEST_PATH_IMAGE031
Of greatest value
Figure 983831DEST_PATH_IMAGE025
And is marked as
Figure 70736DEST_PATH_IMAGE032
The label at this time is d (X,
Figure 482125DEST_PATH_IMAGE009
). Then will be
Figure 529847DEST_PATH_IMAGE032
The next segment is set as the new initial segment
Figure 419306DEST_PATH_IMAGE015
Returning to step 62;
step 64, iterate steps 62 and 63, and combine all d (X,
Figure 626296DEST_PATH_IMAGE009
) Store to D until the initial fragment
Figure 598800DEST_PATH_IMAGE015
The last fragment in Z.
The site contribution WIC of each recombination region R from the minor parent and the major parent were tested for significant differences using the mann-whitney rank sum test, two-tailed probability, and if the p-value was less than 0.05, the differences were significant.
If D' is empty, it is reported that no possible recombination event was detected. Otherwise, the possible major parents M and all minor parents X and recombination regions R are reported and a statistical test value is provided for each recombination region.
Step 7, searching potential recombination breakpoints of the recombination region boundary by referring to polymorphic sites and recombination contribution values WIC in the pedigree sequence; based on the recombination region spliced by recombination fragments, the boundaries are approximate, and potential recombination breakpoints near the boundaries need to be obtained sometimes, so that the recombination breakpoints are relatively accurately identified through a variable window and a fixed step value s =1 by using polymorphic sites and recombination contribution values WIC in the aligned sequences.
Specifically, the step and algorithm for obtaining the recombination breakpoint includes:
step 71, running a sliding window on all potential parents, starting from the first polymorphic site, calculating the P value of the site recombination contribution value WIC of the areas at two sides of the S point in the center of the window under the Mann-Whitney rank sum test, taking the negative number of the logarithm with the base 10 for the P value, recording as b (X, S,
Figure 992872DEST_PATH_IMAGE016
) X represents the pedigree sequence name, S represents the position of the point on the aligned sequence in the window;
step 72, for all b (X, S,
Figure 561388DEST_PATH_IMAGE016
) Perform point drawing and correspondence
Figure 357306DEST_PATH_IMAGE016
The peak with the highest value is the recombination breakpoint.
This example demonstrates the invention with a novel coronavirus example, in combination with the download of early sampled Delta lineage and Omicron lineage sequences from the GISAID database (https:// www.gisaid.org /) as reference lineages. Wherein 93 Delta lineage strains, 75 Omicron lineage BA.1 subtype strains and 70 Omicron lineage BA.2 subtype strains are adopted. Recently sampled 2 recombinant strains XE as recombinant query lineage.
The Delta lineage, the Omicron lineage BA.1 subtype, the Omicron lineage BA.2 subtype, the recombinant query lineage XE described above were lineage labeled and the lineage names were written in front of each sequence name. Then, MAFFT software is called, and automatic (-Auto) parameters are used for Alignment to generate an Alignment sequence file.
For Alignment sequence files, gap sites after deletion of the Alignment sequences are selected. Considering that the differences between different lineages of the new crown are small, the embodiment adopts a recombination contribution value WIC based on polymorphic sites to detect the recombination between new crown lineages, so that the polymorphic sites in the aligned sequences are deleted, and finally 5288 polymorphic sites are left in the Alignment sequence file.
As with the algorithm shown in FIG. 2, the recombination contribution WIC of 5288 polymorphic sites contributed to the recombination query lineage XE by each reference lineage in the Alignment file was calculated, and the recombination contribution WIC of the first 29 polymorphic sites is shown in Table 1 below:
Figure 110498DEST_PATH_IMAGE033
the distribution curve chart of the recombination contribution values WIC of all polymorphic sites is obtained and is shown in FIG. 3; a scatter plot of the WIC distribution of the contribution of recombination for all polymorphic sites is shown in FIG. 4.
For the 5288 polymorphic sites described above, a sliding window was used, the size of the sliding window was set to 100 polymorphic sites, the step size was set to 20 polymorphic sites, the polymorphic sites of each lineage were slid, and the average recombination contribution at the site in each window for each lineage was calculated
Figure 382079DEST_PATH_IMAGE004
. A total of 261 windows and corresponding average recombination contribution values
Figure 878920DEST_PATH_IMAGE004
The calculation results are shown in fig. 5; mean recombination contribution of this lineage if within a sliding window
Figure 794923DEST_PATH_IMAGE004
Larger than the major parent, are considered potential recombinant fragments.
For the potential recombinant fragments of each lineage described above, the maximum recombination interregional range parameter was set to 1200bp (number of polymorphic sites), the average recombination contribution value between recombination regions
Figure 349312DEST_PATH_IMAGE004
The threshold for the ratio was set at 95% in an attempt to splice these potentially recombined fragments into a reasonable recombined region. The pedigree with the largest accumulated recombination region was judged as the primary parent and the other pedigrees containing the recombination region were potential secondary parents. These secondary parents were then tested for significance in the corresponding recombination regions using the mann-whitney rank sum test, using two-tailed probabilities. The P value is calculated by comparing the recombination contribution value WIC of the main parent and the secondary parent at the recombination region site, and the difference is obvious when the P value is less than 0.05. Whether or not the recombination region is a possible false positive can be determined from the P value. The results of the recombinant identification and the recombinant regions and P-values for lineage XE are queried as shown in table 2 below:
Figure 983555DEST_PATH_IMAGE034
based on the assay results and P values, the primary parent of the recombination query lineage XE was ba.2, ba.1 was the most likely secondary parent, and the location of the recombination regions in the aligned sequences were positions 181 to 11291 and 19347 to 21323, which is essentially consistent with the results that have been reported.
This example contains 5288 polymorphic site Alignment files, and uses a sliding window, where the size of the sliding window is set to 200 polymorphic sites, and the step size is set to 1 polymorphic site. Then, the sliding window is averagely divided into two sub-windows, the difference degree of the site recombination contribution values WIC of the two sub-windows is detected by using rank sum test and double-tail probability, the P value is calculated, and then the P value is calculated
Figure 143141DEST_PATH_IMAGE016
. For each polymorphic site and corresponding
Figure 179230DEST_PATH_IMAGE016
The resulting profile is shown in FIG. 6, where the higher the peak, i.e., the
Figure 8646DEST_PATH_IMAGE016
Where the larger the value, the smaller the P value, the higher the probability of a recombination breakpoint. In addition, the greater the probability that the region between the peaks is a recombination region.
BA.1, BA.2 and Delta at sites near the recombination region 181 to 11291
Figure 5552DEST_PATH_IMAGE016
The values are shown in the following table:
Figure 844195DEST_PATH_IMAGE035
query pedigree XE by the above table of recombinations: ba.2 and recombinant query lineage XE: BA.1 showed that the highest peak in the recombination region 181 to 11291 was near position 11300, indicating that this and nearby positions are the most likely recombination breakpoints.
The embodiment of the invention introduces information theory and Shannon information entropy into the identification of virus recombination, treats the fragment transfer event in the virus recombination as the transmission process of 'information', quantifies the recombination contribution and measures the occurrence probability of recombination by using the contribution value from the pedigree weighted information quantity, and can sensitively and efficiently identify the recombination event and recombination region of the new coronavirus existing among different highly similar pedigrees.
The invention is used for testing recombination identification among lineages, is not limited to the novel coronavirus, and is also applicable to other viruses.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (9)

1. A method for detecting recombination between new coronavirus lineages based on information theory, which comprises the following steps:
step 1, constructing a consistency sequence of a pedigree sequence to be inquired, and marking a base with the maximum content of each site of the consistency sequence and a corresponding proportion;
step 2, merging the pedigree sequence to be inquired and other reference pedigree sequences, and then performing multiple sequence comparison to obtain an alignment sequence file;
step 3, selecting whether to reserve gap sites in the alignment sequence file or not, and extracting polymorphic sites;
step 4, respectively calculating the recombination contribution value WIC of each reference pedigree sequence provided for the pedigree sequence to be inquired at each polymorphic site;
the recombination contribution WIC is equal to the ratio of the corresponding bases of the query pedigree sequence with the most content in the loci, multiplied by the ratio of the bases in the reference pedigree sequence, and multiplied by the information content of the bases of the loci in the reference pedigree sequence;
step 5, calculate the mean contribution of recombination for each reference lineage sequence at sites within the sliding window
Figure DEST_PATH_IMAGE001
Detecting recombinant fragments from potential parents;
step 6, splicing continuous and discrete recombination fragments on each potential parent into a recombination region, and checking whether the recombination region is false positive;
step 7, searching potential recombination breakpoints at the boundaries of the recombination region by referring to polymorphic sites in the lineage sequence and the recombination contribution value WIC.
2. The method for detecting recombination between new coronavirus lineages according to claim 1, wherein the step 4 comprises:
extracting a sequence corresponding to each reference pedigree in the alignment sequence file according to the pedigree name;
statistics of base species at each site of each reference lineage sequencexAnd corresponding ratio
Figure 75595DEST_PATH_IMAGE002
Calculating the base information quantity IC of each site;
the weighted information amount WIC for each locus is calculated, which WIC represents the recombination contribution value.
3. The method for detecting recombination between new coronavirus lineages according to claim 2, wherein the base information amount IC is
Figure DEST_PATH_IMAGE003
Figure 204700DEST_PATH_IMAGE004
Where EIC is the total amount of desired base information and Ent is the shannon entropy of information at the base site.
4. The method for detecting recombination between new coronavirus lineages according to claim 1, wherein the step 5 comprises:
each reference lineage sequence was slid using the same sliding window w and step size s, and the average recombination contribution for all sites in each sliding window was calculated
Figure 637956DEST_PATH_IMAGE001
And the sum of the sums of all (X, S,
Figure 853036DEST_PATH_IMAGE001
) Plotting, X representing the pedigree name, S representing the position of the point on the aligned sequence in the sliding window, averaging the recombination contribution values
Figure 879898DEST_PATH_IMAGE001
The largest sliding window is labeled as potential recombined fragments, and the lineage sequences within the sliding window are labeled as potential parents and are labeled as potential parents
Figure DEST_PATH_IMAGE005
And storing to F.
5. The method for detecting recombination between new coronavirus lineages according to claim 4, wherein the total size of recombination regions on each potential parent is calculated, and the lineage sequence with the largest cumulative area of recombination regions is labeled as the potential major parent and designated as Major (M).
6. The method for detecting recombination between new coronavirus lineages according to claim 4, wherein the step 6 comprises:
splicing continuous or discrete recombination segments existing on the same potential parent X into recombination regions, proofreading each recombination boundary in the splicing process, and setting the maximum recombination region
Figure 956570DEST_PATH_IMAGE006
The recombination region R, denoted d (X, R), is generated for each potential parent X.
7. The method for detecting recombination between new coronavirus lineages according to claim 6, which comprises:
step 61, subjecting said F to the same lineage sequence
Figure 385277DEST_PATH_IMAGE005
Extracting, sorting by increasing S value, and placingZThen will beZThe first fragment in the sequence is marked as the initial fragment
Figure DEST_PATH_IMAGE007
Step 62, the other fragments in Z
Figure 364734DEST_PATH_IMAGE008
With the first segment
Figure 613313DEST_PATH_IMAGE007
Constituting a recombination region
Figure 352730DEST_PATH_IMAGE009
Recombination region
Figure 901523DEST_PATH_IMAGE009
Left side boundary of
Figure DEST_PATH_IMAGE010
Right side boundary
Figure 786302DEST_PATH_IMAGE011
All is the total site count, recombination region of alignment sequence file
Figure 394614DEST_PATH_IMAGE009
Has a size of
Figure DEST_PATH_IMAGE012
Step 63, calculating each recombination region
Figure 796776DEST_PATH_IMAGE009
Recombination contribution of lineage sequences within
Figure 59130DEST_PATH_IMAGE001
Judging the recombination contribution value of the lineage sequence
Figure 786915DEST_PATH_IMAGE001
Whether it is maximum and greater than a specified threshold ratio
Figure 885452DEST_PATH_IMAGE013
Figure DEST_PATH_IMAGE015
Step 64, iterating steps 62 and 63 until the initial segment
Figure 91305DEST_PATH_IMAGE016
Is composed ofZThe last fragment in (c).
8. The method for detecting recombination between new coronavirus pedigrees based on informatics claim 7, wherein the site recombination contribution value WIC of each recombination region R from the minor parent and the major parent are significantly different by using Mann-Whitney rank sum test and two-tailed probability test.
9. The method for detecting recombination between new coronavirus lineages based on the information theory as claimed in claim 1, wherein the method for obtaining recombination breakpoints in the step 7 comprises:
step 71, running a sliding window on all potential parents, starting from the first polymorphic site, calculating the P value of the site recombination contribution value WIC of the areas at two sides of the S point in the center of the window under the Mann-Whitney rank sum test, taking the negative number of the logarithm with the base 10 for the P value, recording as b (X, S,
Figure DEST_PATH_IMAGE017
) X represents the pedigree sequence name, S represents the position of the point on the aligned sequence in the window;
step 72, for all b (X, S,
Figure 677008DEST_PATH_IMAGE017
) Perform point drawing and correspondence
Figure 451060DEST_PATH_IMAGE017
The peak with the highest value is the recombination fragmentAnd (4) point.
CN202210665351.9A 2022-06-14 2022-06-14 Method for detecting recombination among new coronavirus pedigrees based on information theory Active CN114743598B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210665351.9A CN114743598B (en) 2022-06-14 2022-06-14 Method for detecting recombination among new coronavirus pedigrees based on information theory
PCT/CN2022/137508 WO2023240947A1 (en) 2022-06-14 2022-12-08 Method for detecting recombination between sars-cov-2 lineages on the basis of information theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210665351.9A CN114743598B (en) 2022-06-14 2022-06-14 Method for detecting recombination among new coronavirus pedigrees based on information theory

Publications (2)

Publication Number Publication Date
CN114743598A CN114743598A (en) 2022-07-12
CN114743598B true CN114743598B (en) 2022-09-02

Family

ID=82287387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210665351.9A Active CN114743598B (en) 2022-06-14 2022-06-14 Method for detecting recombination among new coronavirus pedigrees based on information theory

Country Status (2)

Country Link
CN (1) CN114743598B (en)
WO (1) WO2023240947A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743598B (en) * 2022-06-14 2022-09-02 湖南大学 Method for detecting recombination among new coronavirus pedigrees based on information theory

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2004239716A1 (en) * 2003-05-07 2004-11-25 Novasite Pharmaceuticals, Inc. Multiplexed multitarget screening method
CN102367490A (en) * 2008-12-12 2012-03-07 深圳华大基因科技有限公司 Method for detecting viruses
SG11201707909YA (en) * 2015-04-02 2017-10-30 Jackson Lab Method for detecting genomic variations using circularised mate-pair library and shotgun sequencing
WO2018204764A1 (en) * 2017-05-05 2018-11-08 Camp4 Therapeutics Corporation Identification and targeted modulation of gene signaling networks
KR20220036908A (en) * 2018-08-28 2022-03-23 보르 바이오파마 인크. Genetically engineered hematopoietic stem cells and uses thereof
CN111117975A (en) * 2020-01-03 2020-05-08 西南民族大学 Bovine coronavirus HE gene recombinant strain, inactivated vaccine prepared from same and application of inactivated vaccine
CN111004869B (en) * 2020-02-08 2023-05-23 吉林大学 Fluorescent quantitative PCR (polymerase chain reaction) primer and reference standard for identifying genetic evolutionary lineages of H1N1 subtype influenza viruses
CN114743598B (en) * 2022-06-14 2022-09-02 湖南大学 Method for detecting recombination among new coronavirus pedigrees based on information theory

Also Published As

Publication number Publication date
WO2023240947A1 (en) 2023-12-21
CN114743598A (en) 2022-07-12

Similar Documents

Publication Publication Date Title
CN107423578B (en) Device for detecting somatic cell mutation
CN114743598B (en) Method for detecting recombination among new coronavirus pedigrees based on information theory
CN104794371B (en) The method and apparatus for detecting retrotransponsons insertion polymorphism
CN111326212B (en) Structural variation detection method
RU2014134175A (en) METHOD AND SYSTEM FOR IDENTIFICATION OF VARIATION OF THE NUMBER OF COPIES IN THE GENOME
CN113948151B (en) Processing method of low-depth WGS (WGS) offline data
CN108256292A (en) A kind of copy number variation detection device
KR101936933B1 (en) Methods for detecting nucleic acid sequence variations and a device for detecting nucleic acid sequence variations using the same
CN108595912B (en) Method, device and system for detecting chromosome aneuploidy
CN113113152A (en) Disease data set sample acquisition processing method, system, device, processor and storage medium thereof for novel coronavirus pneumonia
CN107451422A (en) A kind of gene sequence data analysis and online interaction visualization method
CN113205857B (en) Method and device for identifying non-homologous regions of genomic chromosomes
CN111696622B (en) Method for correcting and evaluating detection result of mutation detection software
CN113096737A (en) Method and system for automatically analyzing pathogen types
CN117275585A (en) Method for constructing lung cancer early-screening model based on LP-WGS and DNA methylation and electronic equipment
CN114758720B (en) Method, apparatus and medium for detecting copy number variation
Harrison et al. A survey of glass fragments recovered from clothing of persons suspected of involvement in crime
CN116825193A (en) Method, device and storage medium for correcting mitochondrial genome sequencing mutation
US20210130888A1 (en) Method, apparatus, and system for detecting chromosome aneuploidy
CN115856092A (en) Method for determining rock crack initiation stress based on acoustic emission data and stress data
CN110970089B (en) Pretreatment method and pretreatment device for fetal concentration calculation and application of pretreatment device
CN116209777A (en) Genetic relationship judging method and device based on noninvasive prenatal gene detection data
CN108733982B (en) Pregnant woman NIPT result correction method and device, and computer-readable storage medium and equipment
CN110533190B (en) Data object analysis method and device based on machine learning
CN114305442B (en) Detection method of atrial fibrillation occurrence start and stop points based on sliding window coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant