CN114743598B

CN114743598B - Method for detecting recombination among new coronavirus pedigrees based on information theory

Info

Publication number: CN114743598B
Application number: CN202210665351.9A
Authority: CN
Inventors: 葛行义; 周秩建; 叶生宝; 邱烨
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-09-02
Anticipated expiration: 2042-06-14
Also published as: WO2023240947A1; CN114743598A

Abstract

The invention provides a method for detecting recombination among new coronavirus pedigrees based on an information theory, which comprises the following steps: constructing a consistency sequence of a pedigree sequence to be inquired, and marking the base with the maximum content of each site and the corresponding proportion; merging the pedigree sequence and other reference pedigree sequences and then performing multiple sequence comparison to obtain an alignment sequence file; selecting whether to reserve a gap site in the alignment sequence file or not, and extracting a polymorphic site; calculating the recombination contribution value provided by each lineage sequence to the query lineage sequence at each polymorphic site; calculating the average recombination contribution value of each lineage sequence at the position in a sliding window, and detecting the recombined fragments from potential parents; splicing continuous and discrete recombination fragments on potential parents into a recombination region, and checking whether the recombination region is false positive; potential recombination breakpoints at the boundaries of the recombination region are searched by reference to polymorphic sites and recombination contributions in the lineage sequences.

Description

Method for detecting recombination among new coronavirus pedigrees based on information theory

Technical Field

The invention relates to the technical field of biological gene detection, in particular to a method for detecting recombination between new coronavirus pedigrees based on information theory.

Background

The new coronavirus (SARS-CoV-2) which is prevalent in the global world brings great challenges to public health safety and social stability. New coronaviruses belong to the RNA viruses, and are in rapid evolution and variation due to long-term prevalence in the human population, and constantly generate new lineages. Ormkrron (Omicron) and Deltay (Delta) lineage variants, which require close attention as currently promulgated by the World Health Organization (WHO), are still prevalent worldwide, as are the previously prevalent Alpha, Beta, Gamma, Epsilon, and like lineage variants that require attention.

Related research shows that the new crown variant strains from different lineages have gene fragment recombination phenomena, and further generate new strains. This recombination accelerates the viral variation and even confers new properties to the virus, such as recombination of the neocrown XD variants from Delta subtypes AY.4 and Omicron subtype BA.1, recombination of the neocrown XE variants from Omicron subtypes BA.1 and BA.2, and community-transmitted advantage of the neocrown XE variants about 10% over his parental BA.1. Currently, more recombinant strains are still being tracked.

For recombinant identification of viruses, two kinds of software, RDP and Simplot, are generally used. However, RDP and Simplot have difficulty identifying recombination events for highly similar strains of neocorona due to the different lineages of neocorona, the small genomic differences between the different subtypes of the same lineage. Furthermore, although RDP has multiple algorithms and programs built in, the software is mainly used for detecting single sequences from each other, and is not suitable for detecting inter-lineage recombination comprising multiple sequences; simplot, although it can detect recombination of a given grouped sequence, is mainly used to plot a similarity point diagram based on the overall similarity of sequences within a sliding window, and it is difficult to identify recombination events from among highly similar new crown lineages.

Disclosure of Invention

The invention provides a method for detecting recombination among new coronavirus pedigrees based on information theory, aiming at introducing information theory and Shannon information entropy into identification of virus recombination, taking fragment transfer events in the virus recombination as a transmission process of 'information', quantifying recombination contribution and measuring occurrence probability of recombination by using a contribution value from pedigree weighted information quantity, and sensitively and efficiently identifying recombination events and recombination regions of new coronavirus existing among different pedigree sequences with high similarity.

In order to achieve the above object, the present invention provides a method for detecting recombination between new coronavirus pedigrees based on information theory, comprising:

step 1, constructing a consistency sequence of a pedigree sequence to be inquired, and marking a base with the maximum content of each site of the consistency sequence and a corresponding proportion;

step 2, combining the query pedigree sequence with other reference pedigree sequences and then performing multiple sequence comparison to obtain an alignment sequence file;

step 3, selecting whether to reserve gap sites in the alignment sequence file or not, and extracting polymorphic sites;

step 4, respectively calculating the recombination contribution value WIC of each reference pedigree providing the query pedigree sequence at each polymorphic site;

the recombination contribution WIC is equal to the ratio of the corresponding bases of the query pedigree sequence with the most content in the loci, multiplied by the ratio of the bases in the reference pedigree sequence, and multiplied by the information content of the bases of the loci in the reference pedigree sequence;

step 5, calculating the average recombination contribution value WIC of each reference lineage sequence at the position in a sliding window, and detecting the recombination fragments from potential parents;

step 6, splicing continuous and discrete recombination fragments on each potential parent into a recombination region, and checking whether the recombination region is false positive;

step 7, searching for potential recombination breakpoints at the recombination region boundaries by referring to polymorphic sites in the lineage sequences and the recombination contribution WIC.

Wherein, step 4 includes:

extracting a sequence corresponding to each reference pedigree in the alignment sequence file according to the pedigree name;

counting the base type x and corresponding ratio of each site of each reference pedigree sequence

；

Calculating the base information content IC of each site;

the weighted information amount WIC for each site is calculated, WIC representing the recombination contribution value.

The amount of base information IC is

Wherein EIC is the total amount of expected base information, and Ent is the Shannon information entropy of the base site.

Wherein, step 5 includes:

each reference lineage sequence was slid using the same sliding window w and step size s, and the average recombination contribution for all sites in each sliding window was calculated

And the sum of the sums of all (X, S,

) Plotting, X represents the lineage name, S represents the position of the point on the aligned sequence in the sliding window, averaging the recombination contributions

The largest sliding window is labeled as potential recombinant fragment, the lineage sequences within the sliding window are labeled as potential parent and are labeled as potential parent

And storing to F.

The total size of the recombination region on each potential parent was calculated and the pedigree sequence with the largest cumulative interval of recombination regions was labeled as potential major parent, designated major (M).

Wherein, step 6 includes: splicing continuous or discrete recombination segments existing on the same minor parent X into recombination regions, proofreading each recombination boundary in the splicing process, and setting the maximum recombination region

The recombination region R, denoted d (X, R), is generated for each minor parent X.

Wherein, the step 6 comprises the following steps:

step 61, subjecting said F to the same lineage sequence

Extracting, sorting according to S value, adding into Z, and marking the first segment in Z as initial segment

；

Step 62, the other fragments in Z

With the first segment

Constituting a recombination region

Recombination region

Left boundary of (2)

Right side boundary

All is the total site count, recombination region of alignment sequence file

Has a size of

；

Step 63, calculating each recombination region

Contribution of lineage sequences within

Judging the contribution value of the pedigree sequence

Whether or not it is maximum and

greater than a specified threshold ratio

Get ofValues ranging from 0 to 1;

step 64, iterating steps 62 and 63 until the initial segment

The last fragment in Z.

The site recombination contribution WIC of each recombination region R from the secondary parent and the primary parent were tested for significant differences using the mann-whitney rank sum test, two-tailed probability.

The method for acquiring the recombination breakpoint in the step 7 comprises the following steps:

step 71, running a sliding window on all potential parents, starting from the first polymorphic site, calculating the P value of the site recombination contribution value WIC of the areas at two sides of the S point in the center of the window under the Mann-Whitney rank sum test, taking the negative number of the logarithm with the base 10 for the P value, recording as b (X, S,

) X represents the pedigree sequence name, S represents the position of the point on the aligned sequence in the window;

step 72, for all b (X, S,

) Perform point drawing and correspondence

The peak with the highest value is the recombination breakpoint.

The scheme of the invention has the following beneficial effects:

the information theory and the Shannon information entropy are introduced into the identification of virus recombination, the fragment transfer event in the virus recombination is regarded as the transmission process of 'information', the contribution value of pedigree weighted information is used for quantifying the recombination contribution and measuring the occurrence probability of recombination, and the recombination event and recombination region of the new coronavirus existing in different highly similar pedigrees can be sensitively and efficiently identified.

Other advantages of the present invention will be described in detail in the detailed description that follows.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of the calculation of the contribution value WIC of site recombination according to the embodiment of the present invention;

FIG. 3 is a graph showing the distribution of the recombination contribution values WIC of all polymorphic sites in an example of the present invention;

FIG. 4 (a) is a scatter plot of the recombination contribution WIC of all polymorphic sites in the present invention, which is an example of the recombination of the Xinguan XE variant from BA.1; FIG. 4 (b) is a scatter plot of the recombinational contribution WIC of all polymorphic sites of the new crown XE variant from BA.2 as an example of the present invention; FIG. 4 (c) is a distribution scattergram of the recombination contribution WIC of all polymorphic sites of the present invention, which is an example of the recombination of the Xinguan XE variant from Delta;

FIG. 5 is a graph showing the calculation results of the mean recombination contribution value WIC of polymorphic sites within a sliding window for each lineage sequence according to the example of the present invention;

FIG. 6 (a) is an example of the recombination of the novel crown XE variant from BA.1 according to the invention, for each polymorphic site and corresponding

A distribution graph of (a); FIG. 6 (b) is a diagram illustrating the recombination of the novel crown XE variant from BA.2 according to the present invention, for each polymorphic site and corresponding

A distribution graph of (a); FIG. 6 (c) is a diagram illustrating the present invention employing Delta-derived recombination of a novel crown XE variant, for each polymorphic site and corresponding

Distribution graph of (a).

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted", "connected" and "connected" are to be understood broadly, for example, as being either a locked connection, a detachable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Aiming at the existing problems, the invention provides a method for detecting recombination among new coronavirus pedigrees based on information theory, which introduces the information theory and Shannon information entropy into the identification of virus recombination, takes fragment transfer events in the virus recombination as the transmission process of 'information', and uses the contribution value from pedigree weighted information quantity to quantify the recombination contribution and measure the occurrence probability of recombination.

As shown in fig. 1, an embodiment of the present invention provides a method for detecting recombination between new coronavirus lineages based on information theory, comprising:

step 1, constructing a consistency sequence of a pedigree sequence to be inquired, and marking a base with the maximum content at each site of the consistency sequence

And corresponding ratio

(ii) a Filtering the variants for searching recombinants, if the query lineage sequence has only one sequence, the corresponding ratio of each base position

Is constantly equal to 1.

Step 2, combining different pedigree sequences and then performing multiple sequence comparison to obtain an alignment sequence file; carrying out pedigree marking on a pedigree sequence to be inquired and other reference pedigree sequences, then combining sequence files of different pedigrees into a file, then carrying out multiple sequence comparison on the combined file to obtain an alignment sequence file, wherein the process is completed by calling MAFFT software, alignment is carried out by using automatic parameters (-auto), and alignment can be realized if each pedigree only has a single sequence.

Step 3, selecting whether to reserve a gap site in the alignment sequence file or not, and extracting a polymorphic site; if the influence of the gap sites on recombination is considered, the gap sites are reserved, if the pedigrees are highly similar, only polymorphic sites are used for searching recombination, namely, the polymorphic sites in the alignment sequence file are deleted, if the pedigrees are greatly different, the polymorphic sites in the alignment sequence file are reserved, and the polymorphic sites + the polymorphic sites are used for recombination identification.

Step 4, calculating the recombination contribution value WIC (weighted information content, WIC) of each reference spectrum system to the query spectrum sequence at each polymorphic site respectively, and identifying the similarity between sites based on the recombination contribution value of the weighted information amount of each site in the alignment sequence file; the calculation principle is as follows:

according to the sequence corresponding to each reference pedigree in the pedigree name extraction position alignment sequence file, counting the base type x and the corresponding proportion of each position of each reference pedigree sequence

The amount of information on the base at each site, IC (information content), is calculated, and the recombination contribution WIC at each site is calculated.

Specifically, the amount of base information IC is

If step 3 retains the gap site of the alignment sequence file, the total amount of base information EIC expected for that site is equal to

If the gap site of the alignment sequence file is deleted in step 3, the total amount of base information EIC expected for that site is equal to

I.e. equal to 2.

In particular, the recombination contribution WIC is equal to the base of the query lineage sequence that is most abundant at the locus

Corresponding ratio

Multiplied by bases in the reference lineage sequence

In a ratio of

And then multiplied by the amount of base information IC of the site in the reference lineage sequence, i.e.

。

Step 5, calculate the mean contribution of recombination for each lineage sequence at sites within the sliding window

Detecting recombinant fragments from potential parents;

specifically, each reference lineage sequence was slid using the same sliding window w and step size s, and the average recombination contribution for all sites in each sliding window was calculated

And the process is repeated for all of (X, S,

Store F, where it should be noted that larger window sizes reduce interference but also reduce sensitivity of the analysis; smaller window sizes increase sensitivity but also increase the probability of false positives.

specifically, multiple continuous or discrete recombination fragments possibly existing on the same secondary parent X are spliced into a recombination region, each recombination boundary is corrected in the splicing process, and the maximum recombination region is set

The recombination region R, denoted d (X, R), is generated for each minor parent X by the following steps and algorithms:

step 61, subjecting said F to the same lineage sequence

；

Step 62, the other fragments in Z

With the first segment

Constituting a recombination region

Recombination region

Left side boundary of

Right side boundary

All is the total site count, recombination region of alignment sequence file

Has a size of

；

Step 63, calculating each recombination region

Contribution of pedigree sequences within

Determining the contribution value of the pedigree sequence

Whether or not it is maximum and

greater than a specified threshold ratio

And (3) is in the range of 0 to 1, and the recombination region is marked as

Or

；

I) If it is not

Is totally made of

Will be located at

The next segment is set as new

Returning to step 62;

II) if present

Then is at

Under the condition of (1) searching

Of greatest value

And is marked as

The label at this time is d (X,

). Then will be

The next segment is set as the new initial segment

Returning to step 62;

step 64, iterate steps 62 and 63, and combine all d (X,

) Store to D until the initial fragment

The last fragment in Z.

The site contribution WIC of each recombination region R from the minor parent and the major parent were tested for significant differences using the mann-whitney rank sum test, two-tailed probability, and if the p-value was less than 0.05, the differences were significant.

If D' is empty, it is reported that no possible recombination event was detected. Otherwise, the possible major parents M and all minor parents X and recombination regions R are reported and a statistical test value is provided for each recombination region.

Step 7, searching potential recombination breakpoints of the recombination region boundary by referring to polymorphic sites and recombination contribution values WIC in the pedigree sequence; based on the recombination region spliced by recombination fragments, the boundaries are approximate, and potential recombination breakpoints near the boundaries need to be obtained sometimes, so that the recombination breakpoints are relatively accurately identified through a variable window and a fixed step value s =1 by using polymorphic sites and recombination contribution values WIC in the aligned sequences.

Specifically, the step and algorithm for obtaining the recombination breakpoint includes:

step 72, for all b (X, S,

) Perform point drawing and correspondence

The peak with the highest value is the recombination breakpoint.

This example demonstrates the invention with a novel coronavirus example, in combination with the download of early sampled Delta lineage and Omicron lineage sequences from the GISAID database (https:// www.gisaid.org /) as reference lineages. Wherein 93 Delta lineage strains, 75 Omicron lineage BA.1 subtype strains and 70 Omicron lineage BA.2 subtype strains are adopted. Recently sampled 2 recombinant strains XE as recombinant query lineage.

The Delta lineage, the Omicron lineage BA.1 subtype, the Omicron lineage BA.2 subtype, the recombinant query lineage XE described above were lineage labeled and the lineage names were written in front of each sequence name. Then, MAFFT software is called, and automatic (-Auto) parameters are used for Alignment to generate an Alignment sequence file.

For Alignment sequence files, gap sites after deletion of the Alignment sequences are selected. Considering that the differences between different lineages of the new crown are small, the embodiment adopts a recombination contribution value WIC based on polymorphic sites to detect the recombination between new crown lineages, so that the polymorphic sites in the aligned sequences are deleted, and finally 5288 polymorphic sites are left in the Alignment sequence file.

As with the algorithm shown in FIG. 2, the recombination contribution WIC of 5288 polymorphic sites contributed to the recombination query lineage XE by each reference lineage in the Alignment file was calculated, and the recombination contribution WIC of the first 29 polymorphic sites is shown in Table 1 below:

the distribution curve chart of the recombination contribution values WIC of all polymorphic sites is obtained and is shown in FIG. 3; a scatter plot of the WIC distribution of the contribution of recombination for all polymorphic sites is shown in FIG. 4.

For the 5288 polymorphic sites described above, a sliding window was used, the size of the sliding window was set to 100 polymorphic sites, the step size was set to 20 polymorphic sites, the polymorphic sites of each lineage were slid, and the average recombination contribution at the site in each window for each lineage was calculated

. A total of 261 windows and corresponding average recombination contribution values

The calculation results are shown in fig. 5; mean recombination contribution of this lineage if within a sliding window

Larger than the major parent, are considered potential recombinant fragments.

For the potential recombinant fragments of each lineage described above, the maximum recombination interregional range parameter was set to 1200bp (number of polymorphic sites), the average recombination contribution value between recombination regions

The threshold for the ratio was set at 95% in an attempt to splice these potentially recombined fragments into a reasonable recombined region. The pedigree with the largest accumulated recombination region was judged as the primary parent and the other pedigrees containing the recombination region were potential secondary parents. These secondary parents were then tested for significance in the corresponding recombination regions using the mann-whitney rank sum test, using two-tailed probabilities. The P value is calculated by comparing the recombination contribution value WIC of the main parent and the secondary parent at the recombination region site, and the difference is obvious when the P value is less than 0.05. Whether or not the recombination region is a possible false positive can be determined from the P value. The results of the recombinant identification and the recombinant regions and P-values for lineage XE are queried as shown in table 2 below:

based on the assay results and P values, the primary parent of the recombination query lineage XE was ba.2, ba.1 was the most likely secondary parent, and the location of the recombination regions in the aligned sequences were positions 181 to 11291 and 19347 to 21323, which is essentially consistent with the results that have been reported.

This example contains 5288 polymorphic site Alignment files, and uses a sliding window, where the size of the sliding window is set to 200 polymorphic sites, and the step size is set to 1 polymorphic site. Then, the sliding window is averagely divided into two sub-windows, the difference degree of the site recombination contribution values WIC of the two sub-windows is detected by using rank sum test and double-tail probability, the P value is calculated, and then the P value is calculated

. For each polymorphic site and corresponding

The resulting profile is shown in FIG. 6, where the higher the peak, i.e., the

Where the larger the value, the smaller the P value, the higher the probability of a recombination breakpoint. In addition, the greater the probability that the region between the peaks is a recombination region.

BA.1, BA.2 and Delta at sites near the recombination region 181 to 11291

The values are shown in the following table:

query pedigree XE by the above table of recombinations: ba.2 and recombinant query lineage XE: BA.1 showed that the highest peak in the recombination region 181 to 11291 was near position 11300, indicating that this and nearby positions are the most likely recombination breakpoints.

The embodiment of the invention introduces information theory and Shannon information entropy into the identification of virus recombination, treats the fragment transfer event in the virus recombination as the transmission process of 'information', quantifies the recombination contribution and measures the occurrence probability of recombination by using the contribution value from the pedigree weighted information quantity, and can sensitively and efficiently identify the recombination event and recombination region of the new coronavirus existing among different highly similar pedigrees.

The invention is used for testing recombination identification among lineages, is not limited to the novel coronavirus, and is also applicable to other viruses.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for detecting recombination between new coronavirus lineages based on information theory, which comprises the following steps:

step 2, merging the pedigree sequence to be inquired and other reference pedigree sequences, and then performing multiple sequence comparison to obtain an alignment sequence file;

step 4, respectively calculating the recombination contribution value WIC of each reference pedigree sequence provided for the pedigree sequence to be inquired at each polymorphic site;

step 5, calculate the mean contribution of recombination for each reference lineage sequence at sites within the sliding window

Detecting recombinant fragments from potential parents;

step 7, searching potential recombination breakpoints at the boundaries of the recombination region by referring to polymorphic sites in the lineage sequence and the recombination contribution value WIC.

2. The method for detecting recombination between new coronavirus lineages according to claim 1, wherein the step 4 comprises:

statistics of base species at each site of each reference lineage sequencexAnd corresponding ratio

；

Calculating the base information quantity IC of each site;

the weighted information amount WIC for each locus is calculated, which WIC represents the recombination contribution value.

3. The method for detecting recombination between new coronavirus lineages according to claim 2, wherein the base information amount IC is

Where EIC is the total amount of desired base information and Ent is the shannon entropy of information at the base site.

4. The method for detecting recombination between new coronavirus lineages according to claim 1, wherein the step 5 comprises:

And the sum of the sums of all (X, S,

) Plotting, X representing the pedigree name, S representing the position of the point on the aligned sequence in the sliding window, averaging the recombination contribution values

The largest sliding window is labeled as potential recombined fragments, and the lineage sequences within the sliding window are labeled as potential parents and are labeled as potential parents

And storing to F.

5. The method for detecting recombination between new coronavirus lineages according to claim 4, wherein the total size of recombination regions on each potential parent is calculated, and the lineage sequence with the largest cumulative area of recombination regions is labeled as the potential major parent and designated as Major (M).

6. The method for detecting recombination between new coronavirus lineages according to claim 4, wherein the step 6 comprises:

splicing continuous or discrete recombination segments existing on the same potential parent X into recombination regions, proofreading each recombination boundary in the splicing process, and setting the maximum recombination region

The recombination region R, denoted d (X, R), is generated for each potential parent X.

7. The method for detecting recombination between new coronavirus lineages according to claim 6, which comprises:

step 61, subjecting said F to the same lineage sequence

Extracting, sorting by increasing S value, and placingZThen will beZThe first fragment in the sequence is marked as the initial fragment

；

Step 62, the other fragments in Z

With the first segment

Constituting a recombination region

Recombination region

Left side boundary of

Right side boundary

All is the total site count, recombination region of alignment sequence file

Has a size of

；

Step 63, calculating each recombination region

Recombination contribution of lineage sequences within

Judging the recombination contribution value of the lineage sequence

Whether it is maximum and greater than a specified threshold ratio

，

；

Step 64, iterating steps 62 and 63 until the initial segment

Is composed ofZThe last fragment in (c).

8. The method for detecting recombination between new coronavirus pedigrees based on informatics claim 7, wherein the site recombination contribution value WIC of each recombination region R from the minor parent and the major parent are significantly different by using Mann-Whitney rank sum test and two-tailed probability test.

9. The method for detecting recombination between new coronavirus lineages based on the information theory as claimed in claim 1, wherein the method for obtaining recombination breakpoints in the step 7 comprises:

step 72, for all b (X, S,

) Perform point drawing and correspondence

The peak with the highest value is the recombination fragmentAnd (4) point.