CN115798578A

CN115798578A - Device and method for analyzing and detecting virus new epidemic variant strain

Info

Publication number: CN115798578A
Application number: CN202211554331.0A
Authority: CN
Inventors: 任洪广; 胡明达; 王辛; 靳远; 王博千; 赵云祥; 岳俊杰; 梁龙
Original assignee: Academy of Military Medical Sciences AMMS of PLA
Current assignee: Academy of Military Medical Sciences AMMS of PLA
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-03-14

Abstract

The invention discloses a device for detecting and analyzing a new epidemic variant strain of a virus, which comprises parameter acquisition equipment and a computer readable carrier; the functions of quickly, efficiently and automatically detecting and early warning the new popular variant strain of the new coronavirus are realized by the methods of mutation site analysis, core mutation statistics, high-frequency core mutation combination statistics, variant strain identification, propagation trend regression analysis and the like. The method has the advantages of large analyzable genome data quantity, high operation efficiency, accurate detection of new epidemic variant strains and capability of assisting in quickly and comprehensively tracking the variation trend and the historical process of the virus. The invention can automatically and rapidly detect the new epidemic variant strain of the new coronavirus and the core mutation characteristics of the new epidemic variant strain, does not need manual participation, and can provide important technical support for monitoring epidemic situation development, variant strain identification and vaccine drug research and development in real time.

Description

Device and method for analyzing and detecting virus new epidemic variant strain

Technical Field

The invention relates to the technical field of biology, in particular to an automatic detection and early warning method for a new epidemic variant strain of a virus based on a mutation combination distribution trend.

Background

The novel coronavirus has a plurality of novel variant strains with enhanced transmission capability, wherein four new epidemic variant strains with the highest threat are respectively named as an alpha variant strain, a beta variant strain, a gamma variant strain and a delta variant strain by the world health organization, wherein the delta variant strain appears late, the outbreak speed is high, and the hazard degree is high. The insufficient recognition and early warning capability of the variant strains can lead to serious delay of epidemic strategy from epidemic spread. Therefore, in the face of rapid virus mutation, how to automatically and rapidly detect a new epidemic variant strain of new coronavirus is a very important topic.

At present, the method for screening the virus new epidemic variant strain is mainly to adopt a method for constructing a phylogenetic tree, and then to identify and detect whether a new epidemic strain exists or not by depending on the growth condition of each branch in the phylogenetic tree through manual observation. And standardized designations are made according to the position of the strain in the evolutionary tree, such as the name of the delta variant of the new coronavirus B.1.617.2. However, this method has many disadvantages, firstly, the method needs to construct a phylogenetic tree for viruses, and the computational resources required for constructing the phylogenetic tree are in a positive growth relationship with the number of virus samples. Since the sample size of the novel coronavirus is over 400 ten thousand, the required computing resources and computing time are huge, and the computing power of the conventional computer cannot meet the requirement of constructing the phylogenetic tree of the novel coronavirus. Secondly, conventional methods rely on manual observation, and the current evolutionary tree of novel coronaviruses is extremely complex, and the number of leaf nodes to be observed is over 400 ten thousand, and it is difficult to rapidly detect a new epidemic variant strain by manual observation. Thirdly, the mutation characteristics of the new epidemic variant strains cannot be directly revealed by using the conventional method, and further, technical support cannot be provided for the rapid identification of the variant strains and the vaccine design, and finally, the consequence of implementing the slow epidemic prevention strategy is caused. In summary, conventional methods for detecting new prevalent variant strains have been difficult to meet the needs of epidemic situation prevention and control in the present day when new coronavirus is abused. An automatic rapid detection method for a new epidemic variant strain of the new coronavirus is urgently needed, and is used for assisting guidance of epidemic prevention early warning and vaccine design work.

Disclosure of Invention

It is an object of the present invention to provide an apparatus for detecting and analyzing a new epidemic variant of a virus.

The invention provides a device for detecting and analyzing a virus new epidemic variant strain, which comprises parameter acquisition equipment and a computer readable carrier; the parameter acquisition equipment comprises a step of continuously acquiring amino acid sequence data of the virus to be detected in a certain fixed area for a certain time;

the readable carrier comprises:

the data storage module is used for storing the collected amino acid sequence data of the virus to be detected and the amino acid sequence of the reference strain, and the amino acid sequence data of the virus to be detected can be collected by self or can be directly used by the sequence data in the database;

the quality control module is used for performing quality control and length screening on the gene to be tested stored in the data storage module to obtain a data set to be tested and transmitting the data set to the statistical module;

the statistic module is used for receiving the data set to be tested transmitted by the quality control module, combining the gene to be tested with the reference gene, and counting mutation site information and mutation frequency to obtain an amino acid mutation frequency table;

the screening module receives the amino acid mutation frequency table transmitted by the statistical module and further screens to obtain a core mutation site key mutation set and a high-frequency amino acid mutation set;

the identification module is used for identifying the gene sequence of the data set to be detected by combining the core mutation site key mutation set to obtain a virus variant mutation characteristic set;

and the analysis module receives the virus variant mutation feature set obtained by the identification module, and performs propagation trend analysis on the virus variant mutation feature set and popular variant identification on the virus variant mutation feature set to obtain the popular variant.

The quality control module performs quality control on the gene to be detected stored in the data storage module, and the method comprises the following steps: respectively calculating the proportion of abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of abnormal characters is as follows:

the total length of the sequence to be detected is L, and the number of abnormal characters in the sequence is mu;

the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, wherein the length integrity is calculated by the formula:

the total length of the sequence to be detected is L, and the total length of the reference strain sequence is L ₀ 。

The method for acquiring the amino acid mutation frequency table in the statistical module comprises the following steps:

(1) Performing multi-sequence comparison on the sequences in the data set to be detected and the amino acid sequences of the reference strains one by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;

(2) According to the result of multi-sequence comparison, comparing the amino acid sequence to be detected with the reference strain sequence bit by bit, recording the amino acid mutation condition in the sequence to be detected and generating a mutation information set of the amino acid sequence to be detected;

MSet[Seq _m ]＝{Amino _l1 _l1_Mutation _l1 ,…,AMino _ln _ln_Mutation _ln }

wherein Amino _l1 _l1_Mutation _l1 Represents an Amino acid mutation wherein l1 is the positional information of the site of the Amino acid mutation, amino _l1 Amino acid code at position l1 of the reference Strain sequence, mutation _l1 The amino acid code at the l1 th position of the sequence of the strain to be tested, MSet is a mutation information dictionary for recording all sequence mutation information sets, MSet [ Seq ] _m ]Sequence of expression Seq _m A set of mutation information of (a);

(3) And (3) analyzing the amino acid mutation records of all the sequences to be detected in the step (2), carrying out frequency statistics on the frequency of each amino acid mutation in all the sequences, and establishing an amino acid mutation frequency table of the viruses to be detected.

The method for screening the high-frequency amino acid mutation set in the screening module comprises the following steps:

(1) When the occurrence frequency of the amino acid mutation is more than one thousandth of the total number N of the sequences contained in the data set to be detected, the amino acid mutation is high-frequency amino acid mutation;

(2) Judging all amino acid mutations in the amino acid mutation frequency table in the statistical module one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:

HFMS＝{Amino _l1 _l1_Mutation _l1 ,…,Amino _lm _lm_Mutation _lm }。

the method for screening the key mutation set of the core mutation site in the screening module comprises the following steps:

(1) Counting the frequency of the amino acid mutation at each amino acid position, and calculating the formula as follows:

wherein AH ^ln Represents the amino acid frequency set of the ln th amino acid position, N is the total number of sequences contained in the data set to be tested,

indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position of the ln position,

represents the frequency of occurrence of Amino _ m at the Amino acid position of ln;

(2) And (3) calculating the Shannon entropy of each amino acid site, wherein the calculation formula is as follows:

if H is ^ln >0.1, recording the amino acid position ln as the core mutation position. Selecting the mutant Amino with the highest mutation frequency in the ln-th Amino acid site _ln _ln_Mutation ^max _ln As a key mutation at that site;

(3) According to the method in the step (2), counting the key mutation of all core mutation sites to obtain a core mutation site key mutation set KLMS,

KLMS＝{Amino _l1 _l1_Mutation ^max _l1 ,…,Amino _ln _ln_Mutation ^max _ln }。

wherein each virus data in the data storage module comprises sampling time, sampling place and sequence information.

The method for obtaining the virus variant mutation feature set comprises the steps of identifying a core mutation combination in a mutation information dictionary MSet and performing propagation trend analysis and epidemic variant identification in an analysis module.

The method for identifying the core mutation combination in the mutation information dictionary MSet comprises the following steps:

(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:

KMS＝HFMS∪KLMS；

(2) Identifying core mutation combinations in the mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain a core mutation combination of each sequence, wherein the calculation formula is as follows:

the KMSet is a core mutation combination information dictionary and is used for recording core mutation combinations contained in each sequence;

is sequence Seq _m A combination of core mutations comprised;

(3) Performing high-frequency screening on the core mutation combined dictionary in the step (2):

combining core mutation in core mutation combined dictionary

Is the number of

Larger than that contained in the data set to be testedWhen the sequence total number is one thousandth of N, the corresponding core mutation combination

For high frequency core mutation combinations, the virus strain comprising the core mutation combination is referred to as variant MC _m And recording the virus variant strain in a virus variant strain list MCL;

the method for analyzing the propagation trend and identifying the epidemic variant strains in the analysis module comprises the following steps:

(1) And (3) counting the proportion of each variant strain in the virus transmission process according to the sampling time information of the sequence: the calculation formula is as follows:

where MC ∈ MCL, n denotes month,

indicates the number of sequences of the N-th variant strain MC in the sequence set, AN _n Indicating the number of all sequences collected in month n of the sequence set,

the sequence number of the variant strain MC in the nth month is shown;

(2) And (2) performing propagation curve fitting on the monthly proportion condition of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:

Y＝β _n X+ε

wherein X is a time variable matrix and Y is a variant MPropagation ratio matrix of C, beta _n The slope of a transmission fitting curve of the nth month variant strain MC is obtained, and epsilon represents the intercept of the fitting curve;

(3) By fitting the slope beta of the propagation curve in step (2) _n The identification of the virus epidemic variant strain is completed by the threshold judgment of (1): the calculation formula is as follows:

T＝3×β _n ×M

wherein beta is _n The propagation fitting curve slope of the n-th month variant strain MC is shown, M is the total number of all variant strains in the variant strain list MCL, T represents the propagation rising index of the epidemic trend of the n-th month variant strain MC, and if T is more than or equal to 1, the variant strain MC is judged to be the epidemic variant strain in the n-th month.

The invention also provides a method for automatically detecting and early warning the new epidemic variant strain of the new coronavirus based on the mutation combination distribution trend, which comprises the following steps:

1) Obtaining all genes to be detected and reference strain amino acid sequences with complete genome from an original database;

2) Performing quality control and length screening on each genome sequence obtained in the step 1);

3) Carrying out mutation site statistics on the data set DataBase _1 to be detected obtained in the step 2) to obtain an amino acid mutation frequency list and a mutation information dictionary;

4) Screening high-frequency amino acid mutation from the amino acid mutation frequency table obtained in the step 3);

5) Screening core mutation sites of the amino acid mutation frequency table in the step 3);

6) Identifying the core mutation combination in the mutation information dictionary in the step 3), naming the virus strains containing the core mutation combination as variant strains, and storing all identified variant strains to obtain a novel coronavirus variant strain list.

Wherein, the quality control and length screening method in the step 2) comprises the following steps:

(1) Calculating the proportion of abnormal characters in the sequence, recording the total length of the sequence as L, and the number of the abnormal characters in the sequence as mu, wherein the calculation formula is as follows:

if P >0.1, deleting the sequence from the amino acid sequence dataset;

(2) Calculating the sequence length integrity, recording the total sequence length as L, and recording the reference strain sequence length as L ₀ The calculation formula is as follows:

if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, and reserving other amino acid sequences after screening to be used as a data set to be detected and marking as DataBase _1.

Wherein, the mutation site statistical method in the step 3) comprises the following steps:

(1) Performing multi-sequence comparison on the sequences in the data set DataBase _1 to be detected and the amino acid sequence of the reference strain by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;

MSet[Seq _m ]＝{Amino _l1 _l1_Mutation _l1 ,…,Amino _ln _ln_Mutation _ln }

wherein Amino _l1 _l1_Mutation _l1 Represents an amino acid mutation. Wherein l1 is the position information of the Amino acid mutation site, amino _l1 To reference the amino acid code at position l1 of the strain sequence, mutation _l1 Is the amino acid code of the l1 position of the sequence of the strain to be tested. For example, the amino acid mutation D614G indicates that the mutation occurs at position 614, the 614 th position of the reference sequence is D (aspartic acid), and the 614 th position of the test strain sequence is G (glycine). MSet is a mutation information dictionary that records a set of all sequence mutation information. MSet [ Seq ] _m ]Representing a sequenceSeq _m The mutation information set of (3);

(3) And (3) carrying out frequency statistics on the amino acid mutation records of all the sequences to be detected in the step (2) and establishing an amino acid mutation frequency table of the target gene.

Wherein, the specific method for screening high-frequency amino acid mutation from the amino acid mutation frequency list obtained in the step 3) in the step 4) comprises the following steps:

(1) Recording the total number of sequences contained in the data set DataBase _1 to be tested as N, and recording the Amino acid mutation Amino _l1 _l1_Mutation _l1 The frequency of occurrence is S _l1 The calculation formula is as follows:

if SP _l1 >0.001, the Amino acid mutation Amino is identified _l1 _l1_Mutation _l1 Is a high frequency amino acid mutation;

(2) Identifying the amino acid mutations in the amino acid mutation frequency table in the step 3) one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:

HFMS＝{Amino _l1 _l1_Mutation _l1 ,…,Amino _lm _lm_Mutation _lm 。

wherein, the method for screening the core mutation sites of the amino acid mutation frequency table in the step 3) in the step 5) comprises the following steps:

(1) Counting the amino acid mutation frequency of each amino acid site, wherein the calculation formula is as follows:

wherein AH ^ln Represents an amino acid at the ln-th amino acid positionA frequency set, N is the total number of sequences contained in the data set DataBase _1 to be tested,

indicates the frequency of occurrence of the Amino acid Amino _ m at the Amino acid position (ln),

represents the frequency of occurrence of Amino _ m at the Amino acid position of lnn;

if H is ^ln >0.1, recording the amino acid position ln as the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site _ln _ln_Mutation ^max _ln As a key mutation at that site;

(3) Establishing key mutation set KLMS of a core mutation site,

the method for identifying the core mutation combination in the mutation information dictionary MSet in the step 3) in the step 6) comprises the following steps:

(1) Establishing a core mutation set: performing union operation on the high-frequency amino acid mutation set HFMS obtained in the step 4) and the key mutation set KLMS of the core mutation site obtained in the step 5) to obtain a core mutation set KMS, wherein the calculation formula is as follows:

KMS＝HFMS∪KLMS；

(2) Identifying core mutation combinations in the mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain the core mutation combination of each sequence

The calculation formula is as follows:

(3) Performing high-frequency screening on the core mutation combination in the step (2): note the book

In KMSet with

The number of the same core mutation combinations, the total number of sequences contained in the data set to be tested DataBase _1 is N, and the calculation formula is as follows:

if it is used

Identifying core mutation combinations

For high frequency core mutation combinations, all combinations comprising core mutations are referred to

The virus strain of (A) is a variant strain MC _m All identified variants are saved to the new coronavirus variant list MCL.

Based on the method, all variant strain lists MCL can be obtained, so that data support is provided for the subsequent analysis and detection of the novel epidemic coronavirus, and data support is provided for the subsequent design of vaccine and epidemic control.

Wherein, the method also comprises a step 7) of carrying out transmission trend analysis and epidemic variant strain identification on the novel coronavirus variant strain list obtained in the step 6) according to the following method:

(1) And counting the proportion of each variant strain in the virus propagation process according to the time-space labeling information of the sequence. The calculation formula is as follows:

where MC ∈ MCL, n denotes month,

the number of sequences of the n-th variant strain MC in the sequence set is shown. AN (Access network) _n Indicating the number of all sequences collected in month n of the sequence set,

the sequence number of the variant strain MC in the nth month is shown as a ratio;

(2) And (3) fitting a propagation curve for the monthly ratio of each variant strain calculated in the step (1) by using a linear regression method calculation tool, wherein the calculation formula is as follows:

Y＝β _n X+ε

wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta _n The slope of the curve fitted to the transmission of the n-th month variant strain MC is shown as ε represents the intercept of the fitted curve.

(3) By opposite steppingFitting propagation curve slope beta in step (2) _n The discrimination of the new coronavirus epidemic variant strain is completed. The calculation formula is as follows:

T＝3×β _n ×M

wherein beta is _n The slope of the propagation fit curve for the nth variant strain MC, M is the total number of all variants in the variant list MCL. T represents an index of the prevalence spread of the N month variant strain MC. If T is more than or equal to 1, judging that the variant MC is a popular variant in the nth month;

(4) Counting the epidemic variant strains identified in the step (3), and marking the epidemic variant strains which do not appear before as new epidemic variant strains. Thus, the whole work of automatically detecting the new epidemic variant strain of the novel coronavirus is completed.

Wherein, the propagation trend analysis of the variant strain comprises the following four steps:

1) And calculating the proportion of each variant strain in each month, and drawing a propagation curve chart.

2) And (3) calculating the rising slope of the propagation curve of each variant in the current month by using the data of three months before the current month as fitting data by using a linear regression method.

3) And (4) carrying out threshold judgment on the rising slope of the transmission curve of each variant strain per month, and defining the variant strain exceeding the threshold as the circulating strain in the month.

4) Recording the current month epidemic strain, marking the epidemic strain which has not appeared before as new type coronavirus new epidemic variant strain.

Wherein, the virus genome data is large-scale novel coronavirus genome amino acid sequence data which is complete as much as possible.

The high-frequency amino acid mutation is a novel coronavirus genome sequence amino acid mutation with mutation rate exceeding one thousandth obtained through statistical screening.

Wherein, the core mutation site is a novel coronavirus genome sequence amino acid site with a Shannon entropy value of more than 0.1 obtained by Shannon entropy calculation and screening. The key mutation of the core mutation site is the amino acid mutation with the most occurrence on the amino acid site.

The core mutation is an important amino acid mutation with high occurrence frequency in a novel coronavirus genome, and the method is obtained by performing union operation on a high-frequency amino acid mutation set and a core mutation site key mutation set.

Wherein, the core mutation combination is a specific case of simultaneous occurrence between core mutations in the genome sequence of the novel coronavirus, and is also used as a gene mutation characteristic for describing a variant strain.

The application of the method in the identification and early warning operation of the new-epidemic variant strain of the novel coronavirus or the preparation of the identification and early warning operation product of the new-epidemic variant strain of the novel coronavirus also falls within the protection scope of the invention.

Experiments of the invention show that compared with the prior art, the invention has the advantages that: 1. aiming at large-scale genome data, the identification efficiency of the new epidemic variant strains of the novel coronavirus can be greatly accelerated by analyzing the distribution trend of the core mutation site combination in the novel coronavirus sequence. 2. By analyzing the core mutation combination, the evolutionary characteristics of the novel coronavirus new-epidemic variant strain can be accurately positioned, an effective data base 3 is provided for identification of variant strains, vaccine design and drug research and development, a linear regression method is used, prediction and early warning of the propagation trend of the novel coronavirus new-epidemic variant strain can be completed more efficiently, and a powerful data base is provided for designation of epidemic situation prevention and control strategies. 4. The method for detecting the new epidemic variant strains of the novel coronavirus based on the mutation combination distribution trend can be automatically processed in all processes, and is more convenient for establishing an efficient analysis early warning system.

In conclusion, the invention realizes the function of quickly, efficiently and automatically detecting and early warning the new epidemic variant strain of the new coronavirus through the methods of mutation site analysis, core mutation statistics, high-frequency core mutation combination statistics, variant strain identification, propagation trend regression analysis and the like. The method has the advantages of large analyzable genome data quantity, high operation efficiency, accurate detection of new epidemic variant strains and capability of assisting in quickly and comprehensively tracking the variation trend and the historical process of the virus. The method solves the problem that the efficiency of identifying the variant strains by constructing the evolutionary tree is low aiming at the insufficient analysis and processing capacity of large-scale genome data in the traditional method. The method has the main advantages that the method can automatically and quickly detect the new epidemic variant strains of the new coronavirus and the core mutation characteristics of the new coronavirus, does not need manual participation, and can provide important technical support for monitoring epidemic situation development, identifying variant strains and researching and developing vaccine medicaments in real time by utilizing the technical advantages.

Drawings

FIG. 1 is a schematic flow chart of a method for automatically detecting a new pandemic variant of a coronavirus based on a distribution trend of mutation combinations;

FIG. 2 is a statistical chart of amino acid mutation frequencies of novel coronavirus Spike proteins;

FIG. 3 is a Shannon entropy statistic chart of amino acid sites of novel coronavirus Spike protein;

FIG. 4 is a schematic flow chart of core mutation combinatorial screening and variant strain identification;

FIG. 5 is a graph showing the analysis of the distribution trend of the core mutation combinations;

FIG. 6 is a graph of regression analysis of the identification and propagation curves of newly circulating variants.

Detailed Description

The present invention is described in further detail below with reference to specific embodiments, and the examples are given only for illustrating the present invention and not for limiting the scope of the present invention. The examples provided below serve as a guide for further modifications by a person skilled in the art and do not constitute a limitation of the invention in any way.

The experimental procedures in the following examples, unless otherwise indicated, are conventional and are carried out according to the techniques or conditions described in the literature in the field or according to the instructions of the products. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

Example 1

The invention provides a device for detecting and analyzing a virus new epidemic variant strain, which comprises parameter acquisition equipment and a computer readable carrier; the parameter acquisition equipment comprises equipment for acquiring or collecting the amino acid sequence of the virus to be detected, such as equipment for inputting the amino acid sequence of the virus to be detected or equipment for detecting the amino acid sequence of the virus to be detected;

the readability carrier comprising:

the data storage module is used for storing the collected amino acid sequences of the virus gene to be tested and the reference gene;

and the analysis module receives the virus variant mutation characteristic set obtained by the identification module, performs propagation trend analysis on the virus variant mutation characteristic set, and identifies the epidemic variant strain to obtain a new epidemic variant strain.

The method for controlling the quality of the gene to be detected stored in the data storage module in the quality control module comprises the following steps: respectively calculating the proportion of abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of abnormal characters is as follows:

the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP is less than 0.8, deleting the sequence from the amino acid sequence data set, wherein the calculation formula of the length integrity is as follows:

(1) Performing multi-sequence comparison on the sequences in the data set DataBase _1 to be detected and the amino acid sequences of the reference strains by using a multi-sequence comparison tool, and calibrating the position information of each amino acid site in the sequence to be detected;

MSet[Seq _m ]＝{Amino _l1 _l1_Mutation _l1 ,…,Amino _ln _ln_Mutation _ln }

wherein Amino _l1 _l1_Mutation _l1 Represents an amino acid mutation. Wherein l1 is the position information of the Amino acid mutation site, amino _l1 Amino acid code at position l1 of the reference Strain sequence, mutation _l1 Amino acid code of l1 position of the sequence of the strain to be tested, MSet is a mutation information dictionary recording all sequence mutation information sets, MSet [ Seq ] _m ]Sequence of expression Seq _m The mutation information set of (3);

if SP _l1 >0.001, amino acid mutation Amino is identified _l1 _l1_Mutation _l1 Is a high frequency amino acid mutation;

(2) Identifying the amino acid mutations in the amino acid mutation frequency table in the statistical module one by one according to the method in the step (1), recording all screened high-frequency amino acid mutations, and establishing a high-frequency amino acid mutation set HFMS:

HFMS＝{Amino _l1 _l1_Mutation _l1 ,…,Amino _lm _lm_Mutation _lm }。

the method for screening the key mutation set of the core mutation sites in the screening module comprises the following steps:

wherein AH ^ln Represents the amino acid frequency set of the ln-th amino acid position, N is the total number of sequences contained in the data set DataBase _1 to be tested,

if H is present ^ln >0.1, recording the amino acid position ln as the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site _ln _ln_Mutation ^max _ln As a key mutation at that site;

(3) Establishing a key mutation set KLMS of a core mutation site,

the identification module is combined with the core mutation site key mutation set to identify the gene sequence of the data set to be detected, and the method for obtaining the virus variant mutation characteristic set comprises the following steps:

KMS＝HFMS∪KLMS；

(2) Identifying core mutation combinations in mutation information dictionary MSet: performing intersection operation on the core mutation set KMS and the mutation information set of each sequence in the mutation information dictionary MSet to obtain the core mutation combination of each sequence

The calculation formula is as follows:

In KMSet with

if it is used

Identifying core mutation combinations

The virus strain of (A) is a variant strain MC _m Storing all identified variant strains into a novel coronavirus variant strain list MCL;

the method for analyzing the propagation trend and identifying the popular variant strains in the analysis module comprises the following steps:

(1) And (3) counting the proportion of each variant strain in the virus propagation process according to the time-space labeling information of the sequence: the calculation formula is as follows:

where MC ∈ MCL, n denotes month,

indicates the number of sequences of the Nth month variant strain MC in the sequence set, AN _n Indicating the number of all sequences collected in month n of the sequence set,

the sequence number of the variant strain MC in the nth month is shown;

Y＝β _n X+ε

wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta _n The slope of a transmission fitting curve of the nth variant strain MC is shown, and epsilon represents the intercept of the fitting curve;

(3) By fitting the slope beta of the propagation curve to that in step (2) _n The threshold value judgment of (2) to finish the identification of the novel coronavirus epidemic variant: the calculation formula is as follows:

T＝3×β _n ×M

wherein beta is _n The propagation fitting curve slope of the n-th month variant strain MC is obtained, M is the total number of all variant strains in the variant strain list MCL, T represents the propagation rising index of the epidemic trend of the n-th month variant strain MC, and if T is more than or equal to 1, the variant strain MC is judged to be the epidemic variant strain in the n-th month;

(4) And (4) counting the epidemic variant strains identified in the step (3), and marking the epidemic variant strains which do not appear before as new epidemic variant strains to realize the automatic detection of the new epidemic variant strains of the novel coronavirus.

Example 2 identification of new circulating variants of the novel coronavirus based on the trend of distribution of the combination of mutations in the S protein of the novel coronavirus.

Based on the device described in example 1, the present invention identifies a new prevalent variant strain of the novel coronavirus based on the distribution trend of the mutation combinations in the S protein of the novel coronavirus by the following method, as shown in fig. 1, firstly performing quality control and length screening on the amino acid sequence set of the gene to be tested of the novel coronavirus, then performing mutation site statistics on the screened data set, analyzing the core mutation existing in the virus by counting mutation frequency and calculating shannon entropy, then screening out the high-frequency core mutation combinations according to the distribution of the core mutations in the sequence to be tested, and identifying and naming each variant strain accordingly, then analyzing the propagation trend of each variant strain by using a linear regression method, and finally automatically detecting the new prevalent variant strain of the novel coronavirus by screening the rising slope of the propagation curve. The method comprises the following specific steps:

1. preparation of novel coronavirus Spike protein genome sample

The preparation of the genome sequence sample aiming at the novel coronavirus Spike protein comprises the following steps:

1. construction of novel coronavirus Spike protein genome database

The amino acid sequences of the novel coronavirus Spike protein with all complete genomes and full-length sequences are obtained from public databases, and the number of the novel coronavirus Spike protein sequences collected from the databases is about 320 ten thousand by 9/1 in 2021.

2. Sequence standardization naming and reference strain sequence extraction

And (3) carrying out standardized renaming treatment on the amino acid sequences in the novel coronavirus S protein genome database obtained in the step (1) one by one according to a sequence naming rule.

The sequence naming rule is:

(> Gene name | sequence name | time information | sequence number | geographic information)

According to the information of the suggested reference strains in the database, extracting sequences:

"> Spike | hCoV-19/Wuhan/Hu-1/2019 Shanxi (EPI_ISL_402125) Tena (China) is used as a reference strain sequence of a novel coronavirus Spike protein genome database.

2. Quality control and length screening of novel coronavirus Spike protein genome samples

Quality control and length screening against the novel coronavirus Spike protein genome database included the following steps:

1. calculating abnormal character ratio of each sequence in database

The calculation formula is as follows:

where μ is the number of anomalous characters 'X' in each sequence and L is the total sequence length. If P >0.1, the sequence is deleted.

2. Calculating the integrity of the genome sequence length

The total length of the sequence is recorded as L, and the total length of the reference strain sequence is recorded as L ₀ The calculation formula is as follows:

wherein the total length L of the reference sequence of the novel coronavirus Spike protein ₀ =1274. If LP < 0.8, the genomic sequence is deleted from the set of genomic sequence data. After screening, the number of the genome sequences of the novel coronavirus S protein genome DataBase is kept to be 246 ten thousand, and the genome sequences are used as a novel coronavirus Spike protein genome to-be-detected data set DataBase _1.

3. Statistics of mutation sites of novel coronavirus Spike protein genomes

Mutation site statistics is carried out on the data set DataBase _1 to be detected of the novel coronavirus Spike protein genome.

The method comprises the following specific steps:

1. and (3) performing multi-sequence alignment on the sequences in the data set DataBase _1 to be detected and the reference strain sequence by using a multi-sequence alignment tool, and calibrating the position information of each amino acid site in the sequence to be detected.

2. And comparing the amino acid conditions of each amino acid site in the sequence to be detected and the corresponding amino acid site of the reference sequence according to the multi-sequence comparison result. And recording the mutation condition of amino acids in the sequence to be detected and generating a mutation information set.

3. And carrying out frequency statistics on the amino acid mutation in the mutation information set, and establishing a Spike protein amino acid mutation frequency table.

4. High-frequency amino acid mutation screening of novel coronavirus Spike protein

Analyzing the Spike protein amino acid mutation frequency table, and screening the amino acid mutation with high frequency. The method comprises the following specific steps:

1. the frequency of each amino acid mutation in the Spike protein amino acid mutation frequency table is recorded as S ₁ The total number of sequences in the data set DataBase _1 to be measured is about 246 ten thousand, and the calculation formula is as follows:

screening all SPs ₁ >The 0.001 amino acid mutation is a high frequency amino acid mutation.

2. Recording all the high-frequency amino acid mutations in the step 1, and establishing a high-frequency amino acid mutation set HFMS, wherein the number of the high-frequency amino acid mutations contained in the HFMS is 151, the experimental result is shown in FIG. 2, FIG. 2 is a statistical graph of the amino acid mutation frequency of the novel coronavirus Spike protein, the abscissa is the name of the amino acid mutation, and the ordinate is the amino acid mutation frequency. The 151 high-frequency amino acid mutations are L5F, S12F, S13I, L18F, T19R, T20N, T20I, R21I, T22I, P26S, A27S, T29I, H49Y, Q52R, L54F, A67V, I68-, I68V, H69-, V70-, V70I, G75V, G75R, D80A, D80G, D80Y, S94F, T95I, S98F, D138Y, D138H, G142D, V143-, V143F, Y144-, Y144V, Y145-, W152R, W152L, W152C, M153I, M153T, E154K, E154-, S155-, E156G, E156-, F157-, F157L, F157S, R158-, R158G, L176F, D178H, G181V, L189F, R190S, D215H, D215G, L216F, S221L, A222V, L241-, L242-, A243-, D253G, S254F, S255F, S256L, W258L, A262S, P272L, V367F, K417T, K417N, N439K, N440K, L452R, S477N, T478K, E484K, E484Q, F490S, S494P, N501T, N501Y, A520S, A522S, K558N, A570D, T572I, E583D, Q613H, D614G, V622F, S640F, A653V, H655Y, Q675R, Q675H, Q677P, Q677H, N679K, P681H, P681R, A688V, A701V, S704L, T716I, T732A, G769V, V772I, E780Q, T791I, D796H, P809S, P812L, a845S, T859I, T859N, F888L, a899S, D936Y, S939F, D950N, D950H, Q957R, S982A, a1020S, T1027I, Q1071L, Q1071H, K1073N, a1078S, V1104L, D1118H, G1124V, P1162S, D1163Y, G1167V, V1176F, K1191N, G1219C, G1219V, V1228L, M1229I, M1237I, S1252F, P1263L, V1264L. Wherein the numbers indicate positions in the sequence, the original amino acids preceding the numbers and the mutated amino acids following the numbers, indicate amino acid deletions

5. Novel screening of core mutation sites of coronavirus Spike protein

Analyzing the Spike protein amino acid mutation frequency table, and screening the core mutation sites in the Spike protein amino acid sequence. The method comprises the following specific steps:

1. the mutation frequency at each amino acid position was calculated. The calculation formula is as follows:

indicates the frequency of Amino _ m at the nth Amino acid position. (where Amino _ m is a whole, indicates a different Amino acid symbol appearing at the ln-th position, and m has no single meaning but distinguishes it from the different Amino acid symbols.)

2. Shannon entropy of amino acid positions. The calculation formula is as follows:

if H is present ^ln >0.1, recording the amino acid position ln as the core mutation position. Selecting the mutant Amino with the highest mutation frequency in the ln-th Amino acid site _ln _ln_Mutation _ln As a key mutation at this site.

3. And (3) recording key mutations of the core mutation sites screened in the step (2) and establishing a set KLMS. The KLMS comprises 33 key mutations, the experimental result is shown in figure 3, figure 3 is a Shannon entropy statistical chart of amino acid sites of the novel coronavirus Spike protein, wherein the abscissa is the amino acid mutation site, the ordinate is the Shannon entropy value, and the 33 key mutations are respectively L5F, L18F, T19R, T20N, P26S, I68-, H69-, V70-, T95I, D138Y, G142D, Y144-, W152C, E156-, F157-, R158G, R190S, A222V, K417T, L452R, S477N, T478K, E484K, N501Y, A570D, H655Y, P681H, T716I, D950N, S982A, T1027I, D1118H, V6F. Where the numbers indicate positions in the sequence, the original amino acids are preceded by the numbers and the mutated amino acids are followed-indicating amino acid deletions.

6. Statistics and screening of novel coronavirus Spike protein core mutation combination

Comprehensively analyzing the amino acid mutation characteristics in the Spike protein genome, and screening the core mutation combination in the Spike protein by counting the occurrence frequency of different amino acid mutation combinations. The method comprises the following specific steps:

1. and (4) merging the high-frequency amino acid mutation set HFMS and the key mutation set KLMS obtained in the fourth step and the fifth step. A set of core mutations KMS is established. Wherein the number of core mutations contained in the KMS is 151.

2. Screening mutations in each amino acid sequence, performing intersection operation on a set KMS of core mutations, and establishing a core mutation combined information dictionary KMSet. The KMSet records the combination of core mutations contained in each sequence.

3. And counting the core mutation combination information dictionary KMSet, and selecting the core mutation combinations with high frequency. The specific calculation formula is as follows:

wherein

For the frequency of occurrence of the core mutation combinations, N is the total number of sequences. If it is

The core mutation combination is identified as a high frequency core mutation combination.

4. And recording the high-frequency core mutation combination, and establishing a high-frequency mutation combination information table. And regarding each high-frequency core mutation combination as the gene mutation characteristic of one variant strain, and regarding the sequences containing the same high-frequency core mutation combination as the same variant strain. The high-frequency core mutation combination information table comprises 16 high-frequency core mutation combinations, the experimental result is shown in fig. 4, fig. 4 is a high-frequency core mutation combination frequency statistical chart, the abscissa is mutation combination frequency, and the ordinate is high-frequency core mutation combination name. The 16 groups of high-frequency core mutation combinations are { H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H }; { L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R }; { H69-, A570D, S982A, N501Y, T716I, D614G, D1118H, V70-, P681H }; { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R }; { D253G, T95I, Q957R, L5F, D614G, S477N }; { A262S, P272L, A222V, D614G }; { G769V, W152L, D614G, E484K }; { H69-, A570D, S982A, N501Y, L5F, T716I, D614G, D1118H, Y144-, V70-, P681H }; { H69-, A570D, S982A, N501Y, K1191N, T716I, D614G, D1118H, Y144-, V70-, P681H }; { D253G, a701V, T95I, L5F, D614G, E484K }; { D614G, T732A, T478K, P681H }; { L452R, T95I, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R }; { L452R, S13I, W152C, D614G }; { D138Y, P26S, T20N, T1027I, K417T, R190S, N501Y, L18F, D614G, V1176F, H655Y, E484K }; { A222V, L18F, D614G }; { H69-, A570D, S982A, N501Y, T716I, D614G, D1118H, Y144-, V70-, P681H }. The experimental result shows that 16 virus variant strains are identified by the equipment through calculation, and the name of each variant strain is represented by the corresponding high-frequency core mutation combination.

7. High-frequency core mutation combination distribution trend analysis and recognition early warning of new popular variant strains

And C, analyzing the distribution trend of the high-frequency core mutation combination obtained in the sixth step, and finishing the identification of the new popular variant strain by combining with the propagation curve fitting. The method comprises the following specific steps:

1. according to the spatio-temporal labeling information of the sequence, the monthly proportion of each variant strain in the virus propagation process is counted, a distribution trend graph of high-frequency core mutation combination is drawn by using a Mathplotlib module in a Python program package, as shown in FIG. 5, FIG. 5 is a distribution trend analysis graph of the core mutation combination, wherein the abscissa is date, and the ordinate is the occurrence frequency of the mutation combination in the month, the change situation of the monthly propagation proportion of each variant strain (high-frequency core mutation combination) can be seen from FIG. 5, wherein the variant strain with gradually increased propagation proportion has the risk of becoming a popular variant strain, and the possibility of becoming the popular variant strain can be predicted through the fitting analysis of the propagation curve.

2. Propagation curve fitting is carried out on monthly occupation conditions of each variant strain by utilizing a linear regression method calculation tool in a Pyhton program module Sklearn.

3. The curvature of propagation of each variant was calculated, and variants above a defined threshold were considered as circulating variants in the month. The propagation curvature definition threshold calculation formula is as follows:

T＝3×β _n ×M

wherein beta is _n The slope of the curve was fitted to the transmission of the nth variant, the total number of M all variants. If T is more than or equal to 1, the variant is judged to be the epidemic variant in the nth month, as shown in FIG. 6, FIG. 6 is a regression analysis chart of the identification and propagation curve of the newly epidemic variant, wherein the abscissa is the date, the ordinate is the propagation distribution ratio of the variant, and the dotted straight line in FIG. 6 represents the regression curve of the month in which the maximum propagation curvature of each variant is located. The circular mark points are the propagation curvatures of variant strains { H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H } in 1 month of 2021, and T =3.12. The apparatus determines the variant strain to be a prevalent variant strain in 1 month 2021. The triangular mark is the propagation curvature of variant { L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K, P681R } at 6 months of 2021, T =2.16. The apparatus determines the variant strain to be a circulating variant strain in 6 months of 2021. The square markers are the propagation curvatures of variants { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R } at 6 months of 2021, T =1.21. The facility identified the variant as a circulating variant in 2021, 6 months.

4. And (3) carrying out integration recording on the identified monthly epidemic variant strains, and marking the epidemic variant strains which do not appear before as new epidemic variant strains, so that all the work of automatic detection and early warning of the new coronavirus new epidemic variant strains is completed.

8. Experimental results and verification

1. Through the experimental steps, 3 new epidemic variant strains are automatically identified by the method. The method comprises the following steps:

(1) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein in month 1 of 2020- (H69-, A570D, S982A, N501Y, W152R, T716I, D614G, D1118H, Y144-, V70-, P681H };

(2) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein of 6 months 2021- (L452R, T19R, F157-, E156-, R158G, D950N, D614G, G142D, T478K and P681R);

(3) Amino acid mutation characteristics of a novel coronavirus new epidemic variant Spike protein in No. 6 of 2021- { L452R, T95I, T19R, F157-, E156G, R158-, D950N, D614G, G142D, T478K, P681R }.

2. By inquiring the related information of the novel coronavirus published by the WHO official website and the GISAID data website, the new epidemic variant strain (1) identified by the method can be known to accord with the gene variation characteristic of the novel coronavirus alpha variant strain. The new epidemic variant strains (2) and (3) identified by the method conform to the gene variation characteristics of the novel coronavirus delta variant strain. In conclusion, the method can accurately and effectively identify the new-epidemic variant strain of the novel coronavirus and early warn whether the new-epidemic variant strain is epidemic in a large scale.

The present invention has been described in detail above. It will be apparent to those skilled in the art that the invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with reference to specific embodiments, it will be appreciated that the invention can be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. The use of some of the essential features is made possible within the scope of the claims attached below.

Claims

1. Apparatus for detecting and analyzing a new circulating variant of a virus, comprising a parameter acquisition device and a computer readable carrier; the parameter acquisition equipment comprises a parameter acquisition unit, a parameter acquisition unit and a parameter acquisition unit, wherein the parameter acquisition unit is used for continuously acquiring amino acid sequence data of a virus to be detected in a certain fixed area for a certain time;

the readable carrier comprises:

the data storage module is used for storing the collected amino acid sequence data of the virus to be detected and the amino acid sequence of the reference strain;

the quality control module is used for performing quality control and length screening on the genes to be tested stored in the data storage module, obtaining a data set to be tested after screening and transmitting the data set to the statistical module;

the statistic module is used for receiving the data set to be tested transmitted by the quality control module, combining the genes to be tested in the data set to be tested with the reference genes, and counting mutation site information and mutation frequency to obtain an amino acid mutation frequency table;

the screening module receives the amino acid mutation frequency table transmitted by the counting module and screens the amino acid mutation frequency table to obtain a core mutation site key mutation set and a high-frequency amino acid mutation set;

the identification module is used for receiving the key mutation set of the core mutation site transmitted by the screening module to identify the gene sequence of the data set to be detected, so as to obtain a virus variant mutation feature set;

and the analysis module receives the virus variant mutation characteristic set obtained by the identification module, performs propagation trend analysis on the virus variant mutation characteristic set, and identifies the virus variant mutation characteristic set and the epidemic variant strain to obtain the epidemic variant strain.

2. The device of claim 1, wherein the quality control module performs quality control on the gene to be tested stored in the data storage module by: respectively calculating the proportion of the abnormal characters in each sequence to be detected, and deleting the sequence from the amino acid sequence data set if P is more than 0.1, wherein the calculation formula of the proportion of the abnormal characters is as follows:

the method for screening the length of the gene to be tested stored in the data storage module in the quality control module comprises the following steps: respectively calculating the length integrity of each sequence to be detected, if LP < 0.8, the sequence is deleted from the amino acid sequence data set and the length integrity is calculated as:

3. The apparatus of claim 1, wherein the table of the frequency of the amino acid mutations in the statistical module is obtained by:

(2) According to the result of the multi-sequence comparison, the amino acid sequence to be detected is compared with the reference strain sequence bit by bit, the amino acid mutation condition in the sequence to be detected is recorded, and the mutation information set of the amino acid sequence to be detected is generated;

MSet[Seq _m ]＝{Amino _l1 _l1_Mutation _l1 ，…，Amino _ln _In_Mutation _ln }

wherein Amino _l1 _l1_Mutation _l1 Represents an Amino acid mutation wherein l1 is the positional information of the site of the Amino acid mutation, amino _l1 Amino acid code at position l1 of the reference Strain sequence, mutation _l1 Amino acid code of l1 position of the sequence of the strain to be tested, MSet is a mutation information dictionary recording all sequence mutation information sets, MSet [ Seq ] _m ]Sequence of expression Seq _m A set of mutation information of (a);

4. The device of claim 1, wherein the screening module screens the high-frequency amino acid mutation set by:

HFMS＝{Amino _l1 _l1_Mutation _l1 ，…，Amino _lm _lm_Mutation _lm }。

5. the apparatus according to claim 1, wherein the screening module screens the key mutation sets of the core mutation sites by:

if H is ^ln If the amino acid position ln is more than 0.1, the recorded amino acid position ln is the core mutation position. Selecting mutant Amino with highest mutation frequency in the ln-th Amino acid site _ln _ln_Mutation ^max _ln As a key mutation at this site;

KLMS＝{Amino _l1 _l1_Mutation ^max _l1 ，...，Amino _ln _ln_Mutation ^max _ln }。

6. the apparatus of claim 1, wherein each virus data in the data storage module comprises sampling time, sampling location and sequence information.

7. The device according to claim 6, wherein the identification module is used for identifying the gene sequences of the data set to be tested by combining the core mutation site key mutation set and the high-frequency amino acid mutation set, and the method for obtaining the virus variant mutation feature set comprises the steps of identifying the core mutation combination in the mutation information dictionary MSet and performing propagation trend analysis and epidemic variant identification in the analysis module.

8. The apparatus of claim 7, wherein the means for identifying core mutation combinations in the mutation information dictionary MSet comprises:

KMS＝HFMS∪KLMS；

is sequence Seq _m A combination of core mutations comprised;

combining core mutations in a dictionary of core mutation combinations

When the number of

When the number of the core mutation combinations is more than one thousandth of the total number N of the sequences contained in the data set to be tested, the corresponding core mutation combinations

For high frequency core mutation combinations, the virus strain comprising the core mutation combination is referred to as a variant MC _m And recorded in a virus variant list MCL.

9. The apparatus of claim 7, wherein the analysis module performs a propagation trend analysis and a popular variant identification by:

(1) And (3) counting the proportion of each variant strain in the virus propagation process according to the sampling time information of the sequence: the calculation formula is as follows:

where MC ∈ MCL, n denotes month,

the sequence number of the variant strain MC in the nth month is shown;

Y＝β _n X+ε

wherein X is a time variable matrix, Y is a propagation ratio matrix of the variant strain MC, beta _n The slope of a transmission fitting curve of the nth month variant strain MC is obtained, and epsilon represents the intercept of the fitting curve;

T＝3×β _n ×M